pymfe.statistical.MFEStatistical
- class pymfe.statistical.MFEStatistical[source]
Keep methods for metafeatures of
Statistical
group.The convention adopted for metafeature-extraction related methods is to always start with
ft_
prefix in order to allow automatic method detection. This prefix is predefined within_internal
module.All method signature follows the conventions and restrictions listed below:
For independent attribute data,
X
meansevery type of attribute
,N
meansNumeric attributes only
andC
stands forCategorical attributes only
. It is important to note that the categorical attribute sets betweenX
andC
and the numerical attribute sets betweenX
andN
may differ due to data transformations, performed while fitting data into MFE model, enabled by, respectively,transform_num
andtransform_cat
arguments fromfit
(MFE method).Only arguments in MFE
_custom_args_ft
attribute (set up insidefit
method) are allowed to be required method arguments. All other arguments must be strictly optional (i.e., has a predefined default value).It is assumed that the user can change any optional argument, without any previous verification for both type or value, via kwargs argument of
extract
method of MFE class.The return value of all feature-extraction methods should be a single value or a generic List (preferably an np.ndarray) type with numeric values.
There is another type of method adopted for automatic detection. It is adopted the prefix
precompute_
for automatic detection of these methods. These methods run while fitting some data into an MFE model automatically, and their objective is to precompute some common value shared between more than one feature extraction method. This strategy is a trade-off between more system memory consumption and speeds up of feature extraction. Their return value must always be a dictionary whose keys are possible extra arguments for both feature extraction methods and other precomputation methods. Note that there is a share of precomputed values between all valid feature-extraction modules (e.g.,class_freqs
computed in modulestatistical
can freely be used for any precomputation or feature extraction method of modulelandmarking
).- __init__(*args, **kwargs)
Methods
__init__
(*args, **kwargs)ft_can_cor
(N, y[, can_cors])Compute canonical correlations of data.
ft_cor
(N[, abs_corr_mat])Compute the absolute value of the correlation of distinct dataset column pairs.
ft_cov
(N[, ddof, cov_mat])Compute the absolute value of the covariance of distinct dataset attribute pairs.
ft_eigenvalues
(N[, ddof, cov_mat])Compute the eigenvalues of covariance matrix from dataset.
ft_g_mean
(N[, allow_zeros, epsilon])Compute the geometric mean of each attribute.
ft_gravity
(N, y[, norm_ord, classes, ...])Compute the distance between minority and majority classes center of mass.
ft_h_mean
(N)Compute the harmonic mean of each attribute.
ft_iq_range
(N)Compute the interquartile range (IQR) of each attribute.
ft_kurtosis
(N[, method, bias])Compute the kurtosis of each attribute.
ft_lh_trace
(N, y[, can_cor_eigvals, can_cors])Compute the Lawley-Hotelling trace.
ft_mad
(N[, factor])Compute the Median Absolute Deviation (MAD) adjusted by a factor.
ft_max
(N)Compute the maximum value from each attribute.
ft_mean
(N)Compute the mean value of each attribute.
ft_median
(N)Compute the median value from each attribute.
ft_min
(N)Compute the minimum value from each attribute.
ft_nr_cor_attr
(N[, threshold, normalize, ...])Compute the number of distinct highly correlated pair of attributes.
ft_nr_disc
(N, y[, can_cors])Compute the number of canonical correlation between each attribute and class.
ft_nr_norm
(N[, method, threshold, failure, ...])Compute the number of attributes normally distributed based in a given method.
ft_nr_outliers
(N[, whis])Compute the number of attributes with at least one outlier value.
ft_p_trace
(N, y[, can_cors])Compute the Pillai's trace.
ft_range
(N)Compute the range (max - min) of each attribute.
ft_roy_root
(N, y[, criterion, can_cors, ...])Compute the Roy's largest root.
ft_sd
(N[, ddof])Compute the standard deviation of each attribute.
ft_sd_ratio
(N, y[, ddof, classes, class_freqs])Compute a statistical test for homogeneity of covariances.
ft_skewness
(N[, method, bias])Compute the skewness for each attribute.
ft_sparsity
(X[, normalize])Compute (possibly normalized) sparsity metric for each attribute.
ft_t_mean
(N[, pcut])Compute the trimmed mean of each attribute.
ft_var
(N[, ddof])Compute the variance of each attribute.
ft_w_lambda
(N, y[, can_cor_eigvals, can_cors])Compute the Wilks' Lambda value.
precompute_can_cors
([N, y])Precompute canonical correlations and its eigenvalues.
Precompute distinct classes and its abs.
precompute_statistical_cor_cov
([N, ddof])Precomputes the correlation and covariance matrix of numerical data.
- classmethod ft_can_cor(N: ndarray, y: ndarray, can_cors: Optional[ndarray] = None) ndarray [source]
Compute canonical correlations of data.
The canonical correlations are calculated between the attributes in
N
and the binarized (one-hot encoded) version ofy
.- Parameters
- N
np.ndarray
Fitted numerical data.
- y
np.ndarray
Target attribute.
- can_cors
np.ndarray
, optional Canonical correlations between
N
and the one-hot encoded version ofy
. Argument used to take advantage of precomputations.
- N
- Returns
np.ndarray
Canonical correlations of the data.
References
- 1
Alexandros Kalousis. Algorithm Selection via Meta-Learning. PhD thesis, Faculty of Science of the University of Geneva, 2002.
- classmethod ft_cor(N: ndarray, abs_corr_mat: Optional[ndarray] = None) ndarray [source]
Compute the absolute value of the correlation of distinct dataset column pairs.
- Parameters
- N
np.ndarray
Fitted numerical data.
- abs_corr_mat
np.ndarray
, optional Absolute correlation matrix of
N
. Argument used to exploit precomputations.
- N
- Returns
np.ndarray
Absolute value of correlation between distinct attributes.
References
- 1
Ciro Castiello, Giovanna Castellano, and Anna Maria Fanelli. Meta-data: Characterization of input features for meta-learning. In 2nd International Conference on Modeling Decisions for Artificial Intelligence (MDAI), pages 457–468, 2005.
- 2
Matthias Reif, Faisal Shafait, Markus Goldstein, Thomas Breuel, and Andreas Dengel. Automatic classifier selection for non-experts. Pattern Analysis and Applications, 17(1):83–96, 2014.
- 3
Donald Michie, David J. Spiegelhalter, Charles C. Taylor, and John Campbell. Machine Learning, Neural and Statistical Classification, volume 37. Ellis Horwood Upper Saddle River, 1994.
- classmethod ft_cov(N: ndarray, ddof: int = 1, cov_mat: Optional[ndarray] = None) ndarray [source]
Compute the absolute value of the covariance of distinct dataset attribute pairs.
- Parameters
- N
np.ndarray
Fitted numerical data.
- ddofint, optional
Degrees of freedom for covariance matrix.
- cov_mat
np.ndarray
, optional Covariance matrix of
N
. Argument meant to exploit precomputations. Note that this argument value is not the same as this method return value, as it only returns the lower-triangle values fromcov_mat
.
- N
- Returns
np.ndarray
Absolute value of covariances between distinct attributes.
References
- 1
Ciro Castiello, Giovanna Castellano, and Anna Maria Fanelli. Meta-data: Characterization of input features for meta-learning. In 2nd International Conference on Modeling Decisions for Artificial Intelligence (MDAI), pages 457–468, 2005.
- 2
Donald Michie, David J. Spiegelhalter, Charles C. Taylor, and John Campbell. Machine Learning, Neural and Statistical Classification, volume 37. Ellis Horwood Upper Saddle River, 1994.
- classmethod ft_eigenvalues(N: ndarray, ddof: int = 1, cov_mat: Optional[ndarray] = None) ndarray [source]
Compute the eigenvalues of covariance matrix from dataset.
- Parameters
- N
np.ndarray
Fitted numerical data.
- ddofint, optional
Degrees of freedom for covariance matrix.
- cov_mat
np.ndarray
, optional Covariance matrix of
N
. Argument meant to exploit precomputations.
- N
- Returns
np.ndarray
Eigenvalues of
N
covariance matrix.
References
- 1
Shawkat Ali and Kate A. Smith. On learning algorithm selection for classification. Applied Soft Computing, 6(2):119 – 138, 2006.
- classmethod ft_g_mean(N: ndarray, allow_zeros: bool = True, epsilon: float = 1e-10) ndarray [source]
Compute the geometric mean of each attribute.
- Parameters
- N
np.ndarray
Fitted numerical data.
- allow_zeros
bool
, optional If True, then the geometric mean of all attributes with zero values is set to zero. Otherwise, is set to
np.nan
these values.- epsilonfloat, optional
A small value which all values with absolute value lesser than it is considered zero-valued. Used only if
allow_zeros
is False.
- N
- Returns
np.ndarray
Attribute geometric means.
References
- 1
Shawkat Ali and Kate A. Smith-Miles. A meta-learning approach to automatic kernel selection for support vector machines. Neurocomputing, 70(1):173 – 186, 2006.
- classmethod ft_gravity(N: ndarray, y: ndarray, norm_ord: Union[int, float] = 2, classes: Optional[ndarray] = None, class_freqs: Optional[ndarray] = None, cls_inds: Optional[ndarray] = None) float [source]
Compute the distance between minority and majority classes center of mass.
The center of mass of a class is the average value of each attribute between instances of the same class.
The majority and minority classes cannot be the same, even if every class has the same number of instances.
- Parameters
- N
np.ndarray
Fitted numerical data.
- y
np.ndarray
Target attribute.
- norm_ord
numeric
, optional Minkowski Distance parameter. Minkowski Distance has the following popular cases for this argument value
norm_ord
Distance name
-> -inf
Min value
1.0
Manhattan/City Block
2.0
Euclidean
-> +inf
Max value (infinite norm)
- classes
np.ndarray
, optional Distinct classes of
y
.- class_freqs
np.ndarray
, optional Absolute frequencies of each distinct class in target attribute
y
orclasses
. Ifclasses
is given, then this argument must be paired with it by index.- cls_inds
np.ndarray
, optional Boolean array which indicates the examples of each class. The rows represents each distinct class, and the columns represents the instances. Used to take advantage of precomputations.
- N
- Returns
- float
Gravity of the numeric dataset.
- Raises
ValueError
If
norm_ord
is not numeric.
References
- 1
Shawkat Ali and Kate A. Smith. On learning algorithm selection for classification. Applied Soft Computing, 6(2):119 – 138, 2006.
- classmethod ft_h_mean(N: ndarray) ndarray [source]
Compute the harmonic mean of each attribute.
- Parameters
- N
np.ndarray
Fitted numerical data.
- N
- Returns
np.ndarray
Attribute harmonic means.
References
- 1
Shawkat Ali and Kate A. Smith-Miles. A meta-learning approach to automatic kernel selection for support vector machines. Neurocomputing, 70(1):173 – 186, 2006.
- classmethod ft_iq_range(N: ndarray) ndarray [source]
Compute the interquartile range (IQR) of each attribute.
- Parameters
- N
np.ndarray
Fitted numerical data.
- N
- Returns
np.ndarray
Attribute interquartile ranges.
References
- 1
Shawkat Ali and Kate A. Smith-Miles. A meta-learning approach to automatic kernel selection for support vector machines. Neurocomputing, 70(1):173 – 186, 2006.
- classmethod ft_kurtosis(N: ndarray, method: int = 3, bias: bool = True) ndarray [source]
Compute the kurtosis of each attribute.
- Parameters
- N
np.ndarray
Fitted numerical data.
- methodint, optional
Defines the strategy used for estimate data kurtosis. Used for total compatibility with R package
e1071
. The options must be one of the following:Option
Formula
1
Kurt_1 = (m_4 / m_2**2 - 3) (default of scipy.stats package)
2
Kurt_2 = (((n+1) * Kurt_1 + 6) * (n-1) / f_2), f_2 = ((n-2)*(n-3))
3
- Kurt_3 = (m_4 / s**4 - 3)
= ((Kurt_1+3) * (1 - 1/n)**2 - 3)
Where n is the number of instances in
N
, s is the standard deviation of each attribute inN
, and m_i is the ith statistical momentum of each attribute inN
.Note that if the selected method is unable to be calculated due to division by zero, then the first method is used instead.
- biasbool, optional
If False, then the calculations are corrected for statistical bias.
- N
- Returns
np.ndarray
Attribute kurtosis.
References
- 1
Donald Michie, David J. Spiegelhalter, Charles C. Taylor, and John Campbell. Machine Learning, Neural and Statistical Classification, volume 37. Ellis Horwood Upper Saddle River, 1994.
- classmethod ft_lh_trace(N: ndarray, y: ndarray, can_cor_eigvals: Optional[ndarray] = None, can_cors: Optional[ndarray] = None) float [source]
Compute the Lawley-Hotelling trace.
The Lawley-Hotelling trace LH is given by:
LH = sum_{i} can_cor_i**2 / (1 - can_cor_i**2)
Where can_cor_i is the ith canonical correlation of
N
and the one-hot encoded version ofy
.Equivalently, LH can be calculated from the eigenvalues related to each canonical correlation due to the relationship:
can_cor_eigval_i = can_cor_i**2 / (1 - can_cor_i**2)
Therefore, LH is given simply by:
LH = sum_{i} can_cor_eigval_i
- Parameters
- N
np.ndarray
Numerical fitted data.
- y
np.ndarray
Target attribute.
- can_cor_eigvals
np.ndarray
, optional Eigenvalues associated with the canonical correlations of
N
and one-hot encodedy
. This argument is used to exploit precomputations. The relationship between the ith canonical correlationcan_cor_i
and its eigenvalue is:can_cor_i = sqrt(can_cor_eigval_i / (1 + can_cor_eigval_i))
Or, equivalently:
can_cor_eigval_i = can_cor_i**2 / (1 - can_cor_i**2)
- can_cors
np.ndarray
, optional Canonical correlations between
N
and the one-hot encoded version ofy
. Argument used to take advantage of precomputations. Used only ifcan_cor_eigvals
is None.
- N
- Returns
- float
Lawley-Hotelling trace value.
References
- 1
Lawley D. A Generalization of Fisher’s z Test. Biometrika. 1938;30(1):180-187.
- 2
Hotelling H. A generalized T test and measure of multivariate dispersion. In: Neyman J, ed. Proceedings of the Second Berkeley Symposium on Mathematical Statistics and Probability. Berkeley: University of California Press; 1951:23-41.
- classmethod ft_mad(N: ndarray, factor: float = 1.4826) ndarray [source]
Compute the Median Absolute Deviation (MAD) adjusted by a factor.
- Parameters
- N
np.ndarray
Fitted numerical data.
- factorfloat, optional
Multiplication factor for output correction. The default
factor
is 1.4826 since it is an approximated result of MAD of a normally distributed data (with any mean and standard deviation of 1.0), so it makes this method result comparable with this sort of data.
- N
- Returns
np.ndarray
Attribute MAD (Median Absolute Deviation.)
References
- 1
Shawkat Ali and Kate A. Smith. On learning algorithm selection for classification. Applied Soft Computing, 6(2):119 – 138, 2006.
- classmethod ft_max(N: ndarray) ndarray [source]
Compute the maximum value from each attribute.
- Parameters
- N
np.ndarray
Fitted numerical data.
- N
- Returns
np.ndarray
Attribute maximum values.
References
- 1
Robert Engels and Christiane Theusinger. Using a data metric for preprocessing advice for data mining applications. In 13th European Conference on on Artificial Intelligence (ECAI), pages 430 – 434, 1998.
- classmethod ft_mean(N: ndarray) ndarray [source]
Compute the mean value of each attribute.
- Parameters
- N
np.ndarray
Fitted numerical data.
- N
- Returns
np.ndarray
Attribute mean values.
References
- 1
Robert Engels and Christiane Theusinger. Using a data metric for preprocessing advice for data mining applications. In 13th European Conference on on Artificial Intelligence (ECAI), pages 430 – 434, 1998.
- classmethod ft_median(N: ndarray) ndarray [source]
Compute the median value from each attribute.
- Parameters
- N
np.ndarray
Fitted numerical data.
- N
- Returns
np.ndarray
Attribute median values.
References
- 1
Robert Engels and Christiane Theusinger. Using a data metric for preprocessing advice for data mining applications. In 13th European Conference on on Artificial Intelligence (ECAI), pages 430 – 434, 1998.
- classmethod ft_min(N: ndarray) ndarray [source]
Compute the minimum value from each attribute.
- Parameters
- N
np.ndarray
Fitted numerical data.
- N
- Returns
np.ndarray
Attribute minimum values.
References
- 1
Robert Engels and Christiane Theusinger. Using a data metric for preprocessing advice for data mining applications. In 13th European Conference on on Artificial Intelligence (ECAI), pages 430 – 434, 1998.
- classmethod ft_nr_cor_attr(N: ndarray, threshold: float = 0.5, normalize: bool = True, abs_corr_mat: Optional[ndarray] = None) Union[int, float] [source]
Compute the number of distinct highly correlated pair of attributes.
A pair of attributes is considered highly correlated if the absolute value of its covariance is equal or larger than a given
threshold
.- Parameters
- N
np.ndarray
Fitted numerical data.
- thresholdfloat, optional
A value of the threshold, where correlation is assumed to be strong if its absolute value is equal or greater than it.
- normalizebool, optional
If True, the result is normalized by a factor of 2/(d*(d-1)), where d is number of attributes (columns) in
N
.- abs_corr_mat
np.ndarray
, optional Absolute correlation matrix of
N
. Argument used to exploit precomputations.
- N
- Returns
- int | float
If
normalize
is False, this method returns the number of highly correlated pair of distinct attributes. Otherwise, return the proportion of highly correlated attributes.
References
- 1
Mostafa A. Salama, Aboul Ella Hassanien, and Kenneth Revett. Employment of neural network and rough set in meta-learning. Memetic Computing, 5(3):165 – 177, 2013.
- classmethod ft_nr_disc(N: ndarray, y: ndarray, can_cors: Optional[ndarray] = None) Union[int, float] [source]
Compute the number of canonical correlation between each attribute and class.
This method return value is effectively the size of the return value of
ft_can_cor
method. Check its documentation for more in-depth details.- Parameters
- N
np.ndarray
Fitted numerical data.
- y
np.ndarray
Target attribute.
- can_cors
np.ndarray
, optional Canonical correlations between
N
and the one-hot encoded version ofy
. Argument used to take advantage of precomputations.
- N
- Returns
- int or float
Number of canonical correlations between each attribute and class, if
ft_can_cor
is executed successfully. Returnsnp.nan
otherwise.
References
- 1
Guido Lindner and Rudi Studer. AST: Support for algorithm selection with a CBR approach. In European Conference on Principles of Data Mining and Knowledge Discovery (PKDD), pages 418 – 423, 1999.
- classmethod ft_nr_norm(N: ndarray, method: str = 'shapiro-wilk', threshold: float = 0.05, failure: str = 'soft', max_samples: int = 5000) Union[float, int] [source]
Compute the number of attributes normally distributed based in a given method.
- Parameters
- N
np.ndarray
Fitted numerical data.
- methodstr, optional
Select the normality test to be executed. This argument must assume one of the options shown below:
shapiro-wilk: from scipy.stats.shapiro documentation: the Shapiro-Wilk test tests the null hypothesis that the data was drawn from a normal distribution.
dagostino-pearson: from scipy.stats.normaltest documentation: It is based on D’Agostino and Pearson’s, test that combines skew and kurtosis to produce an omnibus test of normality.
anderson-darling: from scipy.stats.anderson documentation: The Anderson-Darling tests the null hypothesis that a sample is drawn from a population that follows a particular distribution. In this method context, that particular distribution is fixed in the normal/gaussian.
all: perform all tests cited above. To consider an attribute normaly distributed all test results are taken into account with equal weight. Check
failure
argument for more information.
- thresholdfloat, optional
Level of significance used to reject the null hypothesis of normality tests.
- failurestr, optional
Used only if
method
argument value is all. This argument must assumed one value between soft or hard. If soft, then if a single test have its null hypothesis (which all states the data follows a Guassian distribution) rejected for some attribute, then that attribute is already considered normally distributed. If value is hard, then is necessary the rejection of the null hypothesis of every single normality test to consider the attribute normally distributed.- max_samplesint, optional
Max samples used while performing the normality tests. Shapiro-Wilks test p-value may not be accurate when sample size is higher than 5000. Note that the instances are NOT shuffled before doing this cutoff. This means that the very first
max_samples
instances of the datasetN
will be considered in the statistical tests.
- N
- Returns
- int
The number of normally distributed attributes based on the
method
. Ifmax_samples
is non-positive,np.nan
is returned instead.
- Raises
- ValueError
If
method
orfailure
is not a valid option.
References
- 1
Christian Kopf, Charles Taylor, and Jorg Keller. Meta-Analysis: From data characterisation for meta-learning to meta-regression. In PKDD Workshop on Data Mining, Decision Support, Meta-Learning and Inductive Logic Programming, pages 15 – 26, 2000.
- classmethod ft_nr_outliers(N: ndarray, whis: float = 1.5) int [source]
Compute the number of attributes with at least one outlier value.
An attribute has outlier if some value is outside the closed interval [first_quartile - WHIS * IQR, third_quartile + WHIS * IQR], where IQR is the Interquartile Range (third_quartile - first_quartile), and WHIS value is typically 1.5.
- Parameters
- N
np.ndarray
Fitted numerical data.
- whisfloat, optional
A factor to multiply IQR and set up non-outlier interval (as stated above). Higher values make the interval more significant, thus increasing the tolerance against outliers, where lower values decrease non-outlier interval and, therefore, creates less tolerance against possible outliers.
- N
- Returns
- int
Number of attributes with at least one outlier.
References
- 1
Christian Kopf and Ioannis Iglezakis. Combination of task description strategies and case base properties for meta-learning. In 2nd ECML/PKDD International Workshop on Integration and Collaboration Aspects of Data Mining, Decision Support and Meta-Learning(IDDM), pages 65 – 76, 2002.
- 2
Peter J. Rousseeuw and Mia Hubert. Robust statistics for outlier detection. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 1(1):73 – 79, 2011.
- classmethod ft_p_trace(N: ndarray, y: ndarray, can_cors: Optional[ndarray] = None) float [source]
Compute the Pillai’s trace.
The Pillai’s trace is the sum of the squared canonical correlations of
N
and the one-hot encoded version ofy
.- Parameters
- N
np.ndarray
Numerical fitted data.
- y
np.ndarray
Target attribute.
- can_cors
np.ndarray
, optional Canonical correlations between
N
and the one-hot encoded version ofy
. Argument used to take advantage of precomputations.
- N
- Returns
- float
Pillai’s trace value.
References
- 1
Pillai K.C.S (1955). Some New test criteria in multivariate analysis. Ann Math Stat: 26(1):117–21. Seber, G.A.F. (1984). Multivariate Observations. New York: John Wiley and Sons.
- classmethod ft_range(N: ndarray) ndarray [source]
Compute the range (max - min) of each attribute.
- Parameters
- N
np.ndarray
Fitted numerical data.
- N
- Returns
np.ndarray
Attribute ranges.
References
- 1
Shawkat Ali and Kate A. Smith-Miles. A meta-learning approach to automatic kernel selection for support vector machines. Neurocomputing, 70(1):173 – 186, 2006.
- classmethod ft_roy_root(N: ndarray, y: ndarray, criterion: str = 'eigval', can_cors: Optional[ndarray] = None, can_cor_eigvals: Optional[ndarray] = None) float [source]
Compute the Roy’s largest root.
The Roy’s largest root RLR can be computed using two distinct approaches (see references for further explanation.)
1. Based on Roy’s (ii) original hypothesis: formulated using the largest eigenvalue associated with the canonical correlations between
N
and the one-hot encoded version ofy
. That is, the Roy’s Largest Root RLR_a can be defined as:RLR_a = max_{I} can_cor_eig_val_i
It is in range [0, +inf).
2. Based on Roy’s (iii) original hypothesis: formulated using the largest squared canonical correlations of
N``and the one- hot encoded version of ``y
. Therefore, the Roy’s Largest Root RLR_b can be defined as:RLR_b = max_{i} can_cor_i**2
It is in range [0, 1].
Note that both statistics have different meanings and, therefore, will assume distinct values.
Which formulation is used can be controled using the
criterion
argument (see below for more information.)- Parameters
- N
np.ndarray
Numerical fitted data.
- y
np.ndarray
Target attribute.
- criterionstr, optional
If eigval, calculate the Roy’s largest root as the largest eigenvalue associated with each canonical correlation. This is the first formulation described above. If cancor, calculate the Roy’s largest root as the largest squared canonical correlation. This is the second formulation above.
- can_cors
np.ndarray
, optional Canonical correlations between
N
and the one-hot encoded version ofy
. Argument used to take advantage of precomputations. Used only ifcriterion
is cancor or, if otherwise,can_cor_eigvals
argument is None.- can_cor_eigvals
np.ndarray
, optional Eigenvalues associated with the canonical correlations of
N
and one-hot encodedy
. This argument is used to exploit precomputations. The relationship between the ith canonical correlationcan_cor_i
and its eigenvalue is:can_cor_i = sqrt(can_cor_eigval_i / (1 + can_cor_eigval_i))
Or, equivalently:
can_cor_eigval_i = can_cor_i**2 / (1 - can_cor_i**2)
This argument is used only if
criterion
argument is eigval.
- N
- Returns
- float
Roy’s largest root calculated based on criterion defined by the
criterion
argument.
References
- 1
Roy SN. On a Heuristic Method of Test Construction and its use in Multivariate Analysis. Ann Math Stat. 1953;24(2):220-238.
- 2
A note on Roy’s largest root. Kuhfeld, W.F. Psychometrika (1986) 51: 479. https://doi.org/10.1007/BF02294069
- classmethod ft_sd(N: ndarray, ddof: int = 1) ndarray [source]
Compute the standard deviation of each attribute.
- Parameters
- N
np.ndarray
Fitted numerical data.
- ddoffloat, optional
Degrees of freedom for standard deviation.
- N
- Returns
np.ndarray
Attribute standard deviations.
References
- 1
Robert Engels and Christiane Theusinger. Using a data metric for preprocessing advice for data mining applications. In 13th European Conference on on Artificial Intelligence (ECAI), pages 430 – 434, 1998.
- classmethod ft_sd_ratio(N: ndarray, y: ndarray, ddof: int = 1, classes: Optional[ndarray] = None, class_freqs: Optional[ndarray] = None) float [source]
Compute a statistical test for homogeneity of covariances.
The test applied is the Box’s M Test for equivalence of covariances.
The null hypothesis of this test states that the covariance matrices of the instances of every class are equal.
- Parameters
- N
np.ndarray
Fitted numerical data.
- y
np.ndarray
Target attribute.
- ddofint, optional
Degrees of freedom for covariance matrix, calculated during this test.
- classes
np.ndarray
, optional All distinct classes in target attribute
y
. Used to exploit precomputations.- class_freqs
np.ndarray
, optional Absolute frequencies of each distinct class in target attribute
y
orclasses
. Ifclasses
is given, then this argument must be paired with it by index.
- N
- Returns
- float
Homogeneity of covariances test result.
Notes
For details about how this test is applied, check out Rivolli et al. (pag. 32).
References
- 1
Donald Michie, David J. Spiegelhalter, Charles C. Taylor, and John Campbell. Machine Learning, Neural and Statistical Classification, volume 37. Ellis Horwood Upper Saddle River, 1994.
- classmethod ft_skewness(N: ndarray, method: int = 3, bias: bool = True) ndarray [source]
Compute the skewness for each attribute.
- Parameters
- N
np.ndarray
Fitted numerical data.
- methodint, optional
Defines the strategy used for estimate data skewness. This argument is used fo compatibility with R package e1071. The options must be one of the following:
Option
Formula
1
Skew_1 = m_3 / m_2**(3/2) (default of
scipy.stats
)2
Skew_2 = Skew_1 * sqrt(n(n-1)) / (n-2)
3
Skew_3 = m_3 / s**3 = Skew_1 ((n-1)/n)**(3/2)
Where n is the number of instances in
N
, s is the standard deviation of each attribute inN
, and m_i is the ith statistical momentum of each attribute inN
.Note that if the selected method is unable to be calculated due to division by zero, then the first method will be used instead.
- biasbool, optional
If False, then the calculations are corrected for statistical bias.
- N
- Returns
np.ndarray
Attribute skewness.
References
- 1
Donald Michie, David J. Spiegelhalter, Charles C. Taylor, and John Campbell. Machine Learning, Neural and Statistical Classification, volume 37. Ellis Horwood Upper Saddle River, 1994.
- classmethod ft_sparsity(X: ndarray, normalize: bool = True) ndarray [source]
Compute (possibly normalized) sparsity metric for each attribute.
Sparsity S of a vector v of numeric values is defined as
S(v) = (1.0 / (n - 1.0)) * ((n / phi(v)) - 1.0),
- where
n is the number of instances in dataset
X
.phi(v) is the number of distinct values in v.
- Parameters
- X
np.ndarray
Fitted numerical data.
- normalizebool, optional
If True, then the output will be S(v) as shown above. Otherwise, the output is not be multiplied by the (1.0 / (n - 1.0)) factor (i.e. new output is defined as S’(v) = ((n / phi(v)) - 1.0)).
- X
- Returns
np.ndarray
Attribute sparsities.
References
- 1
Mostafa A. Salama, Aboul Ella Hassanien, and Kenneth Revett. Employment of neural network and rough set in meta-learning. Memetic Computing, 5(3):165 – 177, 2013.
- classmethod ft_t_mean(N: ndarray, pcut: float = 0.2) ndarray [source]
Compute the trimmed mean of each attribute.
- Parameters
- N
np.ndarray
Fitted numerical data.
- pcutfloat, optional
Percentage of cut from both the lower and higher values. This value should be in interval [0.0, 0.5), where if 0.0 the return value is the default mean calculation.
- N
- Returns
np.ndarray
Attribute trimmed means.
References
- 1
Robert Engels and Christiane Theusinger. Using a data metric for preprocessing advice for data mining applications. In 13th European Conference on on Artificial Intelligence (ECAI), pages 430 – 434, 1998.
- classmethod ft_var(N: ndarray, ddof: int = 1) ndarray [source]
Compute the variance of each attribute.
- Parameters
- N
np.ndarray
Fitted numerical data.
- ddoffloat, optional
Degrees of freedom for variance.
- N
- Returns
np.ndarray
Attribute variances.
References
- 1
Ciro Castiello, Giovanna Castellano, and Anna Maria Fanelli. Meta-data: Characterization of input features for meta-learning. In 2nd International Conference on Modeling Decisions for Artificial Intelligence (MDAI), pages 457–468, 2005.
- classmethod ft_w_lambda(N: ndarray, y: ndarray, can_cor_eigvals: Optional[ndarray] = None, can_cors: Optional[ndarray] = None) float [source]
Compute the Wilks’ Lambda value.
The Wilk’s Lambda L is calculated as:
L = prod(1.0 / (1.0 + can_cor_eig_i))
Where can_cor_eig_i is the ith eigenvalue related to the ith canonical correlation can_cor_i between the attributes in
N
and the binarized (one-hot encoded) version ofy
.The relationship between can_cor_eig_i and can_cor_i is given by:
can_cor_i = sqrt(can_cor_eig_i / (1 + can_cor_eig_i))
Or, equivalently:
can_cor_eig_i = can_cor_i**2 / (1 - can_cor_i**2)
- Parameters
- N
np.ndarray
Fitted numerical data.
- y
np.ndarray
Target attribute.
- can_cor_eigvals
np.ndarray
, optional Eigenvalues associated with the canonical correlations of
N
and one-hot encodedy
. This argument is used to exploit precomputations. The relationship between the ith canonical correlation can_cor_i and its eigenvalue is:can_cor_i = sqrt(can_cor_eigval_i / (1 + can_cor_eigval_i))
Or, equivalently:
can_cor_eigval_i = can_cor_i**2 / (1 - can_cor_i**2)
- can_cors
np.ndarray
, optional Canonical correlations between
N
and the one-hot encoded version ofy
. Argument used to take advantage of precomputations. Used only ifcan_cor_eigvals
is None.
- N
- Returns
- float
Wilk’s lambda value.
References
- 1
Guido Lindner and Rudi Studer. AST: Support for algorithm selection with a CBR approach. In European Conference on Principles of Data Mining and Knowledge Discovery (PKDD), pages 418 – 423, 1999.
- classmethod precompute_can_cors(N: Optional[ndarray] = None, y: Optional[ndarray] = None, **kwargs) Dict[str, Any] [source]
Precompute canonical correlations and its eigenvalues.
- Parameters
- N
np.ndarray
, optional Numerical fitted data.
- y
np.ndarray
Target attribute.
- kwargs:
Additional arguments. May have previously precomputed before this method from other precomputed methods, so they can help speed up this precomputation.
- N
- Returns
dict
- With following precomputed items:
can_cors
(np.ndarray
): canonical correlations betweenN
and the one-hot encoded version ofy
.can_cor_eigvals
(np.ndarray
): eigenvalues related to the canonical correlations.
- classmethod precompute_statistical_class(y: Optional[ndarray] = None, **kwargs) Dict[str, Any] [source]
Precompute distinct classes and its abs. frequencies from
y
.- Parameters
- y
np.ndarray
The target attribute from fitted data.
- kwargs:
Additional arguments. May have previously precomputed before this method from other precomputed methods, so they can help speed up this precomputation.
- y
- Returns
dict
- With following precomputed items:
classes
(np.ndarray
): distinct classes ofy
, ify
is notNoneType
.class_freqs
(np.ndarray
): absolute class frequencies ofy
, ify
is notNoneType
.
- classmethod precompute_statistical_cor_cov(N: Optional[ndarray] = None, ddof: int = 1, **kwargs) Dict[str, Any] [source]
Precomputes the correlation and covariance matrix of numerical data.
Be cautious in allowing this precomputation method on huge datasets, as this precomputation method may be very memory hungry.
- Parameters
- N
np.ndarray
, optional Numerical fitted data.
- ddofint, optional
Degrees of freedom of covariance matrix.
- kwargs:
Additional arguments. May have previously precomputed before this method from other precomputed methods, so they can help speed up this precomputation.
- N
- Returns
dict
- With following precomputed items:
cov_mat
(np.ndarray
): covariance matrix ofN
, ifN
is notNoneType
.abs_corr_mat
(np.ndarray
): absolute correlation matrix ofN
, ifN
is notNoneType
.