pymfe.statistical.MFEStatistical
- class pymfe.statistical.MFEStatistical[source]
Keep methods for metafeatures of
Statisticalgroup.The convention adopted for metafeature-extraction related methods is to always start with
ft_prefix in order to allow automatic method detection. This prefix is predefined within_internalmodule.All method signature follows the conventions and restrictions listed below:
For independent attribute data,
Xmeansevery type of attribute,NmeansNumeric attributes onlyandCstands forCategorical attributes only. It is important to note that the categorical attribute sets betweenXandCand the numerical attribute sets betweenXandNmay differ due to data transformations, performed while fitting data into MFE model, enabled by, respectively,transform_numandtransform_catarguments fromfit(MFE method).Only arguments in MFE
_custom_args_ftattribute (set up insidefitmethod) are allowed to be required method arguments. All other arguments must be strictly optional (i.e., has a predefined default value).It is assumed that the user can change any optional argument, without any previous verification for both type or value, via kwargs argument of
extractmethod of MFE class.The return value of all feature-extraction methods should be a single value or a generic List (preferably an np.ndarray) type with numeric values.
There is another type of method adopted for automatic detection. It is adopted the prefix
precompute_for automatic detection of these methods. These methods run while fitting some data into an MFE model automatically, and their objective is to precompute some common value shared between more than one feature extraction method. This strategy is a trade-off between more system memory consumption and speeds up of feature extraction. Their return value must always be a dictionary whose keys are possible extra arguments for both feature extraction methods and other precomputation methods. Note that there is a share of precomputed values between all valid feature-extraction modules (e.g.,class_freqscomputed in modulestatisticalcan freely be used for any precomputation or feature extraction method of modulelandmarking).- __init__(*args, **kwargs)
Methods
__init__(*args, **kwargs)ft_can_cor(N, y[, can_cors])Compute canonical correlations of data.
ft_cor(N[, abs_corr_mat])Compute the absolute value of the correlation of distinct dataset column pairs.
ft_cov(N[, ddof, cov_mat])Compute the absolute value of the covariance of distinct dataset attribute pairs.
ft_eigenvalues(N[, ddof, cov_mat])Compute the eigenvalues of covariance matrix from dataset.
ft_g_mean(N[, allow_zeros, epsilon])Compute the geometric mean of each attribute.
ft_gravity(N, y[, norm_ord, classes, ...])Compute the distance between minority and majority classes center of mass.
ft_h_mean(N)Compute the harmonic mean of each attribute.
ft_iq_range(N)Compute the interquartile range (IQR) of each attribute.
ft_kurtosis(N[, method, bias])Compute the kurtosis of each attribute.
ft_lh_trace(N, y[, can_cor_eigvals, can_cors])Compute the Lawley-Hotelling trace.
ft_mad(N[, factor])Compute the Median Absolute Deviation (MAD) adjusted by a factor.
ft_max(N)Compute the maximum value from each attribute.
ft_mean(N)Compute the mean value of each attribute.
ft_median(N)Compute the median value from each attribute.
ft_min(N)Compute the minimum value from each attribute.
ft_nr_cor_attr(N[, threshold, normalize, ...])Compute the number of distinct highly correlated pair of attributes.
ft_nr_disc(N, y[, can_cors])Compute the number of canonical correlation between each attribute and class.
ft_nr_norm(N[, method, threshold, failure, ...])Compute the number of attributes normally distributed based in a given method.
ft_nr_outliers(N[, whis])Compute the number of attributes with at least one outlier value.
ft_p_trace(N, y[, can_cors])Compute the Pillai's trace.
ft_range(N)Compute the range (max - min) of each attribute.
ft_roy_root(N, y[, criterion, can_cors, ...])Compute the Roy's largest root.
ft_sd(N[, ddof])Compute the standard deviation of each attribute.
ft_sd_ratio(N, y[, ddof, classes, class_freqs])Compute a statistical test for homogeneity of covariances.
ft_skewness(N[, method, bias])Compute the skewness for each attribute.
ft_sparsity(X[, normalize])Compute (possibly normalized) sparsity metric for each attribute.
ft_t_mean(N[, pcut])Compute the trimmed mean of each attribute.
ft_var(N[, ddof])Compute the variance of each attribute.
ft_w_lambda(N, y[, can_cor_eigvals, can_cors])Compute the Wilks' Lambda value.
precompute_can_cors([N, y])Precompute canonical correlations and its eigenvalues.
Precompute distinct classes and its abs.
precompute_statistical_cor_cov([N, ddof])Precomputes the correlation and covariance matrix of numerical data.
- classmethod ft_can_cor(N: ndarray, y: ndarray, can_cors: Optional[ndarray] = None) ndarray[source]
Compute canonical correlations of data.
The canonical correlations are calculated between the attributes in
Nand the binarized (one-hot encoded) version ofy.- Parameters
- N
np.ndarray Fitted numerical data.
- y
np.ndarray Target attribute.
- can_cors
np.ndarray, optional Canonical correlations between
Nand the one-hot encoded version ofy. Argument used to take advantage of precomputations.
- N
- Returns
np.ndarrayCanonical correlations of the data.
References
- 1
Alexandros Kalousis. Algorithm Selection via Meta-Learning. PhD thesis, Faculty of Science of the University of Geneva, 2002.
- classmethod ft_cor(N: ndarray, abs_corr_mat: Optional[ndarray] = None) ndarray[source]
Compute the absolute value of the correlation of distinct dataset column pairs.
- Parameters
- N
np.ndarray Fitted numerical data.
- abs_corr_mat
np.ndarray, optional Absolute correlation matrix of
N. Argument used to exploit precomputations.
- N
- Returns
np.ndarrayAbsolute value of correlation between distinct attributes.
References
- 1
Ciro Castiello, Giovanna Castellano, and Anna Maria Fanelli. Meta-data: Characterization of input features for meta-learning. In 2nd International Conference on Modeling Decisions for Artificial Intelligence (MDAI), pages 457–468, 2005.
- 2
Matthias Reif, Faisal Shafait, Markus Goldstein, Thomas Breuel, and Andreas Dengel. Automatic classifier selection for non-experts. Pattern Analysis and Applications, 17(1):83–96, 2014.
- 3
Donald Michie, David J. Spiegelhalter, Charles C. Taylor, and John Campbell. Machine Learning, Neural and Statistical Classification, volume 37. Ellis Horwood Upper Saddle River, 1994.
- classmethod ft_cov(N: ndarray, ddof: int = 1, cov_mat: Optional[ndarray] = None) ndarray[source]
Compute the absolute value of the covariance of distinct dataset attribute pairs.
- Parameters
- N
np.ndarray Fitted numerical data.
- ddofint, optional
Degrees of freedom for covariance matrix.
- cov_mat
np.ndarray, optional Covariance matrix of
N. Argument meant to exploit precomputations. Note that this argument value is not the same as this method return value, as it only returns the lower-triangle values fromcov_mat.
- N
- Returns
np.ndarrayAbsolute value of covariances between distinct attributes.
References
- 1
Ciro Castiello, Giovanna Castellano, and Anna Maria Fanelli. Meta-data: Characterization of input features for meta-learning. In 2nd International Conference on Modeling Decisions for Artificial Intelligence (MDAI), pages 457–468, 2005.
- 2
Donald Michie, David J. Spiegelhalter, Charles C. Taylor, and John Campbell. Machine Learning, Neural and Statistical Classification, volume 37. Ellis Horwood Upper Saddle River, 1994.
- classmethod ft_eigenvalues(N: ndarray, ddof: int = 1, cov_mat: Optional[ndarray] = None) ndarray[source]
Compute the eigenvalues of covariance matrix from dataset.
- Parameters
- N
np.ndarray Fitted numerical data.
- ddofint, optional
Degrees of freedom for covariance matrix.
- cov_mat
np.ndarray, optional Covariance matrix of
N. Argument meant to exploit precomputations.
- N
- Returns
np.ndarrayEigenvalues of
Ncovariance matrix.
References
- 1
Shawkat Ali and Kate A. Smith. On learning algorithm selection for classification. Applied Soft Computing, 6(2):119 – 138, 2006.
- classmethod ft_g_mean(N: ndarray, allow_zeros: bool = True, epsilon: float = 1e-10) ndarray[source]
Compute the geometric mean of each attribute.
- Parameters
- N
np.ndarray Fitted numerical data.
- allow_zeros
bool, optional If True, then the geometric mean of all attributes with zero values is set to zero. Otherwise, is set to
np.nanthese values.- epsilonfloat, optional
A small value which all values with absolute value lesser than it is considered zero-valued. Used only if
allow_zerosis False.
- N
- Returns
np.ndarrayAttribute geometric means.
References
- 1
Shawkat Ali and Kate A. Smith-Miles. A meta-learning approach to automatic kernel selection for support vector machines. Neurocomputing, 70(1):173 – 186, 2006.
- classmethod ft_gravity(N: ndarray, y: ndarray, norm_ord: Union[int, float] = 2, classes: Optional[ndarray] = None, class_freqs: Optional[ndarray] = None, cls_inds: Optional[ndarray] = None) float[source]
Compute the distance between minority and majority classes center of mass.
The center of mass of a class is the average value of each attribute between instances of the same class.
The majority and minority classes cannot be the same, even if every class has the same number of instances.
- Parameters
- N
np.ndarray Fitted numerical data.
- y
np.ndarray Target attribute.
- norm_ord
numeric, optional Minkowski Distance parameter. Minkowski Distance has the following popular cases for this argument value
norm_ord
Distance name
-> -inf
Min value
1.0
Manhattan/City Block
2.0
Euclidean
-> +inf
Max value (infinite norm)
- classes
np.ndarray, optional Distinct classes of
y.- class_freqs
np.ndarray, optional Absolute frequencies of each distinct class in target attribute
yorclasses. Ifclassesis given, then this argument must be paired with it by index.- cls_inds
np.ndarray, optional Boolean array which indicates the examples of each class. The rows represents each distinct class, and the columns represents the instances. Used to take advantage of precomputations.
- N
- Returns
- float
Gravity of the numeric dataset.
- Raises
ValueErrorIf
norm_ordis not numeric.
References
- 1
Shawkat Ali and Kate A. Smith. On learning algorithm selection for classification. Applied Soft Computing, 6(2):119 – 138, 2006.
- classmethod ft_h_mean(N: ndarray) ndarray[source]
Compute the harmonic mean of each attribute.
- Parameters
- N
np.ndarray Fitted numerical data.
- N
- Returns
np.ndarrayAttribute harmonic means.
References
- 1
Shawkat Ali and Kate A. Smith-Miles. A meta-learning approach to automatic kernel selection for support vector machines. Neurocomputing, 70(1):173 – 186, 2006.
- classmethod ft_iq_range(N: ndarray) ndarray[source]
Compute the interquartile range (IQR) of each attribute.
- Parameters
- N
np.ndarray Fitted numerical data.
- N
- Returns
np.ndarrayAttribute interquartile ranges.
References
- 1
Shawkat Ali and Kate A. Smith-Miles. A meta-learning approach to automatic kernel selection for support vector machines. Neurocomputing, 70(1):173 – 186, 2006.
- classmethod ft_kurtosis(N: ndarray, method: int = 3, bias: bool = True) ndarray[source]
Compute the kurtosis of each attribute.
- Parameters
- N
np.ndarray Fitted numerical data.
- methodint, optional
Defines the strategy used for estimate data kurtosis. Used for total compatibility with R package
e1071. The options must be one of the following:Option
Formula
1
Kurt_1 = (m_4 / m_2**2 - 3) (default of scipy.stats package)
2
Kurt_2 = (((n+1) * Kurt_1 + 6) * (n-1) / f_2), f_2 = ((n-2)*(n-3))
3
- Kurt_3 = (m_4 / s**4 - 3)
= ((Kurt_1+3) * (1 - 1/n)**2 - 3)
Where n is the number of instances in
N, s is the standard deviation of each attribute inN, and m_i is the ith statistical momentum of each attribute inN.Note that if the selected method is unable to be calculated due to division by zero, then the first method is used instead.
- biasbool, optional
If False, then the calculations are corrected for statistical bias.
- N
- Returns
np.ndarrayAttribute kurtosis.
References
- 1
Donald Michie, David J. Spiegelhalter, Charles C. Taylor, and John Campbell. Machine Learning, Neural and Statistical Classification, volume 37. Ellis Horwood Upper Saddle River, 1994.
- classmethod ft_lh_trace(N: ndarray, y: ndarray, can_cor_eigvals: Optional[ndarray] = None, can_cors: Optional[ndarray] = None) float[source]
Compute the Lawley-Hotelling trace.
The Lawley-Hotelling trace LH is given by:
LH = sum_{i} can_cor_i**2 / (1 - can_cor_i**2)
Where can_cor_i is the ith canonical correlation of
Nand the one-hot encoded version ofy.Equivalently, LH can be calculated from the eigenvalues related to each canonical correlation due to the relationship:
can_cor_eigval_i = can_cor_i**2 / (1 - can_cor_i**2)
Therefore, LH is given simply by:
LH = sum_{i} can_cor_eigval_i
- Parameters
- N
np.ndarray Numerical fitted data.
- y
np.ndarray Target attribute.
- can_cor_eigvals
np.ndarray, optional Eigenvalues associated with the canonical correlations of
Nand one-hot encodedy. This argument is used to exploit precomputations. The relationship between the ith canonical correlationcan_cor_iand its eigenvalue is:can_cor_i = sqrt(can_cor_eigval_i / (1 + can_cor_eigval_i))
Or, equivalently:
can_cor_eigval_i = can_cor_i**2 / (1 - can_cor_i**2)
- can_cors
np.ndarray, optional Canonical correlations between
Nand the one-hot encoded version ofy. Argument used to take advantage of precomputations. Used only ifcan_cor_eigvalsis None.
- N
- Returns
- float
Lawley-Hotelling trace value.
References
- 1
Lawley D. A Generalization of Fisher’s z Test. Biometrika. 1938;30(1):180-187.
- 2
Hotelling H. A generalized T test and measure of multivariate dispersion. In: Neyman J, ed. Proceedings of the Second Berkeley Symposium on Mathematical Statistics and Probability. Berkeley: University of California Press; 1951:23-41.
- classmethod ft_mad(N: ndarray, factor: float = 1.4826) ndarray[source]
Compute the Median Absolute Deviation (MAD) adjusted by a factor.
- Parameters
- N
np.ndarray Fitted numerical data.
- factorfloat, optional
Multiplication factor for output correction. The default
factoris 1.4826 since it is an approximated result of MAD of a normally distributed data (with any mean and standard deviation of 1.0), so it makes this method result comparable with this sort of data.
- N
- Returns
np.ndarrayAttribute MAD (Median Absolute Deviation.)
References
- 1
Shawkat Ali and Kate A. Smith. On learning algorithm selection for classification. Applied Soft Computing, 6(2):119 – 138, 2006.
- classmethod ft_max(N: ndarray) ndarray[source]
Compute the maximum value from each attribute.
- Parameters
- N
np.ndarray Fitted numerical data.
- N
- Returns
np.ndarrayAttribute maximum values.
References
- 1
Robert Engels and Christiane Theusinger. Using a data metric for preprocessing advice for data mining applications. In 13th European Conference on on Artificial Intelligence (ECAI), pages 430 – 434, 1998.
- classmethod ft_mean(N: ndarray) ndarray[source]
Compute the mean value of each attribute.
- Parameters
- N
np.ndarray Fitted numerical data.
- N
- Returns
np.ndarrayAttribute mean values.
References
- 1
Robert Engels and Christiane Theusinger. Using a data metric for preprocessing advice for data mining applications. In 13th European Conference on on Artificial Intelligence (ECAI), pages 430 – 434, 1998.
- classmethod ft_median(N: ndarray) ndarray[source]
Compute the median value from each attribute.
- Parameters
- N
np.ndarray Fitted numerical data.
- N
- Returns
np.ndarrayAttribute median values.
References
- 1
Robert Engels and Christiane Theusinger. Using a data metric for preprocessing advice for data mining applications. In 13th European Conference on on Artificial Intelligence (ECAI), pages 430 – 434, 1998.
- classmethod ft_min(N: ndarray) ndarray[source]
Compute the minimum value from each attribute.
- Parameters
- N
np.ndarray Fitted numerical data.
- N
- Returns
np.ndarrayAttribute minimum values.
References
- 1
Robert Engels and Christiane Theusinger. Using a data metric for preprocessing advice for data mining applications. In 13th European Conference on on Artificial Intelligence (ECAI), pages 430 – 434, 1998.
- classmethod ft_nr_cor_attr(N: ndarray, threshold: float = 0.5, normalize: bool = True, abs_corr_mat: Optional[ndarray] = None) Union[int, float][source]
Compute the number of distinct highly correlated pair of attributes.
A pair of attributes is considered highly correlated if the absolute value of its covariance is equal or larger than a given
threshold.- Parameters
- N
np.ndarray Fitted numerical data.
- thresholdfloat, optional
A value of the threshold, where correlation is assumed to be strong if its absolute value is equal or greater than it.
- normalizebool, optional
If True, the result is normalized by a factor of 2/(d*(d-1)), where d is number of attributes (columns) in
N.- abs_corr_mat
np.ndarray, optional Absolute correlation matrix of
N. Argument used to exploit precomputations.
- N
- Returns
- int | float
If
normalizeis False, this method returns the number of highly correlated pair of distinct attributes. Otherwise, return the proportion of highly correlated attributes.
References
- 1
Mostafa A. Salama, Aboul Ella Hassanien, and Kenneth Revett. Employment of neural network and rough set in meta-learning. Memetic Computing, 5(3):165 – 177, 2013.
- classmethod ft_nr_disc(N: ndarray, y: ndarray, can_cors: Optional[ndarray] = None) Union[int, float][source]
Compute the number of canonical correlation between each attribute and class.
This method return value is effectively the size of the return value of
ft_can_cormethod. Check its documentation for more in-depth details.- Parameters
- N
np.ndarray Fitted numerical data.
- y
np.ndarray Target attribute.
- can_cors
np.ndarray, optional Canonical correlations between
Nand the one-hot encoded version ofy. Argument used to take advantage of precomputations.
- N
- Returns
- int or float
Number of canonical correlations between each attribute and class, if
ft_can_coris executed successfully. Returnsnp.nanotherwise.
References
- 1
Guido Lindner and Rudi Studer. AST: Support for algorithm selection with a CBR approach. In European Conference on Principles of Data Mining and Knowledge Discovery (PKDD), pages 418 – 423, 1999.
- classmethod ft_nr_norm(N: ndarray, method: str = 'shapiro-wilk', threshold: float = 0.05, failure: str = 'soft', max_samples: int = 5000) Union[float, int][source]
Compute the number of attributes normally distributed based in a given method.
- Parameters
- N
np.ndarray Fitted numerical data.
- methodstr, optional
Select the normality test to be executed. This argument must assume one of the options shown below:
shapiro-wilk: from scipy.stats.shapiro documentation: the Shapiro-Wilk test tests the null hypothesis that the data was drawn from a normal distribution.
dagostino-pearson: from scipy.stats.normaltest documentation: It is based on D’Agostino and Pearson’s, test that combines skew and kurtosis to produce an omnibus test of normality.
anderson-darling: from scipy.stats.anderson documentation: The Anderson-Darling tests the null hypothesis that a sample is drawn from a population that follows a particular distribution. In this method context, that particular distribution is fixed in the normal/gaussian.
all: perform all tests cited above. To consider an attribute normaly distributed all test results are taken into account with equal weight. Check
failureargument for more information.
- thresholdfloat, optional
Level of significance used to reject the null hypothesis of normality tests.
- failurestr, optional
Used only if
methodargument value is all. This argument must assumed one value between soft or hard. If soft, then if a single test have its null hypothesis (which all states the data follows a Guassian distribution) rejected for some attribute, then that attribute is already considered normally distributed. If value is hard, then is necessary the rejection of the null hypothesis of every single normality test to consider the attribute normally distributed.- max_samplesint, optional
Max samples used while performing the normality tests. Shapiro-Wilks test p-value may not be accurate when sample size is higher than 5000. Note that the instances are NOT shuffled before doing this cutoff. This means that the very first
max_samplesinstances of the datasetNwill be considered in the statistical tests.
- N
- Returns
- int
The number of normally distributed attributes based on the
method. Ifmax_samplesis non-positive,np.nanis returned instead.
- Raises
- ValueError
If
methodorfailureis not a valid option.
References
- 1
Christian Kopf, Charles Taylor, and Jorg Keller. Meta-Analysis: From data characterisation for meta-learning to meta-regression. In PKDD Workshop on Data Mining, Decision Support, Meta-Learning and Inductive Logic Programming, pages 15 – 26, 2000.
- classmethod ft_nr_outliers(N: ndarray, whis: float = 1.5) int[source]
Compute the number of attributes with at least one outlier value.
An attribute has outlier if some value is outside the closed interval [first_quartile - WHIS * IQR, third_quartile + WHIS * IQR], where IQR is the Interquartile Range (third_quartile - first_quartile), and WHIS value is typically 1.5.
- Parameters
- N
np.ndarray Fitted numerical data.
- whisfloat, optional
A factor to multiply IQR and set up non-outlier interval (as stated above). Higher values make the interval more significant, thus increasing the tolerance against outliers, where lower values decrease non-outlier interval and, therefore, creates less tolerance against possible outliers.
- N
- Returns
- int
Number of attributes with at least one outlier.
References
- 1
Christian Kopf and Ioannis Iglezakis. Combination of task description strategies and case base properties for meta-learning. In 2nd ECML/PKDD International Workshop on Integration and Collaboration Aspects of Data Mining, Decision Support and Meta-Learning(IDDM), pages 65 – 76, 2002.
- 2
Peter J. Rousseeuw and Mia Hubert. Robust statistics for outlier detection. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 1(1):73 – 79, 2011.
- classmethod ft_p_trace(N: ndarray, y: ndarray, can_cors: Optional[ndarray] = None) float[source]
Compute the Pillai’s trace.
The Pillai’s trace is the sum of the squared canonical correlations of
Nand the one-hot encoded version ofy.- Parameters
- N
np.ndarray Numerical fitted data.
- y
np.ndarray Target attribute.
- can_cors
np.ndarray, optional Canonical correlations between
Nand the one-hot encoded version ofy. Argument used to take advantage of precomputations.
- N
- Returns
- float
Pillai’s trace value.
References
- 1
Pillai K.C.S (1955). Some New test criteria in multivariate analysis. Ann Math Stat: 26(1):117–21. Seber, G.A.F. (1984). Multivariate Observations. New York: John Wiley and Sons.
- classmethod ft_range(N: ndarray) ndarray[source]
Compute the range (max - min) of each attribute.
- Parameters
- N
np.ndarray Fitted numerical data.
- N
- Returns
np.ndarrayAttribute ranges.
References
- 1
Shawkat Ali and Kate A. Smith-Miles. A meta-learning approach to automatic kernel selection for support vector machines. Neurocomputing, 70(1):173 – 186, 2006.
- classmethod ft_roy_root(N: ndarray, y: ndarray, criterion: str = 'eigval', can_cors: Optional[ndarray] = None, can_cor_eigvals: Optional[ndarray] = None) float[source]
Compute the Roy’s largest root.
The Roy’s largest root RLR can be computed using two distinct approaches (see references for further explanation.)
1. Based on Roy’s (ii) original hypothesis: formulated using the largest eigenvalue associated with the canonical correlations between
Nand the one-hot encoded version ofy. That is, the Roy’s Largest Root RLR_a can be defined as:RLR_a = max_{I} can_cor_eig_val_i
It is in range [0, +inf).
2. Based on Roy’s (iii) original hypothesis: formulated using the largest squared canonical correlations of
N``and the one- hot encoded version of ``y. Therefore, the Roy’s Largest Root RLR_b can be defined as:RLR_b = max_{i} can_cor_i**2
It is in range [0, 1].
Note that both statistics have different meanings and, therefore, will assume distinct values.
Which formulation is used can be controled using the
criterionargument (see below for more information.)- Parameters
- N
np.ndarray Numerical fitted data.
- y
np.ndarray Target attribute.
- criterionstr, optional
If eigval, calculate the Roy’s largest root as the largest eigenvalue associated with each canonical correlation. This is the first formulation described above. If cancor, calculate the Roy’s largest root as the largest squared canonical correlation. This is the second formulation above.
- can_cors
np.ndarray, optional Canonical correlations between
Nand the one-hot encoded version ofy. Argument used to take advantage of precomputations. Used only ifcriterionis cancor or, if otherwise,can_cor_eigvalsargument is None.- can_cor_eigvals
np.ndarray, optional Eigenvalues associated with the canonical correlations of
Nand one-hot encodedy. This argument is used to exploit precomputations. The relationship between the ith canonical correlationcan_cor_iand its eigenvalue is:can_cor_i = sqrt(can_cor_eigval_i / (1 + can_cor_eigval_i))
Or, equivalently:
can_cor_eigval_i = can_cor_i**2 / (1 - can_cor_i**2)
This argument is used only if
criterionargument is eigval.
- N
- Returns
- float
Roy’s largest root calculated based on criterion defined by the
criterionargument.
References
- 1
Roy SN. On a Heuristic Method of Test Construction and its use in Multivariate Analysis. Ann Math Stat. 1953;24(2):220-238.
- 2
A note on Roy’s largest root. Kuhfeld, W.F. Psychometrika (1986) 51: 479. https://doi.org/10.1007/BF02294069
- classmethod ft_sd(N: ndarray, ddof: int = 1) ndarray[source]
Compute the standard deviation of each attribute.
- Parameters
- N
np.ndarray Fitted numerical data.
- ddoffloat, optional
Degrees of freedom for standard deviation.
- N
- Returns
np.ndarrayAttribute standard deviations.
References
- 1
Robert Engels and Christiane Theusinger. Using a data metric for preprocessing advice for data mining applications. In 13th European Conference on on Artificial Intelligence (ECAI), pages 430 – 434, 1998.
- classmethod ft_sd_ratio(N: ndarray, y: ndarray, ddof: int = 1, classes: Optional[ndarray] = None, class_freqs: Optional[ndarray] = None) float[source]
Compute a statistical test for homogeneity of covariances.
The test applied is the Box’s M Test for equivalence of covariances.
The null hypothesis of this test states that the covariance matrices of the instances of every class are equal.
- Parameters
- N
np.ndarray Fitted numerical data.
- y
np.ndarray Target attribute.
- ddofint, optional
Degrees of freedom for covariance matrix, calculated during this test.
- classes
np.ndarray, optional All distinct classes in target attribute
y. Used to exploit precomputations.- class_freqs
np.ndarray, optional Absolute frequencies of each distinct class in target attribute
yorclasses. Ifclassesis given, then this argument must be paired with it by index.
- N
- Returns
- float
Homogeneity of covariances test result.
Notes
For details about how this test is applied, check out Rivolli et al. (pag. 32).
References
- 1
Donald Michie, David J. Spiegelhalter, Charles C. Taylor, and John Campbell. Machine Learning, Neural and Statistical Classification, volume 37. Ellis Horwood Upper Saddle River, 1994.
- classmethod ft_skewness(N: ndarray, method: int = 3, bias: bool = True) ndarray[source]
Compute the skewness for each attribute.
- Parameters
- N
np.ndarray Fitted numerical data.
- methodint, optional
Defines the strategy used for estimate data skewness. This argument is used fo compatibility with R package e1071. The options must be one of the following:
Option
Formula
1
Skew_1 = m_3 / m_2**(3/2) (default of
scipy.stats)2
Skew_2 = Skew_1 * sqrt(n(n-1)) / (n-2)
3
Skew_3 = m_3 / s**3 = Skew_1 ((n-1)/n)**(3/2)
Where n is the number of instances in
N, s is the standard deviation of each attribute inN, and m_i is the ith statistical momentum of each attribute inN.Note that if the selected method is unable to be calculated due to division by zero, then the first method will be used instead.
- biasbool, optional
If False, then the calculations are corrected for statistical bias.
- N
- Returns
np.ndarrayAttribute skewness.
References
- 1
Donald Michie, David J. Spiegelhalter, Charles C. Taylor, and John Campbell. Machine Learning, Neural and Statistical Classification, volume 37. Ellis Horwood Upper Saddle River, 1994.
- classmethod ft_sparsity(X: ndarray, normalize: bool = True) ndarray[source]
Compute (possibly normalized) sparsity metric for each attribute.
Sparsity S of a vector v of numeric values is defined as
S(v) = (1.0 / (n - 1.0)) * ((n / phi(v)) - 1.0),
- where
n is the number of instances in dataset
X.phi(v) is the number of distinct values in v.
- Parameters
- X
np.ndarray Fitted numerical data.
- normalizebool, optional
If True, then the output will be S(v) as shown above. Otherwise, the output is not be multiplied by the (1.0 / (n - 1.0)) factor (i.e. new output is defined as S’(v) = ((n / phi(v)) - 1.0)).
- X
- Returns
np.ndarrayAttribute sparsities.
References
- 1
Mostafa A. Salama, Aboul Ella Hassanien, and Kenneth Revett. Employment of neural network and rough set in meta-learning. Memetic Computing, 5(3):165 – 177, 2013.
- classmethod ft_t_mean(N: ndarray, pcut: float = 0.2) ndarray[source]
Compute the trimmed mean of each attribute.
- Parameters
- N
np.ndarray Fitted numerical data.
- pcutfloat, optional
Percentage of cut from both the lower and higher values. This value should be in interval [0.0, 0.5), where if 0.0 the return value is the default mean calculation.
- N
- Returns
np.ndarrayAttribute trimmed means.
References
- 1
Robert Engels and Christiane Theusinger. Using a data metric for preprocessing advice for data mining applications. In 13th European Conference on on Artificial Intelligence (ECAI), pages 430 – 434, 1998.
- classmethod ft_var(N: ndarray, ddof: int = 1) ndarray[source]
Compute the variance of each attribute.
- Parameters
- N
np.ndarray Fitted numerical data.
- ddoffloat, optional
Degrees of freedom for variance.
- N
- Returns
np.ndarrayAttribute variances.
References
- 1
Ciro Castiello, Giovanna Castellano, and Anna Maria Fanelli. Meta-data: Characterization of input features for meta-learning. In 2nd International Conference on Modeling Decisions for Artificial Intelligence (MDAI), pages 457–468, 2005.
- classmethod ft_w_lambda(N: ndarray, y: ndarray, can_cor_eigvals: Optional[ndarray] = None, can_cors: Optional[ndarray] = None) float[source]
Compute the Wilks’ Lambda value.
The Wilk’s Lambda L is calculated as:
L = prod(1.0 / (1.0 + can_cor_eig_i))
Where can_cor_eig_i is the ith eigenvalue related to the ith canonical correlation can_cor_i between the attributes in
Nand the binarized (one-hot encoded) version ofy.The relationship between can_cor_eig_i and can_cor_i is given by:
can_cor_i = sqrt(can_cor_eig_i / (1 + can_cor_eig_i))
Or, equivalently:
can_cor_eig_i = can_cor_i**2 / (1 - can_cor_i**2)
- Parameters
- N
np.ndarray Fitted numerical data.
- y
np.ndarray Target attribute.
- can_cor_eigvals
np.ndarray, optional Eigenvalues associated with the canonical correlations of
Nand one-hot encodedy. This argument is used to exploit precomputations. The relationship between the ith canonical correlation can_cor_i and its eigenvalue is:can_cor_i = sqrt(can_cor_eigval_i / (1 + can_cor_eigval_i))
Or, equivalently:
can_cor_eigval_i = can_cor_i**2 / (1 - can_cor_i**2)
- can_cors
np.ndarray, optional Canonical correlations between
Nand the one-hot encoded version ofy. Argument used to take advantage of precomputations. Used only ifcan_cor_eigvalsis None.
- N
- Returns
- float
Wilk’s lambda value.
References
- 1
Guido Lindner and Rudi Studer. AST: Support for algorithm selection with a CBR approach. In European Conference on Principles of Data Mining and Knowledge Discovery (PKDD), pages 418 – 423, 1999.
- classmethod precompute_can_cors(N: Optional[ndarray] = None, y: Optional[ndarray] = None, **kwargs) Dict[str, Any][source]
Precompute canonical correlations and its eigenvalues.
- Parameters
- N
np.ndarray, optional Numerical fitted data.
- y
np.ndarray Target attribute.
- kwargs:
Additional arguments. May have previously precomputed before this method from other precomputed methods, so they can help speed up this precomputation.
- N
- Returns
dict- With following precomputed items:
can_cors(np.ndarray): canonical correlations betweenNand the one-hot encoded version ofy.can_cor_eigvals(np.ndarray): eigenvalues related to the canonical correlations.
- classmethod precompute_statistical_class(y: Optional[ndarray] = None, **kwargs) Dict[str, Any][source]
Precompute distinct classes and its abs. frequencies from
y.- Parameters
- y
np.ndarray The target attribute from fitted data.
- kwargs:
Additional arguments. May have previously precomputed before this method from other precomputed methods, so they can help speed up this precomputation.
- y
- Returns
dict- With following precomputed items:
classes(np.ndarray): distinct classes ofy, ifyis notNoneType.class_freqs(np.ndarray): absolute class frequencies ofy, ifyis notNoneType.
- classmethod precompute_statistical_cor_cov(N: Optional[ndarray] = None, ddof: int = 1, **kwargs) Dict[str, Any][source]
Precomputes the correlation and covariance matrix of numerical data.
Be cautious in allowing this precomputation method on huge datasets, as this precomputation method may be very memory hungry.
- Parameters
- N
np.ndarray, optional Numerical fitted data.
- ddofint, optional
Degrees of freedom of covariance matrix.
- kwargs:
Additional arguments. May have previously precomputed before this method from other precomputed methods, so they can help speed up this precomputation.
- N
- Returns
dict- With following precomputed items:
cov_mat(np.ndarray): covariance matrix ofN, ifNis notNoneType.abs_corr_mat(np.ndarray): absolute correlation matrix ofN, ifNis notNoneType.