pymfe.statistical.MFEStatistical

class pymfe.statistical.MFEStatistical[source]

Keep methods for metafeatures of Statistical group.

The convention adopted for metafeature-extraction related methods is to always start with ft_ prefix in order to allow automatic method detection. This prefix is predefined within _internal module.

All method signature follows the conventions and restrictions listed below:

  1. For independent attribute data, X means every type of attribute, N means Numeric attributes only and C stands for Categorical attributes only. It is important to note that the categorical attribute sets between X and C and the numerical attribute sets between X and N may differ due to data transformations, performed while fitting data into MFE model, enabled by, respectively, transform_num and transform_cat arguments from fit (MFE method).

  2. Only arguments in MFE _custom_args_ft attribute (set up inside fit method) are allowed to be required method arguments. All other arguments must be strictly optional (i.e., has a predefined default value).

  3. It is assumed that the user can change any optional argument, without any previous verification for both type or value, via kwargs argument of extract method of MFE class.

  4. The return value of all feature-extraction methods should be a single value or a generic List (preferably an np.ndarray) type with numeric values.

There is another type of method adopted for automatic detection. It is adopted the prefix precompute_ for automatic detection of these methods. These methods run while fitting some data into an MFE model automatically, and their objective is to precompute some common value shared between more than one feature extraction method. This strategy is a trade-off between more system memory consumption and speeds up of feature extraction. Their return value must always be a dictionary whose keys are possible extra arguments for both feature extraction methods and other precomputation methods. Note that there is a share of precomputed values between all valid feature-extraction modules (e.g., class_freqs computed in module statistical can freely be used for any precomputation or feature extraction method of module landmarking).

__init__(*args, **kwargs)

Methods

__init__(*args, **kwargs)

ft_can_cor(N, y[, can_cors])

Compute canonical correlations of data.

ft_cor(N[, abs_corr_mat])

Compute the absolute value of the correlation of distinct dataset column pairs.

ft_cov(N[, ddof, cov_mat])

Compute the absolute value of the covariance of distinct dataset attribute pairs.

ft_eigenvalues(N[, ddof, cov_mat])

Compute the eigenvalues of covariance matrix from dataset.

ft_g_mean(N[, allow_zeros, epsilon])

Compute the geometric mean of each attribute.

ft_gravity(N, y[, norm_ord, classes, ...])

Compute the distance between minority and majority classes center of mass.

ft_h_mean(N)

Compute the harmonic mean of each attribute.

ft_iq_range(N)

Compute the interquartile range (IQR) of each attribute.

ft_kurtosis(N[, method, bias])

Compute the kurtosis of each attribute.

ft_lh_trace(N, y[, can_cor_eigvals, can_cors])

Compute the Lawley-Hotelling trace.

ft_mad(N[, factor])

Compute the Median Absolute Deviation (MAD) adjusted by a factor.

ft_max(N)

Compute the maximum value from each attribute.

ft_mean(N)

Compute the mean value of each attribute.

ft_median(N)

Compute the median value from each attribute.

ft_min(N)

Compute the minimum value from each attribute.

ft_nr_cor_attr(N[, threshold, normalize, ...])

Compute the number of distinct highly correlated pair of attributes.

ft_nr_disc(N, y[, can_cors])

Compute the number of canonical correlation between each attribute and class.

ft_nr_norm(N[, method, threshold, failure, ...])

Compute the number of attributes normally distributed based in a given method.

ft_nr_outliers(N[, whis])

Compute the number of attributes with at least one outlier value.

ft_p_trace(N, y[, can_cors])

Compute the Pillai's trace.

ft_range(N)

Compute the range (max - min) of each attribute.

ft_roy_root(N, y[, criterion, can_cors, ...])

Compute the Roy's largest root.

ft_sd(N[, ddof])

Compute the standard deviation of each attribute.

ft_sd_ratio(N, y[, ddof, classes, class_freqs])

Compute a statistical test for homogeneity of covariances.

ft_skewness(N[, method, bias])

Compute the skewness for each attribute.

ft_sparsity(X[, normalize])

Compute (possibly normalized) sparsity metric for each attribute.

ft_t_mean(N[, pcut])

Compute the trimmed mean of each attribute.

ft_var(N[, ddof])

Compute the variance of each attribute.

ft_w_lambda(N, y[, can_cor_eigvals, can_cors])

Compute the Wilks' Lambda value.

precompute_can_cors([N, y])

Precompute canonical correlations and its eigenvalues.

precompute_statistical_class([y])

Precompute distinct classes and its abs.

precompute_statistical_cor_cov([N, ddof])

Precomputes the correlation and covariance matrix of numerical data.

classmethod ft_can_cor(N: ndarray, y: ndarray, can_cors: Optional[ndarray] = None) ndarray[source]

Compute canonical correlations of data.

The canonical correlations are calculated between the attributes in N and the binarized (one-hot encoded) version of y.

Parameters
Nnp.ndarray

Fitted numerical data.

ynp.ndarray

Target attribute.

can_corsnp.ndarray, optional

Canonical correlations between N and the one-hot encoded version of y. Argument used to take advantage of precomputations.

Returns
np.ndarray

Canonical correlations of the data.

References

1

Alexandros Kalousis. Algorithm Selection via Meta-Learning. PhD thesis, Faculty of Science of the University of Geneva, 2002.

classmethod ft_cor(N: ndarray, abs_corr_mat: Optional[ndarray] = None) ndarray[source]

Compute the absolute value of the correlation of distinct dataset column pairs.

Parameters
Nnp.ndarray

Fitted numerical data.

abs_corr_matnp.ndarray, optional

Absolute correlation matrix of N. Argument used to exploit precomputations.

Returns
np.ndarray

Absolute value of correlation between distinct attributes.

References

1

Ciro Castiello, Giovanna Castellano, and Anna Maria Fanelli. Meta-data: Characterization of input features for meta-learning. In 2nd International Conference on Modeling Decisions for Artificial Intelligence (MDAI), pages 457–468, 2005.

2

Matthias Reif, Faisal Shafait, Markus Goldstein, Thomas Breuel, and Andreas Dengel. Automatic classifier selection for non-experts. Pattern Analysis and Applications, 17(1):83–96, 2014.

3

Donald Michie, David J. Spiegelhalter, Charles C. Taylor, and John Campbell. Machine Learning, Neural and Statistical Classification, volume 37. Ellis Horwood Upper Saddle River, 1994.

classmethod ft_cov(N: ndarray, ddof: int = 1, cov_mat: Optional[ndarray] = None) ndarray[source]

Compute the absolute value of the covariance of distinct dataset attribute pairs.

Parameters
Nnp.ndarray

Fitted numerical data.

ddofint, optional

Degrees of freedom for covariance matrix.

cov_matnp.ndarray, optional

Covariance matrix of N. Argument meant to exploit precomputations. Note that this argument value is not the same as this method return value, as it only returns the lower-triangle values from cov_mat.

Returns
np.ndarray

Absolute value of covariances between distinct attributes.

References

1

Ciro Castiello, Giovanna Castellano, and Anna Maria Fanelli. Meta-data: Characterization of input features for meta-learning. In 2nd International Conference on Modeling Decisions for Artificial Intelligence (MDAI), pages 457–468, 2005.

2

Donald Michie, David J. Spiegelhalter, Charles C. Taylor, and John Campbell. Machine Learning, Neural and Statistical Classification, volume 37. Ellis Horwood Upper Saddle River, 1994.

classmethod ft_eigenvalues(N: ndarray, ddof: int = 1, cov_mat: Optional[ndarray] = None) ndarray[source]

Compute the eigenvalues of covariance matrix from dataset.

Parameters
Nnp.ndarray

Fitted numerical data.

ddofint, optional

Degrees of freedom for covariance matrix.

cov_matnp.ndarray, optional

Covariance matrix of N. Argument meant to exploit precomputations.

Returns
np.ndarray

Eigenvalues of N covariance matrix.

References

1

Shawkat Ali and Kate A. Smith. On learning algorithm selection for classification. Applied Soft Computing, 6(2):119 – 138, 2006.

classmethod ft_g_mean(N: ndarray, allow_zeros: bool = True, epsilon: float = 1e-10) ndarray[source]

Compute the geometric mean of each attribute.

Parameters
Nnp.ndarray

Fitted numerical data.

allow_zerosbool, optional

If True, then the geometric mean of all attributes with zero values is set to zero. Otherwise, is set to np.nan these values.

epsilonfloat, optional

A small value which all values with absolute value lesser than it is considered zero-valued. Used only if allow_zeros is False.

Returns
np.ndarray

Attribute geometric means.

References

1

Shawkat Ali and Kate A. Smith-Miles. A meta-learning approach to automatic kernel selection for support vector machines. Neurocomputing, 70(1):173 – 186, 2006.

classmethod ft_gravity(N: ndarray, y: ndarray, norm_ord: Union[int, float] = 2, classes: Optional[ndarray] = None, class_freqs: Optional[ndarray] = None, cls_inds: Optional[ndarray] = None) float[source]

Compute the distance between minority and majority classes center of mass.

The center of mass of a class is the average value of each attribute between instances of the same class.

The majority and minority classes cannot be the same, even if every class has the same number of instances.

Parameters
Nnp.ndarray

Fitted numerical data.

ynp.ndarray

Target attribute.

norm_ordnumeric, optional

Minkowski Distance parameter. Minkowski Distance has the following popular cases for this argument value

norm_ord

Distance name

-> -inf

Min value

1.0

Manhattan/City Block

2.0

Euclidean

-> +inf

Max value (infinite norm)

classesnp.ndarray, optional

Distinct classes of y.

class_freqsnp.ndarray, optional

Absolute frequencies of each distinct class in target attribute y or classes. If classes is given, then this argument must be paired with it by index.

cls_indsnp.ndarray, optional

Boolean array which indicates the examples of each class. The rows represents each distinct class, and the columns represents the instances. Used to take advantage of precomputations.

Returns
float

Gravity of the numeric dataset.

Raises
ValueError

If norm_ord is not numeric.

References

1

Shawkat Ali and Kate A. Smith. On learning algorithm selection for classification. Applied Soft Computing, 6(2):119 – 138, 2006.

classmethod ft_h_mean(N: ndarray) ndarray[source]

Compute the harmonic mean of each attribute.

Parameters
Nnp.ndarray

Fitted numerical data.

Returns
np.ndarray

Attribute harmonic means.

References

1

Shawkat Ali and Kate A. Smith-Miles. A meta-learning approach to automatic kernel selection for support vector machines. Neurocomputing, 70(1):173 – 186, 2006.

classmethod ft_iq_range(N: ndarray) ndarray[source]

Compute the interquartile range (IQR) of each attribute.

Parameters
Nnp.ndarray

Fitted numerical data.

Returns
np.ndarray

Attribute interquartile ranges.

References

1

Shawkat Ali and Kate A. Smith-Miles. A meta-learning approach to automatic kernel selection for support vector machines. Neurocomputing, 70(1):173 – 186, 2006.

classmethod ft_kurtosis(N: ndarray, method: int = 3, bias: bool = True) ndarray[source]

Compute the kurtosis of each attribute.

Parameters
Nnp.ndarray

Fitted numerical data.

methodint, optional

Defines the strategy used for estimate data kurtosis. Used for total compatibility with R package e1071. The options must be one of the following:

Option

Formula

1

Kurt_1 = (m_4 / m_2**2 - 3) (default of scipy.stats package)

2

Kurt_2 = (((n+1) * Kurt_1 + 6) * (n-1) / f_2), f_2 = ((n-2)*(n-3))

3

Kurt_3 = (m_4 / s**4 - 3)

= ((Kurt_1+3) * (1 - 1/n)**2 - 3)

Where n is the number of instances in N, s is the standard deviation of each attribute in N, and m_i is the ith statistical momentum of each attribute in N.

Note that if the selected method is unable to be calculated due to division by zero, then the first method is used instead.

biasbool, optional

If False, then the calculations are corrected for statistical bias.

Returns
np.ndarray

Attribute kurtosis.

References

1

Donald Michie, David J. Spiegelhalter, Charles C. Taylor, and John Campbell. Machine Learning, Neural and Statistical Classification, volume 37. Ellis Horwood Upper Saddle River, 1994.

classmethod ft_lh_trace(N: ndarray, y: ndarray, can_cor_eigvals: Optional[ndarray] = None, can_cors: Optional[ndarray] = None) float[source]

Compute the Lawley-Hotelling trace.

The Lawley-Hotelling trace LH is given by:

LH = sum_{i} can_cor_i**2 / (1 - can_cor_i**2)

Where can_cor_i is the ith canonical correlation of N and the one-hot encoded version of y.

Equivalently, LH can be calculated from the eigenvalues related to each canonical correlation due to the relationship:

can_cor_eigval_i = can_cor_i**2 / (1 - can_cor_i**2)

Therefore, LH is given simply by:

LH = sum_{i} can_cor_eigval_i

Parameters
Nnp.ndarray

Numerical fitted data.

ynp.ndarray

Target attribute.

can_cor_eigvalsnp.ndarray, optional

Eigenvalues associated with the canonical correlations of N and one-hot encoded y. This argument is used to exploit precomputations. The relationship between the ith canonical correlation can_cor_i and its eigenvalue is:

can_cor_i = sqrt(can_cor_eigval_i / (1 + can_cor_eigval_i))

Or, equivalently:

can_cor_eigval_i = can_cor_i**2 / (1 - can_cor_i**2)

can_corsnp.ndarray, optional

Canonical correlations between N and the one-hot encoded version of y. Argument used to take advantage of precomputations. Used only if can_cor_eigvals is None.

Returns
float

Lawley-Hotelling trace value.

References

1

Lawley D. A Generalization of Fisher’s z Test. Biometrika. 1938;30(1):180-187.

2

Hotelling H. A generalized T test and measure of multivariate dispersion. In: Neyman J, ed. Proceedings of the Second Berkeley Symposium on Mathematical Statistics and Probability. Berkeley: University of California Press; 1951:23-41.

classmethod ft_mad(N: ndarray, factor: float = 1.4826) ndarray[source]

Compute the Median Absolute Deviation (MAD) adjusted by a factor.

Parameters
Nnp.ndarray

Fitted numerical data.

factorfloat, optional

Multiplication factor for output correction. The default factor is 1.4826 since it is an approximated result of MAD of a normally distributed data (with any mean and standard deviation of 1.0), so it makes this method result comparable with this sort of data.

Returns
np.ndarray

Attribute MAD (Median Absolute Deviation.)

References

1

Shawkat Ali and Kate A. Smith. On learning algorithm selection for classification. Applied Soft Computing, 6(2):119 – 138, 2006.

classmethod ft_max(N: ndarray) ndarray[source]

Compute the maximum value from each attribute.

Parameters
Nnp.ndarray

Fitted numerical data.

Returns
np.ndarray

Attribute maximum values.

References

1

Robert Engels and Christiane Theusinger. Using a data metric for preprocessing advice for data mining applications. In 13th European Conference on on Artificial Intelligence (ECAI), pages 430 – 434, 1998.

classmethod ft_mean(N: ndarray) ndarray[source]

Compute the mean value of each attribute.

Parameters
Nnp.ndarray

Fitted numerical data.

Returns
np.ndarray

Attribute mean values.

References

1

Robert Engels and Christiane Theusinger. Using a data metric for preprocessing advice for data mining applications. In 13th European Conference on on Artificial Intelligence (ECAI), pages 430 – 434, 1998.

classmethod ft_median(N: ndarray) ndarray[source]

Compute the median value from each attribute.

Parameters
Nnp.ndarray

Fitted numerical data.

Returns
np.ndarray

Attribute median values.

References

1

Robert Engels and Christiane Theusinger. Using a data metric for preprocessing advice for data mining applications. In 13th European Conference on on Artificial Intelligence (ECAI), pages 430 – 434, 1998.

classmethod ft_min(N: ndarray) ndarray[source]

Compute the minimum value from each attribute.

Parameters
Nnp.ndarray

Fitted numerical data.

Returns
np.ndarray

Attribute minimum values.

References

1

Robert Engels and Christiane Theusinger. Using a data metric for preprocessing advice for data mining applications. In 13th European Conference on on Artificial Intelligence (ECAI), pages 430 – 434, 1998.

classmethod ft_nr_cor_attr(N: ndarray, threshold: float = 0.5, normalize: bool = True, abs_corr_mat: Optional[ndarray] = None) Union[int, float][source]

Compute the number of distinct highly correlated pair of attributes.

A pair of attributes is considered highly correlated if the absolute value of its covariance is equal or larger than a given threshold.

Parameters
Nnp.ndarray

Fitted numerical data.

thresholdfloat, optional

A value of the threshold, where correlation is assumed to be strong if its absolute value is equal or greater than it.

normalizebool, optional

If True, the result is normalized by a factor of 2/(d*(d-1)), where d is number of attributes (columns) in N.

abs_corr_matnp.ndarray, optional

Absolute correlation matrix of N. Argument used to exploit precomputations.

Returns
int | float

If normalize is False, this method returns the number of highly correlated pair of distinct attributes. Otherwise, return the proportion of highly correlated attributes.

References

1

Mostafa A. Salama, Aboul Ella Hassanien, and Kenneth Revett. Employment of neural network and rough set in meta-learning. Memetic Computing, 5(3):165 – 177, 2013.

classmethod ft_nr_disc(N: ndarray, y: ndarray, can_cors: Optional[ndarray] = None) Union[int, float][source]

Compute the number of canonical correlation between each attribute and class.

This method return value is effectively the size of the return value of ft_can_cor method. Check its documentation for more in-depth details.

Parameters
Nnp.ndarray

Fitted numerical data.

ynp.ndarray

Target attribute.

can_corsnp.ndarray, optional

Canonical correlations between N and the one-hot encoded version of y. Argument used to take advantage of precomputations.

Returns
int or float

Number of canonical correlations between each attribute and class, if ft_can_cor is executed successfully. Returns np.nan otherwise.

References

1

Guido Lindner and Rudi Studer. AST: Support for algorithm selection with a CBR approach. In European Conference on Principles of Data Mining and Knowledge Discovery (PKDD), pages 418 – 423, 1999.

classmethod ft_nr_norm(N: ndarray, method: str = 'shapiro-wilk', threshold: float = 0.05, failure: str = 'soft', max_samples: int = 5000) Union[float, int][source]

Compute the number of attributes normally distributed based in a given method.

Parameters
Nnp.ndarray

Fitted numerical data.

methodstr, optional

Select the normality test to be executed. This argument must assume one of the options shown below:

  • shapiro-wilk: from scipy.stats.shapiro documentation: the Shapiro-Wilk test tests the null hypothesis that the data was drawn from a normal distribution.

  • dagostino-pearson: from scipy.stats.normaltest documentation: It is based on D’Agostino and Pearson’s, test that combines skew and kurtosis to produce an omnibus test of normality.

  • anderson-darling: from scipy.stats.anderson documentation: The Anderson-Darling tests the null hypothesis that a sample is drawn from a population that follows a particular distribution. In this method context, that particular distribution is fixed in the normal/gaussian.

  • all: perform all tests cited above. To consider an attribute normaly distributed all test results are taken into account with equal weight. Check failure argument for more information.

thresholdfloat, optional

Level of significance used to reject the null hypothesis of normality tests.

failurestr, optional

Used only if method argument value is all. This argument must assumed one value between soft or hard. If soft, then if a single test have its null hypothesis (which all states the data follows a Guassian distribution) rejected for some attribute, then that attribute is already considered normally distributed. If value is hard, then is necessary the rejection of the null hypothesis of every single normality test to consider the attribute normally distributed.

max_samplesint, optional

Max samples used while performing the normality tests. Shapiro-Wilks test p-value may not be accurate when sample size is higher than 5000. Note that the instances are NOT shuffled before doing this cutoff. This means that the very first max_samples instances of the dataset N will be considered in the statistical tests.

Returns
int

The number of normally distributed attributes based on the method. If max_samples is non-positive, np.nan is returned instead.

Raises
ValueError

If method or failure is not a valid option.

References

1

Christian Kopf, Charles Taylor, and Jorg Keller. Meta-Analysis: From data characterisation for meta-learning to meta-regression. In PKDD Workshop on Data Mining, Decision Support, Meta-Learning and Inductive Logic Programming, pages 15 – 26, 2000.

classmethod ft_nr_outliers(N: ndarray, whis: float = 1.5) int[source]

Compute the number of attributes with at least one outlier value.

An attribute has outlier if some value is outside the closed interval [first_quartile - WHIS * IQR, third_quartile + WHIS * IQR], where IQR is the Interquartile Range (third_quartile - first_quartile), and WHIS value is typically 1.5.

Parameters
Nnp.ndarray

Fitted numerical data.

whisfloat, optional

A factor to multiply IQR and set up non-outlier interval (as stated above). Higher values make the interval more significant, thus increasing the tolerance against outliers, where lower values decrease non-outlier interval and, therefore, creates less tolerance against possible outliers.

Returns
int

Number of attributes with at least one outlier.

References

1

Christian Kopf and Ioannis Iglezakis. Combination of task description strategies and case base properties for meta-learning. In 2nd ECML/PKDD International Workshop on Integration and Collaboration Aspects of Data Mining, Decision Support and Meta-Learning(IDDM), pages 65 – 76, 2002.

2

Peter J. Rousseeuw and Mia Hubert. Robust statistics for outlier detection. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 1(1):73 – 79, 2011.

classmethod ft_p_trace(N: ndarray, y: ndarray, can_cors: Optional[ndarray] = None) float[source]

Compute the Pillai’s trace.

The Pillai’s trace is the sum of the squared canonical correlations of N and the one-hot encoded version of y.

Parameters
Nnp.ndarray

Numerical fitted data.

ynp.ndarray

Target attribute.

can_corsnp.ndarray, optional

Canonical correlations between N and the one-hot encoded version of y. Argument used to take advantage of precomputations.

Returns
float

Pillai’s trace value.

References

1

Pillai K.C.S (1955). Some New test criteria in multivariate analysis. Ann Math Stat: 26(1):117–21. Seber, G.A.F. (1984). Multivariate Observations. New York: John Wiley and Sons.

classmethod ft_range(N: ndarray) ndarray[source]

Compute the range (max - min) of each attribute.

Parameters
Nnp.ndarray

Fitted numerical data.

Returns
np.ndarray

Attribute ranges.

References

1

Shawkat Ali and Kate A. Smith-Miles. A meta-learning approach to automatic kernel selection for support vector machines. Neurocomputing, 70(1):173 – 186, 2006.

classmethod ft_roy_root(N: ndarray, y: ndarray, criterion: str = 'eigval', can_cors: Optional[ndarray] = None, can_cor_eigvals: Optional[ndarray] = None) float[source]

Compute the Roy’s largest root.

The Roy’s largest root RLR can be computed using two distinct approaches (see references for further explanation.)

1. Based on Roy’s (ii) original hypothesis: formulated using the largest eigenvalue associated with the canonical correlations between N and the one-hot encoded version of y. That is, the Roy’s Largest Root RLR_a can be defined as:

RLR_a = max_{I} can_cor_eig_val_i

It is in range [0, +inf).

2. Based on Roy’s (iii) original hypothesis: formulated using the largest squared canonical correlations of N``and the one- hot encoded version of ``y. Therefore, the Roy’s Largest Root RLR_b can be defined as:

RLR_b = max_{i} can_cor_i**2

It is in range [0, 1].

Note that both statistics have different meanings and, therefore, will assume distinct values.

Which formulation is used can be controled using the criterion argument (see below for more information.)

Parameters
Nnp.ndarray

Numerical fitted data.

ynp.ndarray

Target attribute.

criterionstr, optional

If eigval, calculate the Roy’s largest root as the largest eigenvalue associated with each canonical correlation. This is the first formulation described above. If cancor, calculate the Roy’s largest root as the largest squared canonical correlation. This is the second formulation above.

can_corsnp.ndarray, optional

Canonical correlations between N and the one-hot encoded version of y. Argument used to take advantage of precomputations. Used only if criterion is cancor or, if otherwise, can_cor_eigvals argument is None.

can_cor_eigvalsnp.ndarray, optional

Eigenvalues associated with the canonical correlations of N and one-hot encoded y. This argument is used to exploit precomputations. The relationship between the ith canonical correlation can_cor_i and its eigenvalue is:

can_cor_i = sqrt(can_cor_eigval_i / (1 + can_cor_eigval_i))

Or, equivalently:

can_cor_eigval_i = can_cor_i**2 / (1 - can_cor_i**2)

This argument is used only if criterion argument is eigval.

Returns
float

Roy’s largest root calculated based on criterion defined by the criterion argument.

References

1

Roy SN. On a Heuristic Method of Test Construction and its use in Multivariate Analysis. Ann Math Stat. 1953;24(2):220-238.

2

A note on Roy’s largest root. Kuhfeld, W.F. Psychometrika (1986) 51: 479. https://doi.org/10.1007/BF02294069

classmethod ft_sd(N: ndarray, ddof: int = 1) ndarray[source]

Compute the standard deviation of each attribute.

Parameters
Nnp.ndarray

Fitted numerical data.

ddoffloat, optional

Degrees of freedom for standard deviation.

Returns
np.ndarray

Attribute standard deviations.

References

1

Robert Engels and Christiane Theusinger. Using a data metric for preprocessing advice for data mining applications. In 13th European Conference on on Artificial Intelligence (ECAI), pages 430 – 434, 1998.

classmethod ft_sd_ratio(N: ndarray, y: ndarray, ddof: int = 1, classes: Optional[ndarray] = None, class_freqs: Optional[ndarray] = None) float[source]

Compute a statistical test for homogeneity of covariances.

The test applied is the Box’s M Test for equivalence of covariances.

The null hypothesis of this test states that the covariance matrices of the instances of every class are equal.

Parameters
Nnp.ndarray

Fitted numerical data.

ynp.ndarray

Target attribute.

ddofint, optional

Degrees of freedom for covariance matrix, calculated during this test.

classesnp.ndarray, optional

All distinct classes in target attribute y. Used to exploit precomputations.

class_freqsnp.ndarray, optional

Absolute frequencies of each distinct class in target attribute y or classes. If classes is given, then this argument must be paired with it by index.

Returns
float

Homogeneity of covariances test result.

Notes

For details about how this test is applied, check out Rivolli et al. (pag. 32).

References

1

Donald Michie, David J. Spiegelhalter, Charles C. Taylor, and John Campbell. Machine Learning, Neural and Statistical Classification, volume 37. Ellis Horwood Upper Saddle River, 1994.

classmethod ft_skewness(N: ndarray, method: int = 3, bias: bool = True) ndarray[source]

Compute the skewness for each attribute.

Parameters
Nnp.ndarray

Fitted numerical data.

methodint, optional

Defines the strategy used for estimate data skewness. This argument is used fo compatibility with R package e1071. The options must be one of the following:

Option

Formula

1

Skew_1 = m_3 / m_2**(3/2) (default of scipy.stats)

2

Skew_2 = Skew_1 * sqrt(n(n-1)) / (n-2)

3

Skew_3 = m_3 / s**3 = Skew_1 ((n-1)/n)**(3/2)

Where n is the number of instances in N, s is the standard deviation of each attribute in N, and m_i is the ith statistical momentum of each attribute in N.

Note that if the selected method is unable to be calculated due to division by zero, then the first method will be used instead.

biasbool, optional

If False, then the calculations are corrected for statistical bias.

Returns
np.ndarray

Attribute skewness.

References

1

Donald Michie, David J. Spiegelhalter, Charles C. Taylor, and John Campbell. Machine Learning, Neural and Statistical Classification, volume 37. Ellis Horwood Upper Saddle River, 1994.

classmethod ft_sparsity(X: ndarray, normalize: bool = True) ndarray[source]

Compute (possibly normalized) sparsity metric for each attribute.

Sparsity S of a vector v of numeric values is defined as

S(v) = (1.0 / (n - 1.0)) * ((n / phi(v)) - 1.0),

where
  • n is the number of instances in dataset X.

  • phi(v) is the number of distinct values in v.

Parameters
Xnp.ndarray

Fitted numerical data.

normalizebool, optional

If True, then the output will be S(v) as shown above. Otherwise, the output is not be multiplied by the (1.0 / (n - 1.0)) factor (i.e. new output is defined as S’(v) = ((n / phi(v)) - 1.0)).

Returns
np.ndarray

Attribute sparsities.

References

1

Mostafa A. Salama, Aboul Ella Hassanien, and Kenneth Revett. Employment of neural network and rough set in meta-learning. Memetic Computing, 5(3):165 – 177, 2013.

classmethod ft_t_mean(N: ndarray, pcut: float = 0.2) ndarray[source]

Compute the trimmed mean of each attribute.

Parameters
Nnp.ndarray

Fitted numerical data.

pcutfloat, optional

Percentage of cut from both the lower and higher values. This value should be in interval [0.0, 0.5), where if 0.0 the return value is the default mean calculation.

Returns
np.ndarray

Attribute trimmed means.

References

1

Robert Engels and Christiane Theusinger. Using a data metric for preprocessing advice for data mining applications. In 13th European Conference on on Artificial Intelligence (ECAI), pages 430 – 434, 1998.

classmethod ft_var(N: ndarray, ddof: int = 1) ndarray[source]

Compute the variance of each attribute.

Parameters
Nnp.ndarray

Fitted numerical data.

ddoffloat, optional

Degrees of freedom for variance.

Returns
np.ndarray

Attribute variances.

References

1

Ciro Castiello, Giovanna Castellano, and Anna Maria Fanelli. Meta-data: Characterization of input features for meta-learning. In 2nd International Conference on Modeling Decisions for Artificial Intelligence (MDAI), pages 457–468, 2005.

classmethod ft_w_lambda(N: ndarray, y: ndarray, can_cor_eigvals: Optional[ndarray] = None, can_cors: Optional[ndarray] = None) float[source]

Compute the Wilks’ Lambda value.

The Wilk’s Lambda L is calculated as:

L = prod(1.0 / (1.0 + can_cor_eig_i))

Where can_cor_eig_i is the ith eigenvalue related to the ith canonical correlation can_cor_i between the attributes in N and the binarized (one-hot encoded) version of y.

The relationship between can_cor_eig_i and can_cor_i is given by:

can_cor_i = sqrt(can_cor_eig_i / (1 + can_cor_eig_i))

Or, equivalently:

can_cor_eig_i = can_cor_i**2 / (1 - can_cor_i**2)

Parameters
Nnp.ndarray

Fitted numerical data.

ynp.ndarray

Target attribute.

can_cor_eigvalsnp.ndarray, optional

Eigenvalues associated with the canonical correlations of N and one-hot encoded y. This argument is used to exploit precomputations. The relationship between the ith canonical correlation can_cor_i and its eigenvalue is:

can_cor_i = sqrt(can_cor_eigval_i / (1 + can_cor_eigval_i))

Or, equivalently:

can_cor_eigval_i = can_cor_i**2 / (1 - can_cor_i**2)

can_corsnp.ndarray, optional

Canonical correlations between N and the one-hot encoded version of y. Argument used to take advantage of precomputations. Used only if can_cor_eigvals is None.

Returns
float

Wilk’s lambda value.

References

1

Guido Lindner and Rudi Studer. AST: Support for algorithm selection with a CBR approach. In European Conference on Principles of Data Mining and Knowledge Discovery (PKDD), pages 418 – 423, 1999.

classmethod precompute_can_cors(N: Optional[ndarray] = None, y: Optional[ndarray] = None, **kwargs) Dict[str, Any][source]

Precompute canonical correlations and its eigenvalues.

Parameters
Nnp.ndarray, optional

Numerical fitted data.

ynp.ndarray

Target attribute.

kwargs:

Additional arguments. May have previously precomputed before this method from other precomputed methods, so they can help speed up this precomputation.

Returns
dict
With following precomputed items:
  • can_cors (np.ndarray): canonical correlations between N and the one-hot encoded version of y.

  • can_cor_eigvals (np.ndarray): eigenvalues related to the canonical correlations.

classmethod precompute_statistical_class(y: Optional[ndarray] = None, **kwargs) Dict[str, Any][source]

Precompute distinct classes and its abs. frequencies from y.

Parameters
ynp.ndarray

The target attribute from fitted data.

kwargs:

Additional arguments. May have previously precomputed before this method from other precomputed methods, so they can help speed up this precomputation.

Returns
dict
With following precomputed items:
  • classes (np.ndarray): distinct classes of y, if y is not NoneType.

  • class_freqs (np.ndarray): absolute class frequencies of y, if y is not NoneType.

classmethod precompute_statistical_cor_cov(N: Optional[ndarray] = None, ddof: int = 1, **kwargs) Dict[str, Any][source]

Precomputes the correlation and covariance matrix of numerical data.

Be cautious in allowing this precomputation method on huge datasets, as this precomputation method may be very memory hungry.

Parameters
Nnp.ndarray, optional

Numerical fitted data.

ddofint, optional

Degrees of freedom of covariance matrix.

kwargs:

Additional arguments. May have previously precomputed before this method from other precomputed methods, so they can help speed up this precomputation.

Returns
dict
With following precomputed items:
  • cov_mat (np.ndarray): covariance matrix of N, if N is not NoneType.

  • abs_corr_mat (np.ndarray): absolute correlation matrix of N, if N is not NoneType.