pymfe.complexity.MFEComplexity

class pymfe.complexity.MFEComplexity[source]

Keep methods for metafeatures of Complexity group.

The convention adopted for metafeature extraction related methods is to always start with ft_ prefix to allow automatic method detection. This prefix is predefined within _internal module.

All method signature follows the conventions and restrictions listed below:

For independent attribute data, X means every type of attribute, N means Numeric attributes only and C stands for Categorical attributes only. It is important to note that the categorical attribute sets between X and C and the numerical attribute sets between X and N may differ due to data transformations, performed while fitting data into MFE model, enabled by, respectively, transform_num and transform_cat arguments from fit (MFE method).
Only arguments in MFE _custom_args_ft attribute (set up inside fit method) are allowed to be required method arguments. All other arguments must be strictly optional (i.e., has a predefined default value).
The initial assumption is that the user can change any optional argument, without any previous verification of argument value or its type, via kwargs argument of extract method of MFE class.
The return value of all feature extraction methods should be a single value or a generic List (preferably a np.ndarray) type with numeric values.

There is another type of method adopted for automatic detection. It is adopted the prefix precompute_ for automatic detection of these methods. These methods run while fitting some data into an MFE model automatically, and their objective is to precompute some common value shared between more than one feature extraction method. This strategy is a trade-off between more system memory consumption and speeds up of feature extraction. Their return value must always be a dictionary whose keys are possible extra arguments for both feature extraction methods and other precomputation methods. Note that there is a share of precomputed values between all valid feature-extraction modules (e.g., class_freqs computed in module statistical can freely be used for any precomputation or feature extraction method of module landmarking).

__init__(*args, **kwargs)

Methods

`__init__`(args, *kwargs)
`ft_c1`(y[, class_freqs])	Compute the entropy of class proportions.
`ft_c2`(y[, class_freqs])	Compute the imbalance ratio.
`ft_cls_coef`(N, y[, metric, p, radius_frac, ...])	Clustering coefficient.
`ft_density`(N, y[, metric, p, radius_frac, ...])	Average density of the network.
`ft_f1`(N, y[, cls_inds, class_freqs])	Maximum Fisher's discriminant ratio.
`ft_f1v`(N, y[, ovo_comb, cls_inds, class_freqs])	Directional-vector maximum Fisher's discriminant ratio.
`ft_f2`(N, y[, ovo_comb, cls_inds])	Volume of the overlapping region.
`ft_f3`(N, y[, ovo_comb, cls_inds, class_freqs])	Compute feature maximum individual efficiency.
`ft_f4`(N, y[, ovo_comb, cls_inds, class_freqs])	Compute the collective feature efficiency.
`ft_hubs`(N, y[, metric, p, radius_frac, ...])	Hub score.
`ft_l1`(N, y[, ovo_comb, cls_inds, ...])	Sum of error distance by linear programming.
`ft_l2`(N, y[, ovo_comb, cls_inds, ...])	Compute the OVO subsets error rate of linear classifier.
`ft_l3`(N, y[, ovo_comb, cls_inds, ...])	Non-Linearity of a linear classifier.
`ft_lsc`(N, y[, metric, p, cls_inds, ...])	Local set average cardinality.
`ft_n1`(N, y[, metric, p, N_scaled, norm_dist_mat])	Compute the fraction of borderline points.
`ft_n2`(N, y[, metric, p, class_freqs, ...])	Ratio of intra and extra class nearest neighbor distance.
`ft_n3`(N, y[, metric, p, N_scaled, norm_dist_mat])	Error rate of the nearest neighbor classifier.
`ft_n4`(N, y[, metric, p, n_neighbors, ...])	Compute the non-linearity of the k-NN Classifier.
`ft_t1`(N, y[, metric, p, cls_inds, N_scaled, ...])	Fraction of hyperspheres covering data.
`ft_t2`(N)	Compute the average number of features per dimension.
`ft_t3`(N[, num_attr_pca, random_state])	Compute the average number of PCA dimensions per points.
`ft_t4`(N[, num_attr_pca, random_state])	Compute the ratio of the PCA dimension to the original dimension.
`precompute_adjacency_graph`(N[, y, metric, ...])	Precompute instances nearest enemy related values.
`precompute_complexity`([y])	Precompute some useful things to support feature-based measures.
`precompute_complexity_svm`([y, max_iter, ...])	Init a Support Vector Classifier pipeline (with data standardization.)
`precompute_nearest_enemy`(N[, y, metric, p])	Precompute instances nearest enemy related values.
`precompute_norm_dist_mat`(N[, metric, p])	Precompute normalized `N` and pairwise distance among instances.
`precompute_pca_tx`(N[, tx_n_components, ...])	Precompute PCA to support dimensionality measures.

classmethod ft_c1(y: ndarray, class_freqs: Optional[ndarray] = None) → float[source]

Compute the entropy of class proportions.

This measure is in [0, 1] range.

Parameters

ynp.ndarray: Target attribute.
class_freqsnp.ndarray, optional: The number of examples in each class. The indices corresponds to the classes.

Returns

float: Entropy of class proportions.

References

1: Ana C. Lorena, Luís P. F. Garcia, Jens Lehmann, Marcilio C. P. Souto, and Tin K. Ho. How Complex is your classification problem? A survey on measuring classification complexity (V2). (2019) (Cited on page 15). Published in ACM Computing Surveys (CSUR), Volume 52 Issue 5, October 2019, Article No. 107.

classmethod ft_c2(y: ndarray, class_freqs: Optional[ndarray] = None) → float[source]

Compute the imbalance ratio.

This measure is in [0, 1] range.

Parameters

ynp.ndarray: Target attribute.
class_freqsnp.ndarray, optional: The number of examples in each class. The indices corresponds to the classes.

Returns

float: The imbalance ratio.

References

1: Ana C. Lorena, Luís P. F. Garcia, Jens Lehmann, Marcilio C. P. Souto, and Tin K. Ho. How Complex is your classification problem? A survey on measuring classification complexity (V2). (2019) (Cited on page 16). Published in ACM Computing Surveys (CSUR), Volume 52 Issue 5, October 2019, Article No. 107.

classmethod ft_cls_coef(N: ndarray, y: ndarray, metric: str = 'gower', p: float = 2.0, radius_frac: Union[int, float] = 0.15, n_jobs: Optional[int] = None, cls_inds: Optional[ndarray] = None, norm_dist_mat: Optional[ndarray] = None, adj_graph: Optional[_construct_graph_from_weighted_adjacency] = None) → float[source]

Clustering coefficient.

The clustering coefficient of a vertex v_i is given by the ratio of the number of edges between its neighbors (in a Same-class Radius Neighbor Graph) and the maximum number of edges that could possibly exist between them.

This measure is in [0, 1] range.

Parameters

Nnp.ndarray: Numerical fitted data.
ynp.ndarray: Target attribute.
radiusfloat or int, optional: Maximum distance between each pair of instances of the same class to both be considered neighbors of each other. Note that each feature of N is first normalized into the [0, 1] range before the neighbor calculations.
metricstr, optional: Metric used to calculate the distances between the instances. Check the scipy.spatial.distance.cdist documentation to get a list of all available metrics. This argument is used only if norm_dist_mat is None.
pint, optional: Power parameter for the Minkowski metric. When p = 1, this is equivalent to using Manhattan distance (l1), and Euclidean distance (l2) for p = 2. For arbitrary p, Minkowski distance (l_p) is used. Used only if norm_dist_mat is None.
cls_indsnp.ndarray, optional: Boolean array which indicates the examples of each class. The rows corresponds to each distinct class, and the columns corresponds to the instances.
N_scalednp.ndarray, optional: Numerical data N with each feature normalized in [0, 1] range. Used only if norm_dist_mat is None. Used to take advantage of precomputations.
norm_dist_matnp.ndarray, optional: Square matrix with the pairwise distances between each instance in N_scaled, i.e., between the normalized instances. Used to take advantage of precomputations.

Returns

float: Clustering coefficient of given data.

References

1: Ana C. Lorena, Luís P. F. Garcia, Jens Lehmann, Marcilio C. P. Souto, and Tin K. Ho. How Complex is your classification problem? A survey on measuring classification complexity (V2). (2019) (Cited on page 9). Published in ACM Computing Surveys (CSUR), Volume 52 Issue 5, October 2019, Article No. 107.

classmethod ft_density(N: ndarray, y: ndarray, metric: str = 'gower', p: float = 2.0, radius_frac: Union[int, float] = 0.15, n_jobs: Optional[int] = None, cls_inds: Optional[ndarray] = None, norm_dist_mat: Optional[ndarray] = None, adj_graph: Optional[_construct_graph_from_weighted_adjacency] = None) → float[source]

Average density of the network.

This measure considers the number of edges that are retained in the graph (Same-class Radius Nearest Neighbors) built from the dataset normalized by the maximum number of edges between y.size instances.

This measure is in [0, 1] range.

Parameters

Nnp.ndarray: Numerical fitted data.
ynp.ndarray: Target attribute.
radiusfloat or int, optional: Maximum distance between each pair of instances of the same class to both be considered neighbors of each other. Note that each feature of N is first normalized into the [0, 1] range before the neighbor calculations.
metricstr, optional: Metric used to calculate the distances between the instances. Check the scipy.spatial.distance.cdist documentation to get a list of all available metrics. This argument is used only if norm_dist_mat is None.
pint, optional: Power parameter for the Minkowski metric. When p = 1, this is equivalent to using Manhattan distance (l1), and Euclidean distance (l2) for p = 2. For arbitrary p, Minkowski distance (l_p) is used. Used only if norm_dist_mat is None.
cls_indsnp.ndarray, optional: Boolean array which indicates the examples of each class. The rows corresponds to each distinct class, and the columns corresponds to the instances.
N_scalednp.ndarray, optional: Numerical data N with each feature normalized in [0, 1] range. Used only if norm_dist_mat is None. Used to take advantage of precomputations.
norm_dist_matnp.ndarray, optional: Square matrix with the pairwise distances between each instance in N_scaled, i.e., between the normalized instances. Used to take advantage of precomputations.

Returns

float: Complement of the ratio of total edges in the Radius Nearest Neighbors graph and the total number of edges that could possibly exists in a graph with the given number of instances.

References

1: Ana C. Lorena, Luís P. F. Garcia, Jens Lehmann, Marcilio C. P. Souto, and Tin K. Ho. How Complex is your classification problem? A survey on measuring classification complexity (V2). (2019) (Cited on page 9). Published in ACM Computing Surveys (CSUR), Volume 52 Issue 5, October 2019, Article No. 107.

classmethod ft_f1(N: ndarray, y: ndarray, cls_inds: Optional[ndarray] = None, class_freqs: Optional[ndarray] = None) → ndarray[source]

Maximum Fisher’s discriminant ratio.

It measures theoverlap between the values of the features in different classes.

The average value of this measure is in [0, 1] range.

Parameters

Nnp.ndarray: Numerical fitted data.
ynp.ndarray: Target attribute.
cls_indsnp.ndarray, optional: Boolean array which indicates the examples of each class. The rows corresponds to each distinct class, and the columns corresponds to the instances.
class_freqsnp.ndarray, optional: The number of examples in each class. The indices corresponds to the classes.

Returns

np.ndarray: Inverse of all Fisher’s discriminant ratios.

References

1: Ana C. Lorena, Luís P. F. Garcia, Jens Lehmann, Marcilio C. P. Souto, and Tin K. Ho. How Complex is your classification problem? A survey on measuring classification complexity (V2). (2019) (Cited on page 9). Published in ACM Computing Surveys (CSUR), Volume 52 Issue 5, October 2019, Article No. 107.
2: Ramón A Mollineda, José S Sánchez, and José M Sotoca. Data characterization for effective prototype selection. In 2nd Iberian Conference on Pattern Recognition and Image Analysis (IbPRIA), pages 27–34, 2005.

classmethod ft_f1v(N: ndarray, y: ndarray, ovo_comb: Optional[ndarray] = None, cls_inds: Optional[ndarray] = None, class_freqs: Optional[ndarray] = None) → ndarray[source]

Directional-vector maximum Fisher’s discriminant ratio.

This measure searches for a vector which can separate the two classes after the examples have been projected into it and considers a directional Fisher criterion. Check the references for more information.

The average value of this measure is in [0, 1] range.

Parameters

Nnp.ndarray: Numerical fitted data.
ynp.ndarray: Target attribute.
ovo_combnp.ndarray, optional: List of all class OVO combination, i.e., all combinations of distinct class indices by pairs ([(0, 1), (0, 2) …].)
cls_indsnp.ndarray, optional: Boolean array which indicates the examples of each class. The rows corresponds to each distinct class, and the columns corresponds to the instances.
class_freqsnp.ndarray, optional: The number of examples in each class. The indices corresponds to the classes.

Returns

np.ndarray: Inverse of directional vector of Fisher’s discriminant ratio.

References

1: Ana C. Lorena, Luís P. F. Garcia, Jens Lehmann, Marcilio C. P. Souto, and Tin K. Ho. How Complex is your classification problem? A survey on measuring classification complexity (V2). (2019) (Cited on page 9). Published in ACM Computing Surveys (CSUR), Volume 52 Issue 5, October 2019, Article No. 107.
2: Witold Malina. Two-parameter fisher criterion. IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics), 31(4):629–636, 2001.

classmethod ft_f2(N: ndarray, y: ndarray, ovo_comb: Optional[ndarray] = None, cls_inds: Optional[ndarray] = None) → ndarray[source]

Volume of the overlapping region.

This measure calculates the overlap of the distributions of the features values within the classes.

This measure is in [0, 1] range.

Parameters

Nnp.ndarray: Numerical fitted data.
ynp.ndarray: Fitted target attribute.
ovo_combnp.ndarray, optional: List of all class OVO combination, i.e., all combinations of distinct class indices by pairs ([(0, 1), (0, 2) …].)
cls_indsnp.ndarray, optional: Boolean array which indicates the examples of each class. The rows corresponds to each distinct class, and the columns corresponds to the instances.

Returns

np.ndarray: Volume of the overlapping region for each OVO combination.

References

1: Ana C. Lorena, Luís P. F. Garcia, Jens Lehmann, Marcilio C. P. Souto, and Tin K. Ho. How Complex is your classification problem? A survey on measuring classification complexity (V2). (2019) (Cited on page 9). Published in ACM Computing Surveys (CSUR), Volume 52 Issue 5, October 2019, Article No. 107.
2: Marcilio C P Souto, Ana C Lorena, Newton Spolaôr, and Ivan G Costa. Complexity measures of supervised classification tasks: a case study for cancer gene expression data. In International Joint Conference on Neural Networks (IJCNN), pages 1352–1358, 2010.
3: Lisa Cummins. Combining and Choosing Case Base Maintenance Algorithms. PhD thesis, National University of Ireland, Cork, 2013.

classmethod ft_f3(N: ndarray, y: ndarray, ovo_comb: Optional[ndarray] = None, cls_inds: Optional[ndarray] = None, class_freqs: Optional[ndarray] = None) → ndarray[source]

Compute feature maximum individual efficiency.

The average value of this measure is in [0, 1] range.

Parameters

Nnp.ndarray: Numerical fitted data.
ynp.ndarray: Target attribute.
ovo_combnp.ndarray, optional: List of all class OVO combination, i.e., all combinations of distinct class indices by pairs ([(0, 1), (0, 2) …].)
cls_indsnp.ndarray, optional: Boolean array which indicates the examples of each class. The rows corresponds to each distinct class, and the columns corresponds to the instances.
class_freqsnp.ndarray, optional: The number of examples in each class. The indices corresponds to the classes.

Returns

np.ndarray: An array with the maximum individual feature efficiency measure for each feature.

References

1: Ana C. Lorena, Luís P. F. Garcia, Jens Lehmann, Marcilio C. P. Souto, and Tin K. Ho. How Complex is your classification problem? A survey on measuring classification complexity (V2). (2019) (Cited on page 6). Published in ACM Computing Surveys (CSUR), Volume 52 Issue 5, October 2019, Article No. 107.

classmethod ft_f4(N: ndarray, y: ndarray, ovo_comb: Optional[ndarray] = None, cls_inds: Optional[ndarray] = None, class_freqs: Optional[ndarray] = None) → ndarray[source]

Compute the collective feature efficiency.

The average value of this measure is in [0, 1] range.

Parameters

Nnp.ndarray: Numerical fitted data.
ynp.ndarray: Target attribute.
ovo_combnp.ndarray, optional: List of all class OVO combination, i.e., all combinations of distinct class indices by pairs ([(0, 1), (0, 2) …].)
cls_indsnp.ndarray, optional: Boolean array which indicates the examples of each class. The rows corresponds to each distinct class, and the columns corresponds to the instances.
class_freqsnp.ndarray, optional: The number of examples in each class. The indices corresponds to the classes.

Returns

np.ndarray: An array with the collective feature efficiency measure for each feature.

References

1: Ana C. Lorena, Luís P. F. Garcia, Jens Lehmann, Marcilio C. P. Souto, and Tin K. Ho. How Complex is your classification problem? A survey on measuring classification complexity (V2). (2019) (Cited on page 7). Published in ACM Computing Surveys (CSUR), Volume 52 Issue 5, October 2019, Article No. 107.

classmethod ft_hubs(N: ndarray, y: ndarray, metric: str = 'gower', p: float = 2.0, radius_frac: Union[int, float] = 0.15, n_jobs: Optional[int] = None, cls_inds: Optional[ndarray] = None, norm_dist_mat: Optional[ndarray] = None, adj_graph: Optional[_construct_graph_from_weighted_adjacency] = None) → ndarray[source]

Hub score.

The hub score scores each node by the number of connections it has to other nodes, weighted by the number of connections these neighbors have.

The values of node hub score are given by the principal eigenvector of (A.t * A), where A is the adjacency matrix of the graph.

The average value of this measure is in [0, 1] range.

Parameters

Nnp.ndarray: Numerical fitted data.
ynp.ndarray: Target attribute.
metricstr, optional: Metric used to calculate the distances between the instances. Check the scipy.spatial.distance.cdist documentation to get a list of all available metrics. Used only if adj_graph is None.
pint, optional: Power parameter for the Minkowski metric. When p = 1, this is equivalent to using Manhattan distance (l1), and Euclidean distance (l2) for p = 2. For arbitrary p, Minkowski distance (l_p) is used. Used only if adj_graph is None.
radius_fracfloat or int, optional: If int, maximum number of neighbors of the same class for each instance. If float, the maximum number of neighbors is computed as radius_frac * len(N). Used only if adj_graph is None.
n_jobsint or None, optional: Number of parallel processes to compute nearest neighbors. Used only if adj_graph is None.
cls_indsnp.ndarray, optional: Boolean array which indicates the examples of each class. The rows corresponds to each distinct class, and the columns corresponds to the instances. Used only if adj_graph is None.
norm_dist_mat: :obj:`np.ndarray`, optional: Normalized distance matrix
adj_graphigraph.Graph.Weighted_Adjacency, optional: Undirected and Weighted adjacency graph for the dataset. Only instances belonging to the same class must be connected. If not provided, will compute using metric, p, radius_frac

Returns

np.ndarray: Complement of the hub score of every node.

References

1: Ana C. Lorena, Luís P. F. Garcia, Jens Lehmann, Marcilio C. P. Souto, and Tin K. Ho. How Complex is your classification problem? A survey on measuring classification complexity (V2). (2019) (Cited on page 9). Published in ACM Computing Surveys (CSUR), Volume 52 Issue 5, October 2019, Article No. 107.

classmethod ft_l1(N: ndarray, y: ndarray, ovo_comb: Optional[ndarray] = None, cls_inds: Optional[ndarray] = None, class_freqs: Optional[ndarray] = None, svc_pipeline: Optional[Pipeline] = None, max_iter: Union[int, float] = 100000.0, random_state: Optional[int] = None) → ndarray[source]

Sum of error distance by linear programming.

This measure assesses if the data are linearly separable by computing, for a dataset, the sum of the distances of incorrectly classified examples to a linear boundary used in their classification. If the value of L1 is zero, then theproblem is linearly separable and can be considered simpler than a problem for which a non-linear boundary is required.

The average value of this measure is in [0, 1] range.

Parameters

Nnp.ndarray: Numerical fitted data.
ynp.ndarray: Target attribute.
ovo_combnp.ndarray, optional: List of all class OVO combination, i.e., all combinations of distinct class indices by pairs ([(0, 1), (0, 2) …].)
cls_indsnp.ndarray, optional: Boolean array which indicates the examples of each class. The rows corresponds to each distinct class, and the columns corresponds to the instances.
class_freqsnp.ndarray, optional: The number of examples in each class. The indices corresponds to the classes.
max_iterfloat or int, optional: Maximum number of iterations allowed for the support vector machine model convergence. This parameter can receive float numbers to be compatible with the Python scientific notation data type. Used only if svc_pipeline is None.
svc_pipelinesklearn.pipeline.Pipeline, optional: Support Vector Classifier learning pipeline. Traditionally, the pipeline used is a data standardization (mean = 0 and variance = 1) before the learning model, which is a Support Vector Classifier (linear kernel.) However, any variation of this pipeline can also be used. Note that this metafeature is formulated using a linear classifier. If this argument is none, the described pipeline (standardization + SVC) is used by default.
random_stateint, optional: Random seed for dual coordinate descent while fitting the Support Vector Classifier model. Check sklearn.svm.LinearSVC documentation (random_state parameter) for more information. Used only if svc_pipeline is None.

Returns

np.ndarray: Complement of the inverse of the sum of distances from a Support Vector Classifier (SVC) hyperplane of incorrectly classified instances.

References

1: Ana C. Lorena, Luís P. F. Garcia, Jens Lehmann, Marcilio C. P. Souto, and Tin K. Ho. How Complex is your classification problem? A survey on measuring classification complexity (V2). (2019) (Cited on page 9). Published in ACM Computing Surveys (CSUR), Volume 52 Issue 5, October 2019, Article No. 107.

classmethod ft_l2(N: ndarray, y: ndarray, ovo_comb: Optional[ndarray] = None, cls_inds: Optional[ndarray] = None, svc_pipeline: Optional[Pipeline] = None, max_iter: Union[int, float] = 100000.0, random_state: Optional[int] = None) → ndarray[source]

Compute the OVO subsets error rate of linear classifier.

The linear model used is induced by the Support Vector Machine algorithm.

The average value of this measure is in [0, 1] range.

Parameters

Nnp.ndarray: Numerical fitted data.
ynp.ndarray: Target attribute.
ovo_combnp.ndarray, optional: List of all class OVO combination, i.e., all combinations of distinct class indices by pairs ([(0, 1), (0, 2) …].)
cls_indsnp.ndarray, optional: Boolean array which indicates the examples of each class. The rows corresponds to each distinct class, and the columns corresponds to the instances.
svc_pipelinesklearn.pipeline.Pipeline, optional: Support Vector Classifier learning pipeline. Traditionally, the pipeline used is a data standardization (mean = 0 and variance = 1) before the learning model, which is a Support Vector Classifier (linear kernel.) However, any variation of this pipeline can also be used. Note that this metafeature is formulated using a linear classifier. If this argument is none, the described pipeline (standardization + SVC) is used by default.
max_iterfloat or int, optional: Maximum number of iterations allowed for the support vector machine model convergence. This parameter can receive float numbers to be compatible with the Python scientific notation data type. Used only if svc_pipeline is None.
random_stateint, optional: Random seed for dual coordinate descent while fitting the Support Vector Classifier model. Check sklearn.svm.LinearSVC documentation (random_state parameter) for more information. Used only if svc_pipeline is None.

Returns

np.ndarray: An array with the collective error rate of linear classifier measure for each OVO subset.

References

1: Ana C. Lorena, Luís P. F. Garcia, Jens Lehmann, Marcilio C. P. Souto, and Tin K. Ho. How Complex is your classification problem? A survey on measuring classification complexity (V2). (2019) (Cited on page 9). Published in ACM Computing Surveys (CSUR), Volume 52 Issue 5, October 2019, Article No. 107.

classmethod ft_l3(N: ndarray, y: ndarray, ovo_comb: Optional[ndarray] = None, cls_inds: Optional[ndarray] = None, svc_pipeline: Optional[Pipeline] = None, max_iter: Union[int, float] = 100000.0, random_state: Optional[int] = None) → ndarray[source]

Non-Linearity of a linear classifier.

This index is sensitive to how the data from a class are distributed inthe border regions and also on how much the convex hulls which delimit the classes overlap. In particular, it detects the presence of concavities in the class boundaries. Higher values indicate a greater complexity.

The average value of this measure is in [0, 1] range.

Parameters

Nnp.ndarray: Numerical fitted data.
ynp.ndarray: Target attribute.
ovo_combnp.ndarray, optional: List of all class OVO combination, i.e., all combinations of distinct class indices by pairs ([(0, 1), (0, 2) …].)
cls_indsnp.ndarray, optional: Boolean array which indicates the examples of each class. The rows corresponds to each distinct class, and the columns corresponds to the instances.
svc_pipelinesklearn.pipeline.Pipeline, optional: Support Vector Classifier learning pipeline. Traditionally, the pipeline used is a data standardization (mean = 0 and variance = 1) before the learning model, which is a Support Vector Classifier (linear kernel.) However, any variation of this pipeline can also be used. Note that this metafeature is formulated using a linear classifier. If this argument is none, the described pipeline (standardization + SVC) is used by default.
max_iterfloat or int, optional: Maximum number of iterations allowed for the support vector machine model convergence. This parameter can receive float numbers to be compatible with the Python scientific notation data type. Used only if svc_pipeline is None.
random_stateint, optional: Random seed for dual coordinate descent while fitting the Support Vector Classifier model. Check sklearn.svm.LinearSVC documentation (random_state parameter) for more information. Used only if svc_pipeline is None.

Returns

np.ndarray: Zero-one losses of a Support Vector Classifier for a randomly interpolated dataset using the original instances. The classes are separated in a OVO (One-Versus-One) fashion.

References

1: Ana C. Lorena, Luís P. F. Garcia, Jens Lehmann, Marcilio C. P. Souto, and Tin K. Ho. How Complex is your classification problem? A survey on measuring classification complexity (V2). (2019) (Cited on page 9). Published in ACM Computing Surveys (CSUR), Volume 52 Issue 5, October 2019, Article No. 107.

classmethod ft_lsc(N: ndarray, y: ndarray, metric: str = 'gower', p: Union[int, float] = 2, cls_inds: Optional[ndarray] = None, N_scaled: Optional[ndarray] = None, norm_dist_mat: Optional[ndarray] = None, nearest_enemy_dist: Optional[ndarray] = None) → float[source]

Local set average cardinality.

The Local-Set (LS) of an example x_i in a dataset N is defined as the set of points from N whose distance to x_i is smaller than the distance from x_i and its nearest enemy (the nearest instance from a distinct class of x_i.)

The cardinality of the LS of an example indicates its proximity to the decision boundary and also the narrowness of the gap between the classes.

This measure is in [0, 1 - 1/n] range, where n is the number of instances in N.

Parameters

Nnp.ndarray: Numerical fitted data.
ynp.ndarray: Target attribute.
metricstr, optional: Metric used to calculate the distances between the instances. Check the scipy.spatial.distance.cdist documentation to get a list of all available metrics. This argument is used only if norm_dist_mat is None.
pint, optional: Power parameter for the Minkowski metric. When p = 1, this is equivalent to using Manhattan distance (l1), and Euclidean distance (l2) for p = 2. For arbitrary p, Minkowski distance (l_p) is used. Used only if norm_dist_mat is None.
cls_indsnp.ndarray, optional: Boolean array which indicates the examples of each class. The rows corresponds to each distinct class, and the columns corresponds to the instances. Used only if the argument nearest_enemy_dist is None.
N_scalednp.ndarray, optional: Numerical data N with each feature normalized in [0, 1] range. Used only if norm_dist_mat is None. Used to take advantage of precomputations.
norm_dist_matnp.ndarray, optional: Square matrix with the pairwise distances between each instance in N_scaled, i.e., between the normalized instances. Used to take advantage of precomputations.
nearest_enemy_distnp.ndarray, optional: Distance of each instance to its nearest enemy (instances of a distinct class.)

Returns

float: Local set average cardinality.

References

1: Ana C. Lorena, Luís P. F. Garcia, Jens Lehmann, Marcilio C. P. Souto, and Tin K. Ho. How Complex is your classification problem? A survey on measuring classification complexity (V2). (2019) (Cited on page 15). Published in ACM Computing Surveys (CSUR), Volume 52 Issue 5, October 2019, Article No. 107.
2: Enrique Leyva, Antonio González, and Raúl Pérez. A set of complexity measures designed for applying meta-learning to instance selection. IEEE Transactions on Knowledge and Data Engineering, 27(2):354–367, 2014.

classmethod ft_n1(N: ndarray, y: ndarray, metric: str = 'gower', p: Union[int, float] = 2, N_scaled: Optional[ndarray] = None, norm_dist_mat: Optional[ndarray] = None) → float[source]

Compute the fraction of borderline points.

This measure is in [0, 1] range.

Parameters

Nnp.ndarray: Numerical fitted data.
ynp.ndarray: Target attribute.
metricstr, optional: Metric used to calculate the distances between the instances. Check the scipy.spatial.distance.cdist documentation to get a list of all available metrics. This argument is used only if norm_dist_mat is None.
pint, optional: Power parameter for the Minkowski metric. When p = 1, this is equivalent to using Manhattan distance (l1), and Euclidean distance (l2) for p = 2. For arbitrary p, Minkowski distance (l_p) is used. Used only if norm_dist_mat is None.
N_scalednp.ndarray, optional: Numerical data N with each feature normalized in [0, 1] range. Used only if norm_dist_mat is None. Used to take advantage of precomputations.
norm_dist_matnp.ndarray, optional: Square matrix with the pairwise distances between each instance in N_scaled, i.e., between the normalized instances. Used to take advantage of precomputations.

Returns

float: Fraction of borderline points.

References

1: Ana C. Lorena, Luís P. F. Garcia, Jens Lehmann, Marcilio C. P. Souto, and Tin K. Ho. How Complex is your classification problem? A survey on measuring classification complexity (V2). (2019) (Cited on page 9-10). Published in ACM Computing Surveys (CSUR), Volume 52 Issue 5, October 2019, Article No. 107.

classmethod ft_n2(N: ndarray, y: ndarray, metric: str = 'gower', p: Union[int, float] = 2, class_freqs: Optional[ndarray] = None, cls_inds: Optional[ndarray] = None, N_scaled: Optional[ndarray] = None, norm_dist_mat: Optional[ndarray] = None) → ndarray[source]

Ratio of intra and extra class nearest neighbor distance.

This measure computes the ratio of two sums:

The sum of the distances between each example and its closest neighborfrom the same class (intra-class); and
The sum of the distances between each example and its closest neighbor fromanother class (extra-class)

The average value of this measure is in [0, 1] range.

Parameters

Nnp.ndarray: Numerical fitted data.
ynp.ndarray: Target attribute.
metricstr, optional: Metric used to calculate the distances between the instances. Check the scipy.spatial.distance.cdist documentation to get a list of all available metrics. This argument is used only if norm_dist_mat is None.
pint, optional: Power parameter for the Minkowski metric. When p = 1, this is equivalent to using Manhattan distance (l1), and Euclidean distance (l2) for p = 2. For arbitrary p, Minkowski distance (l_p) is used. Used only if norm_dist_mat is None.
class_freqsnp.ndarray, optional: The number of examples in each class. The indices corresponds to the classes.
cls_indsnp.ndarray, optional: Boolean array which indicates the examples of each class. The rows corresponds to each distinct class, and the columns corresponds to the instances.
N_scalednp.ndarray, optional: Numerical data N with each feature normalized in [0, 1] range. Used only if norm_dist_mat is None. Used to take advantage of precomputations.
norm_dist_matnp.ndarray, optional: Square matrix with the pairwise distances between each instance in N_scaled, i.e., between the normalized instances. Used to take advantage of precomputations.

Returns

np.ndarray: Complement of the inverse of the intra and extra class variance.

References

1: Ana C. Lorena, Luís P. F. Garcia, Jens Lehmann, Marcilio C. P. Souto, and Tin K. Ho. How Complex is your classification problem? A survey on measuring classification complexity (V2). (2019) (Cited on page 9). Published in ACM Computing Surveys (CSUR), Volume 52 Issue 5, October 2019, Article No. 107.

classmethod ft_n3(N: ndarray, y: ndarray, metric: str = 'gower', p: Union[int, float] = 2, N_scaled: Optional[ndarray] = None, norm_dist_mat: Optional[ndarray] = None) → ndarray[source]

Error rate of the nearest neighbor classifier.

The N3 measure refers to the error rate of a 1-NN classifier that is estimated using a leave-one-out cross-validation procedure.

The average value of this measure is in [0, 1] range.

Parameters

Nnp.ndarray: Numerical fitted data.
ynp.ndarray: Target attribute.
metricstr, optional: Metric used to calculate the distances between the instances. Check the scipy.spatial.distance.cdist documentation to get a list of all available metrics. This argument is used only if norm_dist_mat is None.
pint, optional: Power parameter for the Minkowski metric. When p = 1, this is equivalent to using Manhattan distance (l1), and Euclidean distance (l2) for p = 2. For arbitrary p, Minkowski distance (l_p) is used.
N_scalednp.ndarray, optional: Numerical data N with each feature normalized in [0, 1] range. Used only if norm_dist_mat is None. Used to take advantage of precomputations.
norm_dist_matnp.ndarray, optional: Square matrix with the pairwise distances between each instance in N_scaled, i.e., between the normalized instances. Used to take advantage of precomputations.

Returns

np.ndarray: Binary array of misclassification of a 1-NN classifier.

References

1: Ana C. Lorena, Luís P. F. Garcia, Jens Lehmann, Marcilio C. P. Souto, and Tin K. Ho. How Complex is your classification problem? A survey on measuring classification complexity (V2). (2019) (Cited on page 9). Published in ACM Computing Surveys (CSUR), Volume 52 Issue 5, October 2019, Article No. 107.

classmethod ft_n4(N: ndarray, y: ndarray, metric: str = 'gower', p: Union[int, float] = 2, n_neighbors: int = 1, random_state: Optional[int] = None, cls_inds: Optional[ndarray] = None, N_scaled: Optional[ndarray] = None, norm_dist_mat: Optional[ndarray] = None, orig_dist_mat_min: Optional[float] = None, orig_dist_mat_ptp: Optional[float] = None) → ndarray[source]

Compute the non-linearity of the k-NN Classifier.

The average value of this measure is in [0, 1] range.

Parameters

Nnp.ndarray: Numerical fitted data.
ynp.ndarray: Target attribute.
metricstr, optional: The distance metric used in the internal kNN classifier. See the documentation of the scipy.spatial.distance.cdist class for a list of available metrics. Used only if norm_dist_mat is None.
pint, optional: Power parameter for the Minkowski metric. When p = 1, this is equivalent to using Manhattan distance (l1), and Euclidean distance (l2) for p = 2. For arbitrary p, Minkowski distance (l_p) is used. Used only if norm_dist_mat is None.
n_neighborsint, optional: Number of neighbors used for the Nearest Neighbors classifier.
random_stateint, optional: If given, set the random seed before computing the randomized data interpolation.
cls_indsnp.ndarray, optional: Boolean array which indicates the examples of each class. The rows corresponds to each distinct class, and the columns corresponds to the instances. Used to take advantages of precomputations.
N_scalednp.ndarray, optional: Numerical data N with each feature normalized in [0, 1] range. Used only if norm_dist_mat is None. Used to take advantage of precomputations.
norm_dist_matnp.ndarray, optional: Square matrix with the pairwise distances between each instance in N_scaled, i.e., between the normalized instances. Used to take advantage of precomputations. Used if and only if orig_dist_mat_min AND orig_dist_mat_ptp are also given (non None).
orig_dist_mat_minfloat, optional: Minimal distance between the original instances in N.
orig_dist_mat_ptpfloat, optional: Range (max - min) of distances between the original instances in N.

Returns

np.ndarray: Misclassifications of the k-NN classifier in the interpolated dataset.

References

1: Ana C. Lorena, Luís P. F. Garcia, Jens Lehmann, Marcilio C. P. Souto, and Tin K. Ho. How Complex is your classification problem? A survey on measuring classification complexity (V2). (2019) (Cited on page 9-11). Published in ACM Computing Surveys (CSUR), Volume 52 Issue 5, October 2019, Article No. 107.

classmethod ft_t1(N: ndarray, y: ndarray, metric: str = 'gower', p: Union[int, float] = 2, cls_inds: Optional[ndarray] = None, N_scaled: Optional[ndarray] = None, norm_dist_mat: Optional[ndarray] = None, orig_dist_mat_min: Optional[float] = None, orig_dist_mat_ptp: Optional[float] = None) → ndarray[source]

Fraction of hyperspheres covering data.

This measure uses a process that builds hyperspheres centered at each one of the examples. In this implementation, we stop the growth of the hypersphere when the hyperspheres centered at two points of opposite classes just start to touch.

Once the radiuses of all hyperspheres are found, a post-processing step can be applied to verify which hyperspheres must be absorbed (all hyperspheres completely within larger hyperspheres.)

This measure is in [0, 1] range.

Parameters

Nnp.ndarray: Numerical fitted data.
ynp.ndarray: Target attribute.
metricstr, optional: Metric used to calculate the distances between the instances. Check the scipy.spatial.distance.cdist documentation to get a list of all available metrics. This argument is used only if norm_dist_mat is None.
pint, optional: Power parameter for the Minkowski metric. When p = 1, this is equivalent to using Manhattan distance (l1), and Euclidean distance (l2) for p = 2. For arbitrary p, Minkowski distance (l_p) is used. Used only if norm_dist_mat is None.
cls_indsnp.ndarray, optional: Boolean array which indicates the examples of each class. The rows corresponds to each distinct class, and the columns corresponds to the instances. Used only if the arguments nearest_enemy_dist or nearest_enemy_ind are None.
N_scalednp.ndarray, optional: Numerical data N with each feature normalized in [0, 1] range. Used only if norm_dist_mat is None. Used to take advantage of precomputations.
norm_dist_matnp.ndarray, optional: Square matrix with the pairwise distances between each instance in N_scaled, i.e., between the normalized instances. Used to take advantage of precomputations. Used if and only if orig_dist_mat_min and orig_dist_mat_ptp are also given (non None).
orig_dist_mat_minfloat, optional: Minimal distance between the original instances in N.
orig_dist_mat_ptpfloat, optional: Range (max - min) of distances between the original instances in N.

Returns

np.ndarray: Array with the fraction of instances inside each remaining hypersphere.

References

1: Ana C. Lorena, Luís P. F. Garcia, Jens Lehmann, Marcilio C. P. Souto, and Tin K. Ho. How Complex is your classification problem? A survey on measuring classification complexity (V2). (2019) (Cited on page 9). Published in ACM Computing Surveys (CSUR), Volume 52 Issue 5, October 2019, Article No. 107.
2: Tin K Ho and Mitra Basu. Complexity measures of supervised classification problems. IEEE Transactions on Pattern Analysis and Machine Intelligence, 24(3):289–300, 2002.

classmethod ft_t2(N: ndarray) → float[source]

Compute the average number of features per dimension.

This measure is in (0, m] range, where m is the number of features in N.

Parameters

Nnp.ndarray: Numeric attributes from fitted data.

Returns

float: Average number of features per dimension.

References

1: Ana C. Lorena, Luís P. F. Garcia, Jens Lehmann, Marcilio C. P. Souto, and Tin K. Ho. How Complex is your classification problem? A survey on measuring classification complexity (V2). (2019) (Cited on page 15). Published in ACM Computing Surveys (CSUR), Volume 52 Issue 5, October 2019, Article No. 107.

classmethod ft_t3(N: ndarray, num_attr_pca: Optional[int] = None, random_state: Optional[int] = None) → float[source]

Compute the average number of PCA dimensions per points.

This measure is in (0, m] range, where m is the number of features in N.

Parameters

Nnp.ndarray: Numerical fitted data.
num_attr_pcaint, optional: Number of features after PCA where a fraction of at least 0.95 of the data variance is explained by the selected components.
random_stateint, optional: If the fitted data is huge and the number of principal components to be kept is low, then the PCA analysis is done using a randomized strategy for efficiency. This random seed keeps the results replicable. Check sklearn.decomposition.PCA documentation for more information.

Returns

float: Average number of PCA dimensions (explaining at least 95% of the data variance) per points.

References

1: Ana C. Lorena, Luís P. F. Garcia, Jens Lehmann, Marcilio C. P. Souto, and Tin K. Ho. How Complex is your classification problem? A survey on measuring classification complexity (V2). (2019) (Cited on page 15). Published in ACM Computing Surveys (CSUR), Volume 52 Issue 5, October 2019, Article No. 107.

classmethod ft_t4(N: ndarray, num_attr_pca: Optional[int] = None, random_state: Optional[int] = None) → float[source]

Compute the ratio of the PCA dimension to the original dimension.

The components kept in the PCA dimension explains at least 95% of the data variance.

This measure is in [0, 1] range.

Parameters

Nnp.ndarray: Numerical fitted data.
num_attr_pcaint, optional: Number of features after PCA where a fraction of at least 0.95 of the data variance is explained by the selected components.
random_stateint, optional: If the fitted data is huge and the number of principal components to be kept is low, then the PCA analysis is done using a randomized strategy for efficiency. This random seed keeps the results replicable. Check sklearn.decomposition.PCA documentation for more information.

Returns

float: Ratio of the PCA dimension (explaining at least 95% of the data variance) to the original dimension.

References

1: Ana C. Lorena, Luís P. F. Garcia, Jens Lehmann, Marcilio C. P. Souto, and Tin K. Ho. How Complex is your classification problem? A survey on measuring classification complexity (V2). (2019) (Cited on page 15). Published in ACM Computing Surveys (CSUR), Volume 52 Issue 5, October 2019, Article No. 107.

classmethod precompute_adjacency_graph(N: ndarray, y: Optional[ndarray] = None, metric: str = 'gower', p: float = 2.0, n_jobs: Optional[int] = None, **kwargs) → Dict[str, ndarray][source]

Precompute instances nearest enemy related values.

The instance nearest enemy is the nearest instance from a different class.

Parameters

Nnp.ndarray: Numerical fitted data.
ynp.ndarray: Target attribute.
**kwargs: Additional arguments. May have previously precomputed before this method from other precomputed methods, so they can help speed up this precomputation.

Returns

dict: With following precomputed items:

classmethod precompute_complexity(y: Optional[ndarray] = None, **kwargs) → Dict[str, Any][source]

Precompute some useful things to support feature-based measures.

Parameters

ynp.ndarray, optional: Target attribute.
**kwargs: Additional arguments. May have previously precomputed before this method from other precomputed methods, so they can help speed up this precomputation.

Returns

dict

With following precomputed items:

ovo_comb (list): List of all class OVO combination,
i.e., all combinations of distinct class indices by pairs ([(0, 1), (0, 2) …].)
cls_inds (np.ndarray): Boolean array which
indicates whether each example belongs to each class. The rows corresponds to the distinct classes, and the instances are represented by the columns.
classes (np.ndarray): distinct classes in the
fitted target attribute.
class_freqs (np.ndarray): The number of examples
in each class. The indices corresponds to the classes.

classmethod precompute_complexity_svm(y: Optional[ndarray] = None, max_iter: Union[int, float] = 100000.0, random_state: Optional[int] = None, **kwargs) → Dict[str, Pipeline][source]

Init a Support Vector Classifier pipeline (with data standardization.)

Parameters

max_iterfloat or int, optional: Maximum number of iterations allowed for the support vector machine model convergence. This parameter can receive float numbers to be compatible with the Python scientific notation data type.
random_stateint, optional: Random seed for dual coordinate descent while fitting the Support Vector Classifier model. Check sklearn.svm.LinearSVC documentation (random_state parameter) for more information.
**kwargs: Additional arguments. May have previously precomputed before this method from other precomputed methods, so they can help speed up this precomputation.

Returns

dict

With following precomputed items:

svc_pipeline (sklearn.pipeline.Pipeline): support
vector classifier learning pipeline, with data standardization (mean = 0 and variance = 1) before the learning model.

classmethod precompute_nearest_enemy(N: ndarray, y: Optional[ndarray] = None, metric: str = 'gower', p: Union[int, float] = 2, **kwargs) → Dict[str, ndarray][source]

Precompute instances nearest enemy related values.

The instance nearest enemy is the nearest instance from a different class.

Parameters

Nnp.ndarray: Numerical fitted data.
ynp.ndarray: Target attribute.
metricstr, optional: Metric used to calculate the distances between the instances. Check the scipy.spatial.distance.cdist documentation to get a list of all available metrics.
pint, optional: Power parameter for the Minkowski metric. When p = 1, this is equivalent to using Manhattan distance (l1), and Euclidean distance (l2) for p = 2. For arbitrary p, Minkowski distance (l_p) is used.
**kwargs: Additional arguments. May have previously precomputed before this method from other precomputed methods, so they can help speed up this precomputation.

Returns

dict

With following precomputed items:

nearest_enemy_dist (np.ndarray): distance of each
instance to its nearest enemy (instances of a distinct class.)
nearest_enemy_ind (np.ndarray): index of the
nearest enemy (instances of a distinct class) for each instance.

This precomputation method also depends on values precomputed in other precomputation methods, precompute_complexity and precompute_norm_dist_mat. Therefore, the return values of those methods can also be returned in case they are not called before. Check the documentation of each method to verify which values either methods returns to a precise description for the additional values that may be returned by this method.

classmethod precompute_norm_dist_mat(N: ndarray, metric: str = 'gower', p: Union[int, float] = 2, **kwargs) → Dict[str, ndarray][source]

Precompute normalized N and pairwise distance among instances.

Parameters

Nnp.ndarray: Numerical fitted data.
metricstr, optional: Metric used to calculate the distances between the instances. Check the scipy.spatial.distance.cdist documentation to get a list of all available metrics.
pint, optional: Power parameter for the Minkowski metric. When p = 1, this is equivalent to using Manhattan distance (l1), and Euclidean distance (l2) for p = 2. For arbitrary p, Minkowski distance (l_p) is used.
**kwargs: Additional arguments. May have previously precomputed before this method from other precomputed methods, so they can help speed up this precomputation.

Returns

dict

With following precomputed items:

N_scaled (np.ndarray): numerical data N with
each feature normalized in [0, 1] range. Used only if norm_dist_mat is None.
norm_dist_mat (np.ndarray): square matrix with
the normalized pairwise distances between each instance in N_scaled, i.e., between the normalized instances. (note that this matrix is the normalized pairwise distances between the normalized instances, i.e., there is two normalization processes involved.)
orig_dist_mat_min (float): minimal value from the
original pairwise distance matrix. Can be used to preprocess test data before predictions.
orig_dist_mat_min (float): range (max - min) value from
the original pairwise distance matrix. Can be used to preprocess test data before predictions.

classmethod precompute_pca_tx(N: ndarray, tx_n_components: float = 0.95, random_state: Optional[int] = None, **kwargs) → Dict[str, int][source]

Precompute PCA to support dimensionality measures.

Parameters

Nnp.ndarray: Numerical fitted data.
tx_n_componentsfloat, optional: Specifies the number of components such that the amount of variance that needs to be explained is greater than the percentage specified by tx_n_components. The PCA is computed using N.
random_stateint, optional: If the fitted data is huge and the number of principal components to be kept is low, then the PCA analysis is done using a randomized strategy for efficiency. This random seed keeps the results replicable. Check sklearn.decomposition.PCA documentation for more information.
**kwargs: Additional arguments. May have previously precomputed before this method from other precomputed methods, so they can help speed up this precomputation.

Returns

dict

With following precomputed items:

num_attr_pca (int): Number of features after PCA
analysis with at least tx_n_components fraction of data variance explained by the selected principal components.