pymfe.complexity.MFEComplexity
- class pymfe.complexity.MFEComplexity[source]
Keep methods for metafeatures of
Complexity
group.The convention adopted for metafeature extraction related methods is to always start with
ft_
prefix to allow automatic method detection. This prefix is predefined within_internal
module.All method signature follows the conventions and restrictions listed below:
For independent attribute data,
X
meansevery type of attribute
,N
meansNumeric attributes only
andC
stands forCategorical attributes only
. It is important to note that the categorical attribute sets betweenX
andC
and the numerical attribute sets betweenX
andN
may differ due to data transformations, performed while fitting data into MFE model, enabled by, respectively,transform_num
andtransform_cat
arguments fromfit
(MFE method).Only arguments in MFE
_custom_args_ft
attribute (set up insidefit
method) are allowed to be required method arguments. All other arguments must be strictly optional (i.e., has a predefined default value).The initial assumption is that the user can change any optional argument, without any previous verification of argument value or its type, via kwargs argument of
extract
method of MFE class.The return value of all feature extraction methods should be a single value or a generic List (preferably a
np.ndarray
) type with numeric values.
There is another type of method adopted for automatic detection. It is adopted the prefix
precompute_
for automatic detection of these methods. These methods run while fitting some data into an MFE model automatically, and their objective is to precompute some common value shared between more than one feature extraction method. This strategy is a trade-off between more system memory consumption and speeds up of feature extraction. Their return value must always be a dictionary whose keys are possible extra arguments for both feature extraction methods and other precomputation methods. Note that there is a share of precomputed values between all valid feature-extraction modules (e.g.,class_freqs
computed in modulestatistical
can freely be used for any precomputation or feature extraction method of modulelandmarking
).- __init__(*args, **kwargs)
Methods
__init__
(*args, **kwargs)ft_c1
(y[, class_freqs])Compute the entropy of class proportions.
ft_c2
(y[, class_freqs])Compute the imbalance ratio.
ft_cls_coef
(N, y[, metric, p, radius_frac, ...])Clustering coefficient.
ft_density
(N, y[, metric, p, radius_frac, ...])Average density of the network.
ft_f1
(N, y[, cls_inds, class_freqs])Maximum Fisher's discriminant ratio.
ft_f1v
(N, y[, ovo_comb, cls_inds, class_freqs])Directional-vector maximum Fisher's discriminant ratio.
ft_f2
(N, y[, ovo_comb, cls_inds])Volume of the overlapping region.
ft_f3
(N, y[, ovo_comb, cls_inds, class_freqs])Compute feature maximum individual efficiency.
ft_f4
(N, y[, ovo_comb, cls_inds, class_freqs])Compute the collective feature efficiency.
ft_hubs
(N, y[, metric, p, radius_frac, ...])Hub score.
ft_l1
(N, y[, ovo_comb, cls_inds, ...])Sum of error distance by linear programming.
ft_l2
(N, y[, ovo_comb, cls_inds, ...])Compute the OVO subsets error rate of linear classifier.
ft_l3
(N, y[, ovo_comb, cls_inds, ...])Non-Linearity of a linear classifier.
ft_lsc
(N, y[, metric, p, cls_inds, ...])Local set average cardinality.
ft_n1
(N, y[, metric, p, N_scaled, norm_dist_mat])Compute the fraction of borderline points.
ft_n2
(N, y[, metric, p, class_freqs, ...])Ratio of intra and extra class nearest neighbor distance.
ft_n3
(N, y[, metric, p, N_scaled, norm_dist_mat])Error rate of the nearest neighbor classifier.
ft_n4
(N, y[, metric, p, n_neighbors, ...])Compute the non-linearity of the k-NN Classifier.
ft_t1
(N, y[, metric, p, cls_inds, N_scaled, ...])Fraction of hyperspheres covering data.
ft_t2
(N)Compute the average number of features per dimension.
ft_t3
(N[, num_attr_pca, random_state])Compute the average number of PCA dimensions per points.
ft_t4
(N[, num_attr_pca, random_state])Compute the ratio of the PCA dimension to the original dimension.
precompute_adjacency_graph
(N[, y, metric, ...])Precompute instances nearest enemy related values.
Precompute some useful things to support feature-based measures.
precompute_complexity_svm
([y, max_iter, ...])Init a Support Vector Classifier pipeline (with data standardization.)
precompute_nearest_enemy
(N[, y, metric, p])Precompute instances nearest enemy related values.
precompute_norm_dist_mat
(N[, metric, p])Precompute normalized
N
and pairwise distance among instances.precompute_pca_tx
(N[, tx_n_components, ...])Precompute PCA to support dimensionality measures.
- classmethod ft_c1(y: ndarray, class_freqs: Optional[ndarray] = None) float [source]
Compute the entropy of class proportions.
This measure is in [0, 1] range.
- Parameters
- y
np.ndarray
Target attribute.
- class_freqs
np.ndarray
, optional The number of examples in each class. The indices corresponds to the classes.
- y
- Returns
- float
Entropy of class proportions.
References
- 1
Ana C. Lorena, Luís P. F. Garcia, Jens Lehmann, Marcilio C. P. Souto, and Tin K. Ho. How Complex is your classification problem? A survey on measuring classification complexity (V2). (2019) (Cited on page 15). Published in ACM Computing Surveys (CSUR), Volume 52 Issue 5, October 2019, Article No. 107.
- classmethod ft_c2(y: ndarray, class_freqs: Optional[ndarray] = None) float [source]
Compute the imbalance ratio.
This measure is in [0, 1] range.
- Parameters
- y
np.ndarray
Target attribute.
- class_freqs
np.ndarray
, optional The number of examples in each class. The indices corresponds to the classes.
- y
- Returns
- float
The imbalance ratio.
References
- 1
Ana C. Lorena, Luís P. F. Garcia, Jens Lehmann, Marcilio C. P. Souto, and Tin K. Ho. How Complex is your classification problem? A survey on measuring classification complexity (V2). (2019) (Cited on page 16). Published in ACM Computing Surveys (CSUR), Volume 52 Issue 5, October 2019, Article No. 107.
- classmethod ft_cls_coef(N: ndarray, y: ndarray, metric: str = 'gower', p: float = 2.0, radius_frac: Union[int, float] = 0.15, n_jobs: Optional[int] = None, cls_inds: Optional[ndarray] = None, norm_dist_mat: Optional[ndarray] = None, adj_graph: Optional[_construct_graph_from_weighted_adjacency] = None) float [source]
Clustering coefficient.
The clustering coefficient of a vertex v_i is given by the ratio of the number of edges between its neighbors (in a Same-class Radius Neighbor Graph) and the maximum number of edges that could possibly exist between them.
This measure is in [0, 1] range.
- Parameters
- N
np.ndarray
Numerical fitted data.
- y
np.ndarray
Target attribute.
- radiusfloat or int, optional
Maximum distance between each pair of instances of the same class to both be considered neighbors of each other. Note that each feature of
N
is first normalized into the [0, 1] range before the neighbor calculations.- metricstr, optional
Metric used to calculate the distances between the instances. Check the
scipy.spatial.distance.cdist
documentation to get a list of all available metrics. This argument is used only ifnorm_dist_mat
is None.- pint, optional
Power parameter for the Minkowski metric. When p = 1, this is equivalent to using Manhattan distance (l1), and Euclidean distance (l2) for p = 2. For arbitrary p, Minkowski distance (l_p) is used. Used only if
norm_dist_mat
is None.- cls_inds
np.ndarray
, optional Boolean array which indicates the examples of each class. The rows corresponds to each distinct class, and the columns corresponds to the instances.
- N_scaled
np.ndarray
, optional Numerical data
N
with each feature normalized in [0, 1] range. Used only ifnorm_dist_mat
is None. Used to take advantage of precomputations.- norm_dist_mat
np.ndarray
, optional Square matrix with the pairwise distances between each instance in
N_scaled
, i.e., between the normalized instances. Used to take advantage of precomputations.
- N
- Returns
- float
Clustering coefficient of given data.
References
- 1
Ana C. Lorena, Luís P. F. Garcia, Jens Lehmann, Marcilio C. P. Souto, and Tin K. Ho. How Complex is your classification problem? A survey on measuring classification complexity (V2). (2019) (Cited on page 9). Published in ACM Computing Surveys (CSUR), Volume 52 Issue 5, October 2019, Article No. 107.
- classmethod ft_density(N: ndarray, y: ndarray, metric: str = 'gower', p: float = 2.0, radius_frac: Union[int, float] = 0.15, n_jobs: Optional[int] = None, cls_inds: Optional[ndarray] = None, norm_dist_mat: Optional[ndarray] = None, adj_graph: Optional[_construct_graph_from_weighted_adjacency] = None) float [source]
Average density of the network.
This measure considers the number of edges that are retained in the graph (Same-class Radius Nearest Neighbors) built from the dataset normalized by the maximum number of edges between y.size instances.
This measure is in [0, 1] range.
- Parameters
- N
np.ndarray
Numerical fitted data.
- y
np.ndarray
Target attribute.
- radiusfloat or int, optional
Maximum distance between each pair of instances of the same class to both be considered neighbors of each other. Note that each feature of
N
is first normalized into the [0, 1] range before the neighbor calculations.- metricstr, optional
Metric used to calculate the distances between the instances. Check the
scipy.spatial.distance.cdist
documentation to get a list of all available metrics. This argument is used only ifnorm_dist_mat
is None.- pint, optional
Power parameter for the Minkowski metric. When p = 1, this is equivalent to using Manhattan distance (l1), and Euclidean distance (l2) for p = 2. For arbitrary p, Minkowski distance (l_p) is used. Used only if
norm_dist_mat
is None.- cls_inds
np.ndarray
, optional Boolean array which indicates the examples of each class. The rows corresponds to each distinct class, and the columns corresponds to the instances.
- N_scaled
np.ndarray
, optional Numerical data
N
with each feature normalized in [0, 1] range. Used only ifnorm_dist_mat
is None. Used to take advantage of precomputations.- norm_dist_mat
np.ndarray
, optional Square matrix with the pairwise distances between each instance in
N_scaled
, i.e., between the normalized instances. Used to take advantage of precomputations.
- N
- Returns
- float
Complement of the ratio of total edges in the Radius Nearest Neighbors graph and the total number of edges that could possibly exists in a graph with the given number of instances.
References
- 1
Ana C. Lorena, Luís P. F. Garcia, Jens Lehmann, Marcilio C. P. Souto, and Tin K. Ho. How Complex is your classification problem? A survey on measuring classification complexity (V2). (2019) (Cited on page 9). Published in ACM Computing Surveys (CSUR), Volume 52 Issue 5, October 2019, Article No. 107.
- classmethod ft_f1(N: ndarray, y: ndarray, cls_inds: Optional[ndarray] = None, class_freqs: Optional[ndarray] = None) ndarray [source]
Maximum Fisher’s discriminant ratio.
It measures theoverlap between the values of the features in different classes.
The average value of this measure is in [0, 1] range.
- Parameters
- N
np.ndarray
Numerical fitted data.
- y
np.ndarray
Target attribute.
- cls_inds
np.ndarray
, optional Boolean array which indicates the examples of each class. The rows corresponds to each distinct class, and the columns corresponds to the instances.
- class_freqs
np.ndarray
, optional The number of examples in each class. The indices corresponds to the classes.
- N
- Returns
np.ndarray
Inverse of all Fisher’s discriminant ratios.
References
- 1
Ana C. Lorena, Luís P. F. Garcia, Jens Lehmann, Marcilio C. P. Souto, and Tin K. Ho. How Complex is your classification problem? A survey on measuring classification complexity (V2). (2019) (Cited on page 9). Published in ACM Computing Surveys (CSUR), Volume 52 Issue 5, October 2019, Article No. 107.
- 2
Ramón A Mollineda, José S Sánchez, and José M Sotoca. Data characterization for effective prototype selection. In 2nd Iberian Conference on Pattern Recognition and Image Analysis (IbPRIA), pages 27–34, 2005.
- classmethod ft_f1v(N: ndarray, y: ndarray, ovo_comb: Optional[ndarray] = None, cls_inds: Optional[ndarray] = None, class_freqs: Optional[ndarray] = None) ndarray [source]
Directional-vector maximum Fisher’s discriminant ratio.
This measure searches for a vector which can separate the two classes after the examples have been projected into it and considers a directional Fisher criterion. Check the references for more information.
The average value of this measure is in [0, 1] range.
- Parameters
- N
np.ndarray
Numerical fitted data.
- y
np.ndarray
Target attribute.
- ovo_comb
np.ndarray
, optional List of all class OVO combination, i.e., all combinations of distinct class indices by pairs ([(0, 1), (0, 2) …].)
- cls_inds
np.ndarray
, optional Boolean array which indicates the examples of each class. The rows corresponds to each distinct class, and the columns corresponds to the instances.
- class_freqs
np.ndarray
, optional The number of examples in each class. The indices corresponds to the classes.
- N
- Returns
np.ndarray
Inverse of directional vector of Fisher’s discriminant ratio.
References
- 1
Ana C. Lorena, Luís P. F. Garcia, Jens Lehmann, Marcilio C. P. Souto, and Tin K. Ho. How Complex is your classification problem? A survey on measuring classification complexity (V2). (2019) (Cited on page 9). Published in ACM Computing Surveys (CSUR), Volume 52 Issue 5, October 2019, Article No. 107.
- 2
Witold Malina. Two-parameter fisher criterion. IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics), 31(4):629–636, 2001.
- classmethod ft_f2(N: ndarray, y: ndarray, ovo_comb: Optional[ndarray] = None, cls_inds: Optional[ndarray] = None) ndarray [source]
Volume of the overlapping region.
This measure calculates the overlap of the distributions of the features values within the classes.
This measure is in [0, 1] range.
- Parameters
- N
np.ndarray
Numerical fitted data.
- y
np.ndarray
Fitted target attribute.
- ovo_comb
np.ndarray
, optional List of all class OVO combination, i.e., all combinations of distinct class indices by pairs ([(0, 1), (0, 2) …].)
- cls_inds
np.ndarray
, optional Boolean array which indicates the examples of each class. The rows corresponds to each distinct class, and the columns corresponds to the instances.
- N
- Returns
np.ndarray
Volume of the overlapping region for each OVO combination.
References
- 1
Ana C. Lorena, Luís P. F. Garcia, Jens Lehmann, Marcilio C. P. Souto, and Tin K. Ho. How Complex is your classification problem? A survey on measuring classification complexity (V2). (2019) (Cited on page 9). Published in ACM Computing Surveys (CSUR), Volume 52 Issue 5, October 2019, Article No. 107.
- 2
Marcilio C P Souto, Ana C Lorena, Newton Spolaôr, and Ivan G Costa. Complexity measures of supervised classification tasks: a case study for cancer gene expression data. In International Joint Conference on Neural Networks (IJCNN), pages 1352–1358, 2010.
- 3
Lisa Cummins. Combining and Choosing Case Base Maintenance Algorithms. PhD thesis, National University of Ireland, Cork, 2013.
- classmethod ft_f3(N: ndarray, y: ndarray, ovo_comb: Optional[ndarray] = None, cls_inds: Optional[ndarray] = None, class_freqs: Optional[ndarray] = None) ndarray [source]
Compute feature maximum individual efficiency.
The average value of this measure is in [0, 1] range.
- Parameters
- N
np.ndarray
Numerical fitted data.
- y
np.ndarray
Target attribute.
- ovo_comb
np.ndarray
, optional List of all class OVO combination, i.e., all combinations of distinct class indices by pairs ([(0, 1), (0, 2) …].)
- cls_inds
np.ndarray
, optional Boolean array which indicates the examples of each class. The rows corresponds to each distinct class, and the columns corresponds to the instances.
- class_freqs
np.ndarray
, optional The number of examples in each class. The indices corresponds to the classes.
- N
- Returns
np.ndarray
An array with the maximum individual feature efficiency measure for each feature.
References
- 1
Ana C. Lorena, Luís P. F. Garcia, Jens Lehmann, Marcilio C. P. Souto, and Tin K. Ho. How Complex is your classification problem? A survey on measuring classification complexity (V2). (2019) (Cited on page 6). Published in ACM Computing Surveys (CSUR), Volume 52 Issue 5, October 2019, Article No. 107.
- classmethod ft_f4(N: ndarray, y: ndarray, ovo_comb: Optional[ndarray] = None, cls_inds: Optional[ndarray] = None, class_freqs: Optional[ndarray] = None) ndarray [source]
Compute the collective feature efficiency.
The average value of this measure is in [0, 1] range.
- Parameters
- N
np.ndarray
Numerical fitted data.
- y
np.ndarray
Target attribute.
- ovo_comb
np.ndarray
, optional List of all class OVO combination, i.e., all combinations of distinct class indices by pairs ([(0, 1), (0, 2) …].)
- cls_inds
np.ndarray
, optional Boolean array which indicates the examples of each class. The rows corresponds to each distinct class, and the columns corresponds to the instances.
- class_freqs
np.ndarray
, optional The number of examples in each class. The indices corresponds to the classes.
- N
- Returns
np.ndarray
An array with the collective feature efficiency measure for each feature.
References
- 1
Ana C. Lorena, Luís P. F. Garcia, Jens Lehmann, Marcilio C. P. Souto, and Tin K. Ho. How Complex is your classification problem? A survey on measuring classification complexity (V2). (2019) (Cited on page 7). Published in ACM Computing Surveys (CSUR), Volume 52 Issue 5, October 2019, Article No. 107.
- classmethod ft_hubs(N: ndarray, y: ndarray, metric: str = 'gower', p: float = 2.0, radius_frac: Union[int, float] = 0.15, n_jobs: Optional[int] = None, cls_inds: Optional[ndarray] = None, norm_dist_mat: Optional[ndarray] = None, adj_graph: Optional[_construct_graph_from_weighted_adjacency] = None) ndarray [source]
Hub score.
The hub score scores each node by the number of connections it has to other nodes, weighted by the number of connections these neighbors have.
The values of node hub score are given by the principal eigenvector of (A.t * A), where A is the adjacency matrix of the graph.
The average value of this measure is in [0, 1] range.
- Parameters
- N
np.ndarray
Numerical fitted data.
- y
np.ndarray
Target attribute.
- metricstr, optional
Metric used to calculate the distances between the instances. Check the
scipy.spatial.distance.cdist
documentation to get a list of all available metrics. Used only if adj_graph is None.- pint, optional
Power parameter for the Minkowski metric. When p = 1, this is equivalent to using Manhattan distance (l1), and Euclidean distance (l2) for p = 2. For arbitrary p, Minkowski distance (l_p) is used. Used only if adj_graph is None.
- radius_fracfloat or int, optional
If int, maximum number of neighbors of the same class for each instance. If float, the maximum number of neighbors is computed as radius_frac * len(N). Used only if adj_graph is None.
- n_jobsint or None, optional
Number of parallel processes to compute nearest neighbors. Used only if adj_graph is None.
- cls_inds
np.ndarray
, optional Boolean array which indicates the examples of each class. The rows corresponds to each distinct class, and the columns corresponds to the instances. Used only if adj_graph is None.
- norm_dist_mat: :obj:`np.ndarray`, optional
Normalized distance matrix
- adj_graph
igraph.Graph.Weighted_Adjacency
, optional Undirected and Weighted adjacency graph for the dataset. Only instances belonging to the same class must be connected. If not provided, will compute using metric, p, radius_frac
- N
- Returns
np.ndarray
Complement of the hub score of every node.
References
- 1
Ana C. Lorena, Luís P. F. Garcia, Jens Lehmann, Marcilio C. P. Souto, and Tin K. Ho. How Complex is your classification problem? A survey on measuring classification complexity (V2). (2019) (Cited on page 9). Published in ACM Computing Surveys (CSUR), Volume 52 Issue 5, October 2019, Article No. 107.
- classmethod ft_l1(N: ndarray, y: ndarray, ovo_comb: Optional[ndarray] = None, cls_inds: Optional[ndarray] = None, class_freqs: Optional[ndarray] = None, svc_pipeline: Optional[Pipeline] = None, max_iter: Union[int, float] = 100000.0, random_state: Optional[int] = None) ndarray [source]
Sum of error distance by linear programming.
This measure assesses if the data are linearly separable by computing, for a dataset, the sum of the distances of incorrectly classified examples to a linear boundary used in their classification. If the value of L1 is zero, then theproblem is linearly separable and can be considered simpler than a problem for which a non-linear boundary is required.
The average value of this measure is in [0, 1] range.
- Parameters
- N
np.ndarray
Numerical fitted data.
- y
np.ndarray
Target attribute.
- ovo_comb
np.ndarray
, optional List of all class OVO combination, i.e., all combinations of distinct class indices by pairs ([(0, 1), (0, 2) …].)
- cls_inds
np.ndarray
, optional Boolean array which indicates the examples of each class. The rows corresponds to each distinct class, and the columns corresponds to the instances.
- class_freqs
np.ndarray
, optional The number of examples in each class. The indices corresponds to the classes.
- max_iterfloat or int, optional
Maximum number of iterations allowed for the support vector machine model convergence. This parameter can receive float numbers to be compatible with the Python scientific notation data type. Used only if
svc_pipeline
is None.- svc_pipeline
sklearn.pipeline.Pipeline
, optional Support Vector Classifier learning pipeline. Traditionally, the pipeline used is a data standardization (mean = 0 and variance = 1) before the learning model, which is a Support Vector Classifier (linear kernel.) However, any variation of this pipeline can also be used. Note that this metafeature is formulated using a linear classifier. If this argument is none, the described pipeline (standardization + SVC) is used by default.
- random_stateint, optional
Random seed for dual coordinate descent while fitting the Support Vector Classifier model. Check sklearn.svm.LinearSVC documentation (random_state parameter) for more information. Used only if
svc_pipeline
is None.
- N
- Returns
np.ndarray
Complement of the inverse of the sum of distances from a Support Vector Classifier (SVC) hyperplane of incorrectly classified instances.
References
- 1
Ana C. Lorena, Luís P. F. Garcia, Jens Lehmann, Marcilio C. P. Souto, and Tin K. Ho. How Complex is your classification problem? A survey on measuring classification complexity (V2). (2019) (Cited on page 9). Published in ACM Computing Surveys (CSUR), Volume 52 Issue 5, October 2019, Article No. 107.
- classmethod ft_l2(N: ndarray, y: ndarray, ovo_comb: Optional[ndarray] = None, cls_inds: Optional[ndarray] = None, svc_pipeline: Optional[Pipeline] = None, max_iter: Union[int, float] = 100000.0, random_state: Optional[int] = None) ndarray [source]
Compute the OVO subsets error rate of linear classifier.
The linear model used is induced by the Support Vector Machine algorithm.
The average value of this measure is in [0, 1] range.
- Parameters
- N
np.ndarray
Numerical fitted data.
- y
np.ndarray
Target attribute.
- ovo_comb
np.ndarray
, optional List of all class OVO combination, i.e., all combinations of distinct class indices by pairs ([(0, 1), (0, 2) …].)
- cls_inds
np.ndarray
, optional Boolean array which indicates the examples of each class. The rows corresponds to each distinct class, and the columns corresponds to the instances.
- svc_pipeline
sklearn.pipeline.Pipeline
, optional Support Vector Classifier learning pipeline. Traditionally, the pipeline used is a data standardization (mean = 0 and variance = 1) before the learning model, which is a Support Vector Classifier (linear kernel.) However, any variation of this pipeline can also be used. Note that this metafeature is formulated using a linear classifier. If this argument is none, the described pipeline (standardization + SVC) is used by default.
- max_iterfloat or int, optional
Maximum number of iterations allowed for the support vector machine model convergence. This parameter can receive float numbers to be compatible with the Python scientific notation data type. Used only if
svc_pipeline
is None.- random_stateint, optional
Random seed for dual coordinate descent while fitting the Support Vector Classifier model. Check sklearn.svm.LinearSVC documentation (random_state parameter) for more information. Used only if
svc_pipeline
is None.
- N
- Returns
np.ndarray
An array with the collective error rate of linear classifier measure for each OVO subset.
References
- 1
Ana C. Lorena, Luís P. F. Garcia, Jens Lehmann, Marcilio C. P. Souto, and Tin K. Ho. How Complex is your classification problem? A survey on measuring classification complexity (V2). (2019) (Cited on page 9). Published in ACM Computing Surveys (CSUR), Volume 52 Issue 5, October 2019, Article No. 107.
- classmethod ft_l3(N: ndarray, y: ndarray, ovo_comb: Optional[ndarray] = None, cls_inds: Optional[ndarray] = None, svc_pipeline: Optional[Pipeline] = None, max_iter: Union[int, float] = 100000.0, random_state: Optional[int] = None) ndarray [source]
Non-Linearity of a linear classifier.
This index is sensitive to how the data from a class are distributed inthe border regions and also on how much the convex hulls which delimit the classes overlap. In particular, it detects the presence of concavities in the class boundaries. Higher values indicate a greater complexity.
The average value of this measure is in [0, 1] range.
- Parameters
- N
np.ndarray
Numerical fitted data.
- y
np.ndarray
Target attribute.
- ovo_comb
np.ndarray
, optional List of all class OVO combination, i.e., all combinations of distinct class indices by pairs ([(0, 1), (0, 2) …].)
- cls_inds
np.ndarray
, optional Boolean array which indicates the examples of each class. The rows corresponds to each distinct class, and the columns corresponds to the instances.
- svc_pipeline
sklearn.pipeline.Pipeline
, optional Support Vector Classifier learning pipeline. Traditionally, the pipeline used is a data standardization (mean = 0 and variance = 1) before the learning model, which is a Support Vector Classifier (linear kernel.) However, any variation of this pipeline can also be used. Note that this metafeature is formulated using a linear classifier. If this argument is none, the described pipeline (standardization + SVC) is used by default.
- max_iterfloat or int, optional
Maximum number of iterations allowed for the support vector machine model convergence. This parameter can receive float numbers to be compatible with the Python scientific notation data type. Used only if
svc_pipeline
is None.- random_stateint, optional
Random seed for dual coordinate descent while fitting the Support Vector Classifier model. Check sklearn.svm.LinearSVC documentation (random_state parameter) for more information. Used only if
svc_pipeline
is None.
- N
- Returns
np.ndarray
Zero-one losses of a Support Vector Classifier for a randomly interpolated dataset using the original instances. The classes are separated in a OVO (One-Versus-One) fashion.
References
- 1
Ana C. Lorena, Luís P. F. Garcia, Jens Lehmann, Marcilio C. P. Souto, and Tin K. Ho. How Complex is your classification problem? A survey on measuring classification complexity (V2). (2019) (Cited on page 9). Published in ACM Computing Surveys (CSUR), Volume 52 Issue 5, October 2019, Article No. 107.
- classmethod ft_lsc(N: ndarray, y: ndarray, metric: str = 'gower', p: Union[int, float] = 2, cls_inds: Optional[ndarray] = None, N_scaled: Optional[ndarray] = None, norm_dist_mat: Optional[ndarray] = None, nearest_enemy_dist: Optional[ndarray] = None) float [source]
Local set average cardinality.
The Local-Set (LS) of an example x_i in a dataset
N
is defined as the set of points fromN
whose distance to x_i is smaller than the distance from x_i and its nearest enemy (the nearest instance from a distinct class of x_i.)The cardinality of the LS of an example indicates its proximity to the decision boundary and also the narrowness of the gap between the classes.
This measure is in [0, 1 - 1/n] range, where n is the number of instances in
N
.- Parameters
- N
np.ndarray
Numerical fitted data.
- y
np.ndarray
Target attribute.
- metricstr, optional
Metric used to calculate the distances between the instances. Check the
scipy.spatial.distance.cdist
documentation to get a list of all available metrics. This argument is used only ifnorm_dist_mat
is None.- pint, optional
Power parameter for the Minkowski metric. When p = 1, this is equivalent to using Manhattan distance (l1), and Euclidean distance (l2) for p = 2. For arbitrary p, Minkowski distance (l_p) is used. Used only if
norm_dist_mat
is None.- cls_inds
np.ndarray
, optional Boolean array which indicates the examples of each class. The rows corresponds to each distinct class, and the columns corresponds to the instances. Used only if the argument
nearest_enemy_dist
is None.- N_scaled
np.ndarray
, optional Numerical data
N
with each feature normalized in [0, 1] range. Used only ifnorm_dist_mat
is None. Used to take advantage of precomputations.- norm_dist_mat
np.ndarray
, optional Square matrix with the pairwise distances between each instance in
N_scaled
, i.e., between the normalized instances. Used to take advantage of precomputations.- nearest_enemy_dist
np.ndarray
, optional Distance of each instance to its nearest enemy (instances of a distinct class.)
- N
- Returns
- float
Local set average cardinality.
References
- 1
Ana C. Lorena, Luís P. F. Garcia, Jens Lehmann, Marcilio C. P. Souto, and Tin K. Ho. How Complex is your classification problem? A survey on measuring classification complexity (V2). (2019) (Cited on page 15). Published in ACM Computing Surveys (CSUR), Volume 52 Issue 5, October 2019, Article No. 107.
- 2
Enrique Leyva, Antonio González, and Raúl Pérez. A set of complexity measures designed for applying meta-learning to instance selection. IEEE Transactions on Knowledge and Data Engineering, 27(2):354–367, 2014.
- classmethod ft_n1(N: ndarray, y: ndarray, metric: str = 'gower', p: Union[int, float] = 2, N_scaled: Optional[ndarray] = None, norm_dist_mat: Optional[ndarray] = None) float [source]
Compute the fraction of borderline points.
This measure is in [0, 1] range.
- Parameters
- N
np.ndarray
Numerical fitted data.
- y
np.ndarray
Target attribute.
- metricstr, optional
Metric used to calculate the distances between the instances. Check the
scipy.spatial.distance.cdist
documentation to get a list of all available metrics. This argument is used only ifnorm_dist_mat
is None.- pint, optional
Power parameter for the Minkowski metric. When p = 1, this is equivalent to using Manhattan distance (l1), and Euclidean distance (l2) for p = 2. For arbitrary p, Minkowski distance (l_p) is used. Used only if
norm_dist_mat
is None.- N_scaled
np.ndarray
, optional Numerical data
N
with each feature normalized in [0, 1] range. Used only ifnorm_dist_mat
is None. Used to take advantage of precomputations.- norm_dist_mat
np.ndarray
, optional Square matrix with the pairwise distances between each instance in
N_scaled
, i.e., between the normalized instances. Used to take advantage of precomputations.
- N
- Returns
- float
Fraction of borderline points.
References
- 1
Ana C. Lorena, Luís P. F. Garcia, Jens Lehmann, Marcilio C. P. Souto, and Tin K. Ho. How Complex is your classification problem? A survey on measuring classification complexity (V2). (2019) (Cited on page 9-10). Published in ACM Computing Surveys (CSUR), Volume 52 Issue 5, October 2019, Article No. 107.
- classmethod ft_n2(N: ndarray, y: ndarray, metric: str = 'gower', p: Union[int, float] = 2, class_freqs: Optional[ndarray] = None, cls_inds: Optional[ndarray] = None, N_scaled: Optional[ndarray] = None, norm_dist_mat: Optional[ndarray] = None) ndarray [source]
Ratio of intra and extra class nearest neighbor distance.
- This measure computes the ratio of two sums:
The sum of the distances between each example and its closest neighborfrom the same class (intra-class); and
The sum of the distances between each example and its closest neighbor fromanother class (extra-class)
The average value of this measure is in [0, 1] range.
- Parameters
- N
np.ndarray
Numerical fitted data.
- y
np.ndarray
Target attribute.
- metricstr, optional
Metric used to calculate the distances between the instances. Check the
scipy.spatial.distance.cdist
documentation to get a list of all available metrics. This argument is used only ifnorm_dist_mat
is None.- pint, optional
Power parameter for the Minkowski metric. When p = 1, this is equivalent to using Manhattan distance (l1), and Euclidean distance (l2) for p = 2. For arbitrary p, Minkowski distance (l_p) is used. Used only if
norm_dist_mat
is None.- class_freqs
np.ndarray
, optional The number of examples in each class. The indices corresponds to the classes.
- cls_inds
np.ndarray
, optional Boolean array which indicates the examples of each class. The rows corresponds to each distinct class, and the columns corresponds to the instances.
- N_scaled
np.ndarray
, optional Numerical data
N
with each feature normalized in [0, 1] range. Used only ifnorm_dist_mat
is None. Used to take advantage of precomputations.- norm_dist_mat
np.ndarray
, optional Square matrix with the pairwise distances between each instance in
N_scaled
, i.e., between the normalized instances. Used to take advantage of precomputations.
- N
- Returns
np.ndarray
Complement of the inverse of the intra and extra class variance.
References
- 1
Ana C. Lorena, Luís P. F. Garcia, Jens Lehmann, Marcilio C. P. Souto, and Tin K. Ho. How Complex is your classification problem? A survey on measuring classification complexity (V2). (2019) (Cited on page 9). Published in ACM Computing Surveys (CSUR), Volume 52 Issue 5, October 2019, Article No. 107.
- classmethod ft_n3(N: ndarray, y: ndarray, metric: str = 'gower', p: Union[int, float] = 2, N_scaled: Optional[ndarray] = None, norm_dist_mat: Optional[ndarray] = None) ndarray [source]
Error rate of the nearest neighbor classifier.
The N3 measure refers to the error rate of a 1-NN classifier that is estimated using a leave-one-out cross-validation procedure.
The average value of this measure is in [0, 1] range.
- Parameters
- N
np.ndarray
Numerical fitted data.
- y
np.ndarray
Target attribute.
- metricstr, optional
Metric used to calculate the distances between the instances. Check the
scipy.spatial.distance.cdist
documentation to get a list of all available metrics. This argument is used only ifnorm_dist_mat
is None.- pint, optional
Power parameter for the Minkowski metric. When p = 1, this is equivalent to using Manhattan distance (l1), and Euclidean distance (l2) for p = 2. For arbitrary p, Minkowski distance (l_p) is used.
- N_scaled
np.ndarray
, optional Numerical data
N
with each feature normalized in [0, 1] range. Used only ifnorm_dist_mat
is None. Used to take advantage of precomputations.- norm_dist_mat
np.ndarray
, optional Square matrix with the pairwise distances between each instance in
N_scaled
, i.e., between the normalized instances. Used to take advantage of precomputations.
- N
- Returns
np.ndarray
Binary array of misclassification of a 1-NN classifier.
References
- 1
Ana C. Lorena, Luís P. F. Garcia, Jens Lehmann, Marcilio C. P. Souto, and Tin K. Ho. How Complex is your classification problem? A survey on measuring classification complexity (V2). (2019) (Cited on page 9). Published in ACM Computing Surveys (CSUR), Volume 52 Issue 5, October 2019, Article No. 107.
- classmethod ft_n4(N: ndarray, y: ndarray, metric: str = 'gower', p: Union[int, float] = 2, n_neighbors: int = 1, random_state: Optional[int] = None, cls_inds: Optional[ndarray] = None, N_scaled: Optional[ndarray] = None, norm_dist_mat: Optional[ndarray] = None, orig_dist_mat_min: Optional[float] = None, orig_dist_mat_ptp: Optional[float] = None) ndarray [source]
Compute the non-linearity of the k-NN Classifier.
The average value of this measure is in [0, 1] range.
- Parameters
- N
np.ndarray
Numerical fitted data.
- y
np.ndarray
Target attribute.
- metricstr, optional
The distance metric used in the internal kNN classifier. See the documentation of the
scipy.spatial.distance.cdist
class for a list of available metrics. Used only ifnorm_dist_mat
is None.- pint, optional
Power parameter for the Minkowski metric. When p = 1, this is equivalent to using Manhattan distance (l1), and Euclidean distance (l2) for p = 2. For arbitrary p, Minkowski distance (l_p) is used. Used only if
norm_dist_mat
is None.- n_neighborsint, optional
Number of neighbors used for the Nearest Neighbors classifier.
- random_stateint, optional
If given, set the random seed before computing the randomized data interpolation.
- cls_inds
np.ndarray
, optional Boolean array which indicates the examples of each class. The rows corresponds to each distinct class, and the columns corresponds to the instances. Used to take advantages of precomputations.
- N_scaled
np.ndarray
, optional Numerical data
N
with each feature normalized in [0, 1] range. Used only ifnorm_dist_mat
is None. Used to take advantage of precomputations.- norm_dist_mat
np.ndarray
, optional Square matrix with the pairwise distances between each instance in
N_scaled
, i.e., between the normalized instances. Used to take advantage of precomputations. Used if and only iforig_dist_mat_min
ANDorig_dist_mat_ptp
are also given (non None).- orig_dist_mat_min
float
, optional Minimal distance between the original instances in
N
.- orig_dist_mat_ptp
float
, optional Range (max - min) of distances between the original instances in
N
.
- N
- Returns
np.ndarray
Misclassifications of the k-NN classifier in the interpolated dataset.
References
- 1
Ana C. Lorena, Luís P. F. Garcia, Jens Lehmann, Marcilio C. P. Souto, and Tin K. Ho. How Complex is your classification problem? A survey on measuring classification complexity (V2). (2019) (Cited on page 9-11). Published in ACM Computing Surveys (CSUR), Volume 52 Issue 5, October 2019, Article No. 107.
- classmethod ft_t1(N: ndarray, y: ndarray, metric: str = 'gower', p: Union[int, float] = 2, cls_inds: Optional[ndarray] = None, N_scaled: Optional[ndarray] = None, norm_dist_mat: Optional[ndarray] = None, orig_dist_mat_min: Optional[float] = None, orig_dist_mat_ptp: Optional[float] = None) ndarray [source]
Fraction of hyperspheres covering data.
This measure uses a process that builds hyperspheres centered at each one of the examples. In this implementation, we stop the growth of the hypersphere when the hyperspheres centered at two points of opposite classes just start to touch.
Once the radiuses of all hyperspheres are found, a post-processing step can be applied to verify which hyperspheres must be absorbed (all hyperspheres completely within larger hyperspheres.)
This measure is in [0, 1] range.
- Parameters
- N
np.ndarray
Numerical fitted data.
- y
np.ndarray
Target attribute.
- metricstr, optional
Metric used to calculate the distances between the instances. Check the
scipy.spatial.distance.cdist
documentation to get a list of all available metrics. This argument is used only ifnorm_dist_mat
is None.- pint, optional
Power parameter for the Minkowski metric. When p = 1, this is equivalent to using Manhattan distance (l1), and Euclidean distance (l2) for p = 2. For arbitrary p, Minkowski distance (l_p) is used. Used only if
norm_dist_mat
is None.- cls_inds
np.ndarray
, optional Boolean array which indicates the examples of each class. The rows corresponds to each distinct class, and the columns corresponds to the instances. Used only if the arguments
nearest_enemy_dist
ornearest_enemy_ind
are None.- N_scaled
np.ndarray
, optional Numerical data
N
with each feature normalized in [0, 1] range. Used only ifnorm_dist_mat
is None. Used to take advantage of precomputations.- norm_dist_mat
np.ndarray
, optional Square matrix with the pairwise distances between each instance in
N_scaled
, i.e., between the normalized instances. Used to take advantage of precomputations. Used if and only iforig_dist_mat_min
andorig_dist_mat_ptp
are also given (non None).- orig_dist_mat_min
float
, optional Minimal distance between the original instances in
N
.- orig_dist_mat_ptp
float
, optional Range (max - min) of distances between the original instances in
N
.
- N
- Returns
np.ndarray
Array with the fraction of instances inside each remaining hypersphere.
References
- 1
Ana C. Lorena, Luís P. F. Garcia, Jens Lehmann, Marcilio C. P. Souto, and Tin K. Ho. How Complex is your classification problem? A survey on measuring classification complexity (V2). (2019) (Cited on page 9). Published in ACM Computing Surveys (CSUR), Volume 52 Issue 5, October 2019, Article No. 107.
- 2
Tin K Ho and Mitra Basu. Complexity measures of supervised classification problems. IEEE Transactions on Pattern Analysis and Machine Intelligence, 24(3):289–300, 2002.
- classmethod ft_t2(N: ndarray) float [source]
Compute the average number of features per dimension.
This measure is in (0, m] range, where m is the number of features in
N
.- Parameters
- N
np.ndarray
Numeric attributes from fitted data.
- N
- Returns
- float
Average number of features per dimension.
References
- 1
Ana C. Lorena, Luís P. F. Garcia, Jens Lehmann, Marcilio C. P. Souto, and Tin K. Ho. How Complex is your classification problem? A survey on measuring classification complexity (V2). (2019) (Cited on page 15). Published in ACM Computing Surveys (CSUR), Volume 52 Issue 5, October 2019, Article No. 107.
- classmethod ft_t3(N: ndarray, num_attr_pca: Optional[int] = None, random_state: Optional[int] = None) float [source]
Compute the average number of PCA dimensions per points.
This measure is in (0, m] range, where m is the number of features in
N
.- Parameters
- N
np.ndarray
Numerical fitted data.
- num_attr_pcaint, optional
Number of features after PCA where a fraction of at least 0.95 of the data variance is explained by the selected components.
- random_stateint, optional
If the fitted data is huge and the number of principal components to be kept is low, then the PCA analysis is done using a randomized strategy for efficiency. This random seed keeps the results replicable. Check
sklearn.decomposition.PCA
documentation for more information.
- N
- Returns
- float
Average number of PCA dimensions (explaining at least 95% of the data variance) per points.
References
- 1
Ana C. Lorena, Luís P. F. Garcia, Jens Lehmann, Marcilio C. P. Souto, and Tin K. Ho. How Complex is your classification problem? A survey on measuring classification complexity (V2). (2019) (Cited on page 15). Published in ACM Computing Surveys (CSUR), Volume 52 Issue 5, October 2019, Article No. 107.
- classmethod ft_t4(N: ndarray, num_attr_pca: Optional[int] = None, random_state: Optional[int] = None) float [source]
Compute the ratio of the PCA dimension to the original dimension.
The components kept in the PCA dimension explains at least 95% of the data variance.
This measure is in [0, 1] range.
- Parameters
- N
np.ndarray
Numerical fitted data.
- num_attr_pcaint, optional
Number of features after PCA where a fraction of at least 0.95 of the data variance is explained by the selected components.
- random_stateint, optional
If the fitted data is huge and the number of principal components to be kept is low, then the PCA analysis is done using a randomized strategy for efficiency. This random seed keeps the results replicable. Check
sklearn.decomposition.PCA
documentation for more information.
- N
- Returns
- float
Ratio of the PCA dimension (explaining at least 95% of the data variance) to the original dimension.
References
- 1
Ana C. Lorena, Luís P. F. Garcia, Jens Lehmann, Marcilio C. P. Souto, and Tin K. Ho. How Complex is your classification problem? A survey on measuring classification complexity (V2). (2019) (Cited on page 15). Published in ACM Computing Surveys (CSUR), Volume 52 Issue 5, October 2019, Article No. 107.
- classmethod precompute_adjacency_graph(N: ndarray, y: Optional[ndarray] = None, metric: str = 'gower', p: float = 2.0, n_jobs: Optional[int] = None, **kwargs) Dict[str, ndarray] [source]
Precompute instances nearest enemy related values.
The instance nearest enemy is the nearest instance from a different class.
- Parameters
- N
np.ndarray
Numerical fitted data.
- y
np.ndarray
Target attribute.
- **kwargs
Additional arguments. May have previously precomputed before this method from other precomputed methods, so they can help speed up this precomputation.
- N
- Returns
dict
With following precomputed items:
- classmethod precompute_complexity(y: Optional[ndarray] = None, **kwargs) Dict[str, Any] [source]
Precompute some useful things to support feature-based measures.
- Parameters
- y
np.ndarray
, optional Target attribute.
- **kwargs
Additional arguments. May have previously precomputed before this method from other precomputed methods, so they can help speed up this precomputation.
- y
- Returns
dict
- With following precomputed items:
ovo_comb
(list): List of all class OVO combination,i.e., all combinations of distinct class indices by pairs ([(0, 1), (0, 2) …].)
cls_inds
(np.ndarray
): Boolean array whichindicates whether each example belongs to each class. The rows corresponds to the distinct classes, and the instances are represented by the columns.
classes
(np.ndarray
): distinct classes in thefitted target attribute.
class_freqs
(np.ndarray
): The number of examplesin each class. The indices corresponds to the classes.
- classmethod precompute_complexity_svm(y: Optional[ndarray] = None, max_iter: Union[int, float] = 100000.0, random_state: Optional[int] = None, **kwargs) Dict[str, Pipeline] [source]
Init a Support Vector Classifier pipeline (with data standardization.)
- Parameters
- max_iterfloat or int, optional
Maximum number of iterations allowed for the support vector machine model convergence. This parameter can receive float numbers to be compatible with the Python scientific notation data type.
- random_stateint, optional
Random seed for dual coordinate descent while fitting the Support Vector Classifier model. Check sklearn.svm.LinearSVC documentation (random_state parameter) for more information.
- **kwargs
Additional arguments. May have previously precomputed before this method from other precomputed methods, so they can help speed up this precomputation.
- Returns
dict
- With following precomputed items:
svc_pipeline
(sklearn.pipeline.Pipeline): supportvector classifier learning pipeline, with data standardization (mean = 0 and variance = 1) before the learning model.
- classmethod precompute_nearest_enemy(N: ndarray, y: Optional[ndarray] = None, metric: str = 'gower', p: Union[int, float] = 2, **kwargs) Dict[str, ndarray] [source]
Precompute instances nearest enemy related values.
The instance nearest enemy is the nearest instance from a different class.
- Parameters
- N
np.ndarray
Numerical fitted data.
- y
np.ndarray
Target attribute.
- metricstr, optional
Metric used to calculate the distances between the instances. Check the
scipy.spatial.distance.cdist
documentation to get a list of all available metrics.- pint, optional
Power parameter for the Minkowski metric. When p = 1, this is equivalent to using Manhattan distance (l1), and Euclidean distance (l2) for p = 2. For arbitrary p, Minkowski distance (l_p) is used.
- **kwargs
Additional arguments. May have previously precomputed before this method from other precomputed methods, so they can help speed up this precomputation.
- N
- Returns
dict
- With following precomputed items:
nearest_enemy_dist
(np.ndarray
): distance of eachinstance to its nearest enemy (instances of a distinct class.)
nearest_enemy_ind
(np.ndarray
): index of thenearest enemy (instances of a distinct class) for each instance.
This precomputation method also depends on values precomputed in other precomputation methods,
precompute_complexity
andprecompute_norm_dist_mat
. Therefore, the return values of those methods can also be returned in case they are not called before. Check the documentation of each method to verify which values either methods returns to a precise description for the additional values that may be returned by this method.
- classmethod precompute_norm_dist_mat(N: ndarray, metric: str = 'gower', p: Union[int, float] = 2, **kwargs) Dict[str, ndarray] [source]
Precompute normalized
N
and pairwise distance among instances.- Parameters
- N
np.ndarray
Numerical fitted data.
- metricstr, optional
Metric used to calculate the distances between the instances. Check the
scipy.spatial.distance.cdist
documentation to get a list of all available metrics.- pint, optional
Power parameter for the Minkowski metric. When p = 1, this is equivalent to using Manhattan distance (l1), and Euclidean distance (l2) for p = 2. For arbitrary p, Minkowski distance (l_p) is used.
- **kwargs
Additional arguments. May have previously precomputed before this method from other precomputed methods, so they can help speed up this precomputation.
- N
- Returns
dict
- With following precomputed items:
N_scaled
(np.ndarray
): numerical dataN
witheach feature normalized in [0, 1] range. Used only if
norm_dist_mat
is None.
norm_dist_mat
(np.ndarray
): square matrix withthe normalized pairwise distances between each instance in
N_scaled
, i.e., between the normalized instances. (note that this matrix is the normalized pairwise distances between the normalized instances, i.e., there is two normalization processes involved.)
orig_dist_mat_min
(float): minimal value from theoriginal pairwise distance matrix. Can be used to preprocess test data before predictions.
orig_dist_mat_min
(float): range (max - min) value fromthe original pairwise distance matrix. Can be used to preprocess test data before predictions.
- classmethod precompute_pca_tx(N: ndarray, tx_n_components: float = 0.95, random_state: Optional[int] = None, **kwargs) Dict[str, int] [source]
Precompute PCA to support dimensionality measures.
- Parameters
- N
np.ndarray
Numerical fitted data.
- tx_n_componentsfloat, optional
Specifies the number of components such that the amount of variance that needs to be explained is greater than the percentage specified by
tx_n_components
. The PCA is computed usingN
.- random_stateint, optional
If the fitted data is huge and the number of principal components to be kept is low, then the PCA analysis is done using a randomized strategy for efficiency. This random seed keeps the results replicable. Check
sklearn.decomposition.PCA
documentation for more information.- **kwargs
Additional arguments. May have previously precomputed before this method from other precomputed methods, so they can help speed up this precomputation.
- N
- Returns
dict
- With following precomputed items:
num_attr_pca
(int): Number of features after PCAanalysis with at least
tx_n_components
fraction of data variance explained by the selected principal components.