pymfe.clustering.MFEClustering

class pymfe.clustering.MFEClustering[source]

Keep methods for metafeatures of Clustering group.

The convention adopted for metafeature extraction related methods is to always start with ft_ prefix to allow automatic method detection. This prefix is predefined within _internal module.

All method signature follows the conventions and restrictions listed below:

  1. For independent attribute data, X means every type of attribute, N means Numeric attributes only and C stands for Categorical attributes only. It is important to note that the categorical attribute sets between X and C and the numerical attribute sets between X and N may differ due to data transformations, performed while fitting data into MFE model, enabled by, respectively, transform_num and transform_cat arguments from fit (MFE method).

  2. Only arguments in MFE _custom_args_ft attribute (set up inside fit method) are allowed to be required method arguments. All other arguments must be strictly optional (i.e., has a predefined default value).

  3. The initial assumption is that the user can change any optional argument, without any previous verification of argument value or its type, via kwargs argument of extract method of MFE class.

  4. The return value of all feature extraction methods should be a single value or a generic List (preferably a np.ndarray) type with numeric values.

There is another type of method adopted for automatic detection. It is adopted the prefix precompute_ for automatic detection of these methods. These methods run while fitting some data into an MFE model automatically, and their objective is to precompute some common value shared between more than one feature extraction method. This strategy is a trade-off between more system memory consumption and speeds up of feature extraction. Their return value must always be a dictionary whose keys are possible extra arguments for both feature extraction methods and other precomputation methods. Note that there is a share of precomputed values between all valid feature-extraction modules (e.g., class_freqs computed in module statistical can freely be used for any precomputation or feature extraction method of module landmarking).

__init__(*args, **kwargs)

Methods

__init__(*args, **kwargs)

ft_ch(N, y)

Compute the Calinski and Harabasz index.

ft_int(N, y[, dist_metric, cls_inds, ...])

Compute the INT index.

ft_nre(y[, class_freqs])

Compute the normalized relative entropy.

ft_pb(N, y[, dist_metric])

Compute the pearson correlation between class matching and instance distances.

ft_sc(y[, size, normalize, class_freqs])

Compute the number of clusters with size smaller than a given size.

ft_sil(N, y[, dist_metric, sample_frac, ...])

Compute the mean silhouette value.

ft_vdb(N, y)

Compute the Davies and Bouldin Index.

ft_vdu(N, y[, dist_metric, cls_inds, ...])

Compute the Dunn Index.

precompute_class_representatives(N[, y, ...])

Precomputations related to cluster representative instances.

precompute_clustering_class([y])

Precompute distinct classes and its frequencies from y.

precompute_group_distances(N[, y, ...])

Precompute distance metrics between instances.

precompute_nearest_neighbors(N[, y, ...])

Precompute the n_neighbors Nearest Neighbors of every instance.

classmethod ft_ch(N: ndarray, y: ndarray) float[source]

Compute the Calinski and Harabasz index.

Check sklearn.metrics.calinski_harabasz_score documentation for more information.

Parameters
Nnp.ndarray

Attributes from fitted data.

ynp.ndarray

Instance cluster index (or target attribute).

Returns
float

Calinski-Harabasz index.

References

1

T. Calinski, J. Harabasz, A dendrite method for cluster analysis, Commun. Stat. Theory Methods 3 (1) (1974) 1–27.

classmethod ft_int(N: ndarray, y: ndarray, dist_metric: str = 'euclidean', cls_inds: Optional[ndarray] = None, classes: Optional[ndarray] = None, pairwise_norm_intercls_dist: Optional[List[ndarray]] = None) float[source]

Compute the INT index.

Metric range is 0 (inclusive) and infinity.

Parameters
Nnp.ndarray

Attributes from fitted data.

ynp.ndarray

Instance cluster index (or target attribute).

dist_metricstr, optional

The distance metric used to calculate the distances between instances. Check scipy.spatial.distance documentation for a full list of valid distance metrics. If precomputation in clustering metafeatures is enabled, then this parameter takes no effect.

cls_indsnp.ndarray, optional

Boolean array which indicates the examples of each class. The rows represents each distinct class, and the columns represents the instances. Used to take advantage of precomputations.

classesnp.ndarray, optional

Distinct classes in y. Used to exploit precomputations.

pairwise_norm_intercls_distsnp.ndarray, optional

Normalized pairwise distances between instances of different classes. Used to exploit precomputations.

Returns
float

INT index.

References

1

SOUZA, Bruno Feres de. Meta-aprendizagem aplicada à classificação de dados de expressão gênica. 2010. Tese (Doutorado em Ciências de Computação e Matemática Computacional), Instituto de Ciências Matemáticas e de Computação, Universidade de São Paulo, São Carlos, 2010. doi:10.11606/T.55.2010.tde-04012011-142551.

2

Bezdek, J. C.; Pal, N. R. (1998a). Some new indexes of cluster validity. IEEE Transactions on Systems, Man, and Cybernetics, Part B, v.28, n.3, p.301–315.

classmethod ft_nre(y: ndarray, class_freqs: Optional[ndarray] = None) float[source]

Compute the normalized relative entropy.

An indicator of uniformity distributed of instances among clusters.

Parameters
ynp.ndarray

Instance cluster index (or target attribute).

class_freqsnp.ndarray, optional

Absolute class frequencies. Used to exploit precomputations.

Returns
float

Entropy of relative class frequencies.

References

1

Bruno Almeida Pimentel, André C.P.L.F. de Carvalho. A new data characterization for selecting clustering algorithms using meta-learning. Information Sciences, Volume 477, 2019, Pages 203-219.

classmethod ft_pb(N: ndarray, y: ndarray, dist_metric: str = 'euclidean') float[source]

Compute the pearson correlation between class matching and instance distances.

The measure interval is -1 and +1 (inclusive).

Parameters
Nnp.ndarray

Attributes from fitted data.

ynp.ndarray

Instance cluster index (or target attribute).

dist_metricstr, optional

The distance metric used to calculate the distances between instances. Check scipy.spatial.distance for a full list of valid distance metrics.

Returns
float

Point Biserial coefficient.

References

1

J. Lev, “The Point Biserial Coefficient of Correlation”, Ann. Math. Statist., Vol. 20, no.1, pp. 125-126, 1949.

classmethod ft_sc(y: ndarray, size: int = 15, normalize: bool = False, class_freqs: Optional[ndarray] = None) int[source]

Compute the number of clusters with size smaller than a given size.

Parameters
ynp.ndarray

Instance cluster index (or target attribute).

sizeint, optional

Maximum (exclusive) size of classes to be considered.

normalizebool, optional

If True, then the result will be the proportion of classes with less than size instances from the total of classes. (i.e., result is divided by the number of classes.)

class_freqsnp.ndarray, optional

Class (absolute) frequencies. Used to exploit precomputations.

Returns
int or float

Number of classes with less than size instances if normalize is False, proportion of classes with less than size instances otherwise.

References

1

Bruno Almeida Pimentel, André C.P.L.F. de Carvalho. A new data characterization for selecting clustering algorithms using meta-learning. Information Sciences, Volume 477, 2019, Pages 203-219.

classmethod ft_sil(N: ndarray, y: ndarray, dist_metric: str = 'euclidean', sample_frac: Optional[int] = None, random_state: Optional[int] = None) float[source]

Compute the mean silhouette value.

Metric range is -1 to +1 (both inclusive).

Check sklearn.metrics.silhouette_score documentation for more information.

Parameters
Nnp.ndarray

Attributes from fitted data.

ynp.ndarray

Instance cluster index (or target attribute).

dist_metricstr, optional

The distance metric used to calculate the distances between instances. Check sklearn.neighbors.DistanceMetric documentation for a full list of valid distance metrics.

sample_fracint, optional

Sample fraction used to compute the silhouette coefficient. If None is given, then all data is used.

random_stateint, optional

Used if sample_frac is not None. Random seed used while sampling the data.

Returns
float

Mean Silhouette value.

References

1

P.J. Rousseeuw, Silhouettes: a graphical aid to the interpretation and validation of cluster analysis, J. Comput. Appl. Math. 20 (1987) 53–65.

classmethod ft_vdb(N: ndarray, y: ndarray) float[source]

Compute the Davies and Bouldin Index.

Metric range is 0 (inclusive) and infinity.

Check sklearn.metrics.davies_bouldin_score documentation for more information.

Parameters
Nnp.ndarray

Attributes from fitted data.

ynp.ndarray

Instance cluster index (or target attribute).

References

1

D.L. Davies, D.W. Bouldin, A cluster separation measure, IEEE Trans. Pattern Anal. Mach. Intell. 1 (2) (1979) 224–227.

classmethod ft_vdu(N: ndarray, y: ndarray, dist_metric: str = 'euclidean', cls_inds: Optional[ndarray] = None, classes: Optional[ndarray] = None, intracls_dists: Optional[ndarray] = None, pairwise_norm_intercls_dist: Optional[List[ndarray]] = None) float[source]

Compute the Dunn Index.

Metric range is 0 (inclusive) and infinity.

Parameters
Nnp.ndarray

Attributes from fitted data.

ynp.ndarray

Instance cluster index (or target attribute).

dist_metricstr, optional

The distance metric used to calculate the distances between instances. Check scipy.spatial.distance documentation for a full list of valid distance metrics. If precomputation in clustering metafeatures is enabled, then this parameter takes no effect.

cls_indsnp.ndarray, optional

Boolean array which indicates the examples of each class. The rows represents each distinct class, and the columns represents the instances. Used to take advantage of precomputations.

classesnp.ndarray, optional

Distinct classes in y. Used to exploit precomputations.

intracls_distsnp.ndarray, optional

Distance between the fartest pair of instances in the same class, for each class. Used to exploit precomputations.

pairwise_norm_intercls_distsnp.ndarray, optional

Normalized pairwise distances between instances of different classes.

Returns
float

Dunn index for given parameters.

References

1

J.C. Dunn, Well-separated clusters and optimal fuzzy partitions, J. Cybern. 4 (1) (1974) 95–104.

classmethod precompute_class_representatives(N: ndarray, y: Optional[ndarray] = None, representative: str = 'mean', classes: Optional[ndarray] = None, **kwargs) Dict[str, Any][source]

Precomputations related to cluster representative instances.

Parameters
Nnp.ndarray

Numerical fitted data.

ynp.ndarray, optional

Instance cluster index (or target attribute).

dist_metricstr, optional

The distance metric used to calculate the distances between instances. Check sklearn.neighbors.DistanceMetric documentation for a full list of valid distance metrics.

representativestr or np.ndarray or List, optional
  • If representative is string-type, then it must assume one

    value between median or mean, and the selected method is used to estimate the representative instance of each class (e.g., if mean is selected, then the mean of attributes of all instances of the same class is used to represent that class).

  • If representative is a List or have np.ndarray type,

    then its length must be the number of different classes in y and each of its element must be a representative instance for each class. For example, the following 2-D array is the representative of the Iris dataset, calculated using the mean value of instances of the same class (effectively holding the same result as if the argument value was the character string mean):

    [[ 5.006 3.428 1.462 0.246] # ‘Setosa’ mean values

    [ 5.936 2.77 4.26 1.326] # ‘Versicolor’ mean values [ 6.588 2.974 5.552 2.026]] # ‘Virginica’ mean values

    The attribute order must be, of course, the same as the original instances in the dataset.

classesnp.ndarray, optional

Distinct classes in y. Used to exploit precomputations.

**kwargs

Additional arguments. May have previously precomputed before this method from other precomputed methods, so they can help speed up this precomputation.

Returns
dict
The following precomputed items are returned:
  • pairwise_intracls_dists (np.ndarray): distance between each distinct pair of instances of the same class.

classmethod precompute_clustering_class(y: Optional[ndarray] = None, **kwargs) Dict[str, Any][source]

Precompute distinct classes and its frequencies from y.

Parameters
ynp.ndarray, optional

Instance cluster index (or target attribute).

**kwargs

Additional arguments. May have previously precomputed before this method from other precomputed methods, so they can help speed up this precomputation.

Returns
dict
The following precomputed items are returned:
  • classes (np.ndarray): distinct classes of y, if y is not NoneType.

  • class_freqs (np.ndarray): class frequencies of y, if y is not NoneType.

  • cls_inds (np.ndarray): Boolean array which indicates whether each example belongs to each class. The rows represents the distinct classes, and the instances are represented by the columns.

classmethod precompute_group_distances(N: ndarray, y: Optional[ndarray] = None, dist_metric: str = 'euclidean', classes: Optional[ndarray] = None, **kwargs) Dict[str, Any][source]

Precompute distance metrics between instances.

Parameters
Nnp.ndarray

Numerical fitted data.

ynp.ndarray, optional

Instance cluster index (or target attribute).

dist_metricstr, optional

The distance metric used to calculate the distances between instances. Check sklearn.neighbors.DistanceMetric documentation for a full list of valid distance metrics.

classesnp.ndarray, optional

Distinct classes in y. Used to exploit precomputations.

**kwargs

Additional arguments. May have previously precomputed before this method from other precomputed methods, so they can help speed up this precomputation.

Returns
dict
The following precomputed items are returned:
  • pairwise_norm_intercls_dist (np.ndarray): normalized distance between each distinct pair of instances of different classes.

  • pairwise_intracls_dists (np.ndarray): distance between each distinct pair of instances of the same class.

  • intracls_dists (np.ndarray): the distance between the fartest pair of instances of the same class.

The following precomputed items are necessary and are also
returned, if still not previously precomputed:
  • classes (np.ndarray): distinct classes of y, if y is not NoneType.

  • class_freqs (np.ndarray): class frequencies of y, if y is not NoneType.

  • cls_inds (np.ndarray): Boolean array which indicates whether each example belongs to each class. The rows represents the distinct classes, and the instances are represented by the columns.

classmethod precompute_nearest_neighbors(N: ndarray, y: Optional[ndarray] = None, n_neighbors: Optional[int] = None, dist_metric: str = 'euclidean', **kwargs) Dict[str, Any][source]

Precompute the n_neighbors Nearest Neighbors of every instance.

Parameters
Nnp.ndarray

Numerical fitted data.

ynp.ndarray, optional

Instance cluster index (or target attribute).

n_neighborsint, optional

Number of nearest neighbors returned for each instance.

dist_metricstr, optional

The distance metric used to calculate the distances between instances. Check sklearn.neighbors.DistanceMetric documentation for a full list of valid distance metrics.

**kwargs

Additional arguments. May have previously precomputed before this method from other precomputed methods, so they can help speed up this precomputation.

Returns
dict
The following precomputed items are returned:
  • pairwise_intracls_dists (np.ndarray): distance between each distinct pair of instances of the same class.