pymfe.clustering.MFEClustering
- class pymfe.clustering.MFEClustering[source]
Keep methods for metafeatures of
Clusteringgroup.The convention adopted for metafeature extraction related methods is to always start with
ft_prefix to allow automatic method detection. This prefix is predefined within_internalmodule.All method signature follows the conventions and restrictions listed below:
For independent attribute data,
Xmeansevery type of attribute,NmeansNumeric attributes onlyandCstands forCategorical attributes only. It is important to note that the categorical attribute sets betweenXandCand the numerical attribute sets betweenXandNmay differ due to data transformations, performed while fitting data into MFE model, enabled by, respectively,transform_numandtransform_catarguments fromfit(MFE method).Only arguments in MFE
_custom_args_ftattribute (set up insidefitmethod) are allowed to be required method arguments. All other arguments must be strictly optional (i.e., has a predefined default value).The initial assumption is that the user can change any optional argument, without any previous verification of argument value or its type, via kwargs argument of
extractmethod of MFE class.The return value of all feature extraction methods should be a single value or a generic List (preferably a
np.ndarray) type with numeric values.
There is another type of method adopted for automatic detection. It is adopted the prefix
precompute_for automatic detection of these methods. These methods run while fitting some data into an MFE model automatically, and their objective is to precompute some common value shared between more than one feature extraction method. This strategy is a trade-off between more system memory consumption and speeds up of feature extraction. Their return value must always be a dictionary whose keys are possible extra arguments for both feature extraction methods and other precomputation methods. Note that there is a share of precomputed values between all valid feature-extraction modules (e.g.,class_freqscomputed in modulestatisticalcan freely be used for any precomputation or feature extraction method of modulelandmarking).- __init__(*args, **kwargs)
Methods
__init__(*args, **kwargs)ft_ch(N, y)Compute the Calinski and Harabasz index.
ft_int(N, y[, dist_metric, cls_inds, ...])Compute the INT index.
ft_nre(y[, class_freqs])Compute the normalized relative entropy.
ft_pb(N, y[, dist_metric])Compute the pearson correlation between class matching and instance distances.
ft_sc(y[, size, normalize, class_freqs])Compute the number of clusters with size smaller than a given size.
ft_sil(N, y[, dist_metric, sample_frac, ...])Compute the mean silhouette value.
ft_vdb(N, y)Compute the Davies and Bouldin Index.
ft_vdu(N, y[, dist_metric, cls_inds, ...])Compute the Dunn Index.
precompute_class_representatives(N[, y, ...])Precomputations related to cluster representative instances.
Precompute distinct classes and its frequencies from
y.precompute_group_distances(N[, y, ...])Precompute distance metrics between instances.
precompute_nearest_neighbors(N[, y, ...])Precompute the
n_neighborsNearest Neighbors of every instance.- classmethod ft_ch(N: ndarray, y: ndarray) float[source]
Compute the Calinski and Harabasz index.
Check
sklearn.metrics.calinski_harabasz_scoredocumentation for more information.- Parameters
- N
np.ndarray Attributes from fitted data.
- y
np.ndarray Instance cluster index (or target attribute).
- N
- Returns
- float
Calinski-Harabasz index.
References
- 1
T. Calinski, J. Harabasz, A dendrite method for cluster analysis, Commun. Stat. Theory Methods 3 (1) (1974) 1–27.
- classmethod ft_int(N: ndarray, y: ndarray, dist_metric: str = 'euclidean', cls_inds: Optional[ndarray] = None, classes: Optional[ndarray] = None, pairwise_norm_intercls_dist: Optional[List[ndarray]] = None) float[source]
Compute the INT index.
Metric range is 0 (inclusive) and infinity.
- Parameters
- N
np.ndarray Attributes from fitted data.
- y
np.ndarray Instance cluster index (or target attribute).
- dist_metricstr, optional
The distance metric used to calculate the distances between instances. Check
scipy.spatial.distancedocumentation for a full list of valid distance metrics. If precomputation in clustering metafeatures is enabled, then this parameter takes no effect.- cls_inds
np.ndarray, optional Boolean array which indicates the examples of each class. The rows represents each distinct class, and the columns represents the instances. Used to take advantage of precomputations.
- classes
np.ndarray, optional Distinct classes in
y. Used to exploit precomputations.- pairwise_norm_intercls_dists
np.ndarray, optional Normalized pairwise distances between instances of different classes. Used to exploit precomputations.
- N
- Returns
- float
INT index.
References
- 1
SOUZA, Bruno Feres de. Meta-aprendizagem aplicada à classificação de dados de expressão gênica. 2010. Tese (Doutorado em Ciências de Computação e Matemática Computacional), Instituto de Ciências Matemáticas e de Computação, Universidade de São Paulo, São Carlos, 2010. doi:10.11606/T.55.2010.tde-04012011-142551.
- 2
Bezdek, J. C.; Pal, N. R. (1998a). Some new indexes of cluster validity. IEEE Transactions on Systems, Man, and Cybernetics, Part B, v.28, n.3, p.301–315.
- classmethod ft_nre(y: ndarray, class_freqs: Optional[ndarray] = None) float[source]
Compute the normalized relative entropy.
An indicator of uniformity distributed of instances among clusters.
- Parameters
- y
np.ndarray Instance cluster index (or target attribute).
- class_freqs
np.ndarray, optional Absolute class frequencies. Used to exploit precomputations.
- y
- Returns
- float
Entropy of relative class frequencies.
References
- 1
Bruno Almeida Pimentel, André C.P.L.F. de Carvalho. A new data characterization for selecting clustering algorithms using meta-learning. Information Sciences, Volume 477, 2019, Pages 203-219.
- classmethod ft_pb(N: ndarray, y: ndarray, dist_metric: str = 'euclidean') float[source]
Compute the pearson correlation between class matching and instance distances.
The measure interval is -1 and +1 (inclusive).
- Parameters
- N
np.ndarray Attributes from fitted data.
- y
np.ndarray Instance cluster index (or target attribute).
- dist_metricstr, optional
The distance metric used to calculate the distances between instances. Check
scipy.spatial.distancefor a full list of valid distance metrics.
- N
- Returns
- float
Point Biserial coefficient.
References
- 1
J. Lev, “The Point Biserial Coefficient of Correlation”, Ann. Math. Statist., Vol. 20, no.1, pp. 125-126, 1949.
- classmethod ft_sc(y: ndarray, size: int = 15, normalize: bool = False, class_freqs: Optional[ndarray] = None) int[source]
Compute the number of clusters with size smaller than a given size.
- Parameters
- y
np.ndarray Instance cluster index (or target attribute).
- sizeint, optional
Maximum (exclusive) size of classes to be considered.
- normalizebool, optional
If True, then the result will be the proportion of classes with less than
sizeinstances from the total of classes. (i.e., result is divided by the number of classes.)- class_freqs
np.ndarray, optional Class (absolute) frequencies. Used to exploit precomputations.
- y
- Returns
- int or float
Number of classes with less than
sizeinstances ifnormalizeis False, proportion of classes with less thansizeinstances otherwise.
References
- 1
Bruno Almeida Pimentel, André C.P.L.F. de Carvalho. A new data characterization for selecting clustering algorithms using meta-learning. Information Sciences, Volume 477, 2019, Pages 203-219.
- classmethod ft_sil(N: ndarray, y: ndarray, dist_metric: str = 'euclidean', sample_frac: Optional[int] = None, random_state: Optional[int] = None) float[source]
Compute the mean silhouette value.
Metric range is -1 to +1 (both inclusive).
Check
sklearn.metrics.silhouette_scoredocumentation for more information.- Parameters
- N
np.ndarray Attributes from fitted data.
- y
np.ndarray Instance cluster index (or target attribute).
- dist_metricstr, optional
The distance metric used to calculate the distances between instances. Check
sklearn.neighbors.DistanceMetricdocumentation for a full list of valid distance metrics.- sample_fracint, optional
Sample fraction used to compute the silhouette coefficient. If None is given, then all data is used.
- random_stateint, optional
Used if
sample_fracis not None. Random seed used while sampling the data.
- N
- Returns
- float
Mean Silhouette value.
References
- 1
P.J. Rousseeuw, Silhouettes: a graphical aid to the interpretation and validation of cluster analysis, J. Comput. Appl. Math. 20 (1987) 53–65.
- classmethod ft_vdb(N: ndarray, y: ndarray) float[source]
Compute the Davies and Bouldin Index.
Metric range is 0 (inclusive) and infinity.
Check
sklearn.metrics.davies_bouldin_scoredocumentation for more information.- Parameters
- N
np.ndarray Attributes from fitted data.
- y
np.ndarray Instance cluster index (or target attribute).
- N
References
- 1
D.L. Davies, D.W. Bouldin, A cluster separation measure, IEEE Trans. Pattern Anal. Mach. Intell. 1 (2) (1979) 224–227.
- classmethod ft_vdu(N: ndarray, y: ndarray, dist_metric: str = 'euclidean', cls_inds: Optional[ndarray] = None, classes: Optional[ndarray] = None, intracls_dists: Optional[ndarray] = None, pairwise_norm_intercls_dist: Optional[List[ndarray]] = None) float[source]
Compute the Dunn Index.
Metric range is 0 (inclusive) and infinity.
- Parameters
- N
np.ndarray Attributes from fitted data.
- y
np.ndarray Instance cluster index (or target attribute).
- dist_metricstr, optional
The distance metric used to calculate the distances between instances. Check
scipy.spatial.distancedocumentation for a full list of valid distance metrics. If precomputation in clustering metafeatures is enabled, then this parameter takes no effect.- cls_inds
np.ndarray, optional Boolean array which indicates the examples of each class. The rows represents each distinct class, and the columns represents the instances. Used to take advantage of precomputations.
- classes
np.ndarray, optional Distinct classes in
y. Used to exploit precomputations.- intracls_dists
np.ndarray, optional Distance between the fartest pair of instances in the same class, for each class. Used to exploit precomputations.
- pairwise_norm_intercls_dists
np.ndarray, optional Normalized pairwise distances between instances of different classes.
- N
- Returns
- float
Dunn index for given parameters.
References
- 1
J.C. Dunn, Well-separated clusters and optimal fuzzy partitions, J. Cybern. 4 (1) (1974) 95–104.
- classmethod precompute_class_representatives(N: ndarray, y: Optional[ndarray] = None, representative: str = 'mean', classes: Optional[ndarray] = None, **kwargs) Dict[str, Any][source]
Precomputations related to cluster representative instances.
- Parameters
- N
np.ndarray Numerical fitted data.
- y
np.ndarray, optional Instance cluster index (or target attribute).
- dist_metricstr, optional
The distance metric used to calculate the distances between instances. Check
sklearn.neighbors.DistanceMetricdocumentation for a full list of valid distance metrics.- representativestr or
np.ndarrayor List, optional - If representative is string-type, then it must assume one
value between
medianormean, and the selected method is used to estimate the representative instance of each class (e.g., ifmeanis selected, then the mean of attributes of all instances of the same class is used to represent that class).
- If representative is a List or have
np.ndarraytype, then its length must be the number of different classes in
yand each of its element must be a representative instance for each class. For example, the following 2-D array is the representative of theIrisdataset, calculated using the mean value of instances of the same class (effectively holding the same result as if the argument value was the character stringmean):- [[ 5.006 3.428 1.462 0.246] # ‘Setosa’ mean values
[ 5.936 2.77 4.26 1.326] # ‘Versicolor’ mean values [ 6.588 2.974 5.552 2.026]] # ‘Virginica’ mean values
The attribute order must be, of course, the same as the original instances in the dataset.
- If representative is a List or have
- classes
np.ndarray, optional Distinct classes in
y. Used to exploit precomputations.- **kwargs
Additional arguments. May have previously precomputed before this method from other precomputed methods, so they can help speed up this precomputation.
- N
- Returns
dict- The following precomputed items are returned:
pairwise_intracls_dists(np.ndarray): distance between each distinct pair of instances of the same class.
- classmethod precompute_clustering_class(y: Optional[ndarray] = None, **kwargs) Dict[str, Any][source]
Precompute distinct classes and its frequencies from
y.- Parameters
- y
np.ndarray, optional Instance cluster index (or target attribute).
- **kwargs
Additional arguments. May have previously precomputed before this method from other precomputed methods, so they can help speed up this precomputation.
- y
- Returns
dict- The following precomputed items are returned:
classes(np.ndarray): distinct classes ofy, ifyis notNoneType.class_freqs(np.ndarray): class frequencies ofy, ifyis notNoneType.cls_inds(np.ndarray): Boolean array which indicates whether each example belongs to each class. The rows represents the distinct classes, and the instances are represented by the columns.
- classmethod precompute_group_distances(N: ndarray, y: Optional[ndarray] = None, dist_metric: str = 'euclidean', classes: Optional[ndarray] = None, **kwargs) Dict[str, Any][source]
Precompute distance metrics between instances.
- Parameters
- N
np.ndarray Numerical fitted data.
- y
np.ndarray, optional Instance cluster index (or target attribute).
- dist_metricstr, optional
The distance metric used to calculate the distances between instances. Check
sklearn.neighbors.DistanceMetricdocumentation for a full list of valid distance metrics.- classes
np.ndarray, optional Distinct classes in
y. Used to exploit precomputations.- **kwargs
Additional arguments. May have previously precomputed before this method from other precomputed methods, so they can help speed up this precomputation.
- N
- Returns
dict- The following precomputed items are returned:
pairwise_norm_intercls_dist(np.ndarray): normalized distance between each distinct pair of instances of different classes.pairwise_intracls_dists(np.ndarray): distance between each distinct pair of instances of the same class.intracls_dists(np.ndarray): the distance between the fartest pair of instances of the same class.
- The following precomputed items are necessary and are also
- returned, if still not previously precomputed:
classes(np.ndarray): distinct classes ofy, ifyis notNoneType.class_freqs(np.ndarray): class frequencies ofy, ifyis notNoneType.cls_inds(np.ndarray): Boolean array which indicates whether each example belongs to each class. The rows represents the distinct classes, and the instances are represented by the columns.
- classmethod precompute_nearest_neighbors(N: ndarray, y: Optional[ndarray] = None, n_neighbors: Optional[int] = None, dist_metric: str = 'euclidean', **kwargs) Dict[str, Any][source]
Precompute the
n_neighborsNearest Neighbors of every instance.- Parameters
- N
np.ndarray Numerical fitted data.
- y
np.ndarray, optional Instance cluster index (or target attribute).
- n_neighborsint, optional
Number of nearest neighbors returned for each instance.
- dist_metricstr, optional
The distance metric used to calculate the distances between instances. Check
sklearn.neighbors.DistanceMetricdocumentation for a full list of valid distance metrics.- **kwargs
Additional arguments. May have previously precomputed before this method from other precomputed methods, so they can help speed up this precomputation.
- N
- Returns
dict- The following precomputed items are returned:
pairwise_intracls_dists(np.ndarray): distance between each distinct pair of instances of the same class.