pymfe.clustering.MFEClustering
- class pymfe.clustering.MFEClustering[source]
Keep methods for metafeatures of
Clustering
group.The convention adopted for metafeature extraction related methods is to always start with
ft_
prefix to allow automatic method detection. This prefix is predefined within_internal
module.All method signature follows the conventions and restrictions listed below:
For independent attribute data,
X
meansevery type of attribute
,N
meansNumeric attributes only
andC
stands forCategorical attributes only
. It is important to note that the categorical attribute sets betweenX
andC
and the numerical attribute sets betweenX
andN
may differ due to data transformations, performed while fitting data into MFE model, enabled by, respectively,transform_num
andtransform_cat
arguments fromfit
(MFE method).Only arguments in MFE
_custom_args_ft
attribute (set up insidefit
method) are allowed to be required method arguments. All other arguments must be strictly optional (i.e., has a predefined default value).The initial assumption is that the user can change any optional argument, without any previous verification of argument value or its type, via kwargs argument of
extract
method of MFE class.The return value of all feature extraction methods should be a single value or a generic List (preferably a
np.ndarray
) type with numeric values.
There is another type of method adopted for automatic detection. It is adopted the prefix
precompute_
for automatic detection of these methods. These methods run while fitting some data into an MFE model automatically, and their objective is to precompute some common value shared between more than one feature extraction method. This strategy is a trade-off between more system memory consumption and speeds up of feature extraction. Their return value must always be a dictionary whose keys are possible extra arguments for both feature extraction methods and other precomputation methods. Note that there is a share of precomputed values between all valid feature-extraction modules (e.g.,class_freqs
computed in modulestatistical
can freely be used for any precomputation or feature extraction method of modulelandmarking
).- __init__(*args, **kwargs)
Methods
__init__
(*args, **kwargs)ft_ch
(N, y)Compute the Calinski and Harabasz index.
ft_int
(N, y[, dist_metric, cls_inds, ...])Compute the INT index.
ft_nre
(y[, class_freqs])Compute the normalized relative entropy.
ft_pb
(N, y[, dist_metric])Compute the pearson correlation between class matching and instance distances.
ft_sc
(y[, size, normalize, class_freqs])Compute the number of clusters with size smaller than a given size.
ft_sil
(N, y[, dist_metric, sample_frac, ...])Compute the mean silhouette value.
ft_vdb
(N, y)Compute the Davies and Bouldin Index.
ft_vdu
(N, y[, dist_metric, cls_inds, ...])Compute the Dunn Index.
precompute_class_representatives
(N[, y, ...])Precomputations related to cluster representative instances.
Precompute distinct classes and its frequencies from
y
.precompute_group_distances
(N[, y, ...])Precompute distance metrics between instances.
precompute_nearest_neighbors
(N[, y, ...])Precompute the
n_neighbors
Nearest Neighbors of every instance.- classmethod ft_ch(N: ndarray, y: ndarray) float [source]
Compute the Calinski and Harabasz index.
Check
sklearn.metrics.calinski_harabasz_score
documentation for more information.- Parameters
- N
np.ndarray
Attributes from fitted data.
- y
np.ndarray
Instance cluster index (or target attribute).
- N
- Returns
- float
Calinski-Harabasz index.
References
- 1
T. Calinski, J. Harabasz, A dendrite method for cluster analysis, Commun. Stat. Theory Methods 3 (1) (1974) 1–27.
- classmethod ft_int(N: ndarray, y: ndarray, dist_metric: str = 'euclidean', cls_inds: Optional[ndarray] = None, classes: Optional[ndarray] = None, pairwise_norm_intercls_dist: Optional[List[ndarray]] = None) float [source]
Compute the INT index.
Metric range is 0 (inclusive) and infinity.
- Parameters
- N
np.ndarray
Attributes from fitted data.
- y
np.ndarray
Instance cluster index (or target attribute).
- dist_metricstr, optional
The distance metric used to calculate the distances between instances. Check
scipy.spatial.distance
documentation for a full list of valid distance metrics. If precomputation in clustering metafeatures is enabled, then this parameter takes no effect.- cls_inds
np.ndarray
, optional Boolean array which indicates the examples of each class. The rows represents each distinct class, and the columns represents the instances. Used to take advantage of precomputations.
- classes
np.ndarray
, optional Distinct classes in
y
. Used to exploit precomputations.- pairwise_norm_intercls_dists
np.ndarray
, optional Normalized pairwise distances between instances of different classes. Used to exploit precomputations.
- N
- Returns
- float
INT index.
References
- 1
SOUZA, Bruno Feres de. Meta-aprendizagem aplicada à classificação de dados de expressão gênica. 2010. Tese (Doutorado em Ciências de Computação e Matemática Computacional), Instituto de Ciências Matemáticas e de Computação, Universidade de São Paulo, São Carlos, 2010. doi:10.11606/T.55.2010.tde-04012011-142551.
- 2
Bezdek, J. C.; Pal, N. R. (1998a). Some new indexes of cluster validity. IEEE Transactions on Systems, Man, and Cybernetics, Part B, v.28, n.3, p.301–315.
- classmethod ft_nre(y: ndarray, class_freqs: Optional[ndarray] = None) float [source]
Compute the normalized relative entropy.
An indicator of uniformity distributed of instances among clusters.
- Parameters
- y
np.ndarray
Instance cluster index (or target attribute).
- class_freqs
np.ndarray
, optional Absolute class frequencies. Used to exploit precomputations.
- y
- Returns
- float
Entropy of relative class frequencies.
References
- 1
Bruno Almeida Pimentel, André C.P.L.F. de Carvalho. A new data characterization for selecting clustering algorithms using meta-learning. Information Sciences, Volume 477, 2019, Pages 203-219.
- classmethod ft_pb(N: ndarray, y: ndarray, dist_metric: str = 'euclidean') float [source]
Compute the pearson correlation between class matching and instance distances.
The measure interval is -1 and +1 (inclusive).
- Parameters
- N
np.ndarray
Attributes from fitted data.
- y
np.ndarray
Instance cluster index (or target attribute).
- dist_metricstr, optional
The distance metric used to calculate the distances between instances. Check
scipy.spatial.distance
for a full list of valid distance metrics.
- N
- Returns
- float
Point Biserial coefficient.
References
- 1
J. Lev, “The Point Biserial Coefficient of Correlation”, Ann. Math. Statist., Vol. 20, no.1, pp. 125-126, 1949.
- classmethod ft_sc(y: ndarray, size: int = 15, normalize: bool = False, class_freqs: Optional[ndarray] = None) int [source]
Compute the number of clusters with size smaller than a given size.
- Parameters
- y
np.ndarray
Instance cluster index (or target attribute).
- sizeint, optional
Maximum (exclusive) size of classes to be considered.
- normalizebool, optional
If True, then the result will be the proportion of classes with less than
size
instances from the total of classes. (i.e., result is divided by the number of classes.)- class_freqs
np.ndarray
, optional Class (absolute) frequencies. Used to exploit precomputations.
- y
- Returns
- int or float
Number of classes with less than
size
instances ifnormalize
is False, proportion of classes with less thansize
instances otherwise.
References
- 1
Bruno Almeida Pimentel, André C.P.L.F. de Carvalho. A new data characterization for selecting clustering algorithms using meta-learning. Information Sciences, Volume 477, 2019, Pages 203-219.
- classmethod ft_sil(N: ndarray, y: ndarray, dist_metric: str = 'euclidean', sample_frac: Optional[int] = None, random_state: Optional[int] = None) float [source]
Compute the mean silhouette value.
Metric range is -1 to +1 (both inclusive).
Check
sklearn.metrics.silhouette_score
documentation for more information.- Parameters
- N
np.ndarray
Attributes from fitted data.
- y
np.ndarray
Instance cluster index (or target attribute).
- dist_metricstr, optional
The distance metric used to calculate the distances between instances. Check
sklearn.neighbors.DistanceMetric
documentation for a full list of valid distance metrics.- sample_fracint, optional
Sample fraction used to compute the silhouette coefficient. If None is given, then all data is used.
- random_stateint, optional
Used if
sample_frac
is not None. Random seed used while sampling the data.
- N
- Returns
- float
Mean Silhouette value.
References
- 1
P.J. Rousseeuw, Silhouettes: a graphical aid to the interpretation and validation of cluster analysis, J. Comput. Appl. Math. 20 (1987) 53–65.
- classmethod ft_vdb(N: ndarray, y: ndarray) float [source]
Compute the Davies and Bouldin Index.
Metric range is 0 (inclusive) and infinity.
Check
sklearn.metrics.davies_bouldin_score
documentation for more information.- Parameters
- N
np.ndarray
Attributes from fitted data.
- y
np.ndarray
Instance cluster index (or target attribute).
- N
References
- 1
D.L. Davies, D.W. Bouldin, A cluster separation measure, IEEE Trans. Pattern Anal. Mach. Intell. 1 (2) (1979) 224–227.
- classmethod ft_vdu(N: ndarray, y: ndarray, dist_metric: str = 'euclidean', cls_inds: Optional[ndarray] = None, classes: Optional[ndarray] = None, intracls_dists: Optional[ndarray] = None, pairwise_norm_intercls_dist: Optional[List[ndarray]] = None) float [source]
Compute the Dunn Index.
Metric range is 0 (inclusive) and infinity.
- Parameters
- N
np.ndarray
Attributes from fitted data.
- y
np.ndarray
Instance cluster index (or target attribute).
- dist_metricstr, optional
The distance metric used to calculate the distances between instances. Check
scipy.spatial.distance
documentation for a full list of valid distance metrics. If precomputation in clustering metafeatures is enabled, then this parameter takes no effect.- cls_inds
np.ndarray
, optional Boolean array which indicates the examples of each class. The rows represents each distinct class, and the columns represents the instances. Used to take advantage of precomputations.
- classes
np.ndarray
, optional Distinct classes in
y
. Used to exploit precomputations.- intracls_dists
np.ndarray
, optional Distance between the fartest pair of instances in the same class, for each class. Used to exploit precomputations.
- pairwise_norm_intercls_dists
np.ndarray
, optional Normalized pairwise distances between instances of different classes.
- N
- Returns
- float
Dunn index for given parameters.
References
- 1
J.C. Dunn, Well-separated clusters and optimal fuzzy partitions, J. Cybern. 4 (1) (1974) 95–104.
- classmethod precompute_class_representatives(N: ndarray, y: Optional[ndarray] = None, representative: str = 'mean', classes: Optional[ndarray] = None, **kwargs) Dict[str, Any] [source]
Precomputations related to cluster representative instances.
- Parameters
- N
np.ndarray
Numerical fitted data.
- y
np.ndarray
, optional Instance cluster index (or target attribute).
- dist_metricstr, optional
The distance metric used to calculate the distances between instances. Check
sklearn.neighbors.DistanceMetric
documentation for a full list of valid distance metrics.- representativestr or
np.ndarray
or List, optional - If representative is string-type, then it must assume one
value between
median
ormean
, and the selected method is used to estimate the representative instance of each class (e.g., ifmean
is selected, then the mean of attributes of all instances of the same class is used to represent that class).
- If representative is a List or have
np.ndarray
type, then its length must be the number of different classes in
y
and each of its element must be a representative instance for each class. For example, the following 2-D array is the representative of theIris
dataset, calculated using the mean value of instances of the same class (effectively holding the same result as if the argument value was the character stringmean
):- [[ 5.006 3.428 1.462 0.246] # ‘Setosa’ mean values
[ 5.936 2.77 4.26 1.326] # ‘Versicolor’ mean values [ 6.588 2.974 5.552 2.026]] # ‘Virginica’ mean values
The attribute order must be, of course, the same as the original instances in the dataset.
- If representative is a List or have
- classes
np.ndarray
, optional Distinct classes in
y
. Used to exploit precomputations.- **kwargs
Additional arguments. May have previously precomputed before this method from other precomputed methods, so they can help speed up this precomputation.
- N
- Returns
dict
- The following precomputed items are returned:
pairwise_intracls_dists
(np.ndarray
): distance between each distinct pair of instances of the same class.
- classmethod precompute_clustering_class(y: Optional[ndarray] = None, **kwargs) Dict[str, Any] [source]
Precompute distinct classes and its frequencies from
y
.- Parameters
- y
np.ndarray
, optional Instance cluster index (or target attribute).
- **kwargs
Additional arguments. May have previously precomputed before this method from other precomputed methods, so they can help speed up this precomputation.
- y
- Returns
dict
- The following precomputed items are returned:
classes
(np.ndarray
): distinct classes ofy
, ify
is notNoneType
.class_freqs
(np.ndarray
): class frequencies ofy
, ify
is notNoneType
.cls_inds
(np.ndarray
): Boolean array which indicates whether each example belongs to each class. The rows represents the distinct classes, and the instances are represented by the columns.
- classmethod precompute_group_distances(N: ndarray, y: Optional[ndarray] = None, dist_metric: str = 'euclidean', classes: Optional[ndarray] = None, **kwargs) Dict[str, Any] [source]
Precompute distance metrics between instances.
- Parameters
- N
np.ndarray
Numerical fitted data.
- y
np.ndarray
, optional Instance cluster index (or target attribute).
- dist_metricstr, optional
The distance metric used to calculate the distances between instances. Check
sklearn.neighbors.DistanceMetric
documentation for a full list of valid distance metrics.- classes
np.ndarray
, optional Distinct classes in
y
. Used to exploit precomputations.- **kwargs
Additional arguments. May have previously precomputed before this method from other precomputed methods, so they can help speed up this precomputation.
- N
- Returns
dict
- The following precomputed items are returned:
pairwise_norm_intercls_dist
(np.ndarray
): normalized distance between each distinct pair of instances of different classes.pairwise_intracls_dists
(np.ndarray
): distance between each distinct pair of instances of the same class.intracls_dists
(np.ndarray
): the distance between the fartest pair of instances of the same class.
- The following precomputed items are necessary and are also
- returned, if still not previously precomputed:
classes
(np.ndarray
): distinct classes ofy
, ify
is notNoneType
.class_freqs
(np.ndarray
): class frequencies ofy
, ify
is notNoneType
.cls_inds
(np.ndarray
): Boolean array which indicates whether each example belongs to each class. The rows represents the distinct classes, and the instances are represented by the columns.
- classmethod precompute_nearest_neighbors(N: ndarray, y: Optional[ndarray] = None, n_neighbors: Optional[int] = None, dist_metric: str = 'euclidean', **kwargs) Dict[str, Any] [source]
Precompute the
n_neighbors
Nearest Neighbors of every instance.- Parameters
- N
np.ndarray
Numerical fitted data.
- y
np.ndarray
, optional Instance cluster index (or target attribute).
- n_neighborsint, optional
Number of nearest neighbors returned for each instance.
- dist_metricstr, optional
The distance metric used to calculate the distances between instances. Check
sklearn.neighbors.DistanceMetric
documentation for a full list of valid distance metrics.- **kwargs
Additional arguments. May have previously precomputed before this method from other precomputed methods, so they can help speed up this precomputation.
- N
- Returns
dict
- The following precomputed items are returned:
pairwise_intracls_dists
(np.ndarray
): distance between each distinct pair of instances of the same class.