pymfe.landmarking.MFELandmarking
- class pymfe.landmarking.MFELandmarking[source]
Keep methods for metafeatures of
landmarking
group.The convention adopted for metafeature extraction related methods is to always start with
ft_
prefix to allow automatic method detection. This prefix is predefined within_internal
module.All method signature follows the conventions and restrictions listed below:
For independent attribute data,
X
meansevery type of attribute
,N
meansNumeric attributes only
andC
stands forCategorical attributes only
. It is important to note that the categorical attribute sets betweenX
andC
and the numerical attribute sets betweenX
andN
may differ due to data transformations, performed while fitting data into MFE model, enabled by, respectively,transform_num
andtransform_cat
arguments fromfit
(MFE method).Only arguments in MFE
_custom_args_ft
attribute (set up insidefit
method) are allowed to be required method arguments. All other arguments must be strictly optional (i.e., has a predefined default value).The initial assumption is that the user can change any optional argument, without any previous verification of argument value or its type, via kwargs argument of
extract
method of MFE class.The return value of all feature extraction methods should be a single value or a generic List (preferably a
np.ndarray
) type with numeric values.
There is another type of method adopted for automatic detection. It is adopted the prefix
precompute_
for automatic detection of these methods. These methods run while fitting some data into an MFE model automatically, and their objective is to precompute some common value shared between more than one feature extraction method. This strategy is a trade-off between more system memory consumption and speeds up of feature extraction. Their return value must always be a dictionary whose keys are possible extra arguments for both feature extraction methods and other precomputation methods. Note that there is a share of precomputed values between all valid feature-extraction modules (e.g.,class_freqs
computed in modulestatistical
can freely be used for any precomputation or feature extraction method of modulelandmarking
).- __init__(*args, **kwargs)
Methods
__init__
(*args, **kwargs)ft_best_node
(N, y, score[, skf, ...])Performance of a the best single decision tree node.
ft_elite_nn
(N, y, score[, skf, ...])Performance of Elite Nearest Neighbor.
ft_linear_discr
(N, y, score[, skf, ...])Performance of the Linear Discriminant classifier.
ft_naive_bayes
(N, y, score[, skf, ...])Performance of the Naive Bayes classifier.
ft_one_nn
(N, y, score[, skf, num_cv_folds, ...])Performance of the 1-Nearest Neighbor classifier.
ft_random_node
(N, y, score[, skf, ...])Performance of the single decision tree node model induced by a random attribute.
ft_worst_node
(N, y, score[, skf, ...])Performance of the single decision tree node model induced by the worst informative attribute.
precompute_landmarking_kfolds
(N[, y, ...])Precompute k-fold cross validation related values.
precompute_landmarking_sample
(N, lm_sample_frac)Precompute subsampling landmarking subsample indices.
- classmethod ft_best_node(N: ndarray, y: ndarray, score: Callable[[ndarray, ndarray], ndarray], skf: Optional[StratifiedKFold] = None, num_cv_folds: int = 10, shuffle_cv_folds: bool = False, lm_sample_frac: float = 1.0, sample_inds: Optional[ndarray] = None, random_state: Optional[int] = None) ndarray [source]
Performance of a the best single decision tree node.
Construct a single decision tree node model induced by the most informative attribute to establish the linear separability.
- Parameters
- N
np.ndarray
Numerical fitted data.
- y
np.ndarray
Target attribute.
- score
callable
Function to compute score of the K-fold evaluations. Possible functions are described in scoring.py module.
- skf
sklearn.model_selection.StratifiedKFold
, optional Stratified K-Folds cross-validator. Provides train/test indices to split data in train/test sets.
- num_cv_foldsint, optional
Number of folds to k-fold cross validation. Used only if
skf
is None.- shuffle_cv_foldsbool, optional
If True, shuffle the data before splitting into the k-fold cross validation folds. The random seed used for this process is the
random_state
argument.- lm_sample_fracfloat, optional
Proportion of instances to be sampled before extracting the metafeature. Used only if
sample_inds
is None.- sample_inds
np.ndarray
, optional Array of indices of instances to be effectively used while extracting this metafeature. If None, then
lm_sample_frac
is taken into account. Argument used to exploit precomputations.- random_stateint, optional
If given, set the random seed before any pseudo-random calculations to keep the experiments reproducible.
- N
- Returns
np.ndarray
The Decision Tree best-node model performance of each fold.
References
- 1
Hilan Bensusan and Christophe Giraud-Carrier. Discovering task neighbourhoods through landmark learning performances. In 4th European Conference on Principles of Data Mining and Knowledge Discovery (PKDD), pages 325 – 330, 2000.
- 2
Johannes Furnkranz and Johann Petrak. An evaluation of landmarking variants. In 1st ECML/PKDD International Workshop on Integration and Collaboration Aspects of Data Mining, Decision Support and Meta-Learning (IDDM), pages 57 – 68, 2001.
- classmethod ft_elite_nn(N: ndarray, y: ndarray, score: Callable[[ndarray, ndarray], ndarray], skf: Optional[StratifiedKFold] = None, num_cv_folds: int = 10, shuffle_cv_folds: bool = False, lm_sample_frac: float = 1.0, sample_inds: Optional[ndarray] = None, random_state: Optional[int] = None, cv_folds_imp_rank: Optional[ndarray] = None) ndarray [source]
Performance of Elite Nearest Neighbor.
Elite nearest neighbor uses the most informative attribute in the dataset to induce the 1-nearest neighbor.
With the subset of informative attributes it is expected that the models should be noise tolerant.
- Parameters
- N
np.ndarray
Numerical fitted data.
- y
np.ndarray
Target attribute.
- score
callable
Function to compute score of the K-fold evaluations. Possible functions are described in scoring.py module.
- skf
sklearn.model_selection.StratifiedKFold
, optional Stratified K-Folds cross-validator. Provides train/test indices to split data in train/test sets.
- num_cv_foldsint, optional
Number of folds to k-fold cross validation. Used only if
skf
is None.- shuffle_cv_foldsbool, optional
If True, shuffle the data before splitting into the k-fold cross validation folds. The random seed used for this process is the
random_state
argument.- lm_sample_fracfloat, optional
Proportion of instances to be sampled before extracting the metafeature. Used only if
sample_inds
is None.- sample_inds
np.ndarray
, optional Array of indices of instances to be effectively used while extracting this metafeature. If None, then
lm_sample_frac
is taken into account. Argument used to exploit precomputations.- random_stateint, optional
If given, set the random seed before any pseudo-random calculations to keep the experiments reproducible.
- cv_folds_imp_rank
np.ndarray
, optional Ranking based on the predictive attribute importance per cross-validation fold. The rows correspond to each fold, and the columns correspond to each predictive attribute. Argument used to take advantage of precomputations. Do not use it if the k-fold cross validation splitter shuffles the data with no random seed fixed.
- N
- Returns
np.ndarray
The Elite 1-NN model performance of each fold.
References
- 1
Hilan Bensusan and Christophe Giraud-Carrier. Discovering task neighbourhoods through landmark learning performances. In 4th European Conference on Principles of Data Mining and Knowledge Discovery (PKDD), pages 325 – 330, 2000.
- classmethod ft_linear_discr(N: ndarray, y: ndarray, score: Callable[[ndarray, ndarray], ndarray], skf: Optional[StratifiedKFold] = None, num_cv_folds: int = 10, shuffle_cv_folds: bool = False, lm_sample_frac: float = 1.0, sample_inds: Optional[ndarray] = None, random_state: Optional[int] = None) ndarray [source]
Performance of the Linear Discriminant classifier.
The Linear Discriminant Classifier is used to construct a linear split (non parallel axis) in the data to establish the linear separability.
- Parameters
- N
np.ndarray
Numerical fitted data.
- y
np.ndarray
Target attribute.
- score
callable
Function to compute score of the K-fold evaluations. Possible functions are described in scoring.py module.
- skf
sklearn.model_selection.StratifiedKFold
, optional Stratified K-Folds cross-validator. Provides train/test indices to split data in train/test sets.
- num_cv_foldsint, optional
Number of num_cv_folds to k-fold cross validation. Used only if
skf
is None.- shuffle_cv_foldsbool, optional
If True, shuffle the data before splitting into the k-fold cross validation folds. The random seed used for this process is the
random_state
argument.- lm_sample_fracfloat, optional
Proportion of instances to be sampled before extracting the metafeature. Used only if
sample_inds
is None.- sample_inds
np.ndarray
, optional Array of indices of instances to be effectively used while extracting this metafeature. If None, then
lm_sample_frac
is taken into account. Argument used to exploit precomputations.- random_stateint, optional
If given, set the random seed before any pseudo-random calculations to keep the experiments reproducible.
- N
- Returns
np.ndarray
The Linear Discriminant Analysis model performance of each fold.
References
- 1
Hilan Bensusan and Christophe Giraud-Carrier. Discovering task neighbourhoods through landmark learning performances. In 4th European Conference on Principles of Data Mining and Knowledge Discovery (PKDD), pages 325 – 330, 2000.
- 2
Johannes Furnkranz and Johann Petrak. An evaluation of landmarking variants. In 1st ECML/PKDD International Workshop on Integration and Collaboration Aspects of Data Mining, Decision Support and Meta-Learning (IDDM), pages 57 – 68, 2001.
- classmethod ft_naive_bayes(N: ndarray, y: ndarray, score: Callable[[ndarray, ndarray], ndarray], skf: Optional[StratifiedKFold] = None, num_cv_folds: int = 10, shuffle_cv_folds: bool = False, lm_sample_frac: float = 1.0, sample_inds: Optional[ndarray] = None, random_state: Optional[int] = None) ndarray [source]
Performance of the Naive Bayes classifier.
It assumes that the attributes are independent and each example belongs to a certain class based on the Bayes probability.
- Parameters
- N
np.ndarray
Numerical fitted data.
- y
np.ndarray
Target attribute.
- score
callable
Function to compute score of the K-fold evaluations. Possible functions are described in scoring.py module.
- skf
sklearn.model_selection.StratifiedKFold
, optional Stratified K-Folds cross-validator. Provides train/test indices to split data in train/test sets.
- num_cv_foldsint, optional
Number of num_cv_folds to k-fold cross validation. Used only if
skf
is None.- shuffle_cv_foldsbool, optional
If True, shuffle the data before splitting into the k-fold cross validation folds. The random seed used for this process is the
random_state
argument.- lm_sample_fracfloat, optional
Proportion of instances to be sampled before extracting the metafeature. Used only if
sample_inds
is None.- sample_inds
np.ndarray
, optional Array of indices of instances to be effectively used while extracting this metafeature. If None, then
lm_sample_frac
is taken into account. Argument used to exploit precomputations.- random_stateint, optional
If given, set the random seed before any pseudo-random calculations to keep the experiments reproducible.
- N
- Returns
np.ndarray
The Naive Bayes model performance of each fold.
References
- 1
Hilan Bensusan and Christophe Giraud-Carrier. Discovering task neighbourhoods through landmark learning performances. In 4th European Conference on Principles of Data Mining and Knowledge Discovery (PKDD), pages 325 – 330, 2000.
- 2
Johannes Furnkranz and Johann Petrak. An evaluation of landmarking variants. In 1st ECML/PKDD International Workshop on Integration and Collaboration Aspects of Data Mining, Decision Support and Meta-Learning (IDDM), pages 57 – 68, 2001.
- classmethod ft_one_nn(N: ndarray, y: ndarray, score: Callable[[ndarray, ndarray], ndarray], skf: Optional[StratifiedKFold] = None, num_cv_folds: int = 10, shuffle_cv_folds: bool = False, lm_sample_frac: float = 1.0, sample_inds: Optional[ndarray] = None, random_state: Optional[int] = None) ndarray [source]
Performance of the 1-Nearest Neighbor classifier.
It uses the euclidean distance of the nearest neighbor to determine how noisy is the data.
- Parameters
- N
np.ndarray
Numerical fitted data.
- y
np.ndarray
Target attribute.
- score
callable
Function to compute score of the K-fold evaluations. Possible functions are described in scoring.py module.
- skf
sklearn.model_selection.StratifiedKFold
, optional Stratified K-Folds cross-validator. Provides train/test indices to split data in train/test sets.
- num_cv_foldsint, optional
Number of num_cv_folds to k-fold cross validation. Used only if
skf
is None.- shuffle_cv_foldsbool, optional
If True, shuffle the data before splitting into the k-fold cross validation folds. The random seed used for this process is the
random_state
argument.- lm_sample_fracfloat, optional
Proportion of instances to be sampled before extracting the metafeature. Used only if
sample_inds
is None.- sample_inds
np.ndarray
, optional Array of indices of instances to be effectively used while extracting this metafeature. If None, then
lm_sample_frac
is taken into account. Argument used to exploit precomputations.- random_stateint, optional
If given, set the random seed before any pseudo-random calculations to keep the experiments reproducible.
- N
- Returns
np.ndarray
The 1-NN model performance of each fold.
References
- 1
Hilan Bensusan and Christophe Giraud-Carrier. Discovering task neighbourhoods through landmark learning performances. In 4th European Conference on Principles of Data Mining and Knowledge Discovery (PKDD), pages 325 – 330, 2000.
- classmethod ft_random_node(N: ndarray, y: ndarray, score: Callable[[ndarray, ndarray], ndarray], skf: Optional[StratifiedKFold] = None, num_cv_folds: int = 10, shuffle_cv_folds: bool = False, lm_sample_frac: float = 1.0, sample_inds: Optional[ndarray] = None, random_state: Optional[int] = None) ndarray [source]
Performance of the single decision tree node model induced by a random attribute.
- Parameters
- N
np.ndarray
Numerical fitted data.
- y
np.ndarray
Target attribute.
- score
callable
Function to compute score of the K-fold evaluations. Possible functions are described in scoring.py module.
- skf
sklearn.model_selection.StratifiedKFold
, optional Stratified K-Folds cross-validator. Provides train/test indices to split data in train/test sets.
- num_cv_foldsint, optional
Number of folds to k-fold cross validation. Used only if
skf
is None.- shuffle_cv_folds
bool
, optional If True, shuffle the data before splitting into the k-fold cross validation folds. The random seed used for this process is the
random_state
argument.- lm_sample_fracfloat, optional
Proportion of instances to be sampled before extracting the metafeature. Used only if
sample_inds
is None.- sample_inds
np.ndarray
, optional Array of indices of instances to be effectively used while extracting this metafeature. If None, then
lm_sample_frac
is taken into account. Argument used to exploit precomputations.- random_stateint, optional
If given, set the random seed before any pseudo-random calculations to keep the experiments reproducible.
- N
- Returns
np.ndarray
The Decision Tree random-node model performance of each fold.
References
- 1
Hilan Bensusan and Christophe Giraud-Carrier. Discovering task neighbourhoods through landmark learning performances. In 4th European Conference on Principles of Data Mining and Knowledge Discovery (PKDD), pages 325 – 330, 2000.
- 2
Johannes Furnkranz and Johann Petrak. An evaluation of landmarking variants. In 1st ECML/PKDD International Workshop on Integration and Collaboration Aspects of Data Mining, Decision Support and Meta-Learning (IDDM), pages 57 – 68, 2001.
- classmethod ft_worst_node(N: ndarray, y: ndarray, score: Callable[[ndarray, ndarray], ndarray], skf: Optional[StratifiedKFold] = None, num_cv_folds: int = 10, shuffle_cv_folds: bool = False, lm_sample_frac: float = 1.0, sample_inds: Optional[ndarray] = None, random_state: Optional[int] = None, cv_folds_imp_rank: Optional[ndarray] = None) ndarray [source]
Performance of the single decision tree node model induced by the worst informative attribute.
- Parameters
- N
np.ndarray
Numerical fitted data.
- y
np.ndarray
Target attribute.
- score
callable
Function to compute score of the K-fold evaluations. Possible functions are described in scoring.py module.
- skf
sklearn.model_selection.StratifiedKFold
, optional Stratified K-Folds cross-validator. Provides train/test indices to split data in train/test sets.
- num_cv_foldsint, optional
Number of folds to k-fold cross validation. Used only if
skf
is None.- shuffle_cv_foldsbool, optional
If True, shuffle the data before splitting into the k-fold cross validation folds. The random seed used for this process is the
random_state
argument.- lm_sample_fracfloat, optional
Proportion of instances to be sampled before extracting the metafeature. Used only if
sample_inds
is None.- sample_inds
np.ndarray
, optional Array of indices of instances to be effectively used while extracting this metafeature. If None, then
lm_sample_frac
is taken into account. Argument used to exploit precomputations.- random_stateint, optional
If given, set the random seed before any pseudo-random calculations to keep the experiments reproducible.
- cv_folds_imp_rank
np.ndarray
, optional Ranking based on the predictive attribute importance per cross-validation fold. The rows correspond to each fold, and the columns correspond to each predictive attribute. Argument used to take advantage of precomputations. Do not use it if the k-fold cross validation splitter shuffles the data with no random seed fixed.
- N
- Returns
np.ndarray
The Decision Tree worst-node model performance of each fold.
References
- 1
Hilan Bensusan and Christophe Giraud-Carrier. Discovering task neighbourhoods through landmark learning performances. In 4th European Conference on Principles of Data Mining and Knowledge Discovery (PKDD), pages 325 – 330, 2000.
- 2
Johannes Furnkranz and Johann Petrak. An evaluation of landmarking variants. In 1st ECML/PKDD International Workshop on Integration and Collaboration Aspects of Data Mining, Decision Support and Meta-Learning (IDDM), pages 57 – 68, 2001.
- classmethod precompute_landmarking_kfolds(N: ndarray, y: Optional[ndarray] = None, num_cv_folds: int = 10, shuffle_cv_folds: Optional[bool] = False, random_state: Optional[int] = None, lm_sample_frac: float = 1.0, **kwargs) Dict[str, Any] [source]
Precompute k-fold cross validation related values.
- Parameters
- N
np.ndarray
Numerical fitted data.
- y
np.ndarray
, optional Target attribute.
- num_cv_foldsint, optional
Number of folds to k-fold cross validation.
- shuffle_cv_foldsbool, optional
If True, shuffle the samples before splitting the k-fold cross validation.
- random_stateint, optional
If given, set the random seed before any pseudo-random calculations to keep the experiments reproducible.
- lm_sample_fracfloat, optional
The percentage of examples subsampled. Value different from default will generate the subsampling-based relative landmarking metafeatures. Used only if
sample_inds
is not precomputed and ifshuffle_cv_folds
is False orrandom_state
is given.- kwargs:
Additional arguments. May have previously precomputed before this method from other precomputed methods, so they can help speed up this precomputation.
- N
- Returns
dict
- With following precomputed items:
skf
(sklearn.model_selection.StratifiedKFold
): Stratified K-Folds cross-validator. Provides train/test indices to split data in train/test sets.
- classmethod precompute_landmarking_sample(N: ndarray, lm_sample_frac: float, random_state: Optional[int] = None, **kwargs) Dict[str, Any] [source]
Precompute subsampling landmarking subsample indices.
- Parameters
- N
np.ndarray
Numerical fitted data.
- lm_sample_fracfloat
The percentage of examples subsampled. Value different from default will generate the subsampling-based relative landmarking metafeatures.
- random_stateint, optional
If given, set the random seed before any pseudo-random calculations to keep the experiments reproducible.
- N
- Returns
dict
- With following precomputed items:
sample_inds
(np.ndarray
): indices related to the subsampling of the original dataset. Used only if the subsampling landmarking method is used and, therefore, this value is only precomputed iflm_sample_frac
is less than 1.0.