pymfe.mfe.MFE
- class pymfe.mfe.MFE(groups: Union[str, Iterable[str]] = 'default', features: Union[str, Iterable[str]] = 'all', summary: Union[str, Iterable[str]] = ('mean', 'sd'), measure_time: Optional[str] = None, wildcard: str = 'all', score: str = 'accuracy', num_cv_folds: int = 10, shuffle_cv_folds: bool = False, lm_sample_frac: float = 1.0, hypparam_model_dt: Optional[Dict[str, Any]] = None, suppress_warnings: bool = False, random_state: Optional[int] = None)[source]
Core class for metafeature extraction.
- Attributes
- X
List
Independent attributes of the dataset.
- y
List
Target attributes of the dataset.
- groups
tuple
ofstr
Tuple object containing fitted meta-feature groups loaded in the model at instantiation.
- features
tuple
ofstr
Contains loaded meta-feature extraction method names available for meta-feature extraction, from selected metafeatures groups and features listed at instantiation.
- summary
tuple
ofstr
Tuple object which contains summary functions names for features summarization.
- X
- __init__(groups: Union[str, Iterable[str]] = 'default', features: Union[str, Iterable[str]] = 'all', summary: Union[str, Iterable[str]] = ('mean', 'sd'), measure_time: Optional[str] = None, wildcard: str = 'all', score: str = 'accuracy', num_cv_folds: int = 10, shuffle_cv_folds: bool = False, lm_sample_frac: float = 1.0, hypparam_model_dt: Optional[Dict[str, Any]] = None, suppress_warnings: bool = False, random_state: Optional[int] = None) None [source]
Provides easy access for metafeature extraction from datasets.
It expected that user first calls
fit
method after instantiation and thenextract
for effectively extract the selected metafeatures. Check reference [1] for more information.- Parameters
- groups
Iterable
ofstr
orstr
A collection or a single metafeature group name representing the desired group of metafeatures for extraction. Use the method
valid_groups
to get a list of all available groups.Setting with
all
enables all available groups.Setting with
default
enablesgeneral
,info-theory
,statistical
,model-based
andlandmarking
. It is the default value.The value provided by the argument
wildcard
can be used to select all metafeature groups rapidly.- features
Iterable
ofstr
orstr
, optional A collection or a single metafeature name desired for extraction. Keep in mind that the extraction only gathers features also in the selected
groups
. Check this classfeatures
attribute to get a list of available metafeatures from selected groups, or use the methodvalid_metafeatures
to get a list of all available metafeatures filtered by group. Alternatively, you can use the methodmetafeature_description
to get or print a table with all metafeatures with its respectives groups and descriptions.The value provided by the argument
wildcard
can be used to select all features from all selected groups rapidly.- summary
Iterable
ofstr
orstr
, optional A collection or a single summary function to summarize a group of metafeature measures into a fixed-length group of value, typically a single value. The values must be one of the following:
mean
: Average of the values.sd
: Standard deviation of the values.count
: Computes the cardinality of the measure. Suitable for variable cardinality.histogram
: Describes the distribution of the measured values. Suitable for high cardinality.iq_range
: Computes the interquartile range of the measured values.kurtosis
: Describes the shape of the measures values distribution.max
: Results in the maximum value of the measure.median
: Results in the central value of the measure.min
: Results in the minimum value of the measure.quantiles
: Results in the minimum, first quartile, median, third quartile and maximum of the measured values.range
: Computes the range of the measured values.skewness
: Describes the shape of the measure values distribution in terms of symmetry.
You can concatenate nan with the desired summary function name to use an alternative version of the same summary which ignores nan values. For instance, nanmean is the mean summary function which ignores all nan values, while ‘naniq_range` is the interquartile range calculated only with valid (non-nan) values.
If more than one summary function is selected, then all multivalued extracted metafeatures are summarized with each summary function.
The particular value provided by the argument
wildcard
can be used to select all summary functions rapidly.Use the method
valid_summary
to get a list of all available summary functions.- measure_time
str
, optional Options for measuring the time elapsed during metafeature extraction. If this argument value is
NoneType
, no time elapsed is measured. Otherwise, this argument must be astr
valued as one of the options below:avg
: average time for each metafeature (total time divided by the feature cardinality, i.e., number of features extracted by a single feature-extraction related method), without summarization time.avg_summ
: average time for each metafeature (total time of extraction divided by feature cardinality) including required time for summarization.total
: total time for each metafeature, without summarization time.total_summ
: total time for each metafeature including the required time for summarization.
The
cardinality
of the feature is the number of values extracted by a single calculation method.For example,
mean
feature has cardinality equal to the number of numeric features in the dataset, wherecor
(fromcorrelation
) has cardinality equals to (N - 1)/2, where N is the number of numeric features in the dataset.The cardinality is used to divide the total execution time of that method if an option starting with
avg
is selected.If a summary method has cardinality higher than one (more than one value returned after summarization and, thus, creating more than one entry in the result lists) like, for example,
histogram
summary method, then the corresponding time of this summary will be inserted only in the first correspondent element of the time list. The remaining entries are all filled with 0 value, to keep consistency between the size of all lists returned and index correspondence between they.- wildcard
str
, optional Value used as
select all
forgroups
,features
andsummary
arguments.- score
str
, optional Score metric used to extract
landmarking
metafeatures.- num_cv_folds
int
, optional Number of folds to create a Stratified K-Fold cross validation to extract the
landmarking
metafeatures.- shuffle_cv_folds
bool
, optional If True, then the fitted data will be shuffled before splitted in the Stratified K-Fold Cross Validation of
landmarking
features. The shuffle random seed is therandom_state
argument.- lm_sample_frac
float
, optional Sample proportion used to produce the
landmarking
metafeatures. This argument must be in 0.5 and 1.0 (both inclusive) interval.- hypparam_model_dt
dict
, optional Dictionary providing extra hyperparameters for the Decision Tree algorithm for building the Decision Tree model, used to extract the model-based metafeatures. The class used to fit the model is the
sklearn.tree.DecisionTreeClassifier
(sklearn library). Using this argument, it is possible to provide extra arguments in the DecisionTreeClassifier class initialization (e.g.,max_depth
andmin_samples_split
.) In order to use this argument, provide the DecisionTreeClassifier init argument name as the dictionary keys and the corresponding custom values, as the dictionary values. Example: {“min_samples_split”: 10, “criterion”: “entropy”}- suppress_warnings
bool
, optional If True, then ignore all warnings invoked at the instantiation time.
- random_state
int
, optional Random seed used to control random events. Keeps the experiments reproducible.
- groups
Notes
- 1
Rivolli et al. “Towards Reproducible Empirical Research in Meta-Learning,”. Rivolli et al. URL: https://arxiv.org/abs/1808.10406
Examples
Load a dataset
>>> from sklearn.datasets import load_iris >>> from pymfe.mfe import MFE
>>> data = load_iris() >>> y = data.target >>> X = data.data
Extract all measures
>>> mfe = MFE() >>> mfe.fit(X, y) >>> ft = mfe.extract() >>> print(ft)
Extract general, statistical and information-theoretic measures
>>> mfe = MFE(groups=["general", "statistical", "info-theory"]) >>> mfe.fit(X, y) >>> ft = mfe.extract() >>> print(ft)
Methods
__init__
([groups, features, summary, ...])Provides easy access for metafeature extraction from datasets.
extract
([verbose, enable_parallel, ...])Extracts metafeatures from the previously fitted dataset.
extract_from_model
(model[, arguments_fit, ...])Extract model-based metafeatures from given model.
extract_metafeature_names
([supervised])Extract the pre-configured meta-feature names.
extract_with_confidence
([sample_num, ...])Extract metafeatures with confidence intervals.
fit
(X[, y, transform_num, transform_cat, ...])Fits dataset into an MFE model.
metafeature_description
([groups, ...])Print a table with groups, metafeatures and description.
parse_by_group
(groups, extracted_results)Parse the result of
extract
for given metafeaturegroups
.Return a tuple of valid metafeature groups.
valid_metafeatures
([groups])Return a tuple with all metafeatures related to given
groups
.Return a tuple of valid summary functions.
Attributes
groups_alias
- extract(verbose: int = 0, enable_parallel: bool = False, suppress_warnings: bool = False, out_type: ~typing.Any = <class 'tuple'>, **kwargs) Union[Tuple[List, ...], Dict[str, List], DataFrame] [source]
Extracts metafeatures from the previously fitted dataset.
- Parameters
- verbose
int
, optional Defines the verbosity level related to the metafeature extraction. If == 1, show just the current progress, without line breaks. If >= 2, print all messages related to the metafeature extraction process.
Note that warning messages are not affected by this option (see
suppress_warnings
argument below).- enable_parallel
bool
, optional If True, then the meta-feature extraction is done with multi-processes. Currently, this argument has no effect by now (to be implemented).
- suppress_warnings
bool
, optional If True, do not show any warning while extracting meta-features.
- kwargs:
Used to pass custom arguments for both feature-extraction and summary methods. The expected format is the following:
{
mtd_name
: {arg_name
: arg_value, …}, …}In words, the key values of
**kwargs
should be the target methods which receives the custom arguments, and each method has another dictionary containing customs method argument names as keys and their correspondent values, as values. SeeExamples
subsection for a clearer explanation.For more information see Examples.
- out_type: :obj:`Any`, optional
If tuple, then the returned value is a tuple. If dict, then the returned value is a dictionary. If pd.DataFrame the the returned value is a pandas.core.DataFrame. Otherwise, an Type Error is raised.
- verbose
- Returns
tuple`(:obj:`list
,list
)A tuple containing two lists (if
measure_time
is None).The first field is the identifiers of each summarized value in the form
feature_name.summary_mtd_name
(i.e., the feature extraction name concatenated by the summary method name, separated by a dot).The second field is the summarized values.
Both lists have a 1-1 correspondence by the index of each element (i.e., the value at index
i
in the second list has its identifier at the same index in the first list and vice-versa).dict`(:obj:`str
,list
)A dictionary containing two fields (if
measure_time
is None). The fields are: mtf_names, mtf_vals (ifmeasure_time
, the there is mtf_time).The first field is the identifiers of each summarized value in the form
feature_name.summary_mtd_name
(i.e., the feature extraction name concatenated by the summary method name, separated by a dot).The second field is the summarized values.
Both lists of each field have a 1-1 correspondence by the index of each elemen (i.e., the value at index
i
in the second list has its identifier at the same index in the first list and vice-versa).pandas.core.frame.DataFrame
A pandas DataFrame instance.
Each column is a summarized value. The column is identified by the name of the meta-feature in the form
feature_name.summary_mtd_name
(i.e., the featur extraction name concatenated by the summary method name, separate by a dot).The rows store the summarized values (if
measure_time
, there is a row with the time taken to calculate each value).if
measure_time
is given during the model instantiation, a third list will be returned with the time spent during the calculations for the corresponding (by index) metafeature.
- Raises
- TypeError
If calling
extract
method beforefit
method.- TypeError
If calling
extract
method with invalidout_type
.
Examples
Using kwargs. Option 1 to pass ft. extraction custom arguments:
>>> args = { >>> 'sd': {'ddof': 2}, >>> '1NN': {'metric': 'minkowski', 'p': 2}, >>> 'leaves': {'max_depth': 4}, >>> }
>>> model = MFE().fit(X=data, y=labels) >>> result = model.extract(**args)
Option 2 (note: metafeatures with name starting with numbers are not allowed!):
>>> model = MFE().fit(X=data, y=labels) >>> res = extract(sd={'ddof': 2}, leaves={'max_depth': 4})
- extract_from_model(model: Any, arguments_fit: Optional[Dict[str, Any]] = None, arguments_extract: Optional[Dict[str, Any]] = None, verbose: int = 0) Union[Tuple[List, ...], Dict[str, List], DataFrame] [source]
Extract model-based metafeatures from given model.
The random seed used by the new internal model is the same random seed set in the current model (if any.)
The metafeatures extracted will be all metafeatures selected originally in the current model that are also in the ‘model-based’ group.
The extracted values will be summarized also with the summary functions selected originally in this model.
- Parameters
- modelany
Pre-fitted machine learning model.
- arguments_fit
dict
, optional Custom arguments to fit the extractor model. See .fit method documentation for more information.
- arguments_extract
dict
, optional Custom arguments to extract the metafeatures. See .extract method documentation for more information.
- verboseint, optional
Select the level of verbosity of this method. Please note that the verbosity level of each step (fit and extract) need to be given separately using, respectively, arguments_fit and arguments_extract arguments.
- Returns
Notes
Internally, a new MFE model is created to perform the metafeature extractions. Therefore, the current model (if any) will not be affected by this method by any means.
- extract_metafeature_names(supervised: bool = True) Tuple[str, ...] [source]
Extract the pre-configured meta-feature names.
- Parameters
- supervisedbool, optional
If True, extract the meta-feature names assuming that y (data labels) is given alongside X (independent attributes).
If there is some data fit into the MFE model, this method checks wether y was fitted or not. Therefore, setting supervised=True while fitting only X has no effect, and only unsupervised meta-feature names will be returned.
- Returns
- tuple
If Tuple with meta-feature names to be extracted as values.
- extract_with_confidence(sample_num: int = 128, confidence: Union[float, List[float]] = 0.95, arguments_fit: Optional[Dict[str, Any]] = None, arguments_extract: Optional[Dict[str, Any]] = None, verbose: int = 0) Union[Tuple[List, ...], Dict[str, List], DataFrame] [source]
Extract metafeatures with confidence intervals.
To build the confidence intervals, the empirical bootstrap algorithm is used, which is as follows:
All selected metafeatures are extracted from the fitted data, M.
Then, each metafeature is extracted
sample_num
times from a resampled dataset using bootstrap from the fitted data, M_i.Then, the differences delta_i = M_i - M are calculated
From the differences delta_i, the quantiles related to the given confidence levels (confidence = 1 - Type I error rate) are calculated.
The confidence intervals are centered in M and the width of interval is given by the quantiles of the differences previously calculated.
All configuration used by this method are from the configuration while instantiating the current model.
- Parameters
- sample_numint, optional
Number of samples from the fitted data using bootstrap. Each metafeature will be extracted
sample_num
times.- confidencefloat or sequence of floats, optional
Confidence level of the interval. Must be in (0.0, 1.0) range. If a sequence of confidence levels is given, a confidence interval will be extracted for each value.
- arguments_fitdict, optional
Extra arguments for the fit method for each sampled dataset. See
.fit
method documentation for more information.- arguments_extractdict, optional
Extra arguments for each metafeature extraction procedure. See
.extract
method documentation for more information.- verboseint, optional
Verbosity level for this method. Please note that the verbosity level for both
.fit
and.extract
methods performed within this method must be controlled separately using, respectively,arguments_fit
andarguments_extract
parameters.
- Returns
- tuple of
np.ndarray
The same return value format of the
extract
method, appended with the confidence intervals as a new sequence of values in the form (interval_low_1, interval_low_2, …, interval_high_(n-1), interval_high_n) for each corresponding metafeature, and with shape (metafeature_num, 2 * C), where C is the number of confidence levels given inconfidence
(i.e., the rows represents each metafeature and the columns each interval limit). This means that all interval lower limits are given first, and all the interval upper limits are grouped together afterwards. The sequence order of the interval limits follows the same sequence order of the confidence levels given inconfidence
. For instance, if confidence=[0.80, 0.90, 0.99], then the confidence intervals will be returned in the following order (for all metafeatures): (lower_0.80, lower_0.90, lower_0.99, upper_0.80, upper_0.90, upper_0.99).
- tuple of
- Raises
- ValueError
If
confidence
is not in (0.0, 1.0) range.- TypeError
If no data was fit into the model previously.
Notes
The model used to fit and extract metafeatures for each sampled dataset is instantiated within this method and, therefore, this method does not affect the current model (if any) by any means.
- fit(X: Union[ndarray, List], y: Optional[Union[ndarray, List]] = None, transform_num: bool = True, transform_cat: str = 'gray', rescale: Optional[str] = None, rescale_args: Optional[Dict[str, Any]] = None, cat_cols: Optional[Union[str, Iterable[int]]] = 'auto', check_bool: bool = False, precomp_groups: Optional[str] = 'all', wildcard: str = 'all', suppress_warnings: bool = False, verbose: int = 0, **kwargs) MFE [source]
Fits dataset into an MFE model.
- Parameters
- X
List
Predictive attributes of the dataset.
- y
List
, optional Target attributes of the dataset, assuming that it is a supervised task.
- transform_num
bool
, optional If True, numeric attributes are discretized using equal-frequency histogram technique to use alongside categorical data when extracting categoric-only metafeatures. Note that numeric-only features still uses the original numeric values, not the discretized ones. If False, then numeric attributes are ignored for categorical-only meta-features.
- transform_cat
str
, optional Transform categorical data to use alongside numerical data while extracting numeric-only metafeatures. Note that categoric-only features still uses the original categoric values, and not the binarized ones.
If one-hot, categorical attributes are binarized using one-hot encoding with k-1 features for a categorical attribute with k distinct values. This algorithm works as follows:
- For each categorical attribute C:
Encode C with traditional one-hot encoding.
Arbitrarily drop the first column of the encoding result.
The unique value previously represented by the k-length vector [1, 0, …, 0] will now be presented by the (k-1)-length vector [0, 0, …, 0]. Note that all other unique values will also now be represented by (k-1)-length vectors (the first 0 is dropped out).
This algorithm avoids the dummy variable trap, which may raise multicollinearity problems due to the unnecessary extra feature. Note that the decision of dropping the very first encoded feature is arbitrary, as any other encoded feature could have been dropped instead.
If gray, categorical attributes are binarized using a model matrix. The formula used for this transformation is just the union (+) of all categoric attributes using formula language from patsy package API, removing the intercept terms: ~ 0 + A_1 + … + A_n, where n is the number of features and A_i is the ith categoric attribute, 1 <= i <= n.
If one-hot-full, categorical attributes are binarized using one- hot encoding with k features for a categorical attributes with k distinct values. This option is not recommended due to the dummy variable trap, which may cause multicollinearity problems due to an extra unnecessary variable (a label can be encoded using the null vector [0, …, 0]^T).
If None, then categorical attributes are not transformed.
- rescale
str
, optional If
NoneType
, the model keeps all numeric data with its original values. Otherwise, this argument can assume one of the string options below to rescale all numeric values:standard
: set numeric data to zero mean, unit variance. Also known asz-score
normalization. Check the documentation ofsklearn.preprocessing.StandardScaler
for in-depth information.‘min-max`: set numeric data to interval [a, b], a < b. It is possible to define values to
a
andb
using argumentrescale_args
. The default values are a = 0.0 and b = 1.0. Checksklearn.preprocessing.MinMaxScaler
documentation for more information.robust
: rescale data using statistics robust to the presence of outliers. For in-depth information, check documentation ofsklearn.preprocessing.RobustScaler
.
- rescale_args
dict
, optional Dictionary containing parameters for rescaling data. Used only if
rescale
argument is notNoneType
. These dictionary keys are the parameter names as strings and the values, the corresponding parameter value.- cat_cols :obj:`List` of :obj:`int` or :obj:`str`, optional
Categorical columns of dataset. If given
NoneType
or an empty sequence, assume all columns as numeric. If given valueauto
, then an attempt of automatic detection is performed while fitting the dataset.- check_bool
bool
, optional If cat_cols is
auto
, and this flag is True, assume that all columns with precisely two different values is also a categorical (boolean) column, independently of its data type. Otherwise, these columns may be considered numeric depending on their data type.- missing_data
str
, optional Defines the strategy to handle missing values in data. Still not implemented.
- precomp_groups
str
, optional Defines which metafeature groups common values should be cached to share among various meta-feature extraction related methods (e.g.
classes
, orcovariance
). This argument may speed up meta-feature extraction but also consumes more memory, so it may not be suitable for huge datasets.- wildcard
str
, optional Value used as
select all
forprecomp_groups
.- suppress_warnings
bool
, optional If True, ignore all warnings invoked while fitting the dataset.
- verbose
int
, optional Defines the level of verbosity for the fit method. If 1, then print a progress bar related to the precomputations. If 2 or higher, then log every step of the fitted data transformations and the precomputation steps.
- **kwargs:
Extra custom arguments to the precomputation methods. Keep in mind that those values may even replace internal custom parameters, if the name matches. Use this resource carefully.
Hint: you can check which are the internal custom arguments by verifying the values in ‘._custom_args_ft’ attribute after the model is fitted.
This argument format is {‘parameter_name’: parameter_value}.
- X
- Returns
- self
- Raises
- ValueError
If the number of rows of X and y length does not match.
- TypeError
If X or y (or both) is neither a
list
or anp.ndarray
object.
- classmethod metafeature_description(groups: Optional[Union[Iterable[str], str]] = None, sort_by_group: bool = False, sort_by_mtf: bool = False, print_table: bool = True, include_references: bool = False) Optional[Tuple[List[List[str]], str]] [source]
Print a table with groups, metafeatures and description.
- Parameters
- groupssequence of str or str, optional:
Can be a string such value is a name of a specific metafeature group (see
valid_groups
method for more information) or a sequence of metafeature group names. It can be also None, which in that case all available metafeature names will be returned.- sort_by_group: bool
Sort table by meta-feature group name.
- sort_by_mtf: bool
Sort table by meta-feature name.
- print_tablebool
If True a table will be printed with the description, otherwise the table will be send by return.
- print_tablebool
If True sort the table by metafeature name.
- include_referencesbool
If True include a column with article reference.
- Returns
- list of list
A table with the metafeature descriptions or None.
Notes
The returned
metafeatures
are not related to the groups or to the metafeatures fitted in the model instantation. All the returned metafeatures are available in thePymfe
package. Check theMFE
documentation for deeper information.
- classmethod parse_by_group(groups: Union[List[str], str], extracted_results: Tuple[List, ...]) Tuple[List, ...] [source]
Parse the result of
extract
for given metafeaturegroups
.Can be used to easily separate the results of each metafeature group.
- Parameters
- groups
List
ofstr
orstr
Metafeature group names which the results should be parsed relative to. Use
valid_groups
method to check the available metafeature groups.- extracted_results
tuple
oft.List
Output of
extract
method. Should contain all outputed lists (metafeature names, values and elapsed time for extraction, if present.)
- groups
- Returns
Notes
The given
groups
are not related to the groups fitted in the model in the model instantation. Checkvalid_groups
method to get a list of all available groups from thePymfe
package. Check theMFE
documentation for deeper information about all these groups.
- classmethod valid_groups() Tuple[str, ...] [source]
Return a tuple of valid metafeature groups.
Notes
The returned
groups
are not related to the groups fitted in the model in the model instantation. The returned groups are all available metafeature groups in thePymfe
package. Check theMFE
documentation for deeper information.
- classmethod valid_metafeatures(groups: Optional[Union[Iterable[str], str]] = None) Tuple[str, ...] [source]
Return a tuple with all metafeatures related to given
groups
.- Parameters
- Returns
Notes
The returned
metafeatures
are not related to the groups or to the metafeatures fitted in the model in the model instantation. All the returned metafeatures are available in thePymfe
package. Check theMFE
documentation for deeper information.
- classmethod valid_summary() Tuple[str, ...] [source]
Return a tuple of valid summary functions.
Notes
The returned
summaries
are not related to the summaries fitted in the model in the model instantation. The returned summaries are all available in thePymfe
package. Check the documentation ofMFE
for deeper information.