pymfe.mfe.MFE

class pymfe.mfe.MFE(groups: Union[str, Iterable[str]] = 'default', features: Union[str, Iterable[str]] = 'all', summary: Union[str, Iterable[str]] = ('mean', 'sd'), measure_time: Optional[str] = None, wildcard: str = 'all', score: str = 'accuracy', num_cv_folds: int = 10, shuffle_cv_folds: bool = False, lm_sample_frac: float = 1.0, hypparam_model_dt: Optional[Dict[str, Any]] = None, suppress_warnings: bool = False, random_state: Optional[int] = None)[source]

Core class for metafeature extraction.

Attributes
XList

Independent attributes of the dataset.

yList

Target attributes of the dataset.

groupstuple of str

Tuple object containing fitted meta-feature groups loaded in the model at instantiation.

featurestuple of str

Contains loaded meta-feature extraction method names available for meta-feature extraction, from selected metafeatures groups and features listed at instantiation.

summarytuple of str

Tuple object which contains summary functions names for features summarization.

__init__(groups: Union[str, Iterable[str]] = 'default', features: Union[str, Iterable[str]] = 'all', summary: Union[str, Iterable[str]] = ('mean', 'sd'), measure_time: Optional[str] = None, wildcard: str = 'all', score: str = 'accuracy', num_cv_folds: int = 10, shuffle_cv_folds: bool = False, lm_sample_frac: float = 1.0, hypparam_model_dt: Optional[Dict[str, Any]] = None, suppress_warnings: bool = False, random_state: Optional[int] = None) None[source]

Provides easy access for metafeature extraction from datasets.

It expected that user first calls fit method after instantiation and then extract for effectively extract the selected metafeatures. Check reference [1] for more information.

Parameters
groupsIterable of str or str

A collection or a single metafeature group name representing the desired group of metafeatures for extraction. Use the method valid_groups to get a list of all available groups.

Setting with all enables all available groups.

Setting with default enables general, info-theory, statistical, model-based and landmarking. It is the default value.

The value provided by the argument wildcard can be used to select all metafeature groups rapidly.

featuresIterable of str or str, optional

A collection or a single metafeature name desired for extraction. Keep in mind that the extraction only gathers features also in the selected groups. Check this class features attribute to get a list of available metafeatures from selected groups, or use the method valid_metafeatures to get a list of all available metafeatures filtered by group. Alternatively, you can use the method metafeature_description to get or print a table with all metafeatures with its respectives groups and descriptions.

The value provided by the argument wildcard can be used to select all features from all selected groups rapidly.

summaryIterable of str or str, optional

A collection or a single summary function to summarize a group of metafeature measures into a fixed-length group of value, typically a single value. The values must be one of the following:

  1. mean: Average of the values.

  2. sd: Standard deviation of the values.

  3. count: Computes the cardinality of the measure. Suitable for variable cardinality.

  4. histogram: Describes the distribution of the measured values. Suitable for high cardinality.

  5. iq_range: Computes the interquartile range of the measured values.

  6. kurtosis: Describes the shape of the measures values distribution.

  7. max: Results in the maximum value of the measure.

  8. median: Results in the central value of the measure.

  9. min: Results in the minimum value of the measure.

  10. quantiles: Results in the minimum, first quartile, median, third quartile and maximum of the measured values.

  11. range: Computes the range of the measured values.

  12. skewness: Describes the shape of the measure values distribution in terms of symmetry.

You can concatenate nan with the desired summary function name to use an alternative version of the same summary which ignores nan values. For instance, nanmean is the mean summary function which ignores all nan values, while ‘naniq_range` is the interquartile range calculated only with valid (non-nan) values.

If more than one summary function is selected, then all multivalued extracted metafeatures are summarized with each summary function.

The particular value provided by the argument wildcard can be used to select all summary functions rapidly.

Use the method valid_summary to get a list of all available summary functions.

measure_timestr, optional

Options for measuring the time elapsed during metafeature extraction. If this argument value is NoneType, no time elapsed is measured. Otherwise, this argument must be a str valued as one of the options below:

  1. avg: average time for each metafeature (total time divided by the feature cardinality, i.e., number of features extracted by a single feature-extraction related method), without summarization time.

  2. avg_summ: average time for each metafeature (total time of extraction divided by feature cardinality) including required time for summarization.

  3. total: total time for each metafeature, without summarization time.

  4. total_summ: total time for each metafeature including the required time for summarization.

The cardinality of the feature is the number of values extracted by a single calculation method.

For example, mean feature has cardinality equal to the number of numeric features in the dataset, where cor (from correlation) has cardinality equals to (N - 1)/2, where N is the number of numeric features in the dataset.

The cardinality is used to divide the total execution time of that method if an option starting with avg is selected.

If a summary method has cardinality higher than one (more than one value returned after summarization and, thus, creating more than one entry in the result lists) like, for example, histogram summary method, then the corresponding time of this summary will be inserted only in the first correspondent element of the time list. The remaining entries are all filled with 0 value, to keep consistency between the size of all lists returned and index correspondence between they.

wildcardstr, optional

Value used as select all for groups, features and summary arguments.

scorestr, optional

Score metric used to extract landmarking metafeatures.

num_cv_foldsint, optional

Number of folds to create a Stratified K-Fold cross validation to extract the landmarking metafeatures.

shuffle_cv_foldsbool, optional

If True, then the fitted data will be shuffled before splitted in the Stratified K-Fold Cross Validation of landmarking features. The shuffle random seed is the random_state argument.

lm_sample_fracfloat, optional

Sample proportion used to produce the landmarking metafeatures. This argument must be in 0.5 and 1.0 (both inclusive) interval.

hypparam_model_dtdict, optional

Dictionary providing extra hyperparameters for the Decision Tree algorithm for building the Decision Tree model, used to extract the model-based metafeatures. The class used to fit the model is the sklearn.tree.DecisionTreeClassifier (sklearn library). Using this argument, it is possible to provide extra arguments in the DecisionTreeClassifier class initialization (e.g., max_depth and min_samples_split.) In order to use this argument, provide the DecisionTreeClassifier init argument name as the dictionary keys and the corresponding custom values, as the dictionary values. Example: {“min_samples_split”: 10, “criterion”: “entropy”}

suppress_warningsbool, optional

If True, then ignore all warnings invoked at the instantiation time.

random_stateint, optional

Random seed used to control random events. Keeps the experiments reproducible.

Notes

1

Rivolli et al. “Towards Reproducible Empirical Research in Meta-Learning,”. Rivolli et al. URL: https://arxiv.org/abs/1808.10406

Examples

Load a dataset

>>> from sklearn.datasets import load_iris
>>> from pymfe.mfe import MFE
>>> data = load_iris()
>>> y = data.target
>>> X = data.data

Extract all measures

>>> mfe = MFE()
>>> mfe.fit(X, y)
>>> ft = mfe.extract()
>>> print(ft)

Extract general, statistical and information-theoretic measures

>>> mfe = MFE(groups=["general", "statistical", "info-theory"])
>>> mfe.fit(X, y)
>>> ft = mfe.extract()
>>> print(ft)

Methods

__init__([groups, features, summary, ...])

Provides easy access for metafeature extraction from datasets.

extract([verbose, enable_parallel, ...])

Extracts metafeatures from the previously fitted dataset.

extract_from_model(model[, arguments_fit, ...])

Extract model-based metafeatures from given model.

extract_metafeature_names([supervised])

Extract the pre-configured meta-feature names.

extract_with_confidence([sample_num, ...])

Extract metafeatures with confidence intervals.

fit(X[, y, transform_num, transform_cat, ...])

Fits dataset into an MFE model.

metafeature_description([groups, ...])

Print a table with groups, metafeatures and description.

parse_by_group(groups, extracted_results)

Parse the result of extract for given metafeature groups.

valid_groups()

Return a tuple of valid metafeature groups.

valid_metafeatures([groups])

Return a tuple with all metafeatures related to given groups.

valid_summary()

Return a tuple of valid summary functions.

Attributes

groups_alias

extract(verbose: int = 0, enable_parallel: bool = False, suppress_warnings: bool = False, out_type: ~typing.Any = <class 'tuple'>, **kwargs) Union[Tuple[List, ...], Dict[str, List], DataFrame][source]

Extracts metafeatures from the previously fitted dataset.

Parameters
verboseint, optional

Defines the verbosity level related to the metafeature extraction. If == 1, show just the current progress, without line breaks. If >= 2, print all messages related to the metafeature extraction process.

Note that warning messages are not affected by this option (see suppress_warnings argument below).

enable_parallelbool, optional

If True, then the meta-feature extraction is done with multi-processes. Currently, this argument has no effect by now (to be implemented).

suppress_warningsbool, optional

If True, do not show any warning while extracting meta-features.

kwargs:

Used to pass custom arguments for both feature-extraction and summary methods. The expected format is the following:

{mtd_name: {arg_name: arg_value, …}, …}

In words, the key values of **kwargs should be the target methods which receives the custom arguments, and each method has another dictionary containing customs method argument names as keys and their correspondent values, as values. See Examples subsection for a clearer explanation.

For more information see Examples.

out_type: :obj:`Any`, optional

If tuple, then the returned value is a tuple. If dict, then the returned value is a dictionary. If pd.DataFrame the the returned value is a pandas.core.DataFrame. Otherwise, an Type Error is raised.

Returns
tuple`(:obj:`list, list)

A tuple containing two lists (if measure_time is None).

The first field is the identifiers of each summarized value in the form feature_name.summary_mtd_name (i.e., the feature extraction name concatenated by the summary method name, separated by a dot).

The second field is the summarized values.

Both lists have a 1-1 correspondence by the index of each element (i.e., the value at index i in the second list has its identifier at the same index in the first list and vice-versa).

dict`(:obj:`str, list)

A dictionary containing two fields (if measure_time is None). The fields are: mtf_names, mtf_vals (if measure_time, the there is mtf_time).

The first field is the identifiers of each summarized value in the form feature_name.summary_mtd_name (i.e., the feature extraction name concatenated by the summary method name, separated by a dot).

The second field is the summarized values.

Both lists of each field have a 1-1 correspondence by the index of each elemen (i.e., the value at index i in the second list has its identifier at the same index in the first list and vice-versa).

pandas.core.frame.DataFrame

A pandas DataFrame instance.

Each column is a summarized value. The column is identified by the name of the meta-feature in the form feature_name.summary_mtd_name (i.e., the featur extraction name concatenated by the summary method name, separate by a dot).

The rows store the summarized values (if measure_time, there is a row with the time taken to calculate each value).

if measure_time is given during the model instantiation, a third list will be returned with the time spent during the calculations for the corresponding (by index) metafeature.

Raises
TypeError

If calling extract method before fit method.

TypeError

If calling extract method with invalid out_type.

Examples

Using kwargs. Option 1 to pass ft. extraction custom arguments:

>>> args = {
>>> 'sd': {'ddof': 2},
>>> '1NN': {'metric': 'minkowski', 'p': 2},
>>> 'leaves': {'max_depth': 4},
>>> }
>>> model = MFE().fit(X=data, y=labels)
>>> result = model.extract(**args)

Option 2 (note: metafeatures with name starting with numbers are not allowed!):

>>> model = MFE().fit(X=data, y=labels)
>>> res = extract(sd={'ddof': 2}, leaves={'max_depth': 4})
extract_from_model(model: Any, arguments_fit: Optional[Dict[str, Any]] = None, arguments_extract: Optional[Dict[str, Any]] = None, verbose: int = 0) Union[Tuple[List, ...], Dict[str, List], DataFrame][source]

Extract model-based metafeatures from given model.

The random seed used by the new internal model is the same random seed set in the current model (if any.)

The metafeatures extracted will be all metafeatures selected originally in the current model that are also in the ‘model-based’ group.

The extracted values will be summarized also with the summary functions selected originally in this model.

Parameters
modelany

Pre-fitted machine learning model.

arguments_fitdict, optional

Custom arguments to fit the extractor model. See .fit method documentation for more information.

arguments_extractdict, optional

Custom arguments to extract the metafeatures. See .extract method documentation for more information.

verboseint, optional

Select the level of verbosity of this method. Please note that the verbosity level of each step (fit and extract) need to be given separately using, respectively, arguments_fit and arguments_extract arguments.

Returns
tuple`(:obj:`list, list) or
dict`(:obj:`str, any) or
pandas.core.DataFrame

See .extract method return value for more information.

Notes

Internally, a new MFE model is created to perform the metafeature extractions. Therefore, the current model (if any) will not be affected by this method by any means.

extract_metafeature_names(supervised: bool = True) Tuple[str, ...][source]

Extract the pre-configured meta-feature names.

Parameters
supervisedbool, optional

If True, extract the meta-feature names assuming that y (data labels) is given alongside X (independent attributes).

If there is some data fit into the MFE model, this method checks wether y was fitted or not. Therefore, setting supervised=True while fitting only X has no effect, and only unsupervised meta-feature names will be returned.

Returns
tuple

If Tuple with meta-feature names to be extracted as values.

extract_with_confidence(sample_num: int = 128, confidence: Union[float, List[float]] = 0.95, arguments_fit: Optional[Dict[str, Any]] = None, arguments_extract: Optional[Dict[str, Any]] = None, verbose: int = 0) Union[Tuple[List, ...], Dict[str, List], DataFrame][source]

Extract metafeatures with confidence intervals.

To build the confidence intervals, the empirical bootstrap algorithm is used, which is as follows:

  1. All selected metafeatures are extracted from the fitted data, M.

  2. Then, each metafeature is extracted sample_num times from a resampled dataset using bootstrap from the fitted data, M_i.

  3. Then, the differences delta_i = M_i - M are calculated

  4. From the differences delta_i, the quantiles related to the given confidence levels (confidence = 1 - Type I error rate) are calculated.

  5. The confidence intervals are centered in M and the width of interval is given by the quantiles of the differences previously calculated.

All configuration used by this method are from the configuration while instantiating the current model.

Parameters
sample_numint, optional

Number of samples from the fitted data using bootstrap. Each metafeature will be extracted sample_num times.

confidencefloat or sequence of floats, optional

Confidence level of the interval. Must be in (0.0, 1.0) range. If a sequence of confidence levels is given, a confidence interval will be extracted for each value.

arguments_fitdict, optional

Extra arguments for the fit method for each sampled dataset. See .fit method documentation for more information.

arguments_extractdict, optional

Extra arguments for each metafeature extraction procedure. See .extract method documentation for more information.

verboseint, optional

Verbosity level for this method. Please note that the verbosity level for both .fit and .extract methods performed within this method must be controlled separately using, respectively, arguments_fit and arguments_extract parameters.

Returns
tuple of np.ndarray

The same return value format of the extract method, appended with the confidence intervals as a new sequence of values in the form (interval_low_1, interval_low_2, …, interval_high_(n-1), interval_high_n) for each corresponding metafeature, and with shape (metafeature_num, 2 * C), where C is the number of confidence levels given in confidence (i.e., the rows represents each metafeature and the columns each interval limit). This means that all interval lower limits are given first, and all the interval upper limits are grouped together afterwards. The sequence order of the interval limits follows the same sequence order of the confidence levels given in confidence. For instance, if confidence=[0.80, 0.90, 0.99], then the confidence intervals will be returned in the following order (for all metafeatures): (lower_0.80, lower_0.90, lower_0.99, upper_0.80, upper_0.90, upper_0.99).

Raises
ValueError

If confidence is not in (0.0, 1.0) range.

TypeError

If no data was fit into the model previously.

Notes

The model used to fit and extract metafeatures for each sampled dataset is instantiated within this method and, therefore, this method does not affect the current model (if any) by any means.

fit(X: Union[ndarray, List], y: Optional[Union[ndarray, List]] = None, transform_num: bool = True, transform_cat: str = 'gray', rescale: Optional[str] = None, rescale_args: Optional[Dict[str, Any]] = None, cat_cols: Optional[Union[str, Iterable[int]]] = 'auto', check_bool: bool = False, precomp_groups: Optional[str] = 'all', wildcard: str = 'all', suppress_warnings: bool = False, verbose: int = 0, **kwargs) MFE[source]

Fits dataset into an MFE model.

Parameters
XList

Predictive attributes of the dataset.

yList, optional

Target attributes of the dataset, assuming that it is a supervised task.

transform_numbool, optional

If True, numeric attributes are discretized using equal-frequency histogram technique to use alongside categorical data when extracting categoric-only metafeatures. Note that numeric-only features still uses the original numeric values, not the discretized ones. If False, then numeric attributes are ignored for categorical-only meta-features.

transform_catstr, optional

Transform categorical data to use alongside numerical data while extracting numeric-only metafeatures. Note that categoric-only features still uses the original categoric values, and not the binarized ones.

If one-hot, categorical attributes are binarized using one-hot encoding with k-1 features for a categorical attribute with k distinct values. This algorithm works as follows:

For each categorical attribute C:
  1. Encode C with traditional one-hot encoding.

  2. Arbitrarily drop the first column of the encoding result.

The unique value previously represented by the k-length vector [1, 0, …, 0] will now be presented by the (k-1)-length vector [0, 0, …, 0]. Note that all other unique values will also now be represented by (k-1)-length vectors (the first 0 is dropped out).

This algorithm avoids the dummy variable trap, which may raise multicollinearity problems due to the unnecessary extra feature. Note that the decision of dropping the very first encoded feature is arbitrary, as any other encoded feature could have been dropped instead.

If gray, categorical attributes are binarized using a model matrix. The formula used for this transformation is just the union (+) of all categoric attributes using formula language from patsy package API, removing the intercept terms: ~ 0 + A_1 + … + A_n, where n is the number of features and A_i is the ith categoric attribute, 1 <= i <= n.

If one-hot-full, categorical attributes are binarized using one- hot encoding with k features for a categorical attributes with k distinct values. This option is not recommended due to the dummy variable trap, which may cause multicollinearity problems due to an extra unnecessary variable (a label can be encoded using the null vector [0, …, 0]^T).

If None, then categorical attributes are not transformed.

rescalestr, optional

If NoneType, the model keeps all numeric data with its original values. Otherwise, this argument can assume one of the string options below to rescale all numeric values:

  1. standard: set numeric data to zero mean, unit variance. Also known as z-score normalization. Check the documentation of sklearn.preprocessing.StandardScaler for in-depth information.

  2. ‘min-max`: set numeric data to interval [a, b], a < b. It is possible to define values to a and b using argument rescale_args. The default values are a = 0.0 and b = 1.0. Check sklearn.preprocessing.MinMaxScaler documentation for more information.

  3. robust: rescale data using statistics robust to the presence of outliers. For in-depth information, check documentation of sklearn.preprocessing.RobustScaler.

rescale_argsdict, optional

Dictionary containing parameters for rescaling data. Used only if rescale argument is not NoneType. These dictionary keys are the parameter names as strings and the values, the corresponding parameter value.

cat_cols :obj:`List` of :obj:`int` or :obj:`str`, optional

Categorical columns of dataset. If given NoneType or an empty sequence, assume all columns as numeric. If given value auto, then an attempt of automatic detection is performed while fitting the dataset.

check_boolbool, optional

If cat_cols is auto, and this flag is True, assume that all columns with precisely two different values is also a categorical (boolean) column, independently of its data type. Otherwise, these columns may be considered numeric depending on their data type.

missing_datastr, optional

Defines the strategy to handle missing values in data. Still not implemented.

precomp_groupsstr, optional

Defines which metafeature groups common values should be cached to share among various meta-feature extraction related methods (e.g. classes, or covariance). This argument may speed up meta-feature extraction but also consumes more memory, so it may not be suitable for huge datasets.

wildcardstr, optional

Value used as select all for precomp_groups.

suppress_warningsbool, optional

If True, ignore all warnings invoked while fitting the dataset.

verboseint, optional

Defines the level of verbosity for the fit method. If 1, then print a progress bar related to the precomputations. If 2 or higher, then log every step of the fitted data transformations and the precomputation steps.

**kwargs:

Extra custom arguments to the precomputation methods. Keep in mind that those values may even replace internal custom parameters, if the name matches. Use this resource carefully.

Hint: you can check which are the internal custom arguments by verifying the values in ‘._custom_args_ft’ attribute after the model is fitted.

This argument format is {‘parameter_name’: parameter_value}.

Returns
self
Raises
ValueError

If the number of rows of X and y length does not match.

TypeError

If X or y (or both) is neither a list or a np.ndarray object.

classmethod metafeature_description(groups: Optional[Union[Iterable[str], str]] = None, sort_by_group: bool = False, sort_by_mtf: bool = False, print_table: bool = True, include_references: bool = False) Optional[Tuple[List[List[str]], str]][source]

Print a table with groups, metafeatures and description.

Parameters
groupssequence of str or str, optional:

Can be a string such value is a name of a specific metafeature group (see valid_groups method for more information) or a sequence of metafeature group names. It can be also None, which in that case all available metafeature names will be returned.

sort_by_group: bool

Sort table by meta-feature group name.

sort_by_mtf: bool

Sort table by meta-feature name.

print_tablebool

If True a table will be printed with the description, otherwise the table will be send by return.

print_tablebool

If True sort the table by metafeature name.

include_referencesbool

If True include a column with article reference.

Returns
list of list

A table with the metafeature descriptions or None.

Notes

The returned metafeatures are not related to the groups or to the metafeatures fitted in the model instantation. All the returned metafeatures are available in the Pymfe package. Check the MFE documentation for deeper information.

classmethod parse_by_group(groups: Union[List[str], str], extracted_results: Tuple[List, ...]) Tuple[List, ...][source]

Parse the result of extract for given metafeature groups.

Can be used to easily separate the results of each metafeature group.

Parameters
groupsList of str or str

Metafeature group names which the results should be parsed relative to. Use valid_groups method to check the available metafeature groups.

extracted_resultstuple of t.List

Output of extract method. Should contain all outputed lists (metafeature names, values and elapsed time for extraction, if present.)

Returns
tuple of str

Slices of lists of extracted_results, selected based on given groups.

Notes

The given groups are not related to the groups fitted in the model in the model instantation. Check valid_groups method to get a list of all available groups from the Pymfe package. Check the MFE documentation for deeper information about all these groups.

classmethod valid_groups() Tuple[str, ...][source]

Return a tuple of valid metafeature groups.

Notes

The returned groups are not related to the groups fitted in the model in the model instantation. The returned groups are all available metafeature groups in the Pymfe package. Check the MFE documentation for deeper information.

classmethod valid_metafeatures(groups: Optional[Union[Iterable[str], str]] = None) Tuple[str, ...][source]

Return a tuple with all metafeatures related to given groups.

Parameters
groupsList of str or str, optional:

Can be a string such value is a name of a specific metafeature group (see valid_groups method for more information) or a sequence of metafeature group names. It can be also None, which in that case all available metafeature names will be returned.

Returns
tuple of str

Tuple with all available metafeature names of the given groups.

Notes

The returned metafeatures are not related to the groups or to the metafeatures fitted in the model in the model instantation. All the returned metafeatures are available in the Pymfe package. Check the MFE documentation for deeper information.

classmethod valid_summary() Tuple[str, ...][source]

Return a tuple of valid summary functions.

Notes

The returned summaries are not related to the summaries fitted in the model in the model instantation. The returned summaries are all available in the Pymfe package. Check the documentation of MFE for deeper information.