Using Pandas, CSV and ARFF files

In this example we will show you how to use Pandas, CSV and ARFF in PyMFE.

# Necessary imports
import pandas as pd
import numpy as np
from numpy import genfromtxt
from pymfe.mfe import MFE
import csv
import arff

Pandas

Generating synthetic dataset

np.random.seed(42)

sample_size = 150
numeric = pd.DataFrame({
    'num1': np.random.randint(0, 100, size=sample_size),
    'num2': np.random.randint(0, 100, size=sample_size)
})
categoric = pd.DataFrame({
    'cat1': np.repeat(('cat1-1', 'cat1-2'), sample_size/2),
    'cat2': np.repeat(('cat2-1', 'cat2-2', 'cat2-3'), sample_size/3)
})
X = numeric.join(categoric)
y = pd.Series(np.repeat(['C1', 'C2'], sample_size/2))

Exploring characteristics of the data

print("X shape --> ", X.shape)
print("y shape --> ", y.shape)
print("classes --> ", np.unique(y.values))
print("X dtypes --> \n", X.dtypes)
print("y dtypes --> ", y.dtypes)

X shape -->  (150, 4)
y shape -->  (150,)
classes -->  ['C1' 'C2']
X dtypes -->
 num1     int64
num2     int64
cat1    object
cat2    object
dtype: object
y dtypes -->  object

For extracting meta-features, you should send X and y as a sequence, like numpy array or Python list. It is easy to make this using pandas:

mfe = MFE(
    groups=["general", "statistical", "info-theory"],
    random_state=42
)
mfe.fit(X.values, y.values)
ft = mfe.extract(cat_cols='auto', suppress_warnings=True)
print("\n".join("{:50} {:30}".format(x, y) for x, y in zip(ft[0], ft[1])))

attr_conc.mean                                                0.09803782687811123
attr_conc.sd                                                  0.20092780162502177
attr_ent.mean                                                   1.806723506674862
attr_ent.sd                                                    0.6400186923915645
attr_to_inst                                                  0.02666666666666667
can_cor.mean                                                   0.9999999999999939
can_cor.sd                                                                    nan
cat_to_num                                                                    1.0
class_conc.mean                                                0.3367786962694272
class_conc.sd                                                 0.46816436703211783
class_ent                                                                     1.0
cor.mean                                                      0.21555756110154345
cor.sd                                                         0.3346883517496457
cov.mean                                                       1.1208620432513037
cov.sd                                                          2.935012658229703
eigenvalues.mean                                                302.5088590604026
eigenvalues.sd                                                 468.34378838676076
eq_num_attr                                                     2.342391810715466
freq_class.mean                                                               0.5
freq_class.sd                                                                 0.0
g_mean.mean                                                                   nan
g_mean.sd                                                                     nan
gravity                                                         2.526235671244205
h_mean.mean                                                                   0.0
h_mean.sd                                                                     0.0
inst_to_attr                                                                 37.5
iq_range.mean                                                              17.875
iq_range.sd                                                    26.169519483551852
joint_ent.mean                                                 2.3798094458309773
joint_ent.sd                                                   1.1272619013187726
kurtosis.mean                                                  -1.590742398569767
kurtosis.sd                                                    0.3510329462751763
lh_trace                                                        81883629588553.47
mad.mean                                                                  13.0963
mad.sd                                                         19.739563046227747
max.mean                                                                     33.5
max.sd                                                          50.34977656355587
mean.mean                                                      16.575555555555553
mean.sd                                                         25.03349601959269
median.mean                                                    17.083333333333332
median.sd                                                      26.079525813685088
min.mean                                                                      0.0
min.sd                                                                        0.0
mut_inf.mean                                                  0.42691406084388483
mut_inf.sd                                                     0.4886487667308632
nr_attr                                                                         4
nr_bin                                                                          1
nr_cat                                                                          2
nr_class                                                                        2
nr_cor_attr                                                   0.26666666666666666
nr_disc                                                                         1
nr_inst                                                                       150
nr_norm                                                                       0.0
nr_num                                                                          2
nr_outliers                                                                     0
ns_ratio                                                        3.232054346262326
num_to_cat                                                                    1.0
p_trace                                                        0.9999999999999878
range.mean                                                                   33.5
range.sd                                                        50.34977656355587
roy_root                                                        81883629588553.47
sd.mean                                                        10.363955372064863
sd.sd                                                          15.300874018418051
sd_ratio                                                                      nan
skewness.mean                                                 0.21813480959596565
skewness.sd                                                   0.37469793054651285
sparsity.mean                                                 0.20977689098494468
sparsity.sd                                                   0.24417958923140248
t_mean.mean                                                     16.63888888888889
t_mean.sd                                                      25.218526160907878
var.mean                                                       302.50885906040264
var.sd                                                         468.28423639285654
w_lambda                                                   1.2212453270876724e-14

Pandas CSV

Getting data from CSV format

df = pd.read_csv('../data/data.csv')
X, y = df.drop('class', axis=1), df['class']

Exploring characteristics of the data

print("X shape --> ", X.shape)
print("y shape --> ", y.shape)
print("classes --> ", np.unique(y))
print("X dtypes --> \n", X.dtypes)
print("y dtypes --> ", y.dtypes)

X shape -->  (150, 4)
y shape -->  (150,)
classes -->  ['C1' 'C2']
X dtypes -->
 num1     int64
num2     int64
cat1    object
cat2    object
dtype: object
y dtypes -->  object

For extracting meta-features, you should send X and y as a sequence, like numpy array or Python list. It is easy to make this using pandas:

mfe = MFE(
    groups=["general", "statistical", "info-theory"],
    random_state=42
)
mfe.fit(X.values, y.values)
ft = mfe.extract(cat_cols='auto', suppress_warnings=True)
print("\n".join("{:50} {:30}".format(x, y) for x, y in zip(ft[0], ft[1])))

attr_conc.mean                                                0.09803782687811123
attr_conc.sd                                                  0.20092780162502177
attr_ent.mean                                                   1.806723506674862
attr_ent.sd                                                    0.6400186923915645
attr_to_inst                                                  0.02666666666666667
can_cor.mean                                                   0.9999999999999939
can_cor.sd                                                                    nan
cat_to_num                                                                    1.0
class_conc.mean                                                0.3367786962694272
class_conc.sd                                                 0.46816436703211783
class_ent                                                                     1.0
cor.mean                                                      0.21555756110154345
cor.sd                                                         0.3346883517496457
cov.mean                                                       1.1208620432513037
cov.sd                                                          2.935012658229703
eigenvalues.mean                                                302.5088590604026
eigenvalues.sd                                                 468.34378838676076
eq_num_attr                                                     2.342391810715466
freq_class.mean                                                               0.5
freq_class.sd                                                                 0.0
g_mean.mean                                                                   nan
g_mean.sd                                                                     nan
gravity                                                         2.526235671244205
h_mean.mean                                                                   0.0
h_mean.sd                                                                     0.0
inst_to_attr                                                                 37.5
iq_range.mean                                                              17.875
iq_range.sd                                                    26.169519483551852
joint_ent.mean                                                 2.3798094458309773
joint_ent.sd                                                   1.1272619013187726
kurtosis.mean                                                  -1.590742398569767
kurtosis.sd                                                    0.3510329462751763
lh_trace                                                        81883629588553.47
mad.mean                                                                  13.0963
mad.sd                                                         19.739563046227747
max.mean                                                                     33.5
max.sd                                                          50.34977656355587
mean.mean                                                      16.575555555555553
mean.sd                                                         25.03349601959269
median.mean                                                    17.083333333333332
median.sd                                                      26.079525813685088
min.mean                                                                      0.0
min.sd                                                                        0.0
mut_inf.mean                                                  0.42691406084388483
mut_inf.sd                                                     0.4886487667308632
nr_attr                                                                         4
nr_bin                                                                          1
nr_cat                                                                          2
nr_class                                                                        2
nr_cor_attr                                                   0.26666666666666666
nr_disc                                                                         1
nr_inst                                                                       150
nr_norm                                                                       0.0
nr_num                                                                          2
nr_outliers                                                                     0
ns_ratio                                                        3.232054346262326
num_to_cat                                                                    1.0
p_trace                                                        0.9999999999999878
range.mean                                                                   33.5
range.sd                                                        50.34977656355587
roy_root                                                        81883629588553.47
sd.mean                                                        10.363955372064863
sd.sd                                                          15.300874018418051
sd_ratio                                                                      nan
skewness.mean                                                 0.21813480959596565
skewness.sd                                                   0.37469793054651285
sparsity.mean                                                 0.20977689098494468
sparsity.sd                                                   0.24417958923140248
t_mean.mean                                                     16.63888888888889
t_mean.sd                                                      25.218526160907878
var.mean                                                       302.50885906040264
var.sd                                                         468.28423639285654
w_lambda                                                   1.2212453270876724e-14

ARFF

Getting data from ARFF format:

data = arff.load(open('../data/data.arff', 'r'))['data']
X = [i[:4] for i in data]
y = [i[-1] for i in data]

Exploring characteristics of the data

print("X shape --> ", len(X))
print("y shape --> ", len(y))
print("classes --> ", np.unique(y))
print("X dtypes --> ", type(X))
print("y dtypes --> ", type(y))

X shape -->  150
y shape -->  150
classes -->  ['C1' 'C2']
X dtypes -->  <class 'list'>
y dtypes -->  <class 'list'>

For extracting meta-features, you should send X and y as a sequence, like numpy array or Python list. You can do this directly:

mfe = MFE(
    groups=["general", "statistical", "info-theory"],
    random_state=42
)
mfe.fit(X, y)
ft = mfe.extract(cat_cols='auto', suppress_warnings=True)
print("\n".join("{:50} {:30}".format(x, y) for x, y in zip(ft[0], ft[1])))

attr_conc.mean                                                0.09803782687811123
attr_conc.sd                                                  0.20092780162502177
attr_ent.mean                                                   1.806723506674862
attr_ent.sd                                                    0.6400186923915645
attr_to_inst                                                  0.02666666666666667
can_cor.mean                                                   0.9999999999999939
can_cor.sd                                                                    nan
cat_to_num                                                                    1.0
class_conc.mean                                                0.3367786962694272
class_conc.sd                                                 0.46816436703211783
class_ent                                                                     1.0
cor.mean                                                      0.21555756110154345
cor.sd                                                         0.3346883517496457
cov.mean                                                       1.1208620432513037
cov.sd                                                          2.935012658229703
eigenvalues.mean                                                302.5088590604026
eigenvalues.sd                                                 468.34378838676076
eq_num_attr                                                     2.342391810715466
freq_class.mean                                                               0.5
freq_class.sd                                                                 0.0
g_mean.mean                                                                   nan
g_mean.sd                                                                     nan
gravity                                                         2.526235671244205
h_mean.mean                                                                   0.0
h_mean.sd                                                                     0.0
inst_to_attr                                                                 37.5
iq_range.mean                                                              17.875
iq_range.sd                                                    26.169519483551852
joint_ent.mean                                                 2.3798094458309773
joint_ent.sd                                                   1.1272619013187726
kurtosis.mean                                                  -1.590742398569767
kurtosis.sd                                                    0.3510329462751763
lh_trace                                                        81883629588553.47
mad.mean                                                                  13.0963
mad.sd                                                         19.739563046227747
max.mean                                                                     33.5
max.sd                                                          50.34977656355587
mean.mean                                                      16.575555555555553
mean.sd                                                         25.03349601959269
median.mean                                                    17.083333333333332
median.sd                                                      26.079525813685088
min.mean                                                                      0.0
min.sd                                                                        0.0
mut_inf.mean                                                  0.42691406084388483
mut_inf.sd                                                     0.4886487667308632
nr_attr                                                                         4
nr_bin                                                                          1
nr_cat                                                                          2
nr_class                                                                        2
nr_cor_attr                                                   0.26666666666666666
nr_disc                                                                         1
nr_inst                                                                       150
nr_norm                                                                       0.0
nr_num                                                                          2
nr_outliers                                                                     0
ns_ratio                                                        3.232054346262326
num_to_cat                                                                    1.0
p_trace                                                        0.9999999999999878
range.mean                                                                   33.5
range.sd                                                        50.34977656355587
roy_root                                                        81883629588553.47
sd.mean                                                        10.363955372064863
sd.sd                                                          15.300874018418051
sd_ratio                                                                      nan
skewness.mean                                                 0.21813480959596565
skewness.sd                                                   0.37469793054651285
sparsity.mean                                                 0.20977689098494468
sparsity.sd                                                   0.24417958923140248
t_mean.mean                                                     16.63888888888889
t_mean.sd                                                      25.218526160907878
var.mean                                                       302.50885906040264
var.sd                                                         468.28423639285654
w_lambda                                                   1.2212453270876724e-14

As a final example, we do not use the automatic detection of feature type here. We use the ids provided by the liac-arff package.

classid = 4
data = arff.load(open('../data/data.arff', 'r'), encode_nominal=True)
cat_cols = [n for n, i in enumerate(data['attributes'][:classid])
            if isinstance(i[1], list)]
data = np.array(data['data'])
X = data[:, :classid]
y = data[:, classid]

Exploring characteristics of the data

print("X shape --> ", len(X))
print("y shape --> ", len(y))
print("classes --> ", np.unique(y))
print("X dtypes --> ", type(X))
print("y dtypes --> ", type(y))

X shape -->  150
y shape -->  150
classes -->  [0. 1.]
X dtypes -->  <class 'numpy.ndarray'>
y dtypes -->  <class 'numpy.ndarray'>

For extracting meta-features, you should send X and y as a sequence, like numpy array or python list.

mfe = MFE(
    groups=["general", "statistical", "info-theory"],
    random_state=42
)
mfe.fit(X, y, cat_cols=cat_cols)
ft = mfe.extract(suppress_warnings=True)
print("\n".join("{:50} {:30}".format(x, y) for x, y in zip(ft[0], ft[1])))

attr_conc.mean                                                0.09803782687811123
attr_conc.sd                                                  0.20092780162502177
attr_ent.mean                                                   1.806723506674862
attr_ent.sd                                                    0.6400186923915645
attr_to_inst                                                  0.02666666666666667
can_cor.mean                                                   0.9999999999999939
can_cor.sd                                                                    nan
cat_to_num                                                                    1.0
class_conc.mean                                                0.3367786962694272
class_conc.sd                                                 0.46816436703211783
class_ent                                                                     1.0
cor.mean                                                      0.21555756110154345
cor.sd                                                         0.3346883517496457
cov.mean                                                       1.1208620432513037
cov.sd                                                          2.935012658229703
eigenvalues.mean                                                302.5088590604026
eigenvalues.sd                                                 468.34378838676076
eq_num_attr                                                     2.342391810715466
freq_class.mean                                                               0.5
freq_class.sd                                                                 0.0
g_mean.mean                                                                   nan
g_mean.sd                                                                     nan
gravity                                                         2.526235671244205
h_mean.mean                                                                   0.0
h_mean.sd                                                                     0.0
inst_to_attr                                                                 37.5
iq_range.mean                                                              17.875
iq_range.sd                                                    26.169519483551852
joint_ent.mean                                                 2.3798094458309773
joint_ent.sd                                                   1.1272619013187726
kurtosis.mean                                                  -1.590742398569767
kurtosis.sd                                                    0.3510329462751763
lh_trace                                                        81883629588553.47
mad.mean                                                                  13.0963
mad.sd                                                         19.739563046227747
max.mean                                                                     33.5
max.sd                                                          50.34977656355587
mean.mean                                                      16.575555555555553
mean.sd                                                         25.03349601959269
median.mean                                                    17.083333333333332
median.sd                                                      26.079525813685088
min.mean                                                                      0.0
min.sd                                                                        0.0
mut_inf.mean                                                  0.42691406084388483
mut_inf.sd                                                     0.4886487667308632
nr_attr                                                                         4
nr_bin                                                                          1
nr_cat                                                                          2
nr_class                                                                        2
nr_cor_attr                                                   0.26666666666666666
nr_disc                                                                         1
nr_inst                                                                       150
nr_norm                                                                       0.0
nr_num                                                                          2
nr_outliers                                                                     0
ns_ratio                                                        3.232054346262326
num_to_cat                                                                    1.0
p_trace                                                        0.9999999999999878
range.mean                                                                   33.5
range.sd                                                        50.34977656355587
roy_root                                                        81883629588553.47
sd.mean                                                        10.363955372064863
sd.sd                                                          15.300874018418051
sd_ratio                                                                      nan
skewness.mean                                                 0.21813480959596565
skewness.sd                                                   0.37469793054651285
sparsity.mean                                                 0.20977689098494468
sparsity.sd                                                   0.24417958923140248
t_mean.mean                                                     16.63888888888889
t_mean.sd                                                      25.218526160907878
var.mean                                                       302.50885906040264
var.sd                                                         468.28423639285654
w_lambda                                                   1.2212453270876724e-14

Total running time of the script: ( 0 minutes 0.881 seconds)

Gallery generated by Sphinx-Gallery