Note
Click here to download the full example code
Using Pandas, CSV and ARFF files
In this example we will show you how to use Pandas, CSV and ARFF in PyMFE.
# Necessary imports
import pandas as pd
import numpy as np
from numpy import genfromtxt
from pymfe.mfe import MFE
import csv
import arff
Pandas
Generating synthetic dataset
np.random.seed(42)
sample_size = 150
numeric = pd.DataFrame({
'num1': np.random.randint(0, 100, size=sample_size),
'num2': np.random.randint(0, 100, size=sample_size)
})
categoric = pd.DataFrame({
'cat1': np.repeat(('cat1-1', 'cat1-2'), sample_size/2),
'cat2': np.repeat(('cat2-1', 'cat2-2', 'cat2-3'), sample_size/3)
})
X = numeric.join(categoric)
y = pd.Series(np.repeat(['C1', 'C2'], sample_size/2))
Exploring characteristics of the data
X shape --> (150, 4)
y shape --> (150,)
classes --> ['C1' 'C2']
X dtypes -->
num1 int64
num2 int64
cat1 object
cat2 object
dtype: object
y dtypes --> object
For extracting meta-features, you should send X
and y
as a sequence,
like numpy array or Python list.
It is easy to make this using pandas:
attr_conc.mean 0.09803782687811123
attr_conc.sd 0.20092780162502177
attr_ent.mean 1.806723506674862
attr_ent.sd 0.6400186923915645
attr_to_inst 0.02666666666666667
can_cor.mean 0.9999999999999939
can_cor.sd nan
cat_to_num 1.0
class_conc.mean 0.3367786962694272
class_conc.sd 0.46816436703211783
class_ent 1.0
cor.mean 0.21555756110154345
cor.sd 0.3346883517496457
cov.mean 1.1208620432513037
cov.sd 2.935012658229703
eigenvalues.mean 302.5088590604026
eigenvalues.sd 468.34378838676076
eq_num_attr 2.342391810715466
freq_class.mean 0.5
freq_class.sd 0.0
g_mean.mean nan
g_mean.sd nan
gravity 2.526235671244205
h_mean.mean 0.0
h_mean.sd 0.0
inst_to_attr 37.5
iq_range.mean 17.875
iq_range.sd 26.169519483551852
joint_ent.mean 2.3798094458309773
joint_ent.sd 1.1272619013187726
kurtosis.mean -1.590742398569767
kurtosis.sd 0.3510329462751763
lh_trace 81883629588553.47
mad.mean 13.0963
mad.sd 19.739563046227747
max.mean 33.5
max.sd 50.34977656355587
mean.mean 16.575555555555553
mean.sd 25.03349601959269
median.mean 17.083333333333332
median.sd 26.079525813685088
min.mean 0.0
min.sd 0.0
mut_inf.mean 0.42691406084388483
mut_inf.sd 0.4886487667308632
nr_attr 4
nr_bin 1
nr_cat 2
nr_class 2
nr_cor_attr 0.26666666666666666
nr_disc 1
nr_inst 150
nr_norm 0.0
nr_num 2
nr_outliers 0
ns_ratio 3.232054346262326
num_to_cat 1.0
p_trace 0.9999999999999878
range.mean 33.5
range.sd 50.34977656355587
roy_root 81883629588553.47
sd.mean 10.363955372064863
sd.sd 15.300874018418051
sd_ratio nan
skewness.mean 0.21813480959596565
skewness.sd 0.37469793054651285
sparsity.mean 0.20977689098494468
sparsity.sd 0.24417958923140248
t_mean.mean 16.63888888888889
t_mean.sd 25.218526160907878
var.mean 302.50885906040264
var.sd 468.28423639285654
w_lambda 1.2212453270876724e-14
Pandas CSV
Getting data from CSV format
Exploring characteristics of the data
X shape --> (150, 4)
y shape --> (150,)
classes --> ['C1' 'C2']
X dtypes -->
num1 int64
num2 int64
cat1 object
cat2 object
dtype: object
y dtypes --> object
For extracting meta-features, you should send X
and y
as a sequence,
like numpy array or Python list.
It is easy to make this using pandas:
attr_conc.mean 0.09803782687811123
attr_conc.sd 0.20092780162502177
attr_ent.mean 1.806723506674862
attr_ent.sd 0.6400186923915645
attr_to_inst 0.02666666666666667
can_cor.mean 0.9999999999999939
can_cor.sd nan
cat_to_num 1.0
class_conc.mean 0.3367786962694272
class_conc.sd 0.46816436703211783
class_ent 1.0
cor.mean 0.21555756110154345
cor.sd 0.3346883517496457
cov.mean 1.1208620432513037
cov.sd 2.935012658229703
eigenvalues.mean 302.5088590604026
eigenvalues.sd 468.34378838676076
eq_num_attr 2.342391810715466
freq_class.mean 0.5
freq_class.sd 0.0
g_mean.mean nan
g_mean.sd nan
gravity 2.526235671244205
h_mean.mean 0.0
h_mean.sd 0.0
inst_to_attr 37.5
iq_range.mean 17.875
iq_range.sd 26.169519483551852
joint_ent.mean 2.3798094458309773
joint_ent.sd 1.1272619013187726
kurtosis.mean -1.590742398569767
kurtosis.sd 0.3510329462751763
lh_trace 81883629588553.47
mad.mean 13.0963
mad.sd 19.739563046227747
max.mean 33.5
max.sd 50.34977656355587
mean.mean 16.575555555555553
mean.sd 25.03349601959269
median.mean 17.083333333333332
median.sd 26.079525813685088
min.mean 0.0
min.sd 0.0
mut_inf.mean 0.42691406084388483
mut_inf.sd 0.4886487667308632
nr_attr 4
nr_bin 1
nr_cat 2
nr_class 2
nr_cor_attr 0.26666666666666666
nr_disc 1
nr_inst 150
nr_norm 0.0
nr_num 2
nr_outliers 0
ns_ratio 3.232054346262326
num_to_cat 1.0
p_trace 0.9999999999999878
range.mean 33.5
range.sd 50.34977656355587
roy_root 81883629588553.47
sd.mean 10.363955372064863
sd.sd 15.300874018418051
sd_ratio nan
skewness.mean 0.21813480959596565
skewness.sd 0.37469793054651285
sparsity.mean 0.20977689098494468
sparsity.sd 0.24417958923140248
t_mean.mean 16.63888888888889
t_mean.sd 25.218526160907878
var.mean 302.50885906040264
var.sd 468.28423639285654
w_lambda 1.2212453270876724e-14
ARFF
Getting data from ARFF format:
Exploring characteristics of the data
X shape --> 150
y shape --> 150
classes --> ['C1' 'C2']
X dtypes --> <class 'list'>
y dtypes --> <class 'list'>
For extracting meta-features, you should send X
and y
as a sequence,
like numpy array or Python list.
You can do this directly:
attr_conc.mean 0.09803782687811123
attr_conc.sd 0.20092780162502177
attr_ent.mean 1.806723506674862
attr_ent.sd 0.6400186923915645
attr_to_inst 0.02666666666666667
can_cor.mean 0.9999999999999939
can_cor.sd nan
cat_to_num 1.0
class_conc.mean 0.3367786962694272
class_conc.sd 0.46816436703211783
class_ent 1.0
cor.mean 0.21555756110154345
cor.sd 0.3346883517496457
cov.mean 1.1208620432513037
cov.sd 2.935012658229703
eigenvalues.mean 302.5088590604026
eigenvalues.sd 468.34378838676076
eq_num_attr 2.342391810715466
freq_class.mean 0.5
freq_class.sd 0.0
g_mean.mean nan
g_mean.sd nan
gravity 2.526235671244205
h_mean.mean 0.0
h_mean.sd 0.0
inst_to_attr 37.5
iq_range.mean 17.875
iq_range.sd 26.169519483551852
joint_ent.mean 2.3798094458309773
joint_ent.sd 1.1272619013187726
kurtosis.mean -1.590742398569767
kurtosis.sd 0.3510329462751763
lh_trace 81883629588553.47
mad.mean 13.0963
mad.sd 19.739563046227747
max.mean 33.5
max.sd 50.34977656355587
mean.mean 16.575555555555553
mean.sd 25.03349601959269
median.mean 17.083333333333332
median.sd 26.079525813685088
min.mean 0.0
min.sd 0.0
mut_inf.mean 0.42691406084388483
mut_inf.sd 0.4886487667308632
nr_attr 4
nr_bin 1
nr_cat 2
nr_class 2
nr_cor_attr 0.26666666666666666
nr_disc 1
nr_inst 150
nr_norm 0.0
nr_num 2
nr_outliers 0
ns_ratio 3.232054346262326
num_to_cat 1.0
p_trace 0.9999999999999878
range.mean 33.5
range.sd 50.34977656355587
roy_root 81883629588553.47
sd.mean 10.363955372064863
sd.sd 15.300874018418051
sd_ratio nan
skewness.mean 0.21813480959596565
skewness.sd 0.37469793054651285
sparsity.mean 0.20977689098494468
sparsity.sd 0.24417958923140248
t_mean.mean 16.63888888888889
t_mean.sd 25.218526160907878
var.mean 302.50885906040264
var.sd 468.28423639285654
w_lambda 1.2212453270876724e-14
As a final example, we do not use the automatic detection of feature type here. We use the ids provided by the liac-arff package.
Exploring characteristics of the data
X shape --> 150
y shape --> 150
classes --> [0. 1.]
X dtypes --> <class 'numpy.ndarray'>
y dtypes --> <class 'numpy.ndarray'>
For extracting meta-features, you should send X
and y
as a sequence,
like numpy array or python list.
attr_conc.mean 0.09803782687811123
attr_conc.sd 0.20092780162502177
attr_ent.mean 1.806723506674862
attr_ent.sd 0.6400186923915645
attr_to_inst 0.02666666666666667
can_cor.mean 0.9999999999999939
can_cor.sd nan
cat_to_num 1.0
class_conc.mean 0.3367786962694272
class_conc.sd 0.46816436703211783
class_ent 1.0
cor.mean 0.21555756110154345
cor.sd 0.3346883517496457
cov.mean 1.1208620432513037
cov.sd 2.935012658229703
eigenvalues.mean 302.5088590604026
eigenvalues.sd 468.34378838676076
eq_num_attr 2.342391810715466
freq_class.mean 0.5
freq_class.sd 0.0
g_mean.mean nan
g_mean.sd nan
gravity 2.526235671244205
h_mean.mean 0.0
h_mean.sd 0.0
inst_to_attr 37.5
iq_range.mean 17.875
iq_range.sd 26.169519483551852
joint_ent.mean 2.3798094458309773
joint_ent.sd 1.1272619013187726
kurtosis.mean -1.590742398569767
kurtosis.sd 0.3510329462751763
lh_trace 81883629588553.47
mad.mean 13.0963
mad.sd 19.739563046227747
max.mean 33.5
max.sd 50.34977656355587
mean.mean 16.575555555555553
mean.sd 25.03349601959269
median.mean 17.083333333333332
median.sd 26.079525813685088
min.mean 0.0
min.sd 0.0
mut_inf.mean 0.42691406084388483
mut_inf.sd 0.4886487667308632
nr_attr 4
nr_bin 1
nr_cat 2
nr_class 2
nr_cor_attr 0.26666666666666666
nr_disc 1
nr_inst 150
nr_norm 0.0
nr_num 2
nr_outliers 0
ns_ratio 3.232054346262326
num_to_cat 1.0
p_trace 0.9999999999999878
range.mean 33.5
range.sd 50.34977656355587
roy_root 81883629588553.47
sd.mean 10.363955372064863
sd.sd 15.300874018418051
sd_ratio nan
skewness.mean 0.21813480959596565
skewness.sd 0.37469793054651285
sparsity.mean 0.20977689098494468
sparsity.sd 0.24417958923140248
t_mean.mean 16.63888888888889
t_mean.sd 25.218526160907878
var.mean 302.50885906040264
var.sd 468.28423639285654
w_lambda 1.2212453270876724e-14
Total running time of the script: ( 0 minutes 0.881 seconds)