python: change dataframe type data to have pandas.describe for numeric

Question

My goal is to analyze a ndarray created by the function sklearn.train_test_split. Actually they are two ndarry of this type:

N_sample.dtype.name

and it returns 'object' and the same for

R_sample.dtype.name

Let me explain how I obtained those data. I copied all text from here and pasted in an empty file named iris.cvs file. I saved it in the same folder of my project. So, I wrote my python script:

import pandas
from sklearn import model_selection
names = ['sepal-length', 'sepal-width', 'petal-length', 'petal-width', 'class']
dataset = pandas.read_csv('iris.cvs', names=names)
array = dataset.values
N = array[:,0:4]
R = array[:,4]
N_sample, N_test, R_sample, R_test = model_selection.train_test_split(N, R, test_size=0.2, random_state=7)

Therefore, I have:

N_sample.shape

that returns (120,4) and

R_sample.shape

that returns (120,)

So to create the dataset I used this:

new_arr=numpy.column_stack((N_sample,R_sample))
dateN=pandas.DataFrame(data =new_arr, columns=names)
#names was created before with the correct matches

The problem is that if I ask:

dateN.describe()

It returns count, unique, top, etc... but I want mean, std, etc... I tried different methods, like casting data of N_sample but it doesn't work, like:

pandas.to_numeric(dateN,downcast='float', errors='ignore')

but it's not possible because it needs just list, etc...Or I used this method:

N_sample.astype(float,casting='unsafe')

but finally it doesn't change the result.

Moreover, if I do:

dateN.iloc[:,0:4] = dateN.iloc[:,0:4].apply(pandas.to_numeric, errors='coerce')
dateN.dtypes

it returns:

sepal-length    object
sepal-width     object
petal-length    object
petal-width     object
class           object
dtype: object

So nothing's changed. How could I solve? How to make the dataset numeric so that I can get mean std, etc...?

if your data are all numerical, the result of your train test split should be numeric too ( not object). For instance if everything in your original dataset are float64 then the result of train test split should be also float64. So, I recommend to first check your data before train test split. I am sure that not everything in your original data are numeric. — MhFarahani, Aug 17 '17 at 19:56

score 2 · Answer 1 · answered Aug 17 '17 at 20:28

Consider the following demo:

we will start from the DF consisting of all numeric columns:

In [282]: df = pd.DataFrame(np.random.rand(3, 3), columns=list('abc'))

In [283]: df
Out[283]:
          a         b         c
0  0.357395  0.641735  0.959405
1  0.941251  0.966066  0.626380
2  0.966839  0.388960  0.411612

In [284]: df.dtypes
Out[284]:
a    float64
b    float64
c    float64
dtype: object

In [285]: df.describe()
Out[285]:
              a         b         c
count  3.000000  3.000000  3.000000
mean   0.755162  0.665587  0.665799
std    0.344714  0.289292  0.276016
min    0.357395  0.388960  0.411612
25%    0.649323  0.515347  0.518996
50%    0.941251  0.641735  0.626380
75%    0.954045  0.803901  0.792893
max    0.966839  0.966066  0.959405

Now let's change a single cell with a string value:

In [286]: df.loc[0, 'b'] = 'XXXXXXXXX'

In [287]: df
Out[287]:
          a          b         c
0  0.357395  XXXXXXXXX  0.959405
1  0.941251   0.966066  0.626380
2  0.966839    0.38896  0.411612

as a result the whole column became string column:

In [288]: df.dtypes
Out[288]:
a    float64
b     object   # <--- NOTE !!!
c    float64
dtype: object

and it has disappeared from df.describe()

In [289]: df.describe()
Out[289]:
              a         c
count  3.000000  3.000000
mean   0.755162  0.665799
std    0.344714  0.276016
min    0.357395  0.411612
25%    0.649323  0.518996
50%    0.941251  0.626380
75%    0.954045  0.792893
max    0.966839  0.959405

if all our columns aren't numeric, df.describe() will give us different non-numeric statistics:

In [290]: df.astype(str).describe()
Out[290]:
                     a          b               c
count                3          3               3
unique               3          3               3
top     0.357394893221  XXXXXXXXX  0.411612214836
freq                 1          1               1

and when you use pd.to_numeric(..., errors='ignore') the column dtype will NOT be changed:

In [291]: df['b'] = pd.to_numeric(df['b'], errors='ignore')

In [292]: df.dtypes
Out[292]:
a    float64
b     object   # <--- NOTE !!!
c    float64
dtype: object

As a solution you can use errors='coerce' - this will replace all values that can't be converted to numeric dtype with NaN's:

In [293]: df = df.apply(pd.to_numeric, errors='coerce')

In [294]: df.dtypes
Out[294]:
a    float64
b    float64
c    float64
dtype: object

In [295]: df
Out[295]:
          a         b         c
0  0.357395       NaN  0.959405
1  0.941251  0.966066  0.626380
2  0.966839  0.388960  0.411612

now df.describe() will work numeric again:

In [296]: df.describe()
Out[296]:
              a         b         c
count  3.000000  2.000000  3.000000
mean   0.755162  0.677513  0.665799
std    0.344714  0.408076  0.276016
min    0.357395  0.388960  0.411612
25%    0.649323  0.533236  0.518996
50%    0.941251  0.677513  0.626380
75%    0.954045  0.821789  0.792893
max    0.966839  0.966066  0.959405

My starting dataset is [this](https://en.wikipedia.org/wiki/Iris_flower_data_set) that I processed with: `array = dataset.values N = array[:,0:4] F = array[:,4] N_sample, N_validation, R_sample, R_validation = model_selection.train_test_split(N, F, test_size=0.2, random_state=7) ` But if I do dataset.describe I get the numeric information, if I analyze the new dataset with N_sample + R_sample no. That's my issue that I don't understand and I can't solve. How could I manage my new dataset to make what I want? Thanks — SPS, Aug 17 '17 at 21:41
your answer was useful to explain me what happens, and it's great, but `dateN[:,0:4] = dateN[:,0:4].apply(pandas.to_numeric, errors='coerce') ` or similar doesn't work in my case. As you can test by yourself, the first four columns in the dataset are just numbers, so we can do the change. But how? That's the problem! Thanks — SPS, Aug 18 '17 at 10:19
`dateN.iloc[:,0:4] = dateN.iloc[:,0:4].apply(pandas.to_numeric, errors='coerce')` ? — MaxU - stand with Ukraine, Aug 18 '17 at 10:27
it doesn't work! In fact I have this: `datiN.iloc[:,0:4] =dateN.iloc[:,0:4].apply(pandas.to_numeric, errors='coerce') dateN.dtypes` which returns this: ` sepal-length object sepal-width object petal-length object petal-width object class object dtype: object ` so it seems nothing's changed — SPS, Aug 18 '17 at 14:12
@SPS, i 'd recommend you to open a new question and post there a small __reproducible__ data set and your desired data set. Please read [how to make good reproducible pandas examples](http://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples). — MaxU - stand with Ukraine, Aug 18 '17 at 14:33

score 0 · Answer 2 · answered Aug 18 '17 at 16:12

Most sSKLearn methods accept Pandas.DataFrame's so there is no need to convert your data into Numpy arrays and back.

Demo:

read CSV:

In [93]: dataset = pd.read_csv(url, names=names)

In [94]: dataset
Out[94]:
     sepal-length  sepal-width  petal-length  petal-width           class
0             5.1          3.5           1.4          0.2     Iris-setosa
1             4.9          3.0           1.4          0.2     Iris-setosa
2             4.7          3.2           1.3          0.2     Iris-setosa
3             4.6          3.1           1.5          0.2     Iris-setosa
4             5.0          3.6           1.4          0.2     Iris-setosa
..            ...          ...           ...          ...             ...
145           6.7          3.0           5.2          2.3  Iris-virginica
146           6.3          2.5           5.0          1.9  Iris-virginica
147           6.5          3.0           5.2          2.0  Iris-virginica
148           6.2          3.4           5.4          2.3  Iris-virginica
149           5.9          3.0           5.1          1.8  Iris-virginica

[150 rows x 5 columns]

categorizing and factorizing class column:

In [95]: dataset['class'] = dataset['class'].astype('category')

In [96]: dataset['class_num'] = dataset['class'].cat.codes

In [97]: dataset
Out[97]:
     sepal-length  sepal-width  petal-length  petal-width           class  class_num
0             5.1          3.5           1.4          0.2     Iris-setosa          0
1             4.9          3.0           1.4          0.2     Iris-setosa          0
2             4.7          3.2           1.3          0.2     Iris-setosa          0
3             4.6          3.1           1.5          0.2     Iris-setosa          0
4             5.0          3.6           1.4          0.2     Iris-setosa          0
..            ...          ...           ...          ...             ...        ...
145           6.7          3.0           5.2          2.3  Iris-virginica          2
146           6.3          2.5           5.0          1.9  Iris-virginica          2
147           6.5          3.0           5.2          2.0  Iris-virginica          2
148           6.2          3.4           5.4          2.3  Iris-virginica          2
149           5.9          3.0           5.1          1.8  Iris-virginica          2

[150 rows x 6 columns]

splitting dataset into train and test sets:

In [98]: N_sample, N_test, R_sample, R_test = \
    ...:     train_test_split(dataset.iloc[:, :4], dataset.iloc[:, -1],
    ...:                      test_size=0.2, random_state=7)
    ...:

In [99]: N_sample
Out[99]:
     sepal-length  sepal-width  petal-length  petal-width
126           6.2          2.8           4.8          1.8
79            5.7          2.6           3.5          1.0
22            4.6          3.6           1.0          0.2
139           6.9          3.1           5.4          2.1
74            6.4          2.9           4.3          1.3
..            ...          ...           ...          ...
142           5.8          2.7           5.1          1.9
92            5.8          2.6           4.0          1.2
103           6.3          2.9           5.6          1.8
67            5.8          2.7           4.1          1.0
25            5.0          3.0           1.6          0.2

[120 rows x 4 columns]

In [100]: N_sample.dtypes
Out[100]:
sepal-length    float64
sepal-width     float64
petal-length    float64
petal-width     float64
dtype: object

In [101]: R_sample
Out[101]:
126    2
79     1
22     0
139    2
74     1
      ..
142    2
92     1
103    2
67     1
25     0
Name: class_num, Length: 120, dtype: int8

In [102]: R_sample.dtype
Out[102]: dtype('int8')

As you see all columns are of numeric dtypes...

So, practically you added a new column to the dataset representing names with numbers, and then you used just a dataset of 'numbers' to obtain what I want...mmm...but why can we not transform just the fourth columns of interest for my describe outputs? — SPS, Aug 18 '17 at 16:44
what I mean, it's not to touch the last column, and change data type of the other columns — SPS, Aug 19 '17 at 10:49
I don't understand what you want to do... SKLearn methods don't like string (object) dtypes, so you will need to convert them anyway. I've provided you a solution which does exactly that. — MaxU - stand with Ukraine, Aug 19 '17 at 11:06

python: change dataframe type data to have pandas.describe for numeric

2 Answers2