3

I was working with a Pandas dataframe, using the UCI repository credit screening file at http://archive.ics.uci.edu/ml/machine-learning-databases/credit-screening/crx.data

The data contains some missing values and I want to perform a different imputation strategy depending on the data type of the column. For example, if the column is numeric use median imputing, but if it is categorical replace for a category such as "No Value".

I run this code to identify the numeric columns:

#Import data
import pandas as pd
data = pd.read_csv('http://archive.ics.uci.edu/ml/machine-learning- 
databases/credit-screening/crx.data', header=None)

#Imputation
import numpy as np
data = data.replace('?', np.nan)
numeric_columns = data.select_dtypes(include=[np.number]).columns

And it returns:

Out[67]: Int64Index([2, 7, 10, 14], dtype='int64')

For some reason it is not identifying the column 1 (which is clearly numeric) as such. I believe the cause is that there are some NaN values in the column that are making it look like it is not numeric. Anyone know what's going on and what can I do to identify the column 1 as a numeric?

Thanks!

dr_otter
  • 67
  • 5
  • 1
    What do you see when you try `data[0].dype`? If not numeric, try: `data[0] = pd.to_numeric(data[0], errors='coerce')`. – jpp May 29 '18 at 16:07
  • I get `dtype('O')`, what does it mean? I could manually do the to_numeric casting, but I'd like the algorithm to do it programmatically. – dr_otter May 29 '18 at 16:11
  • `dtype('O')` means object, it means there may be strings or any arbitrary type. You will need to convert. – jpp May 29 '18 at 16:14

2 Answers2

3

Use pd.to_numeric with error='ignore':

Before, df.info():

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 690 entries, 0 to 689
Data columns (total 16 columns):
0     678 non-null object
1     678 non-null object
2     690 non-null float64
3     684 non-null object
4     684 non-null object
5     681 non-null object
6     681 non-null object
7     690 non-null float64
8     690 non-null object
9     690 non-null object
10    690 non-null int64
11    690 non-null object
12    690 non-null object
13    677 non-null object
14    690 non-null int64
15    690 non-null object
dtypes: float64(2), int64(2), object(12)
memory usage: 86.3+ KB

Use pd.to_numeric:

df = df.replace('?',np.nan)
df = df.apply(lambda x: pd.to_numeric(x,errors='ignore'))

After output, df.info():

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 690 entries, 0 to 689
Data columns (total 16 columns):
0     678 non-null object
1     678 non-null float64
2     690 non-null float64
3     684 non-null object
4     684 non-null object
5     681 non-null object
6     681 non-null object
7     690 non-null float64
8     690 non-null object
9     690 non-null object
10    690 non-null int64
11    690 non-null object
12    690 non-null object
13    677 non-null float64
14    690 non-null int64
15    690 non-null object
dtypes: float64(4), int64(2), object(10)
memory usage: 86.3+ KB
Scott Boston
  • 147,308
  • 15
  • 139
  • 187
2

The issue is that data[1] is still of dtype object after you replace ? with NaN. However, you can just cast it to float in either of two ways:

The first is to use pd.to_numeric with errors='coerce', which casts un-parseable strings to NaN:

data[1] = pd.to_numeric(data[1], errors='coerce')

The second is to use your replace strategy, and then use astype(float):

data = data.replace('?', np.nan)
data[1] = data[1].astype(float)

Both methods will result in column 1 being included as a numeric column:

numeric_columns = data.select_dtypes(include=[np.number]).columns
>>> numeric_columns
Int64Index([1, 2, 7, 10, 14], dtype='int64')
sacuL
  • 49,704
  • 8
  • 81
  • 106