4

I have a dataframe that has int and categorical features. The categorical features are 2 types: numbers and strings.

I was able to One hot encode columns that were int and categorical that were numbers. I get an error when I try to One hot encode categorical columns that are strings.

ValueError: could not convert string to float: '13367cc6'

Since the dataframe is huge with high cardinality so I only want to convert it to a Sparse form. I would prefer a solution that uses from sklearn.preprocessing import OneHotEncoder since I am familiar with it.

I checked other questions too but none of them addresses what I am asking.

data = [[623, 'dog', 4], [123, 'cat', 2],[623, 'cat', 1], [111, 'lion', 6]]

The above dataframe contains 4 rows and 3 columns

Column names - ['animal_id', 'animal_name', 'number']

Assume that animal_id and animal_name are stored in pandas as category and number as int64 dtype.

MaxU - stand with Ukraine
  • 205,989
  • 36
  • 386
  • 419
Aman
  • 353
  • 1
  • 3
  • 13

2 Answers2

1

Assuming you have the following DF:

In [124]: df
Out[124]:
   animal_id animal_name  number
0        623         dog       4
1        123         cat       2
2        623         cat       1
3        111        lion       6

In [125]: df.dtypes
Out[125]:
animal_id         int64
animal_name    category
number            int64
dtype: object

first save animal_name column (if you need it in future):

In [126]: animal_name = df['animal_name']

convert animal_name column to categorical (memory saving) numeric column:

In [127]: df['animal_name'] = df['animal_name'].cat.codes.astype('category')

In [128]: df
Out[128]:
   animal_id animal_name  number
0        623           1       4
1        123           0       2
2        623           0       1
3        111           2       6

In [129]: df.dtypes
Out[129]:
animal_id         int64
animal_name    category
number            int64
dtype: object

Now OneHotEncoder should work:

In [130]: enc = OneHotEncoder()

In [131]: enc.fit(df)
Out[131]:
OneHotEncoder(categorical_features='all', dtype=<class 'numpy.float64'>,
       handle_unknown='error', n_values='auto', sparse=True)

In [132]: X = enc.fit(df)

In [134]: X.n_values_
Out[134]: array([624,   3,   7])

In [135]: enc.feature_indices_
Out[135]: array([  0, 624, 627, 634], dtype=int32)
MaxU - stand with Ukraine
  • 205,989
  • 36
  • 386
  • 419
  • How do I select multiple columns which need to be converted to categorical(memory saving) numeric columns. I have like 100's of columns in a dataframe that this operation will be performed on. – Aman Mar 28 '17 at 16:54
  • I can think of using a for loop. But is there any other way? – Aman Mar 28 '17 at 16:55
  • @Aman, glad i could help :) What is the `dtypes` of those 100's columns? – MaxU - stand with Ukraine Mar 28 '17 at 17:29
  • Initiall they were either int64 or object but they should be categorical so after your suggestion I wrote a for loop: for cols in columns: train[col] = train[col].astype('category').cat.codes.astype('category') In the above code columns denote the desired columns that need to be converted to categorical. – Aman Mar 28 '17 at 17:34
  • @Aman, are your string columns already of `category` or `object` dtype? I guess it would make sense to open a new question, provide there a sample data set and print an output of `print(train.dtypes)` for your real DF... – MaxU - stand with Ukraine Mar 28 '17 at 17:47
  • 1
    Thanks for all the help. The 'for' loop that I used was not very slow. I can make the code better but I guess I am fine with it right now since I am running on a deadline. – Aman Mar 28 '17 at 17:50
  • Hi! I am trying to use this to convert a column with 22,000 different cities into 22,000 dimensional space. When I run the same code you have above I get: ValueError: could not convert string to float: 'Valdosta'. What should I do? – bernando_vialli Jun 13 '18 at 20:29
  • @mkheifetz, i would recommend you to open a new question and provide there a small reproducible sample data set and your desired data set. Solutions might vary depending on data types, # of columns that must be converted/vectorized, etc. – MaxU - stand with Ukraine Jun 13 '18 at 20:37
1

FYI, there are other powerful encoding schemes which did not add a big number of columns as onehot encoding (In fact they did not add any columns at all). Some of them are count encoding, target encoding. For more details, see my answer here and my ipynb here.

Victor Luu
  • 244
  • 2
  • 10