4

I have a dataset with 41 features [from 0 to 40 columns], of which 7 are categorical. This categorical set is divided in two subset:

  • A subset of string type(the column-features 1, 2, 3)
  • A subset of int type, in binary form 0 or 1 (the column-features 6, 11, 20, 21)

Furthermore the column-features 1, 2 and 3 (of string type) have cardinality 3, 66 and 11 respectively. In this context I have to encode them to use support vector machine algorithm. This is the code that I have:

import numpy as np
import pandas as pd
from sklearn import preprocessing
from sklearn import feature_extraction

df = pd.read_csv("train.csv")
datanumpy = df.as_matrix()
X = datanumpy[:, 0:40]  # select columns 1 through 41 (the features)
y = datanumpy[:, 41]  # select column 42 (the labels)

I don't know if is better to use DictVectorizer() or OneHotEncoder() [for the reasons that I exposed above], and mostly in which way use them [in term of code] with the X matrix that I have. Or should I simply assign a number to each cardinality in the subset of string type (since they have high cardinality and so my feature space will increase exponentially)?

EDIT With respect to subset of int type I guess that the best choice is to keep the column-features as they are (don't pass them to any encoder) The problem persist for subset of string type with high cardinality.

Tonechas
  • 13,398
  • 16
  • 46
  • 80
Gil
  • 111
  • 1
  • 7

3 Answers3

3

This is by far the easiest:

 df = pd.get_dummies(df, drop_first=True)

If you get a memory overflow or it is too slow then reduce the cardinality:

top = df[col].isin(df[col].value_counts().index[:10])
df.loc[~top, col] = "other"
simon
  • 2,561
  • 16
  • 26
  • In my understanding, this is not an acceptable answer because it does not guarantee consistency between different `DataFrame` objects (e.g. train and test). – ldavid Oct 25 '17 at 13:53
  • In the case of train/test then you could just apply before the split though of course the same problem may occur with a new dataset you are trying to predict. In this case one solution is to specify the categories using pandas categorical data and then apply the same specification to each dataset. In this way get_dummies will use same encoding each time. – simon Oct 26 '17 at 15:14
1

As per the official documentation of One Hot Encoder, it should be applied over the combined dataset (Train and Test). Otherwise it may not form a proper encoding.

And performance-wise I think One Hot Encoder is much better than DictVectorizer.

JKC
  • 2,498
  • 6
  • 30
  • 56
  • Why One-Hot Encoder is much better than DictVectorizer? is there any data can support this? – WY Hsu Nov 26 '17 at 05:50
0

You can use the pandasmethod .get_dummies() as suggested by @simon here above, or you can use the sklearn equivalent given by OneHotEncoder.

I prefer OneHotEncoder because you can pass to it parameters like the categorical features you want to encode and the number of values to keep for each feature (if not indicated, it will select automatically the optimal number).

If, for some features, the cardinality is too big, impose low n_values. If you have enough memory don't worry, encode all the values of your features.

I guess for a cardinality of 66, if you have a basic computer, encoding all of the 66 features won't lead to a memory issue. Memory overflow usually happens when you have for example as much values for a feature as the number of samples in your dataset (the case for IDs that are unique per sample). The bigger the dataset, the more likely you'll get a memory issue.

MMF
  • 5,750
  • 3
  • 16
  • 20
  • I agree with you, but since my dataset is really huge I'm also worried about the time Support Vector will take to run. Anyway do you have a script base that uses OneHotEncoder (with the previous usage of LabelEncoder I guess)? – Gil Nov 16 '16 at 11:31