How to one hot encode a large dataframe when multiple columns contain the same values?

Question

The title essentially captures my problem.

I have a dataframe and multiple columns have values such as [0,1] and if I were to go and one hot encode the df, I'd have multiple columns with the same name.

The tedious solution would be to manually create unique columns but I have 58 columns that are categorical so that doesn't seem very efficient.

I'm not sure if this will be helpful, but here is the head() of my dataframe.

x2  x3  x4  x5  x6  x7  x8  x9  x10 x11 ... z217    z218    z219    z220    z221    z222    subject phase   state   output
0   0   0   1   -300.361218 0.886360    -2.590886   225.001899  0.006204    0.000037    -0.000013   ... 0.005242    0.024971    -1017.620978    -382.850838 -48.275711  -2.040336   A   3   B   0
1   0   0   1   -297.126090 0.622211    -3.960940   220.179017  0.006167    -0.000014   -0.000003   ... 0.001722    0.023595    91.229094   24.802230   1.783950    0.022620    A   3   C   0
2   0   0   1   -236.460253 0.423640    -12.656341  139.453445  0.006276    -0.000028   0.000022    ... -0.010894   -0.036318   -188.232347 -17.474861  -1.005571   -0.021628   A   3   B   0
3   0   0   1   33.411458   2.854415    -1.962432   3.208911    0.009752    -0.000273   -0.000024   ... -0.034184   -0.047734   185.122907  -549.282067 542.193381  -178.049926 A   3   A   0
4   0   0   1   -118.125214 2.009809    -3.291637   34.874176   0.007598    0.000001    -0.000022   ... 0.001963    0.004084    35.207794   -78.143166  57.084208   -13.700212  A   4   C   0

[Minimal, complete, verifiable example](http://stackoverflow.com/help/mcve) applies here. Please provide an example of the problem input (none of your columns here has values 0, 1, 2), and the resulting DF you'd like to see. Get rid of the extraneous information (or just keep a couple of columns). — Prune, Oct 09 '17 at 21:58

score 1 · Accepted Answer · answered Oct 09 '17 at 22:04

You are probably already using pandas.get_dummies? If not, this function converts categorical columns into multiple indicator columns (one hot encoding).

There is a 'prefix' argument to this function which exists specifically for your case. This can be a list of strings (length must be equal to number of columns in dataframe). In your case though, you can make it a dictionary wherein you will map column names to prefixes. So, something like:

pd.get_dummies(df, prefix={'x3': 'x3', 'x4': 'x4'})

This will additional columns like x3_0, x3_1 ... x4_0, x4_1 ...

@madsthaks would appreciate if you can accept my answer – shikhanshu Oct 09 '17 at 23:09 — shikhanshu, Oct 09 '17 at 23:09

Yashu Seth · Answer 2 · 2017-12-14T22:54:54.397

You can read the data and first get a list of all the unique values of your categorical variables. Then you can fit a one hot encoder object (like the sklearn.preprocessing.CategoricalEncoder) on your list of unique values.

This method can also help in a train test framework or when you are reading your data in chunks. I have created a python module that does all this on its own. You can find it in this GitHub repository - dummyPy

A short a tutorial on this - How to One Hot Encode Categorical Variables in Python?

How to one hot encode a large dataframe when multiple columns contain the same values?

2 Answers2