one-hot encoding on multi-dimension arrays, using pandas or scikit-learn

Question

I am trying to encode one-hot for my data frame. It is a multi dimension array and I am not sure how to do this. The data frame may look like this:

df = pd.DataFrame({'menu': [['Italian', 'Greek'], ['Japanese'], ['Italian','Greek', 'Japanese']], 'price': ['$$', '$$', '$'], 'location': [['NY', 'CA','MI'], 'CA', ['NY', 'CA','MA']]})

The output I want is something like this:

df2 = pd.DataFrame({'menu': [[1,1,0], [0,0,1], [1,1,1]], 'price': [[1,0], [1,0], [0,1]], 'location': [[1,1,1,0], [0,1,0,0], [1,1,0,1]]})

I am not sure how this can be done using pd.get_dummies or scikit-learn. Can someone help me?

Dealing with lists in a dataframe is the last thing you want to do when using pandas. This is bad design - consider dropping it. — cs95, Sep 14 '17 at 07:37
You can take a look at [MultiLabelBinarizer](http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MultiLabelBinarizer.html) and my [answer here](https://stackoverflow.com/questions/42391165/how-to-one-hot-encode-variant-length-features/42392689#42392689) on how to use it. But the con is that you have to process each column using separate MultiLabelBinarizer object. Also you may need to change non list items in the column to list like `CA` to `[CA]` because thats what is required in it. — Vivek Kumar, Sep 14 '17 at 07:44
@EspoirMurhabazi For the purpose of this case, I will assume, all the locations are: NY, CA, MI, MA for now. — 2D_, Sep 14 '17 at 09:51
@cᴏʟᴅsᴘᴇᴇᴅ Thank you for your feedback. Could you tell me if there is a better way to look at this problem? If pandas is a bad way to start, I have no problem transforming my dataset into other format. I just don't know which is the best way to do it. — 2D_, Sep 14 '17 at 09:54
@2D_ If your goal is to perform machine learning, you should work with lists and sklearn directly. Pandas is used for data mining and number crunching. — cs95, Sep 14 '17 at 09:56
@EspoirMurhabazi Thank you for the post, I read it and it helped me understand how to solve this kind of problem. — 2D_, Sep 15 '17 at 03:19
2D_ say thanks to @jezrael he is the one who gave you the answer — Espoir Murhabazi, Sep 15 '17 at 05:10

score 4 · Accepted Answer · answered Sep 14 '17 at 07:47

4

You can use:

#create list with one item values
df = df.applymap(lambda x: x if isinstance(x, list) else [x])
print (df)
       location                        menu price
0  [NY, CA, MI]            [Italian, Greek]  [$$]
1          [CA]                  [Japanese]  [$$]
2  [NY, CA, MA]  [Italian, Greek, Japanese]   [$]

from sklearn.preprocessing import MultiLabelBinarizer

mlb = MultiLabelBinarizer()
#create Series for each column by list comprehension
vals = [pd.Series(mlb.fit_transform(df[x]).tolist()) for x in df.columns]
#concat to df
df2 = pd.concat(vals, keys=df.columns, axis=1)
print (df2)

       location       menu   price
0  [1, 0, 1, 1]  [1, 1, 0]  [0, 1]
1  [1, 0, 0, 0]  [0, 0, 1]  [0, 1]
2  [1, 1, 0, 1]  [1, 1, 1]  [1, 0]

answered Sep 14 '17 at 07:47

jezrael

822,522
95
1,334
1,252

Thank you. I used your suggestion and it worked! I had error " TypeError: unorderable types: str() > float()" and I think this is caused by my nan value. I will figure out how to handle this case. – 2D_ Sep 15 '17 at 03:18
1

Yes, I test it - replace CA to `np.nan`. So what you need do if NaN? replace NaN to some string like `missing`? Or remove all rows where NaNs? – jezrael Sep 15 '17 at 04:11
1

For replace scalars NaNs to `missing`use `df = df.fillna('missing').applymap(lambda x: x if isinstance(x, list) else [x])` and for remove all rows with NaNs use `df = df.dropna().applymap(lambda x: x if isinstance(x, list) else [x])` – jezrael Sep 15 '17 at 04:15
1

If `NaN`s in list or in scalars - `df = df.fillna('missing').applymap(lambda x: [i if pd.notnull(i) else 'missing' for i in x] if isinstance(x, list) else [x])` – jezrael Sep 15 '17 at 04:30

one-hot encoding on multi-dimension arrays, using pandas or scikit-learn

1 Answers1