Label encode variable with multiple values

Asked Feb 05 '20 at 09:36

Active Feb 05 '20 at 09:36

Viewed 45 times

My variable consists of multiple ingredients. Each consists of different ingredients separated by a comma. I used One Hot Encoding for multiple values(MultiLabelBinarizer()), but it increased my dimension of the dataset.

Do we have some appropriate method for this situation?

My variable looks like this:

df['ingredients_str'].head()

0    romaine lettuce, black olives, grape tomatoes
1    plain flour,ground pepper,salt,tomatoes
2    eggs,pepper,salt,mayonaise,cooking oil
3    water,vegetable oil,wheat,salt
4    black pepper,shallots,cornflour,cayenne
Name: ingredients_str, dtype: object

asked Feb 05 '20 at 09:36

DataCat

1

after splitting ingredients in your series, find count for every ingredient after that you can map ingredients with low count to a new category such as "other". This will help you with high dimensionality – Himanshu Feb 05 '20 at 09:40
1

Take most frequent categories and skip the rest, for more refer this http://proceedings.mlr.press/v7/niculescu09/niculescu09.pdf – Sachin Yadav Feb 15 '20 at 09:51

Label encode variable with multiple values

0 Answers0