0

My variable consists of multiple ingredients. Each consists of different ingredients separated by a comma. I used One Hot Encoding for multiple values(MultiLabelBinarizer()), but it increased my dimension of the dataset.

Do we have some appropriate method for this situation?

My variable looks like this:

df['ingredients_str'].head()

0    romaine lettuce, black olives, grape tomatoes
1    plain flour,ground pepper,salt,tomatoes
2    eggs,pepper,salt,mayonaise,cooking oil
3    water,vegetable oil,wheat,salt
4    black pepper,shallots,cornflour,cayenne
Name: ingredients_str, dtype: object
DataCat
  • 43
  • 1
  • 4
  • 1
    after splitting ingredients in your series, find count for every ingredient after that you can map ingredients with low count to a new category such as "other". This will help you with high dimensionality – Himanshu Feb 05 '20 at 09:40
  • 1
    Take most frequent categories and skip the rest, for more refer this http://proceedings.mlr.press/v7/niculescu09/niculescu09.pdf – Sachin Yadav Feb 15 '20 at 09:51

0 Answers0