0

So i have a situation that i couldnt get out. Im pretty new to machine learning and its community. Im trying to make a classification model but here is my problem:

So lets say i have 2 of X (variables; text or integers) columns and 1 Y (which im trying to predict) column.

One of these X columns originated from a dataset that has duplicate rows but some of the information in duplicates are different and important for my work.

Let me try to make an example;

Product No    Variable 1      Y
1            apple      result1
2           orange      result2
3           banana, apple   result1
4            bluebarry     result3
5            banana     result5

So as you can see in row 3 there are two information that has a value to me. How can i handle this situation in a classifaction model? Sorry if its obvious. Im new to ML :)

Edit Note: that variable 1 column has huge data and approximately thousand different information. I dont have 1 variable at my model ofc. the real model is really high dimensioned already.

  • This is a multi-label classification situation where it's possible to have one observation with several output classes. Try encoding variable 1 as columns of unique values (apple, orange, banana, blueberry), and product no.3 would be (1,0,1,0) in this case. – Xiaoyu Lu Mar 01 '19 at 19:30
  • Yeah i got the same idea but i have thousands of different data in that variable1 column. So i have to make additional thousand columns for that which really wont come as a solution to my already too much dimensioned model :) Thanks tho. – Oğuzhan Alptekin Mar 01 '19 at 19:33
  • You need to apply dimensionality reduction technique. – singhV Mar 01 '19 at 22:50

0 Answers0