-2

For Example in one classification problem's dataset we have 50 categories so it will be difficult for model to predict these many classes. So to avoid this i want to combine target variable's rows which are having similar kind of feature values.

x1 x2 x3 Y New Y
1 0 1 val1 val_u
1 1 0 val2 val_u
0 0 2 val3 val_a

Here in above example row1 and row2 are similar so their target variable value is replaced with some other name(val_u).

I want to find the similarity between multiple row of a dataset so that classes can be combined(reduced in number) and their Probability distribution should remain the almost same.

One Approach i can think of is apply clustering but not sure about the probability distribution after clustring..

Bhaskar
  • 1
  • 4

1 Answers1

-1

Something like finding the euclidian distance between all rows, and grouping the closest ones might help.