For Example in one classification problem's dataset we have 50 categories so it will be difficult for model to predict these many classes. So to avoid this i want to combine target variable's rows which are having similar kind of feature values.
x1 | x2 | x3 | Y | New Y |
---|---|---|---|---|
1 | 0 | 1 | val1 | val_u |
1 | 1 | 0 | val2 | val_u |
0 | 0 | 2 | val3 | val_a |
Here in above example row1 and row2 are similar so their target variable value is replaced with some other name(val_u).
I want to find the similarity between multiple row of a dataset so that classes can be combined(reduced in number) and their Probability distribution should remain the almost same.
One Approach i can think of is apply clustering but not sure about the probability distribution after clustring..