Find similarity between rows of a dataframe in Python

Question

For Example in one classification problem's dataset we have 50 categories so it will be difficult for model to predict these many classes. So to avoid this i want to combine target variable's rows which are having similar kind of feature values.

x1	x2	x3	Y	New Y
1	0	1	val1	val_u
1	1	0	val2	val_u
0	0	2	val3	val_a

Here in above example row1 and row2 are similar so their target variable value is replaced with some other name(val_u).

I want to find the similarity between multiple row of a dataset so that classes can be combined(reduced in number) and their Probability distribution should remain the almost same.

One Approach i can think of is apply clustering but not sure about the probability distribution after clustring..

score -1 · Answer 1 · answered Aug 09 '22 at 12:10

-1

Something like finding the euclidian distance between all rows, and grouping the closest ones might help.

answered Aug 09 '22 at 12:10

Julian Matthews

104
4

Find similarity between rows of a dataframe in Python

1 Answers1