I removing duplicates from large string input, i created cosine similarity matrix as given below.
0 1 2 3 4
0 1.000000 0.515303 0.741283 0.035133 0.076743
1 0.920776 1.000000 0.153878 0.024261 0.845839
2 0.273931 0.842390 1.000000 0.502877 0.962273
3 0.407020 0.409827 0.096752 1.000000 0.886368
4 0.315340 0.618172 0.335455 0.170406 1.000000
someone please help me in removing the duplicate rows using a cutoff, like if index 0 is and 2 have 74% similarity i want to keep just 0(which is the first one)
for now I have created another data frame using data[data <= 0.6] to limit the similarity to 60% and the output is data frame where values are more than 0.6 including diagonals.
0 1 2 3 4
0 NaN 0.515303 NaN 0.035133 0.076743
1 NaN NaN 0.153878 0.024261 NaN
2 0.273931 NaN NaN 0.502877 NaN
3 0.407020 0.409827 0.096752 NaN NaN
4 0.315340 NaN 0.335455 0.170406 NaN
expected output is the dataframe which doesn't have NaN value in the column
0 1 2 3 4
0 NaN 0.515303 NaN 0.035133 0.076743
3 0.407020 0.409827 0.096752 NaN NaN