-1

I removing duplicates from large string input, i created cosine similarity matrix as given below.

          0         1         2         3         4
0  1.000000  0.515303  0.741283  0.035133  0.076743
1  0.920776  1.000000  0.153878  0.024261  0.845839
2  0.273931  0.842390  1.000000  0.502877  0.962273
3  0.407020  0.409827  0.096752  1.000000  0.886368
4  0.315340  0.618172  0.335455  0.170406  1.000000

someone please help me in removing the duplicate rows using a cutoff, like if index 0 is and 2 have 74% similarity i want to keep just 0(which is the first one)

for now I have created another data frame using data[data <= 0.6] to limit the similarity to 60% and the output is data frame where values are more than 0.6 including diagonals.

          0         1         2         3         4
0       NaN  0.515303       NaN  0.035133  0.076743
1       NaN       NaN  0.153878  0.024261       NaN
2  0.273931       NaN       NaN  0.502877       NaN
3  0.407020  0.409827  0.096752       NaN       NaN
4  0.315340       NaN  0.335455  0.170406       NaN

expected output is the dataframe which doesn't have NaN value in the column

          0         1         2         3         4
0       NaN  0.515303       NaN  0.035133  0.076743
3  0.407020  0.409827  0.096752       NaN       NaN
ayhan
  • 70,170
  • 20
  • 182
  • 203
Noufal_S
  • 1
  • 2

1 Answers1

0

Got it, Thanks all for quick response

l = []
for _, row in data[data <= 0.6].iterrows():
    for value in data.columns[row.isnull()].tolist():
        if value not in l:
            l.append(value)
data.drop(index = l)
Noufal_S
  • 1
  • 2