deduplicate removal from cosine similarity matrix pandas data frame

Question

I removing duplicates from large string input, i created cosine similarity matrix as given below.

          0         1         2         3         4
0  1.000000  0.515303  0.741283  0.035133  0.076743
1  0.920776  1.000000  0.153878  0.024261  0.845839
2  0.273931  0.842390  1.000000  0.502877  0.962273
3  0.407020  0.409827  0.096752  1.000000  0.886368
4  0.315340  0.618172  0.335455  0.170406  1.000000

someone please help me in removing the duplicate rows using a cutoff, like if index 0 is and 2 have 74% similarity i want to keep just 0(which is the first one)

for now I have created another data frame using data[data <= 0.6] to limit the similarity to 60% and the output is data frame where values are more than 0.6 including diagonals.

          0         1         2         3         4
0       NaN  0.515303       NaN  0.035133  0.076743
1       NaN       NaN  0.153878  0.024261       NaN
2  0.273931       NaN       NaN  0.502877       NaN
3  0.407020  0.409827  0.096752       NaN       NaN
4  0.315340       NaN  0.335455  0.170406       NaN

expected output is the dataframe which doesn't have NaN value in the column

          0         1         2         3         4
0       NaN  0.515303       NaN  0.035133  0.076743
3  0.407020  0.409827  0.096752       NaN       NaN

since value in [0,2] is .74 i want to drop row 2 from further processing — Noufal_S, Dec 12 '18 at 06:33
Check my answer. I understand replace NaNs by `cutoff`, but not sure how you get last 2 rows DataFrame — jezrael, Dec 12 '18 at 06:36
Not understand, sorry. Can you explain more? e.g. why row 3 was not removed? — jezrael, Dec 12 '18 at 08:52
When iterate through row 0 then column 2 is Nil, hence remove row 2 — Noufal_S, Dec 12 '18 at 09:00
And iterating through row 1 will remove row 0 and row 4, because both that value is Nil — Noufal_S, Dec 12 '18 at 09:01
finally iterate row 3(row 2 already removed) to remove row 4(actually not available) — Noufal_S, Dec 12 '18 at 09:02
ignore the diagonal values, i can replace it with zero, because i don't want an iteration to remove the same row itself — Noufal_S, Dec 12 '18 at 09:04

score 0 · Answer 1 · answered Dec 12 '18 at 12:33

0

Got it, Thanks all for quick response

l = []
for _, row in data[data <= 0.6].iterrows():
    for value in data.columns[row.isnull()].tolist():
        if value not in l:
            l.append(value)
data.drop(index = l)

answered Dec 12 '18 at 12:33

Noufal_S

1
2

deduplicate removal from cosine similarity matrix pandas data frame

1 Answers1