Drop similiar rows in dataframe, that have a match of more than 80%

Question

Following problem

I have the following dataframe structure:

Headline	Text
First	text 1
Second	text

"Headline" contains names of news stories, "Text" of the body of the news article.

Some of the Text articles are similar, but not by 100%. I want to drop one of two articles, that have a similarity in their texts over 80% (for every comparison of texts in the dataframe)- which one is not important.

I checked the whole web for a library, but i did not find what i was looking for. Has anyone in the community an idea or a library idea?

https://stackoverflow.com/questions/59624798/drop-similar-text-rows-of-one-column-in-python — Danila Musaev, Apr 27 '22 at 14:38

score 0 · Accepted Answer · answered Apr 28 '22 at 04:03

0

You can try thefuzz.

# import
from thefuzz import fuzz
# use fuzz.ratio() to check for similarity
fuzz.ratio("this is a test", "this is a test!")
# this outputs 97
# you can set a threshold within a range of 0 to 100, where 100 is an exact match

answered Apr 28 '22 at 04:03

greco

304
4
11

Drop similiar rows in dataframe, that have a match of more than 80%

1 Answers1