0

Following problem

I have the following dataframe structure:

Headline Text
First text 1
Second text

"Headline" contains names of news stories, "Text" of the body of the news article.

Some of the Text articles are similar, but not by 100%. I want to drop one of two articles, that have a similarity in their texts over 80% (for every comparison of texts in the dataframe)- which one is not important.

I checked the whole web for a library, but i did not find what i was looking for. Has anyone in the community an idea or a library idea?

dgho
  • 13
  • 5

1 Answers1

0

You can try thefuzz.

# import
from thefuzz import fuzz
# use fuzz.ratio() to check for similarity
fuzz.ratio("this is a test", "this is a test!")
# this outputs 97
# you can set a threshold within a range of 0 to 100, where 100 is an exact match
greco
  • 304
  • 4
  • 11