Following problem
I have the following dataframe structure:
Headline | Text |
---|---|
First | text 1 |
Second | text |
"Headline" contains names of news stories, "Text" of the body of the news article.
Some of the Text articles are similar, but not by 100%. I want to drop one of two articles, that have a similarity in their texts over 80% (for every comparison of texts in the dataframe)- which one is not important.
I checked the whole web for a library, but i did not find what i was looking for. Has anyone in the community an idea or a library idea?