Print Rows that are "Near Duplicates" in Pandas DataFrame

Question

I'm working on a personal project that performs Web Scraping on multiple databases of research articles (thus far I have done PubMed and Scopus) and extracts the titles of the articles. I've actually managed to pull this off on my own without problem. Next, I've combined the list of articles into a Pandas DataFrame with two columns: Article and Database. I wanted to remove duplicates across the two article databases and used df = df.drop_duplicates(subset='Article') to remove exact matches.

BUT, what if the articles are "near matches", that is, perhaps a word in the title was misspelled, or there is an extra blank space somewhere in the title (not at the end, I've proofed using lstrip() and rstrip()).

I have explored string matching in the past using SequenceMatcher from difflib, but never in a DataFrame. So, my question is, how would I code the following conditional so that I can review the near similar values:

"if row in df['Article'] is 95% similar to another row in df['Article'], print both rows."

I started doing some testing using separate columns like such:

letters1 = ['a','b','c','a','b']
letters2 = ['c','b','a','a','c']
numbers = [1,2,3,4,5]

data = {'Letters1':letters,
        'Letters2':letters2,
        'Numbers':numbers}

test = pd.DataFrame(data)
test['result'] = ''

for i in test['Letters1'].index:
    if SequenceMatcher(None, test['Letters1'], test['Letters2']).ratio() > 0:
        test['result'] = 'True'
    else:
        test['result'] = 'False'

test.head()

However, I'm already not getting the desired results and thought to seek help here first. Any suggestions? To reiterate, I don't want to use two columns ultimately, I am just using the example code block above to start testing how to do this.

Timus · Accepted Answer · 2023-03-07T14:58:42.393

The unexpected result in your code is due to using whole columns instead of items. You can fix that for example by using the .at accessor

for i in test.index:
    if SequenceMatcher(None, test.at[i, 'Letters1'], test.at[i, 'Letters2']).ratio() > 0:
        test.at[i, 'result'] = True
    else:
        test.at[i, 'result'] = False

or more compact by

test["result"] = test.apply(
    lambda r: SequenceMatcher(None, r.at['Letters1'], r.at['Letters2']).ratio() > 0,
    axis=1
)

Result for the sample:

  Letters1 Letters2  Numbers result
0        a        c        1  False
1        b        b        2   True
2        c        a        3  False
3        a        a        4   True
4        b        c        5  False

As an alternative you could do something like:

from itertools import combinations

# Sample dataframe
df = pd.DataFrame({'Letters': ['a', 'b', 'c', 'a', 'b']})

for i, j in combinations(df.index, r=2):
    txt1, txt2 = df.at[i, "Letters"], df.at[j, "Letters"]
    if SequenceMatcher(None, txt1, txt2).ratio() > 0:
        print((i, txt1), (j, txt2))

Output:

(0, 'a') (3, 'a')
(1, 'b') (4, 'b')

Thank you so much, this is exactly what I wanted! I really appreciate how you showed the error in my test example as well so I could understand the issue before answering the final question. Much appreciated! — TrevorM, Mar 08 '23 at 20:11

Print Rows that are "Near Duplicates" in Pandas DataFrame

1 Answers1