0

I have a Pandas dataframe named pd. I am attempting to use a nested-for-loop to iterate through each tuple of the dataframe and, at each iteration, compare the tuple with all other tuples in the frame. During the comparison step, I am using Python's difflib.SequenceMatcher().ratio() and dropping tuples that have a high similarity (ratio > 0.8).

Problem: Unfortunately, I am getting a KeyError after the first, outer-loop, iteration.

I suspect that, by dropping the tuples, I am invalidating the outer-loop's indexer. Or, I am invalidating the inner-loop's indexer by attempting to access an element that doesn't exist (dropped).

Here is the code:

import json
import pandas as pd
import pyreadline
import pprint
from difflib import SequenceMatcher

# Note, this file, 'tweetsR.json', was originally csv, but has been translated to json.

with open("twitter data/tweetsR.json", "r") as read_file:
    data = json.load(read_file)  # Load the source data set, esport tweets.

df = pd.DataFrame(data) # Load data into a pandas(pd) data frame for pandas utilities.
df = df.drop_duplicates(['text'], keep='first') # Drop tweets with identical text content.  Note, 
these tweets are likely reposts/retweets, etc.
df = df.reset_index(drop=True) # Adjust the index to reflect dropping of duplicates.

def duplicates(df):
    for ind in df.index:
        a = df['text'][ind]
        for indd in df.index:
            if indd != 26747: # Trying to prevent an overstep keyError here
                b = df['text'][indd+1]
                if similar(a,b) >= 0.80:
                    df.drop((indd+1), inplace=True)
        print(str(ind) + "Completed") # Debugging statement, tells us which iterations have completed

duplicates(df)

Error Output: enter image description here

Can anyone help me understand this and/or fix it?

Matthew E. Miller
  • 557
  • 1
  • 5
  • 13

1 Answers1

2

One solution, which was mentioned by @KazuyaHatta, is the itertools.combination(). Although, the way I've used it (there may be another way), it's O(n^2). So, in this case, with 27,000 tuples, it's nearly 357,714,378 combinations to iterate (too long).

Here is the code:

# Create a set of the dropped tuples and run this code on bizon overnight.
def duplicates(df):
    # Find out how to improve the speed of this
    excludes = set()
    combos = itertools.combinations(df.index, 2)
    for combo in combos:
        if str(combo) not in excludes:
            if similar(df['text'][combo[0]], df['text'][combo[1]]) > 0.8:
                excludes.add(f'{combo[0]}, {combo[1]}') 
                excludes.add(f'{combo[1]}, {combo[0]}')
                print("Dropped: " + str(combo))
                print(len(excludes))

duplicates(df)

My next step, which @KazuyaHatta described, is to attempt the dropping-by-mask method.

Note: I unfortunately won't be able to post a sample of the dataset.

Matthew E. Miller
  • 557
  • 1
  • 5
  • 13