Pandas - KeyError - Dropping rows by index in a nested loop

Question

I have a Pandas dataframe named pd. I am attempting to use a nested-for-loop to iterate through each tuple of the dataframe and, at each iteration, compare the tuple with all other tuples in the frame. During the comparison step, I am using Python's difflib.SequenceMatcher().ratio() and dropping tuples that have a high similarity (ratio > 0.8).

Problem: Unfortunately, I am getting a KeyError after the first, outer-loop, iteration.

I suspect that, by dropping the tuples, I am invalidating the outer-loop's indexer. Or, I am invalidating the inner-loop's indexer by attempting to access an element that doesn't exist (dropped).

Here is the code:

import json
import pandas as pd
import pyreadline
import pprint
from difflib import SequenceMatcher

# Note, this file, 'tweetsR.json', was originally csv, but has been translated to json.

with open("twitter data/tweetsR.json", "r") as read_file:
    data = json.load(read_file)  # Load the source data set, esport tweets.

df = pd.DataFrame(data) # Load data into a pandas(pd) data frame for pandas utilities.
df = df.drop_duplicates(['text'], keep='first') # Drop tweets with identical text content.  Note, 
these tweets are likely reposts/retweets, etc.
df = df.reset_index(drop=True) # Adjust the index to reflect dropping of duplicates.

def duplicates(df):
    for ind in df.index:
        a = df['text'][ind]
        for indd in df.index:
            if indd != 26747: # Trying to prevent an overstep keyError here
                b = df['text'][indd+1]
                if similar(a,b) >= 0.80:
                    df.drop((indd+1), inplace=True)
        print(str(ind) + "Completed") # Debugging statement, tells us which iterations have completed

duplicates(df)

Error Output:

Can anyone help me understand this and/or fix it?

just a heads up you're much more likely to get help with this if you include some data to make this a reproducible example — Max Power, Dec 19 '19 at 23:40
@MaxPower You're right. Give me a moment. I'll submit a sample of the dataset. — Matthew E. Miller, Dec 19 '19 at 23:41
your problem is the line `b = d['text'][indd + 1]`. what were you trying to do? — Nicolas Gervais, Dec 19 '19 at 23:47
@NicolasGervais Trying to start 1-tuple ahead of the outer loop's tuple. Goal was to avoid comparing a tuple to itself at each outer-iteration. — Matthew E. Miller, Dec 19 '19 at 23:52
Btw this algorithm fails because when the outer loop increase its counter there’ll be index collision with the inner loop. You’ll want to use itertools.combination I suppose — Kazuya Hatta, Dec 20 '19 at 00:08
https://stackoverflow.com/a/45426934/9006027 create a data frame like this and drop rows with condition with mask — Kazuya Hatta, Dec 20 '19 at 00:30
@KazuyaHatta Okay, I will try that. It's looking like itertools.comboinations() might do the trick. — Matthew E. Miller, Dec 20 '19 at 00:37

Matthew E. Miller · Accepted Answer · 2019-12-20T07:14:05.427

One solution, which was mentioned by @KazuyaHatta, is the itertools.combination(). Although, the way I've used it (there may be another way), it's O(n^2). So, in this case, with 27,000 tuples, it's nearly 357,714,378 combinations to iterate (too long).

Here is the code:

# Create a set of the dropped tuples and run this code on bizon overnight.
def duplicates(df):
    # Find out how to improve the speed of this
    excludes = set()
    combos = itertools.combinations(df.index, 2)
    for combo in combos:
        if str(combo) not in excludes:
            if similar(df['text'][combo[0]], df['text'][combo[1]]) > 0.8:
                excludes.add(f'{combo[0]}, {combo[1]}') 
                excludes.add(f'{combo[1]}, {combo[0]}')
                print("Dropped: " + str(combo))
                print(len(excludes))

duplicates(df)

My next step, which @KazuyaHatta described, is to attempt the dropping-by-mask method.

Note: I unfortunately won't be able to post a sample of the dataset.

Pandas - KeyError - Dropping rows by index in a nested loop

1 Answers1