0

I've a very long list of tweets stored in a python list (more than 50k). I'm in the stage of comparing every item verses the rest to find the similarity between tweets by using difflib (to remove those who are 755 similar while just keeping one tweet of those who are similar). I used itertools.combinations to loop over all items but it took very long time (i.e. days). Here is my code:

import pandas as pd
from difflib import SequenceMatcher
import itertools
import re
import time


def similar(a, b):
    return SequenceMatcher(None, a, b).ratio()

df1=pd.read_csv("50k_TweetSheet.csv")
data = df1['text'].tolist()

orginalData = data
outList = []

data[:] = [re.sub(r"http\S+", "", s) for s in data]
data[:] = [re.sub(r"@\S+", "", s) for s in data]
data[:] = [re.sub(r"RT|rt\S+", "", s) for s in data]
data[:] = [s.replace('\r+', ' ') for s in data]
data[:] = [s.replace('\n+', ' ') for s in data]
data[:] = [s.replace(' +', ' ') for s in data]


numOfRows = len(data)

start_time = time.time()
for a, b in itertools.combinations(range(numOfRows), 2):
    if len(data[a].split()) < 4: continue
    if a in outList: continue
    similarity = similar(data[a],data[b])
    if similarity > 0.75:
        if len(data[a].split()) > len(data[b].split()):
            outList.append(b)
            print(data[a])
        else:
            outList.append(a)
            print(data[b])

Is there a faster way to do so?

  • It seems you only want to take into consideration tweets with at least four words. You should remove non-qualifying entries before you create the combinations. And make `outList` a set, the lookup times are O(1) for a set instead of O(n) for a list. – Mr. T Feb 24 '18 at 11:39
  • Iterate through data once, carrying out all the `sub` and `replace` actions. Also have a look at `re.compile` and use these in `sub` for a potential performance boost. – match Feb 24 '18 at 12:37

0 Answers0