I've a very long list of tweets stored in a python list (more than 50k). I'm in the stage of comparing every item verses the rest to find the similarity between tweets by using difflib (to remove those who are 755 similar while just keeping one tweet of those who are similar). I used itertools.combinations to loop over all items but it took very long time (i.e. days). Here is my code:
import pandas as pd
from difflib import SequenceMatcher
import itertools
import re
import time
def similar(a, b):
return SequenceMatcher(None, a, b).ratio()
df1=pd.read_csv("50k_TweetSheet.csv")
data = df1['text'].tolist()
orginalData = data
outList = []
data[:] = [re.sub(r"http\S+", "", s) for s in data]
data[:] = [re.sub(r"@\S+", "", s) for s in data]
data[:] = [re.sub(r"RT|rt\S+", "", s) for s in data]
data[:] = [s.replace('\r+', ' ') for s in data]
data[:] = [s.replace('\n+', ' ') for s in data]
data[:] = [s.replace(' +', ' ') for s in data]
numOfRows = len(data)
start_time = time.time()
for a, b in itertools.combinations(range(numOfRows), 2):
if len(data[a].split()) < 4: continue
if a in outList: continue
similarity = similar(data[a],data[b])
if similarity > 0.75:
if len(data[a].split()) > len(data[b].split()):
outList.append(b)
print(data[a])
else:
outList.append(a)
print(data[b])
Is there a faster way to do so?