I have a csv file with roughly 50K rows of search engine queries. Some of the search queries are the same, just in a different word order, for example "query A this is " and "this is query A".
I've tested using fuzzywuzzy's token_sort_ratio function to find matching word order queries, which works well, however I'm struggling with the runtime of the nested loop, and looking for optimisation tips.
Currently the nested for loops take around 60 hours to run on my machine. Does anyone know how I might speed this up?
Code below:
from fuzzywuzzy import fuzz
from fuzzywuzzy import process
import pandas as pd
from tqdm import tqdm
filePath = '/content/queries.csv'
df = pd.read_csv(filePath)
table1 = df['keyword'].to_list()
table2 = df['keyword'].to_list()
data = []
for kw_t1 in tqdm(table1):
for kw_t2 in table2:
score = fuzz.token_sort_ratio(kw_t1,kw_t2)
if score == 100 and kw_t1 != kw_t2:
data +=[[kw_t1, kw_t2, score]]
data_df = pd.DataFrame(data, columns=['query', 'queryComparison', 'score'])
Any advice would be appreciated.
Thanks!