Improving Python code performance when comparing strings using Levenshtein in pandas

Question

I have this code that functions properly and produces the result I am looking for:

from thefuzz import fuzz
import pandas as pd

df = pd.read_csv('/folder/folder/2011_05-rc.csv', dtype=str, lineterminator='\n')
df_compare = pd.DataFrame(
    df['text'].apply(lambda row: [fuzz.partial_ratio(x, row) for x in df['text']]).to_list())

for i in df_compare.index:
    for j in df_compare.columns[i:]:
        df_compare.iloc[i, j] = 0

df[df_compare.max(axis=1) < 75].to_csv('/folder/folder/2011_05-ready.csv', index=False)

print('Done did')

However, since string comparison is a very costly operation, the script is very slow and only works on relatively small CSV files with 5000-7000 rows. Anything large (over 12MB) takes days before throwing a memory-related error message. I attempted running it with modin on 32 cores with 32 gb memory, but it did not change anything and I ended up with the same result.

import glob
from thefuzz import fuzz
import modin.pandas as pd

files = glob.glob('/folder/folder/2013/*.csv')

for file in files:
    df = pd.read_csv(file, dtype=str, lineterminator='\n')
    f_compare = pd.DataFrame(
        df['text'].apply(lambda row: [fuzz.partial_ratio(x, row) for x in df['text']]).to_list())

    for i in df_compare.index:
        for j in df_compare.columns[i:]:
            df_compare.iloc[i, j] = 0

    df[df_compare.max(axis=1) < 75].to_csv(f'{file[:-4]}-done.csv', index=False)
    print(f'{file} has been done')

It works on smaller files running as a separate job, but files are too many to do it all separately. Would there be a way to optimise this code or some other possible solution?

The data is a collection of tweets while and only one column is being compared (out of around 30 columns). It looks like this:

ID	Text
11213	I am going to the cinema
23213	Black is my favourite colour
35455	I am going to the cinema with you
421323	My friends think I am a good guy.

How many columns are present and what does the data look like? Sentences? — jch, Jun 02 '22 at 22:07
You could use [rapidfuzz](https://pypi.org/project/rapidfuzz/) for a performance increase, but if speed remains an issue, you'll need a [vectorized](https://pypi.org/project/tfidf-matcher/) approach. But since you seems to be looking for semantic similarity rather than string similarity, you could also retrieve sentence embeddings from a language model and calculate cosine similarity, cf https://www.sbert.net/docs/usage/semantic_textual_similarity.html — RJ Adriaansen, Jun 02 '22 at 22:25

jch · Accepted Answer · 2022-06-03T00:27:25.410

It appears that the requirement is to compare each sentence against every other sentence. Given that overall approach here I don't think there is a good answer. You are looking at n^2 comparisons. As your row count gets large the overall processing requirements turn into a monster very quickly.

To figure out the feasibility you could run some smaller tests calculating the n^2 for that test to get an evaluations rows per second metric. Then calculate n^2 for the big datasets that you want to do to get an idea of the required processing time. That is assuming that your memory could handle it. Maybe there is work done on handling n^2 problems. Might want to look around for something like that.

You are doing more than twice the work that you need to do. You compare everything against everything twice and against itself. But even then when things get large, if you just do the combinations, n(n-1)/2 is still monstrous.

Improving Python code performance when comparing strings using Levenshtein in pandas

1 Answers1