Using Symspell compound with Parallel processing

Question

I am new to NLP related tasks and I'm doing this with Pandas (Python) but the idea is that each row has a text that I'm trying to perform spell corrector on (sentence length may vary) and the total pandas dataframe is slightly over ~ 1 million records currently, likely to increase in the future.

Initially, I thought to use the symspell lookup_compound directly via the apply function with Pandas but it took such a long time (>12hours) and there was still no results.

 def symspell_compound(input_term, max_edit_distance=2):   
      suggestions = sym_spell.lookup_compound(input_term, max_edit_distance)
      for suggestion in suggestions:   
          return suggestion.term
df['text_data'].apply(symspell_compound)

Then I came across the Parallel function with joblib and I wasn't able to find much examples on it but it seems to work on lists. So after extracting the text_data into a list, I applied the Parallel() together with the symspell_compound function but yet the processing was still slow (refer to code verbose print out below).

text_list = df['text_data'].to_list()
test_parallel = Parallel(n_jobs=4, verbose=10)(delayed(symspell_compound)(i) for i in text_list[:1000])

This is the code verbose printout when I tried it on a sample of 1000 records.

[Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=4)]: Done   5 tasks      | elapsed:   52.2s
[Parallel(n_jobs=4)]: Done  10 tasks      | elapsed:  1.7min
[Parallel(n_jobs=4)]: Done  17 tasks      | elapsed:  2.8min
[Parallel(n_jobs=4)]: Done  24 tasks      | elapsed:  3.9min
[Parallel(n_jobs=4)]: Done  33 tasks      | elapsed:  5.2min
[Parallel(n_jobs=4)]: Done  42 tasks      | elapsed:  6.6min
[Parallel(n_jobs=4)]: Done  53 tasks      | elapsed:  8.2min
[Parallel(n_jobs=4)]: Done  64 tasks      | elapsed:  9.9min
[Parallel(n_jobs=4)]: Done  77 tasks      | elapsed: 11.9min

Any ideas on what has gone wrong (e.g. on function parameter etc), or how can I do this more efficiently? Thanks in advance.

Side note: I'm doing this in CDSW workbench using 4CPU and 8GB memory (as this is the max allowed so far)

In symspell_compund function you are returning inside for loop ,pls check — qaiser, Feb 22 '22 at 11:30

score 0 · Answer 1 · answered Feb 23 '22 at 09:40

Performance-wise Python probably isn't the best choice. Using the C# implementation LookupCompound can reach 5000 words/s, single-core on 2012 Macbook. ( see https://seekstorm.com/blog/sub-millisecond-compound-aware-automatic.spelling-correction/ ). One of the following ports with Python bindings might help to improve the performance by orders of magnitude:

Rust port https://github.com/reneklacan/symspell

Python bindings for Rust port https://github.com/zoho-labs/symspell

Python bindings for C++ port

Original C# version https://github.com/wolfgarbe/symspell

sorry but the CDSW workbench is limited to only Python and R , I can spin off a pyspark session as well but not C#. thanks! — Wynn, Feb 24 '22 at 05:43

Using Symspell compound with Parallel processing

1 Answers1