I am new to NLP related tasks and I'm doing this with Pandas (Python) but the idea is that each row has a text that I'm trying to perform spell corrector on (sentence length may vary) and the total pandas dataframe is slightly over ~ 1 million records currently, likely to increase in the future.
Initially, I thought to use the symspell lookup_compound directly via the apply function with Pandas but it took such a long time (>12hours) and there was still no results.
def symspell_compound(input_term, max_edit_distance=2):
suggestions = sym_spell.lookup_compound(input_term, max_edit_distance)
for suggestion in suggestions:
return suggestion.term
df['text_data'].apply(symspell_compound)
Then I came across the Parallel function with joblib and I wasn't able to find much examples on it but it seems to work on lists. So after extracting the text_data into a list, I applied the Parallel() together with the symspell_compound function but yet the processing was still slow (refer to code verbose print out below).
text_list = df['text_data'].to_list()
test_parallel = Parallel(n_jobs=4, verbose=10)(delayed(symspell_compound)(i) for i in text_list[:1000])
This is the code verbose printout when I tried it on a sample of 1000 records.
[Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=4)]: Done 5 tasks | elapsed: 52.2s
[Parallel(n_jobs=4)]: Done 10 tasks | elapsed: 1.7min
[Parallel(n_jobs=4)]: Done 17 tasks | elapsed: 2.8min
[Parallel(n_jobs=4)]: Done 24 tasks | elapsed: 3.9min
[Parallel(n_jobs=4)]: Done 33 tasks | elapsed: 5.2min
[Parallel(n_jobs=4)]: Done 42 tasks | elapsed: 6.6min
[Parallel(n_jobs=4)]: Done 53 tasks | elapsed: 8.2min
[Parallel(n_jobs=4)]: Done 64 tasks | elapsed: 9.9min
[Parallel(n_jobs=4)]: Done 77 tasks | elapsed: 11.9min
Any ideas on what has gone wrong (e.g. on function parameter etc), or how can I do this more efficiently? Thanks in advance.
Side note: I'm doing this in CDSW workbench using 4CPU and 8GB memory (as this is the max allowed so far)