6

Using the TextBlob library it is possible to improve the spelling of strings by defining them as TextBlob objects first and then using the correct method.

Example:

from textblob import TextBlob
data = TextBlob('Two raods diverrged in a yullow waod and surry I culd not travl bouth')
print (data.correct())
Two roads diverged in a yellow wood and sorry I could not travel both

Is it possible to do this to strings in a Pandas DataFrame series such as this one:

data = [{'one': '3', 'two': 'two raods'}, 
         {'one': '7', 'two': 'diverrged in a yullow'}, 
        {'one': '8', 'two': 'waod and surry I'}, 
        {'one': '9', 'two': 'culd not travl bouth'}]
df = pd.DataFrame(data)
df

    one   two
0   3     Two raods
1   7     diverrged in a yullow
2   8     waod and surry I
3   9     culd not travl bouth

To return this:

    one   two
0   3     Two roads
1   7     diverged in a yellow
2   8     wood and sorry I
3   9     could not travel both

Either using TextBlob or some other method.

RDJ
  • 4,052
  • 9
  • 36
  • 54

2 Answers2

2

You could do something like:

df.two.apply(lambda txt: ''.join(textblob.TextBlob(txt).correct()))

Using pandas.Series.apply.

Ami Tavory
  • 74,578
  • 11
  • 141
  • 185
2

I am still in search of a faster method. However, I think there is a different library named autocorrect in python that helps in spell correction. I timed both the libraries (autocorrect and testblob) on a demo data and these are the results I got.

%%timeit
spell_correct_tb(['haave', 'naame'])
The slowest run took 4.36 times longer than the fastest. This could mean that an intermediate result is being cached.
1000 loops, best of 3: 505 µs per loop

%%timeit
spell_correct_autocorrect(['haave', 'naame'])
The slowest run took 4.80 times longer than the fastest. This could mean that an intermediate result is being cached.
1000 loops, best of 3: 303 µs per loop

This indicates that autocorrect works faster (or am I wrong for the assumption?). However, I am not very sure about the accuracy measures of the two libraries.

NB : You can install autocorrect using pip running the command pip install autocorrect

labeebee
  • 91
  • 9
  • 1
    I used **TextBlob**'s [correct()](https://textblob.readthedocs.io/en/dev/quickstart.html#spelling-correction) method and it took me around _31 minutes_ to correct _~6500 documents_. It was not 100% accurate , but I agree that it is a task that involves high computation power. – Amitrajit Bose Jan 12 '19 at 10:32