-3

I have exported a pdf file as a .txt and I observed that many words were broken into two parts due to line breaks. So, in this program, I want to join the words that are separated in the text while maintaining the correct words in the sentence. In the end, I want to get a final .txt file (or at least a list of tokens) with all words properly spelt. Can anyone help me?

my current text is like this:

I need your help be cause I am not a good progra mmer.

result I need:

I need your help because I am not a good programmer.

from collections import defaultdict
import re
import string
import enchant

document_text=open('test-list.txt','r')
text_string=document_text.read().lower()
lst=[]
errors=[]

dic=enchant.Dict('en_UK')
d=defaultdict(int)
match_pattern = re.findall(r'\b[a-zA-Z0-9_]{1,15}\b', text_string)

for w in match_pattern:
lst.append(w)

for i in lst:
    if  dic.check(i) is True:
        continue
    else:
        a=list(map(''.join, zip(*([iter(lst)]*2))))
    if dic.check(a) is True:
        continue
    else:
        errors.append(a)
print (lst)
Josh Friedlander
  • 10,870
  • 5
  • 35
  • 75
Natalia Resende
  • 185
  • 1
  • 1
  • 15

1 Answers1

0

You have a bigger problem - how will your program know that:

be
cause

... should be treated as one word?

If you really wanted to, you could replace newline characters with empty spaces:

import re

document_text = """
i need your help be
cause i am not a good programmer
""".lower().replace("\n", '')

print([w for w in re.findall(r'\b[a-zA-Z0-9_]{1,15}\b', document_text)])

This will spellcheck because correctly, but will fail in cases like:

Hello! My name is 
Foo.

... because isFoo is not a word.

alex
  • 6,818
  • 9
  • 52
  • 103