1

I have extracted text from pdf (using pdfplumber) to txt but there are some spaces between words that are not in PDF file. enter image description here

I have tried to nltk to find out Words using "Previous_word" + "current_word" combination and checking if they exist in NLTK.words to find out where there is extra space between words but it is not working well.

I am looking for some suggestions, Thanks

Joy
  • 145
  • 2
  • 9
  • That looks like a few spaces, is it a "\t"? Can you include a sample of the text? – DDaly Mar 15 '21 at 13:14
  • With `sed`you could : `sed -i /s/\ */ /g ` , but i don't know if this is a good answer, maybe you could do something better directly in python – Ivan Mar 15 '21 at 13:21

2 Answers2

1

Example logic which puts words with two spaces between into list, then You can implement functions you like:

text = """
asdasd  asd asdd d
uuurr ii ii  rrr
"""

words = text.split(" ") #<- split if 1 spaces
dictionary = list() #<- dictionary list to compare
words_wrapper = list() #<- list of words with 2 spaces

for idx in range(len(words)):
    if words[idx] == '':
        word = f"{words[idx-1]} {words[idx+1]}"
        words_wrapper.append(word)
        if word in dictionary:
            pass #<- do sth 
            
# Print filtered words
print(words_wrapper)

or You can also use .join to combine words with 2 spaces together :

text = """
asdasd  asd asdd d
uuurr ii ii  rrr
"""

print("".join(text.split("  ")))
Coconutcake
  • 156
  • 9
  • 2
    If you read further down in the extracted text, you'll see that there also are unwanted single spaces ("profi tably", "defi nitive"), the double space one merely leaps to the eye. Thus, a removing double spaces doesn't solve the OP's problem. – mkl Mar 15 '21 at 15:39
0

I suggest looking for occurences of two subsequent words which are not in your corpus, which should reveal all cases where such split does not result in other English word.

Daweo
  • 31,313
  • 3
  • 12
  • 25