How to remove space between English Words after extracting from pdfplumber

Question

I have extracted text from pdf (using pdfplumber) to txt but there are some spaces between words that are not in PDF file.

I have tried to nltk to find out Words using "Previous_word" + "current_word" combination and checking if they exist in NLTK.words to find out where there is extra space between words but it is not working well.

I am looking for some suggestions, Thanks

That looks like a few spaces, is it a "\t"? Can you include a sample of the text? — DDaly, Mar 15 '21 at 13:14
With `sed`you could : `sed -i /s/\ */ /g ` , but i don't know if this is a good answer, maybe you could do something better directly in python — Ivan, Mar 15 '21 at 13:21

Coconutcake · Answer 1 · 2021-03-15T13:36:34.230

Example logic which puts words with two spaces between into list, then You can implement functions you like:

text = """
asdasd  asd asdd d
uuurr ii ii  rrr
"""

words = text.split(" ") #<- split if 1 spaces
dictionary = list() #<- dictionary list to compare
words_wrapper = list() #<- list of words with 2 spaces

for idx in range(len(words)):
    if words[idx] == '':
        word = f"{words[idx-1]} {words[idx+1]}"
        words_wrapper.append(word)
        if word in dictionary:
            pass #<- do sth 
            
# Print filtered words
print(words_wrapper)

or You can also use .join to combine words with 2 spaces together :

text = """
asdasd  asd asdd d
uuurr ii ii  rrr
"""

print("".join(text.split("  ")))

If you read further down in the extracted text, you'll see that there also are unwanted single spaces ("profi tably", "defi nitive"), the double space one merely leaps to the eye. Thus, a removing double spaces doesn't solve the OP's problem. — mkl, Mar 15 '21 at 15:39

score 0 · Answer 2 · answered Mar 15 '21 at 13:13

0

I suggest looking for occurences of two subsequent words which are not in your corpus, which should reveal all cases where such split does not result in other English word.

answered Mar 15 '21 at 13:13

Daweo

31,313
3
12
25

How to remove space between English Words after extracting from pdfplumber

2 Answers2