3

s = "Over 20 years, this investment is cost neutral as it is covered by a modest ‚comfort ch arge™ Œ less than the equivalent energy bills would have been Œ based on the well -proven EnergieSprong model. Capital Budget Rather than speculatively invest ing in commercial property, for which the business case is unclear, we propose that the Council j oin the growing ranks of local authorities developing new solar farms. This meets our policy objectives and provides a modest, but secure, return (net of borrowing). The £51m we propose to invest (similar to the amount originally intended for commercial pr operty)"

This is a text scarped from a web pdf using basic python and its PyPDF library

I want to remove the unwanted spaces in the bold words.

Note: I have manually made them bold just to explain my problem. I would appreciate, if someone could help.. Thanks a lot in advance!

zackakshay
  • 41
  • 2

5 Answers5

1

See mine and the other answers in this thread.

Assuming you sourced the text from either this DOCX or this PDF: If you have DOCX, use that and not the pdf, as docx is an XML-based format which text can be extracted from without errors.

You will also notice that if you copy and paste the pdf document to any other text document, you won't get these erroneous whitespaces as this is a problem resulting in the way the PDF parser works (getting confused by the horizontal spacing of the characters and making false assumptions where there is a whitespace based on the character positions).

You could try a different parser or copy and paste (only works if it not an image PDF of course) to an easiely parsable format first to avoid these problems.

Generally you can probably reduce the error rate by trying to fix the resulting text (if you really want to, check out Optical Character Recognition Post Correction/OCR Post Correction), but instead using that time to improve the parsing is likely to be much more effective.

ewz93
  • 2,444
  • 1
  • 4
  • 12
0

This method removes the whitespace in a word

def remove_space_in_word(text, word):
    index = text.find(word)
    parts = word.split(" ")
    part1_len = len(parts[0])
    return text[:index + part1_len] + text[index + part1_len + 1:]

Output: enter image description here

Don CorliJoni
  • 223
  • 1
  • 8
0

The simple manual method

If you have already identified that 'pr operty' tends to be written with an extra space, here is a simple function that will remove whitespace from all occurrences of pr operty:

def remove_whitespace_in_word(text, word):
    return text.replace(word, ''.join(word.split()))

s = "The pr operty. Over 20 years of pr operty, this investment is cost neutral as it is covered by a modest ‚comfort ch arge™ Œ less than the equivalent energy bills would have been Œ based on the well -proven EnergieSprong model. Capital Budget Rather than speculatively invest ing in commercial property, for which the business case is unclear, we propose that the Council j oin the growing ranks of local authorities developing new solar farms. This meets our pr operty policy objectives and provides a modest, but secure, return (net of borrowing). The £51m we propose to invest in pr operty (similar to the amount originally intended for commercial pr operty)"

new_text = remove_whitespace_in_word(s, 'pr operty')

print(new_text)
# 'The property. Over 20 years of property, this investment is cost neutral as it is covered by a modest ‚comfort ch arge™ Œ less than the equivalent energy bills would have been Œ based on the well -proven EnergieSprong model. Capital Budget Rather than speculatively invest ing in commercial property, for which the business case is unclear, we propose that the Council j oin the growing ranks of local authorities developing new solar farms. This meets our property policy objectives and provides a modest, but secure, return (net of borrowing). The £51m we propose to invest in property (similar to the amount originally intended for commercial property)'

You only need to call it once to fix all occurrences of pr operty; but you need to call it again for every other offending word, such as ch arge.

The complicated automated method

Here is a proposed algorithm. It's not perfect, but should deal with many errors:

  • Load a data structure holding all known English words, for instance the dictionary of Scrabble words.
  • Look for words in your text that are not in the dictionary.
  • Try to fix each offending word by merging it with the adjacent word that comes before or the adjacent word that comes after.
  • When attempting to merge, there are several possibilities. If the word after is also offending and merging them results in a non-offending word, it's likely a good fit. If the word after is not offending but merging them results in a non-offending word, it's maybe still a good fit. If the word after is not offending and merging them doesn't result in a non-offending word, it's probably not a good fit.
  • Generate a log of all the fixes that were performed, so that a user can read the log and make sure that the fixes look legit. Generating a log is really important; you don't want your algorithm to edit the text without keeping a trace of what was edited.
  • You could even do an interactive step, where the computer proposes a fix but waits for the user to validate it. When the user validates a fix, memorise it so that if another fix is identical, the user doesn't need to be asked again. For instance if there are several occurrences of "pr operty" in the text, you only need to ask confirmation once.
Stef
  • 13,242
  • 2
  • 17
  • 28
0

You could split your malformed sentence on spaces and check each pair of words / tokens in the split list to see if they are valid words by themselves or if their combination is a valid word.

For valid words, depending on the OS you are using, you can find a built-in list of words . On Linux, these words are usually at usr/share/dict/words. Or you can download a list of words from the internet.

from itertools import pairwise
with open('/usr/share/dict/words') as f:
    word_file = set(_.strip() for _ in f.readlines())

def fix_spaces(iterable):
    it = iter(pairwise(iterable))
    while True:
        try:
            word1, word2 = next(it)
            if word1 not in word_file or word2 not in word_file:
                if word1 + word2 in word_file:
                    yield word1 + word2
                    word1, word2 = next(it)
            else:
                yield word1
        except StopIteration:
            yield word2
            break

sentence = "A sent ence w ith wei rd spaces"
' '.join(fix_spaces(sentence.split()))
# 'A sentence with weird spaces'

Do note that this will still have edge cases, depending on your word list and also edge cases where spaces can be deleted in multiple ways (e.g. a sentence like s="tube light speed" can either be tubelight speed or tube lightspeed?)

Mortz
  • 4,654
  • 1
  • 19
  • 35
0

Yes this can be done by using rich vocabulary from NLP libraries like NLTK, Spacy.

Make sure these lib installed for below code - NLTK, SpaCy

To download spacy large model => # python -m spacy download en_core_web_lg

Below is the example to do that:

# Fix unwated space between a word. In the first iteration 1530 words are fixed.

import spacy
from nltk.corpus import stopwords, words as nltk_words

nlp = spacy.load('en_core_web_lg', disable=['parser', 'ner'])
spacy_words = set(nlp.vocab.strings)


def cleaning_fix_unwanted_space_v2(_inp_str: str) -> str:
    # _vacob = nltk_words.words()
    _vacob = spacy_words

    _inp_str_splitted = _inp_str.split()
    out_words = []

    i = 0    
    while i < len(_inp_str_splitted):
        word = _inp_str_splitted[i]

        if word not in _vacob and i+1 < len(_inp_str_splitted):
            next_word = _inp_str_splitted[i+1]
            joined = word + next_word

            if joined.strip() in _vacob:
                word = joined
                i += 1
            else:
                if (i - 1) > 0:
                    prev_word = _inp_str_splitted[i-1]                
                    joined = prev_word + word

                    if joined.strip() in _vacob:
                        word = joined
                        del out_words[-1]

        out_words.append(word)
        i += 1

    return " ".join(out_words)

Example

As you can see the example it still has some limitation. It failed to fix "general" because both gen and eral are valid word on their own. But to start with something i guess this is good enough.