Remove extra whitespaces from extracted PDF text

Question

I have extracted the text from a PDF file and some of the text has extra whitespaces between words.

Your water a n d wastewater s t a t e m e n t

I wrote a function to remove the extra spaces from the text above.

function removeExtraWhitespace($val) {
    $nval = "";

    for($i = 0; $i < strlen($val); $i++) {
        if($val[$i] != " ") {
            $nval .= $val[$i];
        }
        else if((isset($val[$i-2]) && $val[$i-2] != " ") || (isset($val[$i+2]) && $val[$i+2] != " ")) {
            $nval .= $val[$i];
        }
    }
    return $nval;
}

Which will output:

Your water and wastewater statement

I know that this function will not work in all circumstances though. If the text has a valid 1 letter word, like 'a', then it will fail, or if only part of a word has extra spaces.

I n e e d to remove whitespaces f r o m a string

When putting the above text in to my function it will output:

Ineed to remove whitespaces froma string

Is there a way to make a function that will work on all possible text?

in this string no unique identification to remove the white space. if first alphabet is capital of each word then it's possible — Bilal Ahmed, Oct 27 '17 at 11:52
I would think of passing text passages like sentence by sentence to an autocorrection API service. Maybe there is a google assistant API or sth. like that. — iquellis, Oct 27 '17 at 11:55
Maybe it's worth to take some more effort as well: I guess, the PDF looks okay concerning your text examples? So maybe your parsing lib or whatever you use is just not good enough or must be used another way? — iquellis, Oct 27 '17 at 11:56
@iquellis I have tried several ways of extracting the text from the PDFs. The example text came from using ebook-convert, which so far has produced the best results for me to parse. — Gary, Oct 27 '17 at 12:06
@Gary: Good, just wanted to be sure, that you tried more than one way... PDFs simpy suck big time... — iquellis, Oct 27 '17 at 12:15
Mission imposible. "In general - `a n a l p h a b e t` is a good thing". Should be translated to - "In general - `an alphabet` is a good thing". Or "In general - `analphabet` is a good thing" ? — Agnius Vasiliauskas, Oct 27 '17 at 16:05
I suggest trying to use "pdftotext" which comes with XPDF (opensource). Maybe like this? https://stackoverflow.com/questions/9286036/how-to-extract-texts-from-pdfs-using-xpdf — Andreas Hauser, Mar 20 '18 at 13:08
The correct answer to this question is "no", unless you have very specific paramaters in terms of what text you expect to be inside the PDF, or you are Google. You should instead focus on extracting it correctly. — Andrew, Apr 26 '18 at 15:55

score 1 · Answer 1 · answered Jun 20 '18 at 00:05

Spelling correction is hard work. I think you should use online spelling correction websites. You can do something like this:

function curl($post)
{
    $user_agent = 'Mozilla/5.0 (Windows; U; Windows NT 5.1; tr; rv:1.9.0.6) Gecko/2009011913 Firefox/3.0.6';
    $ch = curl_init();
    curl_setopt($ch, CURLOPT_URL, 'http://orthographe.reverso.net/RISpellerWS/RestSpeller.svc/v1/CheckSpellingAsXml/language=eng?outputFormat=json&doReplacements=false&interfLang=en&dictionary=both&spellOrigin=interactive&includeSpellCheckUnits=true&includeExtraInfo=true&isStandaloneSpeller=true');
    curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
    curl_setopt($ch, CURLOPT_POST, true);
    curl_setopt($ch, CURLOPT_POSTFIELDS, $post);
    curl_setopt($ch, CURLOPT_USERAGENT, $user_agent);
    curl_setopt($ch, CURLOPT_HTTPHEADER, array(
        'Created: 01/01/0001 00:00:00',
        'Referer: http://www.reverso.net/spell-checker/english-spelling-grammar/',
        'Username: OnlineSpellerWS'
    ));
    $icerik = curl_exec($ch);
    curl_close($ch);
    return $icerik;
}


$response   = json_decode(curl('Ineed to remove whitespaces froma string'));

var_dump($response->AutoCorrectedText);

It is just for idea. I am sure there are spelling correction websites which provide API.

Remove extra whitespaces from extracted PDF text

1 Answers1