3

TL;DR

My workflow:

  1. Download PDF
  2. Split it into pages using pdftk
  3. Extract text of each page using pdftotext
  4. Classify text and add metadata
  5. Send it to client in a structured format

I need to extract consistent text to jump from 3 to 4. If text is garbled, I have to OCR its page. But, OCR all pages is out of question. How to identify beforehand which pages should be OCRed? I've tried to run pdffonts and pdftohtml on each page. Isn't it expensive to run subprocess.run twice a page?

What do I mean by broken page?

A PDF page that is not possible to extract text from its source, maybe due to to_unicode conversion.

Description

I'm building an application that relies on the extraction of text from a thousand PDF files every day. The layout of text in each PDF is somewhat structured, therefore calling pdftotext from python works well in most cases. But, some PDF files from one or two resources bring pages with problematic fonts, which results in garbled text. I think that using OCR only on problematic pages would be ok to overcome such an issue. So, my problem is how to identify, before extracting text, which pages are likely to result in gibberish.

First, I tried to identify garbled text, after extracting it, using regex (\p{Cc} or unlikely chars outside Latin alphabet), but it did not work because I found corrupted text with valid chars and numbers, i.e AAAAABS12 54c] $( JJJJ Pk , as well.

Second, I tried to identify garbled text calling pdffonts - to identify name, encoding, embeddedness and existence of to_unicode map - on each page and parsing its output. In my tests, it kinda works well. But I found also necessary to count how many chars used likely problematic fonts, pdftohtml - Display each text block in p tag along with its font name - saved the day here. @LMC helped me to figure out how to do it, take a look at the answer. The bad part is I ended up calling subprocess.run two times for each pdf page, what is super expensive. It would be cheaper if I could just bind those tools.

I'd like to know if it's possible and feasible to look at PDF source and validate some CMAP (uni yes and not custom font), if present, or maybe other heuristics to find problematic fonts before extracting text or OCR it.

Example of garbled text in one of my PDF files:

0\n1\n2\n3\n4\n2\n0\n3\n0\n5 6\n6\nÿ\n89 ÿ\n4\n\x0e\n3\nÿ\n\x0f\x10\n\x11\n\x12\nÿ\n5\nÿ\n6\n6\n\x13\n\x11\n\x11\n\x146\n2\n2\n\x15\n\x11\n\x16\n\x12\n\x15\n\x10\n\x11\n\x0e\n\x11\n\x17\n\x12\n\x18\n\x0e\n\x17\n\x19\x0e\n\x1a\n\x16\n2 \x11\n\x10\n\x1b\x12\n\x1c\n\x10\n\x10\n\x15\n\x1d29 2\n\x18\n\x10\n\x16\n89 \x0e\n\x14\n\x13\n\x14\n\x1e\n\x14\n\x1f\n5 \x11\x1f\n\x15\n\x10\n! \x1c\n89 \x1f\n5\n3\n4\n"\n1\n1\n5 \x1c\n89\n#\x15\n\x1d\x1f\n5\n5\n1\n3\n5\n$\n5\n1 5\n2\n5\n%8&&#\'#(8&)\n*+\n\'#&*,\nÿ\n(*ÿ\n-\n./0)\n1\n*\n*//#//8&)\n*ÿ\n#/2#%)\n*,\nÿ\n(*/ÿ\n/#&3#40)\n*/ÿ\n#50&*-\n.()\n%)\n*)\n/ÿ\n+\nÿ\n*#/#\n&\x19\n\x12\nÿ\n\x1cÿ\n,\x1d\n\x12\n\x1b\x10\n\x15\n\x116\nÿ\n\x15\n7\nÿ\n8\n9\n4\n6\nÿ\n%\x10\n\x15\n\x11\n\x166\nÿ\n:\x12\x10;\n2\n*,\n%#26\nÿ\n<\n$\n3\n0\n3\n+\n3\n8\n3\nÿ\n+\nÿ\n=\x15\n\x10\n6\nÿ\n>\n9\n0\n?\nÿ\n4\n3\n3\n1\n+\n8\n9\n3\n<\n@A\nB\nC\nD\nEÿ\nGH\nI\nÿ\nJ\nJ\nK\nL\nJ\nM\nJ\nN\nO\nP\nO\nQ\nI\n#\x1bÿ\n0\n1\nÿ\n\x1c\n\x10\nÿ\n*\x1a\n\x16\n\x18\nÿ\n\x1c\n\x10\nÿ\n0\n3\n0\n5\n\x0e\n/\x10\n\x15\n\x13\x16\n\x12\nÿ\n/\x10\n\x16\n\x1d\x1c\x16\n\x12\n6\nÿ\n* \x19\n\x15\n\x116\nÿ\n\x12\n\x19\n\x11\n\x19\n\x12\n\x16\nÿ\n\x15ÿ\n/*-\n\x0e\nÿ\nÿ\nÿ\nÿ\nÿ\nÿ\nÿ\nÿ\nÿ\nÿ\nÿ\nÿ\nÿ\nÿ\nÿ\nÿ\nÿ\nÿ\nÿ\nÿ\nÿ\n(\x10\nÿ\x16\n\x1c\n\x10\n\x1bÿ\n\x1c\n\x12\nÿ\n%\x13\n\x10\n9\n\x10\nÿ\n\x1c\n\x10\nÿ\n\'\x12\n\x1a\x15\n\x10\n\x11\n\x10\nÿ\n\x1c\n\x12\nÿ\n%\x16\n\x16\n\x10\nR\n\x10\n\x1c\x16\n\x12\nÿ\n\'\x10\n\x16\n\x12\n\x18\nÿ\n\x1c\n\x12\nÿ\n-\n\x19\x11\n1\n\x12\nÿ\n\x1cÿ\n#\x11\n\x12\n\x1cÿ\n\x1c\n\x10\nÿ\n*\x18\n\x12\nR\x126\nÿ\n/\x16\n\x12\n\x0e\n& \x10\n\x12\n\x15\n\x12\nÿ\n%\x10\n\x18\x11\n\x16\n\x10\nÿ\n:\x12\x13\n\x12\n\x1c\x0e\nÿ\n*\x19\n\x11\n\x19\n\x10\n+\x10\nÿ\n\x10\nÿ\n&\x10\nR\x11\n\x16\n\x10\n+\x10\nÿ\n\x15ÿ\n/*-\n2\n2\'<\nÿ\n+\nÿ\n#S\n\x11\n\x16\n\x12\n\x17\n\x19\n\x1c \x12\n\x18\nÿ\n*\x1c\n\x1b\x15\x11\n\x16\n\x12\n\x11\n\x1d\x0e\nÿ\nÿ\nÿ\nÿ\nÿ\nÿ\nÿ\nÿ\nÿ\nÿ\nÿ\nÿ\nÿ\nÿ\nÿ\nÿ\nÿ\nÿ\nÿ\nÿ\n*\x11\n\x10\n\x15 \x12\n\x1b\x10\n\x15\n\x11\n\x10\n6\nTU\nV\nWU\nXÿ\nYXÿ\nTU\nV\nW\nX\nXYZU\n[U\nT\\]X\\U\nW\nX\nVD\n^\n_\n`\nÿ\nab\nÿ\nXGb\nc\nE^\nd\nO\nP\nO\nQ\nP\ne\nO\nf\nP\nf\nJ\nf\nP\ne\ng\nGb\nh_\nEGI\niaA\nYjTk\nXlm@ YjTk\nXlmX] ]jTk@[Yj] U\nZk]U\nZU\n] X]noU\nW\nX] W@V\n\\\nX]\nÿ\n89\nÿ\n89\np ÿ\nq\n(\x10\x14\n\x12\x13\n8r\nIOV\x11\x03\x14\n(VWH\x03GRFXPHQWR\x03p\x03FySLD\x03GR\x03RULJLQDO\x03DVVLQDGR\x03GLJLWDOPHQWH\x03SRU\x03(00$18(/$\x030$5,$\x03&$/$\'2\x03\'(\x03)$5,$6\x036,/9$\x11\x033DUD\x03FRQIHULU\x03R\x03RULJLQDO\x0f\x03DFHVVH\x03R\x03VLWH\x03\x0f\x03LQIRUPH\x03R\x03SURFHVVR\x03\x13\x13\x13\x13\x16\x17\x18\x10\x1a\x18\x11\x15\x13\x15\x14\x11\x1b\x11\x13\x15\x11\x13\x13\x1a\x16\x03H\x03R\x03\nFyGLJR\x03\x17(\x14\x14\x16\x14\x13\x11\x03

The text above was extracted from page 25 of this document using pdftotext.

For that page, pdffonts outputs:

name                                 type              encoding         emb sub uni object ID
------------------------------------ ----------------- ---------------- --- --- --- ---------
[none]                               Type 3            Custom           yes no  no      13  0
DIIDPF+ArialMT                       CID TrueType      Identity-H       yes yes yes    131  0
DIIEDH+Arial                         CID TrueType      Identity-H       yes yes no     137  0
DIIEBG+TimesNewRomanPSMT             CID TrueType      Identity-H       yes yes yes    142  0
DIIEDG+Arial                         CID TrueType      Identity-H       yes yes no     148  0
Arial                                TrueType          WinAnsi          yes no  no     159  0

It's easy to identify that [none] named font as problematic. My take so far, given the data I've analysed, is to mark fonts with custom or identity-h encoding, no to_unicode map or none named as likely problematic. But, as I said, I also found cases with ToUnicode table and not Custom encoding fonts, problematic as well. As far as I know, it's also possible to find, for example, a single char that is defined for a broken font, but does not affect the overall readability of the page, so maybe it would be not necessary to OCR that page. In other words, if a font, in a given page, does not have ToUnicode convertion, it does not mean that the text of the page is totally affected.

I'm looking for a solution that is better than regex garbled text.

Examples of PDF pages that I had to OCR

All pages bellow contains text in portuguese, but if you try to copy the text and paste somewhere you will see universal gibberish.

What I've done so far

I've avoid calling subprocess twice a page since I created a bash script that iterate pages and merges pdftohtml and pdffonts output for each one into a single HTML:

#!/bin/sh

# Usage: ./font_report.sh -a 1 -b 100 -c foo.pdf


while getopts "a:b:c:" arg; do
    case $arg in
        a) FIRST_PAGE=$OPTARG;;
        b) LAST_PAGE=$OPTARG;;
        c) FILENAME=$OPTARG;;
        *)
            echo 'Error: invalid options' >&2
            exit 1
    esac
done

: ${FILENAME:?Missing -c}

if ! [ -f "$FILENAME" ]; then
    echo "Error: $FILENAME does not exist" >&2
    exit 1
fi

echo "<html xmlns='http://www.w3.org/1999/xhtml' lang='' xml:lang=''>" ;

for page in $(seq $FIRST_PAGE $LAST_PAGE)
do
   { 
       echo "<page number=$page>" ; 
       echo "<pdffonts>" ; 
       pdffonts -f $page -l $page $FILENAME ; 
       echo "</pdffonts>" ;  
       (
           pdftohtml -f $page -l $page -s -i -fontfullname -hidden $FILENAME -stdout | 
           tail -n +35 |  # skips head tag and its content
           head -n -1  # skips html ending tag
        ) ;
       echo "</page>"
    }
done

echo "</html>"

The code above has enabled me to call subprocess once and parse html using lxml for each page (considering <page> tag). But it is still needed to look at text content to have a idea if the text is broken.

Kfcaio
  • 442
  • 1
  • 8
  • 20
  • The problem is that there are a lot of ways to make text extraction fail, and many of them aren't that easy to recognize. Thus, either you identify lots and lots of indicators for broken fonts or you ignore the fonts and instead analyze the text output using dictionaries. – mkl Jul 31 '21 at 21:03
  • Could you please elaborate on analyzing text output using dictionaries? – Kfcaio Jul 31 '21 at 21:47
  • 1
    Essentially I'd propose you collect all the words you have in the extracted text and check whether a high enough part of that collection can be found in a dictionary (well, a list of words of the language in question). Maybe one can try and reduce that to checking whether there are enough indicator letter groups to check for (as @KJ has tried with a single such group in his answer), but I'd start with the full word dictionaries. This test may be flanked by additional tests (e.g. page filling images might indicate scanned pages). – mkl Aug 03 '21 at 08:22

4 Answers4

3

A quick function based on pdftotext

Here is a full (rewrited) function scanning for badpages:

#!/bin/bash

findBadPages() {
    local line opts progress=true usage="Usage: ${FUNCNAME[0]} [-f first page]"
    usage+=' [-l last page] [-m min bad/page] [-q (quiet)]'
    local -a pdftxt=(pdftotext -layout - -)
    local -ia badpages=()
    local -i page=1 limit=10 OPTIND
    while getopts "ql:f:m:" opt;do
        case $opt in
            f ) pdftxt+=(-f $OPTARG); page=$OPTARG ;;
            l ) pdftxt+=(-l $OPTARG) ;;
            m ) limit=$OPTARG ;;
            q ) progress=false ;;
            * ) printf >&2 '%s ERROR: Unknown option!\n%s\n' \
                           "${FUNCNAME[0]}" "$usage";return -1 ;;
        esac
    done
    shift $((OPTIND-1))
    while IFS= read -r line; do
        [ "$line" = $'\f' ] && page+=1 && $progress && printf %d\\r $page
        ((${#line} > 1 )) && badpages[page]+=${#line}
    done < <(
        tr -d '0-9a-zA-Z\047"()[]{}<>,-./+?!$&@#:;%$=_ºÁÃÇÔàáâãçéêíóôõú– ' < <(
            "${pdftxt[@]}" <"$1"
    ))
    for page in ${!badpages[@]} ;do
        (( ${badpages[page]} > limit )) && {
            $progress && printf "There are %d strange characters in page %d\n" \
               ${badpages[page]} $page || echo $page ;}
    done
}

Then now:

findBadPages DJE_3254_I_18062021\(1\).pdf
There are 2237 strange characters in page 23
There are 258 strange characters in page 24
There are 20 strange characters in page 32

findBadPages -m 100 -f 40 -l 100 DJE_3254_I_18062021.pdf 
There are 623 strange characters in page 80
There are 1068 strange characters in page 81
There are 1258 strange characters in page 82
There are 1269 strange characters in page 83
There are 1245 strange characters in page 84
There are 256 strange characters in page 85

findBadPages DJE_3254_III_18062021.pdf
There are 11 strange characters in page 125
There are 635 strange characters in page 145

findBadPages -qm100 DJE_3254_III_18062021.pdf 
145

findBadPages -h
/bin/bash: illegal option -- h
findBadPages ERROR: Unknown option!
Usage: findBadPages [-f first page] [-l last page] [-m min bad/page] [-q (quiet)]

Usage:

findBadPages [-f INTEGER] [-l INTEGER] [-m INTEGER] [-q] <pdf file>

Where

  • -f Let you specify first page.
  • -l for last page.
  • -m for Minimum strange character found per page to print status.
  • -q flag suppress page number display during progression, then show only badpages numbers.

Note:

The string used by tr -d: 0-9a-zA-Z\047"()[]{}<>,-./:;%$=_ºÁÃÇÔàáâãçéêíóôõú– was built by sorting used characters in your PDF files! They could not match another language! Maybe adding some more accented charaters or other missed printable could become necessary in some futur uses.

F. Hauri - Give Up GitHub
  • 64,122
  • 17
  • 116
  • 137
1

@mkl may be onto a method by saying use a word dictionary search

I have tried different methods to see some simple way of detecting the two bad pages in your smallest 3rd example and suspect it can easily be defeated by a page that has good and bad text thus this is not a complete answer as almost certainly needs more passes to refine.

We have to accept that by you asking the question that each PDF is unknown quality but will need to be processed. So accepting the fact that we hope the majority of pages are good we run through the burst stage blindly.

A very common word construct can contain the 3 letter syllable "est" so if we search the burst files we see that is missing from Page 23 and Page 24 thus they are good candidates for corruption.

enter image description here

Likewise for the 855 page file you say Page 146 is a problem (confirmed by my previous search method for corrupted only pages, containing ���, just that one is corrupt) but now we can easily see in the first 40 pages

OCR is certainly also needed for pages (including those that are image only)

Page 4, 5, 8, 9, 10, 35 (35 is a very odd page ? BG image ?)

but we get a false positive for 2 Pages 19, 33 (Do have text but no est or EST)

and 20, 32, 38 which have EST so the search needs to be case insensitive

SO using a case insensitive search without modification we should get a confidence of 95% (2 wrong in 40) need OCR but I did not test deeply why lowercase est only returned 275 from the total 855 unless there is a very high percent later with images needing OCR.

I had previously suggested searching the third 6054 page file much faster by looking for ??????? which gives us a more erratic result for use, but does show the corruption is ALL text pages from 25 to 85

So where does that lead to ?

In reality it would be very rare to find someone use ???

Corrupt Pages usually contain ??? OR ���

In Portuguese a corrupt page is less likely to contain /I est

Part corrupt pages may contain a large image for OCR and either est or not

A few corrupt pages will be none of the above

K J
  • 8,045
  • 3
  • 14
  • 36
  • 1
    If reducing the dictionary approach to indicator letter groups, I'd propose using multiple check groups and then requiring multiple hits. Some statistical analysis of representative example documents might be apropos. – mkl Aug 03 '21 at 08:27
  • It's definitively a start. How usable it is, depends on the pdf files the op actually has to deal with. – mkl Aug 03 '21 at 12:53
  • @Kfcaio Have you tried K J's approach? Unless you check and tell us how well you fare with proposed solutions applied to some other of your files, you can hardly expect improved answers. – mkl Aug 03 '21 at 12:57
  • @mkl Have a look at [my answer](https://stackoverflow.com/a/68684371/1765658) – F. Hauri - Give Up GitHub Aug 07 '21 at 14:52
-1

Since this is also (or mainly) a performance problem I would suggest to modify your code to more multi-threaded solution or simply use GNU Parallel

Pretty nice article about it -> link

Michal
  • 338
  • 1
  • 3
  • 11
  • @KJ Well, I'm not so sure about it because even the "font_report.sh" posted above is not coded as multi-thread solution. OP is focused on a "perfect detection" which is more complicated than simply speeding up current solution + OCRing via tesseract or something (also in parallel). – Michal Aug 04 '21 at 15:51
-4

Try to use another module in order to extract the text correctly; I suggest PyPDF2.

Here goes a function which should fix the issue:

import PyPDF2

def extract_text(filename, page_number):
    # Returns the content of a given page
    pdf_file_object = open(filename, 'rb')
    pdf_reader = PyPDF2.PdfFileReader(pdf_file_object)
    # page_number - 1 below because in python, page 1 is considered as page 0
    page_object = pdf_reader.getPage(page_number - 1)
    text = page_object.extractText()
    pdf_file_object.close()
    return text

By the way, PyPDF2 isn't a preinstalled module in Python. To install it, install pip (although it is very likely that point is already done) and run 'pip install PyPDF2' through the command line.

Dharman
  • 30,962
  • 25
  • 85
  • 135
Lvn Rtg
  • 54
  • 6