pdftotext cannot read certain documents

Question

I am currently using pdftotext to read PDF files into python using the following code

import pdftotext
bill_full = []

with open('sample.pdf', "rb") as f:
    pdf = pdftotext.PDF(f)
    bill = ''
    for page in pdf:
        bill = bill + page
    bill_full.append(bill)

The previous code seems to mostly work for my complete dataset, however I seem to encounter seemingly random errors. The previous code applied to the following PDF https://legiscan.com/WI/text/AB649/id/456434/Wisconsin-2009-AB649-Introduced.pdf results in

2011 − 2012 LEGISLATURE LRB−1478/1 2011 SENATE BILL 27\r\n\r\n\r\n\r\n\r\n    March 1, 2011 − Introduced by JOINT COMMITTEE             ON   FINANCE. Referred to Joint\r\n        Committee on Finance.\r\n\r\n\r\n\r\n\r\n1   AN ACT         relating to: state finances and appropriations, constituting the\r\n\r\n2        executive budget act of the 2011 legislature.\r\n\r\n\r\n                      Analysis by the Legislative Reference Bureau\r\n                                        INTRODUCTION\r\n

However when applied to others (eg. https://legiscan.com/WI/text/AB408/id/423828/Wisconsin-2009-AB408-Introduced.pdf) I get the following sequence of characters:

 \x08\x08\x11 \x06 \x08 \x08 \x1c\x18\x1a\x1b"\x1c\x14#$!\x18

What is different in these two PDFs? Ideally I would like to detect "unreadable" PDFs and drop them from my analysis.

I actually provided two examples of PDFs that cannot be read, here's one that does work: https://legiscan.com/WI/text/AJR53/id/364543/Wisconsin-2011-AJR53-Introduced.pdf — ZMV, Oct 18 '21 at 16:11

K J · Accepted Answer · 2021-10-19T00:58:44.260

To answer the direct question what is different is the CID data so lets just look at one object on each page 1. here I pick the subject of your question, the first text that includes the numbers 1 2 9 0, letters L E G I S A T U R and the others in title

Here we see good or bad they are all stored as the same font type ??????+PSOwstnewcspsb, unclear to me but seems to be named along the lines PSO WeSTern NEW Courier ??? Bold

So why would there then be some working as mapped correctly by say OCR and some not ? That is an unknown to me and there is often no clear rhyme or reason, but we can see a difference in outcomes as the good one starts with printable space (/FirstChar 32/LastChar 116) whilst both of the non working ones start (/FirstChar 0/LastChar ## of approx 66) i.e. include a non standard printing range. That however is not an indicator of a bad font and in other bad examples I have seen /FirstChar 2 as giving a hint to a poorly defined font. the problem with searching /FirstChar is it may be encrypted or encode thus not possible to look for in many pdfs until disassembled.

The only good indication of bad characters is good plain text extraction contains invalid print characters.

You say you wish to avoid files with bad construct but many files may only have bad parts of pages, for a wider example of this issue see How to identify likely broken pdf pages before extracting its text?

The link you provided was useful. For newbies in python like me, I used a shortcut without bash functions. Essentially I just check every first page of my documents using ''.join(sorted(pdf[0]))[0] == '\n' and ignore any document which includes characters starting with "\x" (which are sorted before line breaks, \n). — ZMV, Oct 19 '21 at 07:42

pdftotext cannot read certain documents

1 Answers1