I am currently using pdftotext
to read PDF files into python using the following code
import pdftotext
bill_full = []
with open('sample.pdf', "rb") as f:
pdf = pdftotext.PDF(f)
bill = ''
for page in pdf:
bill = bill + page
bill_full.append(bill)
The previous code seems to mostly work for my complete dataset, however I seem to encounter seemingly random errors. The previous code applied to the following PDF https://legiscan.com/WI/text/AB649/id/456434/Wisconsin-2009-AB649-Introduced.pdf results in
2011 − 2012 LEGISLATURE LRB−1478/1 2011 SENATE BILL 27\r\n\r\n\r\n\r\n\r\n March 1, 2011 − Introduced by JOINT COMMITTEE ON FINANCE. Referred to Joint\r\n Committee on Finance.\r\n\r\n\r\n\r\n\r\n1 AN ACT relating to: state finances and appropriations, constituting the\r\n\r\n2 executive budget act of the 2011 legislature.\r\n\r\n\r\n Analysis by the Legislative Reference Bureau\r\n INTRODUCTION\r\n
However when applied to others (eg. https://legiscan.com/WI/text/AB408/id/423828/Wisconsin-2009-AB408-Introduced.pdf) I get the following sequence of characters:
\x08\x08\x11 \x06 \x08 \x08 \x1c\x18\x1a\x1b"\x1c\x14#$!\x18
What is different in these two PDFs? Ideally I would like to detect "unreadable" PDFs and drop them from my analysis.