0

I'm using the Mechanize ruby gem to click a button on the web to download a PDF file and save it to the local file system.

URL = "www.my-site.com"
agent = Mechanize.new
agent.pluggable_parser.pdf = Mechanize::File # FYI I have also tried Mechanize::FileSaver and Mechanize::Download here

page = agent.get(URL)
form = page.forms.first
button = page.form.button_with(:value => "Some Button Text")

local_file = "path/to/file.pdf"
response = agent.submit(form, button)
response.save_as(local_file)

But when I try to read this PDF file using the PDF::Reader gem, I get an error "PDF does not contain EOF marker".

reader = PDF::Reader.new(local_file) # this also happens if I try to use PDF::Reader.new(response.body) and PDF::Reader.new(response.body_io) depending on the different pluggable_parser configurations mentioned above
#> PDF::Reader::MalformedPDFError: PDF does not contain EOF marker

I'm able to save the PDF locally and view it and it looks fine, but the PDF::Reader gem is complaining about it missing an EOF marker.

So my question is: is there a way I could add an EOF marker into the PDF or something to get around this error so I can parse the PDF?

Thanks.

Related (unanswered) question: PDF does not contain EOF marker (PDF::Reader::MalformedPDFError) with pdf-reader

Related Docs:

EDIT:

I found the EOF marker somewhere in the middle of the downloaded file contents, followed by some HTML-looking stuff that I can't seem to figure out how to get rid of. I want to isolate the PDF content and then parse that, but still running into issues. Here is the full script I am using: https://gist.github.com/s2t2/c6766846d024edd696586b2bc7fee0bf

Community
  • 1
  • 1
s2t2
  • 2,462
  • 5
  • 37
  • 47
  • Have you tried just appending an EOF to the end of the document? – Jörg W Mittag Apr 01 '17 at 07:34
  • How can one do that? – s2t2 Apr 01 '17 at 13:42
  • The best way would be to reconsider the content being saved (`response.save(local_file)`)... but you might try opening the PDF with a different Ruby reader (i.e., CombinePDF or a pdftk based reader) and see if they can overcome the error... though it's better to not have an error than to dynamically fix it every tine. – Myst Sep 02 '17 at 11:20

1 Answers1

0

The issue seems to be with the website you're accessing: http://employmentsummary.abaquestionnaire.org

The add HTML data at the end of the response.

However, you could truncate the response by searching for the first substring %EOF and removing all the data after that.

i.e.:

pdf_data = result.body
pdf_data.slice!(0, pdf_data.index("%EOL").to_i + 4)
if(pdf_data.length <= 4)
   # handle error
else
   # save/send pdf_data
end
Myst
  • 18,516
  • 2
  • 45
  • 67
  • I'm getting `ArgumentError: string contains null byte`. Can you share a working script based on my gist? – s2t2 Sep 06 '17 at 16:37
  • @s2t2 I'm away from my computer at the moment, but this issue sounds like a String encoding issue. Perhaps try changing the string to binary encoding before you manipulate it's content? – Myst Sep 06 '17 at 17:50