Saving PDF files with WWW::Mechanize corrupts them

Question

I'm trying to write a script that will log into Bank of America and download PDF statements. I've manage all the difficult tricks, and I'm hung up on saving the PDF files. I've tried both the ':content_file' => "some file path" method, and a $mech->save_content("same file path") . Usually, either of these work fine (even for PDFs). A typical BoA PDF statement is 4 pages long, and about 400k in size.

If I use the former method, it truncates the file to 33k, and it's unopenable by Preview on the Mac (but I can see the PDF header and EPS binary gibberish in Sublime). If I use the latter method, it saves the file with 95 extra bytes (compared to downloading it in Chrome) which somehow screws up the second page (of 4). The only visually obvious difference is that the Mechanize-downloaded file has an extra line containing the character '0' and a few newlines at the end. diff reports "Binary files 2014-06-19 Statement.pdf and eStmt_2014-06-19.pdf differ". I have no idea how to determine the remaining 92 bytes of difference.

Oooh, found something: using save_content(), every few hundred lines in the PDF, I get a newline, the string "8000", and another trailing newline... then the binary picks up again. Not sure what that is. Looks like there are 10 instances of this (so that accounts for another 50 of the extra bytes).

Does anyone have any idea what could be going on here?

I have the following code:

#!/usr/bin/perl
use strict;

use WWW::Mechanize;
use Date::Parse;
use DateTime;
use File::Path;

########################################################################################################################
#                Change only the configuration settings in this section, nothing above or below it.                    #
########################################################################################################################

# Credentials
my $username = "someusername";
my $password = "somepassword";

# Enclose value in double quotes, folders with spaces in the name are ok.
my $root_folder = "/Users/john/Documents/Important/Credit Card Statements";

########################################################################################################################
########################################################################################################################

# Suddenly web robot.
my $mech = WWW::Mechanize->new();
$mech->agent_alias('Mac Safari');

# First we have to log in.
$mech->get("https://www.bankofamerica.com/");

# Login, blah.
$mech->submit_form(
  form_name => 'frmSignIn',
  fields  => { Access_ID => $username },
);

# Dumb thing uses a meta refresh...
$mech->follow_link(url_regex => qr/signOn\.go/);

# This is what they call two factor authentication. Heh.
$mech->submit_form(
  form_name => 'ConfirmSitekeyForm',
  fields  => { password => $password },
);

# Just the single account for now... maybe make this a loop later?
#for my $link ($mech->find_all_links(url_regex => qr/redirect\.go.+?target=acctDetails/)) {
$mech->follow_link(url_regex => qr/redirect\.go.+?target=acctDetails/);

# We need the last four digits, easiest here.
my ($fourdigits) = $mech->content() =~ /<span class="bold TL_NPI_AcctName">.+? - (\d{4})</;

# Go to the account details page... 
$mech->follow_link(url_regex => qr/redirect\.go.+?target=statements/);

# Now we need to select which documents we want...
# I'm assuming that you're running this daily in cron. Therefor, we're only going to search the last 60 days.
my $mech2 = $mech->clone();

$mech2->submit_form(
  form_name => 'statementsAndDocTab',
  fields  => { docItemSelected   => 'All',
               dateRangeSelected => '60D',
               selectedDocCode   => 'All',
               selectedDateRange => '60D',
             },
);

# These are nasty javascripty links. I think I have to post to this damn thing, to get a pdf response back. Need to
# regex-loop.
my $page = $mech2->content();
while ($page =~ /id="hidden-documentId\d+" value="(\d+)" name="statement-name".+?onclick="docInboxModuleAccountSkin.downloadLayerSubmit\(this,'downloadPdf','(.+?)', '(.+?)','([0-9\/]+)','(.+?)'/gs) {
    my $documentId = $1;
    my $actionurl = "https://secure.bankofamerica.com" . $2 . "&nocache=" . sprintf("%05d", int(rand(100000)));
    my $docName = $3;
    my $boadate = $4;
    my $documentTypeId = $5;
    my $year = DateTime->from_epoch(epoch => str2time($boadate))->year;
    my $date = DateTime->from_epoch(epoch => str2time($boadate))->ymd;

    # There are more than just statements here. What do we name the files?
    my $filename;
    if    ($docName =~ m/Change in Terms/i) { $filename = "$date Change in Terms.pdf"; }
    elsif ($docName =~ m/Statement/i)       { $filename = "$date Statement.pdf"; }
    else                                    { $filename = "$date Unknown.pdf"; }

    # We may need to create a folder for the year...
    File::Path::make_path("$root_folder/Bank of America - $fourdigits/$year");

    # Get the file.
    unless (-f "$root_folder/Bank of America - $fourdigits/$year/$filename") {
        my $pdf = $mech2->clone();
        # Normally we'd just do $pdf->get(), but we need to do a submit_form. Unfortunately, the form doesn't exist,
        # javascript creates it in place. Ugh.
        $pdf->post( $actionurl,
         #           ':content_file' => "$root_folder/Bank of America - $fourdigits/$year/$filename",
                    [ documentId     => $documentId,
                      menu           => 'downloadPdf',
                      viewDownload   => 'downloadPdf',
                      date           => $boadate,
                      docName        => $docName,
                      documentTypeId => $documentTypeId,
                      version        => '',
                    ],
        );

        $pdf->save_content("$root_folder/Bank of America - $fourdigits/$year/$filename");

        # Let's do a notification...
        #system("/usr/local/bin/terminal-notifier -message \"Bank of America document dated $date has been downloaded.\" -title \"Statement Retrieved\" ");
    }
}

Does it work if you also dump the pdf content 'directly' - e.g. open the file handle manually? Another thing that this sounds a little like, is that it's saving ascii not binary - the manpage refers to setting 'binmode' when doing save_content — Sobrique, Jul 19 '14 at 20:59
The content-saving code is correct, either format. And since the script expects a response of application/pdf, Mechanize is more than smart enough to set the mode to binary for you... I was making a bad request. $actionurl has a query parameter that javascript attaches (nocache=random-5digit-number). You attach that, and it runs perfectly. I have no idea why. Thank you to those who responded. I'll edit in the fix. — John O, Jul 20 '14 at 04:28
What version of Net::HTTP ? There was a bug [80670](https://rt.cpan.org/Public/Bug/Display.html?id=80670) with similar symptoms using https if you had IO::Socket::SSL installed. — runrig, Jul 31 '14 at 17:17
@runrig It was server-side. I have no idea why, but if you make a malformed request, the PDF was garbled. If you make a correct request, it comes down fine. I need to quit running to Stackoverflow every time I don't get it perfect on the first try. — John O, Jul 31 '14 at 17:36

score 1 · Answer 1 · answered Jul 19 '14 at 21:03

1

From a quick look at the save_content method in the WWW:Mechanize documentation, the thing that might be worth trying is:

$mech->save_content( $filename, binary => 1 );

The problem you describe is similar to the sort you get when saving binary data in ascii mode.

answered Jul 19 '14 at 21:03

Sobrique

52,974
7
60
101

1

I remembered that myself, just tried it... I end up getting the 33k file again. Not sure how to check, but it looks like a connection reset or something. – John O Jul 19 '14 at 21:06

Saving PDF files with WWW::Mechanize corrupts them

1 Answers1