0

I have a list of URLs of pdf files that i want to download, from different sites.

In my firefox i have chosen the option to save PDF files directly to a particular folder.

My plan was to use WWW::Mechanize::Firefox in perl to download each file (in the list - one by one) using Firefox and renaming the file after download.

I used the following code to do it :

    use WWW::Mechanize::Firefox;
    use File::Copy;

    # @list contains the list of links to pdf files
    foreach $x (@list) {
        my $mech = WWW::Mechanize::Firefox->new(autoclose => 1);
        $mech->get($x);  #This downloads the file using firefox in desired folder

        opendir(DIR, "output/download");
        @FILES= readdir(DIR);
        my $old = "output/download/$FILES[2]";
        move ($old, $new);  # $new is the URL of the new filename
    }

When i run the file, it opens the first link in Firefox and Firefox downloads the file to the desired directory. But, after that the 'new tab' is not closed and the file does not get renamed and the code keeps running (like its encountered an endless loop) and no further file gets downloaded.

What is going on here? Why isnt the code working? How do i close the tab and make the code read all the files in the list? Is there any alternate way to download?

Corion
  • 3,855
  • 1
  • 17
  • 27
Pawan Samdani
  • 1,594
  • 1
  • 10
  • 11
  • That's not the code you are using. That code has several obvious errors that mean it will never run. (1) You loop with `$x` but try to get an undefined `$link`. (2) You assume `$FILES[2]` contains the file you just downloaded. (3) There is no built-in `move` sub in Perl - it's called `rename`. – Richard Huxton Mar 11 '14 at 09:54
  • The `$link` was a typo while editing the question, it is `$x` which points to links. I have tested that `$FILES[2]` points to the file, as my directory has only 1 file and the first 2 elements of the array are '.' and '..'. And `move` is a method in File::Copy. I have made the changes to the question – Pawan Samdani Mar 11 '14 at 11:31
  • Assuming that you actually tested the second half and that you verified that the command that hangs is the `get` (easiest by inserting a print after it) i would assume that whatever the `get` uses to determine whether everything has loaded does not work well with your automated downloading. You could try `get_local` and/or use the `content_file` option to download the file instead of some firefox automated behavior that your script does not know about. Or you drop firefox and just use WWW::Mechanize without the fancy candy of it actually remote controlling some browser you can watch. – DeVadder Mar 11 '14 at 11:37
  • I cannot use WWW::Mechanize, as i need to open the links in a browser as the pdf files are accessed using Off-campus proxy (EZProxy) which needs authentication. That does not work with Mechanize alone. The `get_local` method is to load local files. `content_file` is in WWW::Mechanize and not WWW::Mechanize::Firefox – Pawan Samdani Mar 11 '14 at 12:08
  • You are right about `get_local`. But while probably much more complicated, WWW::Mechanize can use proxies. However, i can clearly see `:content_file` as an option for `get` in the Mechanize::Firefox documentation. And i still would argue that it is a better idea to do everything from within the script instead of having firefox automatically download pdf links. Also, Mechanize::Firefox returns faked HTTP::Response objects, so `->content` might work as well. Lastly, if you're not in a hurry, you could add some large enough timeout to the `get` call, ignoring that it doesn't know when it is done. – DeVadder Mar 11 '14 at 12:53
  • Thanks @DeVadder I realized that `get` might be waiting for a response of page load from firefox to proceed. As Firefox was downloading the files, there was no page being loaded. Thus, i set `get` to not wait for response and added a timeout of 60 sec. And it worked. – Pawan Samdani Mar 12 '14 at 06:46

2 Answers2

2

Solved the problem.

The function,

$mech->get() 

waits for 'DOMContentLoaded' Firefox event to be fired by Firefox upon page load. As i had set Firefox to download the files automatically, there was no page being loaded. Thus, the 'DOMContentLoaded' event was never being fired. This led to pause in my code.

I set the function to not wait for the page to load by using the following option

$mech->get($x, synchronize => 0);

After this, i added 60 second delay to allow Firefox to download the file before code progresses

sleep 60;

Thus, my final code look like

use WWW::Mechanize::Firefox;
use File::Copy;

# @list contains the list of links to pdf files
foreach $x (@list) {
    my $mech = WWW::Mechanize::Firefox->new(autoclose => 1);

    $mech->get($x, synchronize => 0);
    sleep 60;

    opendir(DIR, "output/download");
    @FILES= readdir(DIR);
    my $old = "output/download/$FILES[2]";
    move ($old, $new);  # $new is the URL of the new filename
}
Pawan Samdani
  • 1,594
  • 1
  • 10
  • 11
1

If i understood you correctly, you have the links to the actual pdf files. In that case WWW::Mechanize is most likely easier than WWW::Mechanize::Firefox. In fact, i think that is almost always the case. Then again, watching the browser work is certainly cooler.

use strict;
use warnings;

use WWW::Mechanize;

# your code here
# loop

    my $mech = WWW::Mechanize->new();    # Could (should?) be outside of the loop.
    $mech->agent_alias("Linux Mozilla"); # Optionally pretend to be whatever you want.

    $mech->get($link);
    $mech->save_content("$new");

#end of the loop

If that is absolutely not what you wanted, my cover story will be that i did not want to break my 666 rep!

DeVadder
  • 1,404
  • 10
  • 18
  • I tried setting the proxy in WWW::Mechanize, but it doesnt work as it requires a page to be always open to authenticate the proxy. Even if i keep the authentication page open in Firefox and use WWW::Mechanize, it doesn't work. Thus, I need WWW::Mechanize::Firefox – Pawan Samdani Mar 12 '14 at 06:50