Use Crawljax to also download files from webpage

Question

I'm trying to write my own crawljax 3.6 plugin in Java. It should tell crawljax which is a very famous web-crawler to also download files, which he finds on webpages. (PDF, Images, and so on). I don't want only the HTML or actual DOM-Tree. I would like to get access to the files (PDF, jpg) he finds.

How can I tell crawljax to download PDF files, images and so on?

Thanks for any help!

This is what I have so far -a new Class using the default plugin (CrawlOverview):

import java.io.File;
import java.io.IOException;
import java.util.concurrent.TimeUnit;

import org.apache.commons.io.FileUtils;

import com.crawljax.browser.EmbeddedBrowser.BrowserType;
import com.crawljax.condition.NotXPathCondition;
import com.crawljax.core.CrawlSession;
import com.crawljax.core.CrawljaxRunner;
import com.crawljax.core.configuration.BrowserConfiguration;
import com.crawljax.core.configuration.CrawljaxConfiguration;
import com.crawljax.core.configuration.CrawljaxConfiguration.CrawljaxConfigurationBuilder;
import com.crawljax.core.configuration.Form;
import com.crawljax.core.configuration.InputSpecification;
import com.crawljax.plugins.crawloverview.CrawlOverview;

/**
 * Example of running Crawljax with the CrawlOverview plugin on a single-page
 * web app. The crawl will produce output using the {@link CrawlOverview}
 * plugin.
 */
public final class Main {
    private static final long WAIT_TIME_AFTER_EVENT = 200;
    private static final long WAIT_TIME_AFTER_RELOAD = 20;
    private static final String URL = "http://demo.crawljax.com";

    /**
     * Run this method to start the crawl.
     *
     * @throws IOException
     *             when the output folder cannot be created or emptied.
     */
    public static void main(String[] args) throws IOException {
        CrawljaxConfigurationBuilder builder = CrawljaxConfiguration
                .builderFor(URL);
        builder.addPlugin(new CrawlOverview());


        builder.crawlRules().insertRandomDataInInputForms(false);
        // click these elements
        builder.crawlRules().clickDefaultElements();
        builder.crawlRules().click("div");
        builder.crawlRules().click("a");
        builder.setMaximumStates(10);
        builder.setMaximumDepth(3);
        // Set timeouts
        builder.crawlRules().waitAfterReloadUrl(WAIT_TIME_AFTER_RELOAD,
                TimeUnit.MILLISECONDS);
        builder.crawlRules().waitAfterEvent(WAIT_TIME_AFTER_EVENT,
                TimeUnit.MILLISECONDS);


        // We want to use two browsers simultaneously.
        builder.setBrowserConfig(new BrowserConfiguration(BrowserType.FIREFOX,
                1));
        CrawljaxRunner crawljax = new CrawljaxRunner(builder.build());
        crawljax.call();

    }
}

How to get and Displayed all the Data? http://stackoverflow.com/questions/27936719/how-to-get-crawl-content-in-crawljax — BasK, Jan 14 '15 at 06:03

score 0 · Accepted Answer · answered May 08 '15 at 13:44

As images are concerned - I don't see any problem, Crawljax loads these just fine for me.

On the PDF topic: Unfortunately Crawljax is hardcoded to skip links to PDF files.

See com.crawljax.core.CandidateElementExtractor:342:

/**
 * @param href
 *            the string to check
 * @return true if href has the pdf or ps pattern.
 */
private boolean isFileForDownloading(String href) {
    final Pattern p = Pattern.compile(".+.pdf|.+.ps|.+.zip|.+.mp3");
    Matcher m = p.matcher(href);

    if (m.matches()) {
        return true;
    }

    return false;
}

This could be solved by modifying Crawljax source and introducing a configuration option for pattern above.

After that limitations of Selenium regarding non-HTML files apply: PDF is either viewed in Firefox JavaScript PDF viewer, a download pop-up appears or the file is downloaded. It is somewhat possible to interact with the JavaScript viewer, it is not possible to interact with the download popup but if autodownload is enabled then the file is downloaded to disk.

If you would like to set Firefox to automatically download file without popping up a download dialog:

import javax.inject.Provider;

static class MyFirefoxProvider implements Provider<EmbeddedBrowser> {

    @Override
    public EmbeddedBrowser get() {
        FirefoxProfile profile = new FirefoxProfile();
        profile.setPreference("browser.download.folderList", 2);
        profile.setPreference("browser.download.dir", "/tmp");
        profile.setPreference("browser.helperApps.neverAsk.saveToDisk",
            "application/octet-stream,application/pdf,application/x-gzip");

        // disable Firefox's built-in PDF viewer
        profile.setPreference("pdfjs.disabled", true);
        // disable Adobe Acrobat PDF preview plugin
        profile.setPreference("plugin.scan.plid.all", false);
        profile.setPreference("plugin.scan.Acrobat", "99.0");

        FirefoxDriver driver = new FirefoxDriver(profile);

        return WebDriverBackedEmbeddedBrowser.withDriver(driver);
    }
}

And use the newly created FirefoxProvider:

BrowserConfiguration bc = 
new BrowserConfiguration(BrowserType.FIREFOX, 1, new MyFirefoxProvider());

score 0 · Answer 2 · edited Mar 27 '16 at 05:14

0

Obtain the links manually using Jsoup by using the CSS selector a[href] on getStrippedDom(), iterate through the elements and use a HttpURLConnection / HttpsURLConnection to download them.

edited Mar 27 '16 at 05:14

James Taylor

6,158
8
48
74

answered Mar 27 '16 at 05:00

user6119646

1

Use Crawljax to also download files from webpage

2 Answers2