I'm trying to write my own crawljax 3.6 plugin in Java. It should tell crawljax which is a very famous web-crawler to also download files, which he finds on webpages. (PDF, Images, and so on). I don't want only the HTML or actual DOM-Tree. I would like to get access to the files (PDF, jpg) he finds.
How can I tell crawljax to download PDF files, images and so on?
Thanks for any help!
This is what I have so far -a new Class using the default plugin (CrawlOverview):
import java.io.File;
import java.io.IOException;
import java.util.concurrent.TimeUnit;
import org.apache.commons.io.FileUtils;
import com.crawljax.browser.EmbeddedBrowser.BrowserType;
import com.crawljax.condition.NotXPathCondition;
import com.crawljax.core.CrawlSession;
import com.crawljax.core.CrawljaxRunner;
import com.crawljax.core.configuration.BrowserConfiguration;
import com.crawljax.core.configuration.CrawljaxConfiguration;
import com.crawljax.core.configuration.CrawljaxConfiguration.CrawljaxConfigurationBuilder;
import com.crawljax.core.configuration.Form;
import com.crawljax.core.configuration.InputSpecification;
import com.crawljax.plugins.crawloverview.CrawlOverview;
/**
* Example of running Crawljax with the CrawlOverview plugin on a single-page
* web app. The crawl will produce output using the {@link CrawlOverview}
* plugin.
*/
public final class Main {
private static final long WAIT_TIME_AFTER_EVENT = 200;
private static final long WAIT_TIME_AFTER_RELOAD = 20;
private static final String URL = "http://demo.crawljax.com";
/**
* Run this method to start the crawl.
*
* @throws IOException
* when the output folder cannot be created or emptied.
*/
public static void main(String[] args) throws IOException {
CrawljaxConfigurationBuilder builder = CrawljaxConfiguration
.builderFor(URL);
builder.addPlugin(new CrawlOverview());
builder.crawlRules().insertRandomDataInInputForms(false);
// click these elements
builder.crawlRules().clickDefaultElements();
builder.crawlRules().click("div");
builder.crawlRules().click("a");
builder.setMaximumStates(10);
builder.setMaximumDepth(3);
// Set timeouts
builder.crawlRules().waitAfterReloadUrl(WAIT_TIME_AFTER_RELOAD,
TimeUnit.MILLISECONDS);
builder.crawlRules().waitAfterEvent(WAIT_TIME_AFTER_EVENT,
TimeUnit.MILLISECONDS);
// We want to use two browsers simultaneously.
builder.setBrowserConfig(new BrowserConfiguration(BrowserType.FIREFOX,
1));
CrawljaxRunner crawljax = new CrawljaxRunner(builder.build());
crawljax.call();
}
}