How to get scrape using crawler4j?

Question

I've been going at this for 4 hours now, and I simply can't see what I'm doing wrong. I have two files:

MyCrawler.java
Controller.java

MyCrawler.java

import edu.uci.ics.crawler4j.crawler.Page;
import edu.uci.ics.crawler4j.crawler.WebCrawler;
import edu.uci.ics.crawler4j.parser.HtmlParseData;
import edu.uci.ics.crawler4j.url.WebURL;
import java.util.List;
import java.util.regex.Pattern;
import org.apache.http.Header;

public class MyCrawler extends WebCrawler {

    private final static Pattern FILTERS = Pattern.compile(".*(\\.(css|js|bmp|gif|jpe?g" + "|png|tiff?|mid|mp2|mp3|mp4"
                    + "|wav|avi|mov|mpeg|ram|m4v|pdf" + "|rm|smil|wmv|swf|wma|zip|rar|gz))$");

    /**
     * You should implement this function to specify whether the given url
     * should be crawled or not (based on your crawling logic).
     */
    @Override
    public boolean shouldVisit(WebURL url) {
            String href = url.getURL().toLowerCase();
            return !FILTERS.matcher(href).matches() && href.startsWith("http://www.ics.uci.edu/");
    }

    /**
     * This function is called when a page is fetched and ready to be processed
     * by your program.
     */
    @Override
    public void visit(Page page) {
            int docid = page.getWebURL().getDocid();
            String url = page.getWebURL().getURL();
            String domain = page.getWebURL().getDomain();
            String path = page.getWebURL().getPath();
            String subDomain = page.getWebURL().getSubDomain();
            String parentUrl = page.getWebURL().getParentUrl();
            String anchor = page.getWebURL().getAnchor();

            System.out.println("Docid: " + docid);
            System.out.println("URL: " + url);
            System.out.println("Domain: '" + domain + "'");
            System.out.println("Sub-domain: '" + subDomain + "'");
            System.out.println("Path: '" + path + "'");
            System.out.println("Parent page: " + parentUrl);
            System.out.println("Anchor text: " + anchor);

            if (page.getParseData() instanceof HtmlParseData) {
                    HtmlParseData htmlParseData = (HtmlParseData) page.getParseData();
                    String text = htmlParseData.getText();
                    String html = htmlParseData.getHtml();
                    List<WebURL> links = htmlParseData.getOutgoingUrls();

                    System.out.println("Text length: " + text.length());
                    System.out.println("Html length: " + html.length());
                    System.out.println("Number of outgoing links: " + links.size());
            }

            Header[] responseHeaders = page.getFetchResponseHeaders();
            if (responseHeaders != null) {
                    System.out.println("Response headers:");
                    for (Header header : responseHeaders) {
                            System.out.println("\t" + header.getName() + ": " + header.getValue());
                    }
            }

            System.out.println("=============");
    }
}

Controller.java

package edu.crawler;

import edu.uci.ics.crawler4j.crawler.Page;
import edu.uci.ics.crawler4j.crawler.WebCrawler;
import edu.uci.ics.crawler4j.parser.HtmlParseData;
import edu.uci.ics.crawler4j.url.WebURL;
import java.util.List;
import java.util.regex.Pattern;

import org.apache.http.Header;

import edu.uci.ics.crawler4j.crawler.CrawlConfig;
import edu.uci.ics.crawler4j.crawler.CrawlController;
import edu.uci.ics.crawler4j.fetcher.PageFetcher;
import edu.uci.ics.crawler4j.robotstxt.RobotstxtConfig;
import edu.uci.ics.crawler4j.robotstxt.RobotstxtServer;

public class Controller 
{

    public static void main(String[] args) throws Exception 
    {
            String crawlStorageFolder = "../data/";
            int numberOfCrawlers = 7;

            CrawlConfig config = new CrawlConfig();
            config.setCrawlStorageFolder(crawlStorageFolder);

            /*
             * Instantiate the controller for this crawl.
             */
            PageFetcher pageFetcher = new PageFetcher(config);
            RobotstxtConfig robotstxtConfig = new RobotstxtConfig();
            RobotstxtServer robotstxtServer = new RobotstxtServer(robotstxtConfig, pageFetcher);
            CrawlController controller = new CrawlController(config, pageFetcher, robotstxtServer);

            /*
             * For each crawl, you need to add some seed urls. These are the first
             * URLs that are fetched and then the crawler starts following links
             * which are found in these pages
             */
            controller.addSeed("http://www.ics.uci.edu/~welling/");
            controller.addSeed("http://www.ics.uci.edu/~lopes/");
            controller.addSeed("http://www.ics.uci.edu/");

            /*
             * Start the crawl. This is a blocking operation, meaning that your code
             * will reach the line after this only when crawling is finished.
             */
            controller.start(MyCrawler, numberOfCrawlers);
    }
}

The Structure is as follows:

java/MyCrawler.java
java/Controller.java
jars/... --> all the jars crawler4j

I try to compile this on a WINDOWS machine using:

javac -cp "C:\xampp\htdocs\crawlcrowd\www\java\jars\*;C:\xampp\htdocs\crawlcrowd\www\java\*" MyCrawler.java

This works perfectly, and I end up with:

java/MyCrawler.class

However, when I type:

javac -cp "C:\xampp\htdocs\crawlcrowd\www\java\jars\*;C:\xampp\htdocs\crawlcrowd\www\java\*" Controller.java

it bombs out with:

Controller.java:50: error: cannot find symbol
            controller.start(MyCrawler, numberOfCrawlers);
                             ^
  symbol:   variable MyCrawler
  location: class Controller
1 error

So, I think somehow I am not doing something that I need to be doing. Something that will make this new executable class be "aware" of the MyCrawler.class. I have tried fiddling with the classpath in the commandline javac part. I've also tried setting it in my environment variables.... no luck.

Any idea how I can get this to work?

UPDATE

I got most of this code from the Google Code page itself. But I just can't figure out what must go there. Even if I try this:

MyCrawler mc = new MyCrawler();

No luck. Somehow Controller.class does not know about MyCrawler.class.

UPDATE 2

I don't think it matters, due the problem clearly being that it can't find the class, but either way, here is the signature of "CrawlController controller". Taken from here.

   /**
     * Start the crawling session and wait for it to finish.
     * 
     * @param _c
     *            the class that implements the logic for crawler threads
     * @param numberOfCrawlers
     *            the number of concurrent threads that will be contributing in
     *            this crawling session.
     */
    public <T extends WebCrawler> void start(final Class<T> _c, final int numberOfCrawlers) {
            this.start(_c, numberOfCrawlers, true);
    }

I am in fact passing through a "crawler" as I'm passing in "MyCrawler". The problem is that application doesn't know what MyCrawler is.

Jordan · Answer 1 · 2014-11-07T18:58:31.100

A couple of things come to mind:

Is your MyCrawler extending edu.uci.ics.crawler4j.crawler.WebCrawler?
```
public class MyCrawler extends WebCrawler
```
Are you passing in MyCrawler.class (i.e., as a class) into controller.start?
```
controller.start(MyCrawler.class, numberOfCrawlers);
```

Both of these need to be satisfied in order for the controller to compile and run. Also, Crawler4j has some great examples here:

https://code.google.com/p/crawler4j/source/browse/src/test/java/edu/uci/ics/crawler4j/examples/basic/BasicCrawler.java

https://code.google.com/p/crawler4j/source/browse/src/test/java/edu/uci/ics/crawler4j/examples/basic/BasicCrawlController.java

These 2 classes will compile and run right away (i.e., BasicCrawlController), so it's a good starting place if you are running into any issues.

score 0 · Answer 2 · answered Oct 17 '18 at 02:51

The parameters for start() should be a class and number of crawlers. Its throwing an error as you are passing in an object of crawler and not the crawler class. Use the start method as shown below, it should work

controller.start(MyCrawler.class, numberOfCrawlers)

score -1 · Answer 3 · answered Nov 07 '14 at 11:20

-1

Here you are passing a class name MyCrawler as a parameter.

controller.start(MyCrawler, numberOfCrawlers);

I think class name should not be a parameter.

I am also working little bit on Crawling!

answered Nov 07 '14 at 11:20

Vishwajit R. Shinde

465
2
5
18

It's definitely not that. please see what I wrote in the update. – rockstardev Nov 07 '14 at 11:28
So what is signature of start() method? for controller object? – Vishwajit R. Shinde Nov 07 '14 at 11:30
That's hidden away in the jar file. Documentation says: https://code.google.com/p/crawler4j/source/browse/src/main/java/edu/uci/ics/crawler4j/crawler/CrawlController.java?r=9bc8c1b54eedfd8fd0384822517159d997670e5e ... i'll add it to my post. – rockstardev Nov 07 '14 at 11:33
then put there mc.getClass() instead of MyCrawler – Vishwajit R. Shinde Nov 07 '14 at 11:35
The first line doesn't work, because it doesn't know what MyCrawler is. That is the problem. As I mentioned up top. – rockstardev Nov 07 '14 at 11:40
You mean at line 50? What happened when you put mc.getClass() as first parameter of start method? – Vishwajit R. Shinde Nov 07 '14 at 11:47
I could not do that, because this doesnt work --> MyCrawler mc = new MyCrawler(); ... it says it doesnt know what MyCrawler is. So it can't create the mc variable. – rockstardev Nov 07 '14 at 11:51
Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/64492/discussion-between-vishwajit-r-shinde-and-coderama). – Vishwajit R. Shinde Nov 07 '14 at 12:01

How to get scrape using crawler4j?

3 Answers3