-4

I m trying the QuickStart from https://github.com/yasserg/crawler4j

I do the following steps to test the example:

0) Add crawler4j.jar to java library

1) Create a java package called mycrawler

2)Paste the Quickstart code to class-mycrawler

3)Run

package mycrawler;
public class MyCrawler extends WebCrawler {

    private final static Pattern FILTERS = Pattern.compile(".*(\\.(css|js|gif|jpg"
                                                           + "|png|mp3|mp3|zip|gz))$");

    /**
     * This method receives two parameters. The first parameter is the page
     * in which we have discovered this new url and the second parameter is
     * the new url. You should implement this function to specify whether
     * the given url should be crawled or not (based on your crawling logic).
     * In this example, we are instructing the crawler to ignore urls that
     * have css, js, git, ... extensions and to only accept urls that start
     * with "http://www.ics.uci.edu/". In this case, we didn't need the
     * referringPage parameter to make the decision.
     */
     @Override
     public boolean shouldVisit(Page referringPage, WebURL url) {
         String href = url.getURL().toLowerCase();
         return !FILTERS.matcher(href).matches()
                && href.startsWith("http://www.ics.uci.edu/");
     }

     /**
      * This function is called when a page is fetched and ready
      * to be processed by your program.
      */
     @Override
     public void visit(Page page) {
         String url = page.getWebURL().getURL();
         System.out.println("URL: " + url);

         if (page.getParseData() instanceof HtmlParseData) {
             HtmlParseData htmlParseData = (HtmlParseData) page.getParseData();
             String text = htmlParseData.getText();
             String html = htmlParseData.getHtml();
             Set<WebURL> links = htmlParseData.getOutgoingUrls();

             System.out.println("Text length: " + text.length());
             System.out.println("Html length: " + html.length());
             System.out.println("Number of outgoing links: " + links.size());
         }
    }
}

Result: enter image description here enter image description here Error: mycrawler.mycrawler class wasnt found in mycrawler project.

No main classes found>

***How to solve ?

I m new to Java .***

evabb
  • 405
  • 3
  • 21
  • 1
    You seem to be missing `import` statements in your code. All those underlines in red are probably trying to tell you the problems. – khelwood Sep 08 '16 at 12:17
  • @ khelwood i added . Still not working .Missing main class/method – evabb Sep 08 '16 at 12:18
  • 2
    Keep reading the crawler4j readme, you still didn't finish. You need to create the `Controller` class, that contains the `main` method – Jose Rui Santos Sep 08 '16 at 12:20

3 Answers3

1

You class extends WebCrawler but there is no indication of how Java could resolve that class.

You need to add an import statement to locate that class.

Moreover, if you want to run your class, you need to have a public static void main(String[] args) method

YMomb
  • 2,366
  • 1
  • 27
  • 36
1

You seem to be using NetBeans. I suggest using Ctrl-Shift-I: Fix all class imports. When there are no errors in the class, it will be able compile.

Then you need to define entry point to your program, which in java is a static main(String[] args) method. Code in that method gets executed when you choose to run the file as a main class.


I suggest you get someone to make you some intro to Java, as you will not be probably able to complete your task just by following Quickstart of the library you want to use.

1

I think you have forgot to implement the controller as per documentation

You should also implement a controller class which specifies the seeds of the crawl, the folder in which intermediate crawl data should be stored and the number of concurrent threads

public class Controller {
    public static void main(String[] args) throws Exception {
        String crawlStorageFolder = "/data/crawl/root";
        int numberOfCrawlers = 7;

        CrawlConfig config = new CrawlConfig();
        config.setCrawlStorageFolder(crawlStorageFolder);

        /*
         * Instantiate the controller for this crawl.
         */
        PageFetcher pageFetcher = new PageFetcher(config);
        RobotstxtConfig robotstxtConfig = new RobotstxtConfig();
        RobotstxtServer robotstxtServer = new RobotstxtServer(robotstxtConfig, pageFetcher);
        CrawlController controller = new CrawlController(config, pageFetcher, robotstxtServer);

        /*
         * For each crawl, you need to add some seed urls. These are the first
         * URLs that are fetched and then the crawler starts following links
         * which are found in these pages
         */
        controller.addSeed("http://www.ics.uci.edu/~lopes/");
        controller.addSeed("http://www.ics.uci.edu/~welling/");
        controller.addSeed("http://www.ics.uci.edu/");

        /*
         * Start the crawl. This is a blocking operation, meaning that your code
         * will reach the line after this only when crawling is finished.
         */
        controller.start(MyCrawler.class, numberOfCrawlers);
    }
}
Gal Nitzan
  • 391
  • 2
  • 12
  • Do you mean class Controller should be added into codes. However , i have added class controller below class MyCrawler ,that still not working – evabb Sep 08 '16 at 19:26
  • Well, I do not understand what you mean by "added below" however from what you shared at the top, there is an indication that you are missing "main" which is the entry point to a java program! – Gal Nitzan Sep 08 '16 at 19:45