How to adapt the URL that I want to crawl in crawler4j

Question

I tried modifying the code crawler4j-Quickstart example

I want to crawl the following link

https://www.google.com/search?biw=1366&bih=645&tbm=nws&q=%22obama%22&oq=%22obama%22&gs_l=serp.3..0l5.825041.826084.0.826833.5.5.0.0.0.0.187.572.2j3.5.0....0...1c.1.64.serp..0.3.333...0i13k1.Tmd9nARKIrU

which is a Google news search link with the keyword obama

I tried modifying mycrawler.java

 @Override
 public boolean shouldVisit(Page referringPage, WebURL url) {
     String href = url.getURL().toLowerCase();
     return !FILTERS.matcher(href).matches()
            && href.startsWith("https://www.google.com/search?biw=1366&bih=645&tbm=nws&q=%22obama%22&oq=%22obama%22&gs_l=serp.3..0l5.825041.826084.0.826833.5.5.0.0.0.0.187.572.2j3.5.0....0...1c.1.64.serp..0.3.333...0i13k1.Tmd9nARKIrU/");
 }

Also, controller.java

 /*
  * For each crawl, you need to add some seed urls. These are the first
  * URLs that are fetched and then the crawler starts following links
  * which are found in these pages
  */
  //controller.addSeed("http://www.ics.uci.edu/~lopes/");
  // controller.addSeed("http://www.ics.uci.edu/~welling/");
    controller.addSeed("https://www.google.com/search?biw=1366&bih=645&tbm=nws&q=%22obama%22&oq=%22obama%22&gs_l=serp.3..0l5.825041.826084.0.826833.5.5.0.0.0.0.187.572.2j3.5.0....0...1c.1.64.serp..0.3.333...0i13k1.Tmd9nARKIrU");

 /*
  * Start the crawl. This is a blocking operation, meaning that your code
  * will reach the line after this only when crawling is finished.
  */
  controller.start(MyCrawler.class, numberOfCrawlers);

Then, it shows an error

SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
SLF4J: Defaulting to no-operation (NOP) logger implementation
SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details.
BUILD SUCCESSFUL (total time: 43 seconds)

Is my code modification wrong?

update

I tried to use other url other than google search link .It works. I m guessing it cannot crawl the google search link .Any idea to tackle it ?

the error had stated that `org.slf4j.impl.StaticLoggerBinder` is not loaded properly. Do you have all the dependencies imported properly? — Samuel Kok, Sep 13 '16 at 03:23
There is no slf4j binding available. Usually its caused by a missing jar or mis-configuration. http://www.slf4j.org/manual.html — Samuel Kok, Sep 13 '16 at 03:46

Samuel Kok · Answer 1 · 2016-09-13T07:49:31.850

2

The error you're receiving has nothing to do with your code modification. Instead, it is related to incorrect configuration and missing jars.

SLF4J binding is required in order for SLF4J to perform logging, else it'll use NOP logger implementation as you've seen in the error message.

To resolve this issue, add a SLF4J binding jar file into your project, such as slf4j-simple-<version>.jar

You may refer to the SLF4J Manual for a more detailed explaination.

Update

I don't think you're allowed to crawl google search results based on Google's robots.txt that disallowed their sites with a suffix /search to be crawled and also in their TOS.

Don’t misuse our Services. For example, don’t interfere with our Services or try to access them using a method other than the interface and the instructions that we provide. You may use our Services only as permitted by law, including applicable export and re-export control laws and regulations. We may suspend or stop providing our Services to you if you do not comply with our terms or policies or if we are investigating suspected misconduct.

You may consider using Google's Custom Search API for conformance with their TOS.

edited Sep 13 '16 at 07:49

answered Sep 13 '16 at 03:59

Samuel Kok

585
8
16

i try to add the jar file again .And I use the example url to test .It works .Also , I tried to use another URL to test .It works too. I guess it is related to the URL issues. (google link is not working) – evabb Sep 13 '16 at 06:55
As for crawling google search results, I don't think you're allowed to do so by referring to [Google's robots.txt](https://www.google.com/robots.txt) Do look up their TOS and policies to check if it's legal to do so. – Samuel Kok Sep 13 '16 at 07:32
Yes i read the google's robots .It said that it blocks the crawler .I tried yahoo search and it is not working too. I tried some news websites that search engine provided by Google or Yahoo .They are not working too . For my project , I want to crawl some specific object ,like 100 "Obama" news articles . I tried to crawl news search result website ,as well as Google,Yahoo news search engine . They are all not successful .Any ideas can be provided ? – evabb Sep 13 '16 at 08:16
I'll suggest you either search for news sources where crawling is permitted or look towards crawling feeds or tweets instead. Hope this helps. – Samuel Kok Sep 13 '16 at 10:43
`crawler4j` respects crawler ethics, so Samuel Kok is right. – rzo1 Sep 19 '16 at 11:13

How to adapt the URL that I want to crawl in crawler4j

update

1 Answers1