Questions tagged [crawler4j]

Crawler4j is an open source Java web crawler.

Crawler4j is an open source Java crawler which provides a simple interface for crawling the Web.

Reference: https://github.com/yasserg/crawler4j

174 questions
1
vote
1 answer

Crawling from a static IP recognized as a robot

I have a problem. My web crawler run correctly from home and university, even if the pages I need are in /pgol/ and the robots.txt says this: # File controlled by PUPPET: do not modify!!! # /robots.txt file for…
Baldo
  • 19
  • 2
1
vote
0 answers

Using Crawler4j to print an Arraylist to HTML file?

Basics of this program; Runs a webcrawler based on PerentUrl and Keyword specified by the user in Controller (main). If the Keyword is found in the page text, the Url is then saved to an array list; ArrayList UrlHits = new ArrayList(); Once the…
1
vote
1 answer

How do I reduce/change delay after crawling?

Has anybody experience with using Crawler4j? I followed the example from the project page to realize my own crawler. The crawler is working fine and crawls very fast. The only thing is that I always have a delay of 20–30 seconds. Is there a way to…
user3411187
  • 61
  • 1
  • 3
1
vote
1 answer

Crawler4j shows different URL names in shouldVisit() and visit() method

I am using crawler4j to crawl a website. The website has certain parameters at the end of a few url for e.g http://www.abcd.com/xyz/?pqrs When the shouldVisit() method for such url is called I get the webURL as http://www.abcd.com/xyz/?pqrs but…
working
  • 873
  • 3
  • 11
  • 21
1
vote
2 answers

Crawler4j missing outgoing links?

I'm trying to crawl the Apache Mailing Lists to get all the archived messages using Crawler4j. I provided a seed URL and am trying to get links to the other messages. However, it seems to not be extracting all the links. Following is the HTML of my…
Pradeep Gollakota
  • 2,161
  • 16
  • 24
1
vote
0 answers

Does another User-Agent String, in a request for a webshop, change the content of the webshops answer?

We want to create a java crawler (crawler4j) which uses many product EANs to collect informations like price, picture, description of products from some defined webshops in cooperation with the host of the webshops. These informations should be…
1
vote
1 answer

crawler4j prints enourmous stack of system output

I started using Crawler4j and played around with the BasicCrawler Example for a while. I deleted all output from the BasicCrawler.visit() method. Then I added some url processing I already had. When I start the programm now, it suddenly prints an…
user2509422
  • 125
  • 10
1
vote
2 answers

Restricting URLs to seed URL domain only crawler4j

I want crawler4j to visit pages in such a manner that they belong to domain in seed only. There multiple domains in seed. How can I do it? Suppose I am adding seed URLs: www.google.com www.yahoo.com www.wikipedia.com Now I am starting the crawling…
akshayb
  • 1,219
  • 2
  • 18
  • 44
1
vote
1 answer

crawler4j recrawl a website not working

I am using crawler4j library to crawl some websites but I have a problem when I call two times the process. It only works for the first time. The second time doesn't give any ERROR but it does nothing. I think that the library is saving the urls…
Hibernator
  • 33
  • 8
1
vote
0 answers

Dynamically adding seeds from a database in Crawler4J

I am trying to read a list of seed urls from a csv file and loading them into the crawl controller using the codes below: public class BasicCrawlController { public static void main(String[] args) throws Exception { ArrayList
thotheolh
  • 7,040
  • 7
  • 33
  • 49
1
vote
2 answers

crawler4j compile error with class CrawlConfig - VariableDeclaratorId Expected

The code will not compile. I changed the JRE to 1.7. The compiler does not highlight the class in Eclipse and the CrawlConfig appears to fail in the compiler. The class should be run from the command line in Linux. Any ideas? Compiler Error…
Trevor Oakley
  • 438
  • 5
  • 17
1
vote
1 answer

Crawler4j visits only seeds URLs

I'm using crawler4j to crawl rottentomatoes website to extract structured data. I have setup everything and with default urls given in example on project home page, everything works, but when I put my own seeds, application only visits URLs that I…
Vuk Stanković
  • 7,864
  • 10
  • 41
  • 65
1
vote
1 answer

What StatisticsDB do in Crawler4j open source code?

I am trying to understand the Crawler4j Open source web crawler. In the mean while I have some doubts, that are as follows, Questions:- What is StatisticsDB do in Counters class., and please explain the following code part, public…
devsda
  • 4,112
  • 9
  • 50
  • 87
1
vote
1 answer

How to get if an url is 404 or 301 in crawler4j

Is it possible to get if an URL is 404 or 301 in crawler4j ? @Override public void visit(Page page) { String url = page.getWebURL().getURL(); System.out.println("URL: " + url); if (page.getParseData() instanceof…
Kathick
  • 1,395
  • 5
  • 19
  • 30
1
vote
0 answers

Why my programatically pulled webpage is different from what i see in the browser?

I am using crawler4j to pull some data from Google play store (https pages). However, I checked my downloaded html content and found that it is slightly different from the page source I see in the browser. Why? Is it because Google detected that I…
andrew
  • 885
  • 2
  • 8
  • 16