Highest Voted 'crawler4j' Questions

1

vote

1 answer

Crawling from a static IP recognized as a robot

I have a problem. My web crawler run correctly from home and university, even if the pages I need are in /pgol/ and the robots.txt says this: # File controlled by PUPPET: do not modify!!! # /robots.txt file for…

asked Mar 28 '14 at 08:58

Baldo

19
2

1

vote

0 answers

Using Crawler4j to print an Arraylist to HTML file?

Basics of this program; Runs a webcrawler based on PerentUrl and Keyword specified by the user in Controller (main). If the Keyword is found in the page text, the Url is then saved to an array list; ArrayList UrlHits = new ArrayList(); Once the…

java arraylist crawler4j

asked Mar 13 '14 at 17:35

user3213241

11
3

1

vote

1 answer

How do I reduce/change delay after crawling?

Has anybody experience with using Crawler4j? I followed the example from the project page to realize my own crawler. The crawler is working fine and crawls very fast. The only thing is that I always have a delay of 20–30 seconds. Is there a way to…

java web-crawler crawler4j

asked Mar 12 '14 at 14:47

user3411187

61
1
3

1

vote

1 answer

Crawler4j shows different URL names in shouldVisit() and visit() method

I am using crawler4j to crawl a website. The website has certain parameters at the end of a few url for e.g http://www.abcd.com/xyz/?pqrs When the shouldVisit() method for such url is called I get the webURL as http://www.abcd.com/xyz/?pqrs but…

java crawler4j

asked Mar 02 '14 at 21:32

working

873
3
11
21

1

vote

2 answers

Crawler4j missing outgoing links?

I'm trying to crawl the Apache Mailing Lists to get all the archived messages using Crawler4j. I provided a seed URL and am trying to get links to the other messages. However, it seems to not be extracting all the links. Following is the HTML of my…

crawler4j

asked Feb 07 '14 at 08:02

Pradeep Gollakota

2,161
16
24

1

vote

0 answers

Does another User-Agent String, in a request for a webshop, change the content of the webshops answer?

We want to create a java crawler (crawler4j) which uses many product EANs to collect informations like price, picture, description of products from some defined webshops in cooperation with the host of the webshops. These informations should be…

java web-crawler user-agent webshop crawler4j

asked Dec 30 '13 at 12:15

user3122338

9
3

1

vote

1 answer

crawler4j prints enourmous stack of system output

I started using Crawler4j and played around with the BasicCrawler Example for a while. I deleted all output from the BasicCrawler.visit() method. Then I added some url processing I already had. When I start the programm now, it suddenly prints an…

crawler4j

asked Nov 20 '13 at 11:36

user2509422

125
10

1

vote

2 answers

Restricting URLs to seed URL domain only crawler4j

I want crawler4j to visit pages in such a manner that they belong to domain in seed only. There multiple domains in seed. How can I do it? Suppose I am adding seed URLs: www.google.com www.yahoo.com www.wikipedia.com Now I am starting the crawling…

java web-crawler crawler4j

asked Nov 09 '13 at 11:18

akshayb

1,219
2
18
44

1

vote

1 answer

crawler4j recrawl a website not working

I am using crawler4j library to crawl some websites but I have a problem when I call two times the process. It only works for the first time. The second time doesn't give any ERROR but it does nothing. I think that the library is saving the urls…

java crawler4j

asked Oct 14 '13 at 08:23

Hibernator

33
8

1

vote

0 answers

Dynamically adding seeds from a database in Crawler4J

I am trying to read a list of seed urls from a csv file and loading them into the crawl controller using the codes below: public class BasicCrawlController { public static void main(String[] args) throws Exception { ArrayList…

dynamic-data crawler4j

asked Aug 20 '13 at 04:31

thotheolh

7,040
7
33
49

1

vote

2 answers

crawler4j compile error with class CrawlConfig - VariableDeclaratorId Expected

The code will not compile. I changed the JRE to 1.7. The compiler does not highlight the class in Eclipse and the CrawlConfig appears to fail in the compiler. The class should be run from the command line in Linux. Any ideas? Compiler Error…

crawler4j

asked Aug 08 '13 at 06:48

Trevor Oakley

438
5
17

1

vote

1 answer

Crawler4j visits only seeds URLs

I'm using crawler4j to crawl rottentomatoes website to extract structured data. I have setup everything and with default urls given in example on project home page, everything works, but when I put my own seeds, application only visits URLs that I…

java web-crawler crawler4j

asked Aug 05 '13 at 22:03

Vuk Stanković

7,864
10
41
65

1

vote

1 answer

What StatisticsDB do in Crawler4j open source code?

I am trying to understand the Crawler4j Open source web crawler. In the mean while I have some doubts, that are as follows, Questions:- What is StatisticsDB do in Counters class., and please explain the following code part, public…

web-crawler crawler4j

asked May 17 '13 at 12:15

devsda

4,112
9
50
87

1

vote

1 answer

How to get if an url is 404 or 301 in crawler4j

Is it possible to get if an URL is 404 or 301 in crawler4j ? @Override public void visit(Page page) { String url = page.getWebURL().getURL(); System.out.println("URL: " + url); if (page.getParseData() instanceof…

crawler4j

asked Feb 04 '13 at 07:11

Kathick

1,395
5
19
30

1

vote

0 answers

Why my programatically pulled webpage is different from what i see in the browser?

I am using crawler4j to pull some data from Google play store (https pages). However, I checked my downloaded html content and found that it is slightly different from the page source I see in the browser. Why? Is it because Google detected that I…

http-headers httpwebrequest web-crawler crawler4j

asked Feb 02 '13 at 08:28

andrew

885
2
8
16

Questions tagged [crawler4j]