Highest Voted 'crawler4j' Questions

1

vote

1 answer

start vs startNonBlocking in crawler4j

I am using crawler4j lib and their dependencies for crawling the page. What is the difference between controller.start(BasicCrawler.class, numberOfCrawlers); vs. controller.startNonBlocking(BasicCrawler.class, numberOfCrawlers); ?

java crawler4j

asked May 19 '15 at 07:55

Selva

546
5
12
34

1

vote

1 answer

crawler4j do not recognize all links on a page

Basically I am facing a problem where crawler4j do not recognize all links on the page. say for example there are 5 links existing on the page out of them only 3 gets recognized and hence fetched. Rest 2 are not even recognized. What is the expected…

crawler4j

asked May 11 '15 at 00:58

Amar Vyawhare

11
2

1

vote

1 answer

Can I add https url as my seed with Crawler4j

I have to crawl through an ssl website using crawler4j-4.1.jar and its dependencies. Can I add an https url as my first seed?

web-crawler crawler4j

asked May 03 '15 at 06:53

Rache

217
1
7
20

1

vote

1 answer

Crawler4j keeps blocking after crawl

I am using Crawler4j to simply get the HTML from the crawled pages. It successfully stores the retrieved HTML for my test site of about 50 pages. It uses the shoudVisit method I implemented, and it uses the visit method I implemented. These both…

web-crawler blocking crawler4j

asked Apr 15 '15 at 20:38

Indigenuity

9,332
6
39
68

1

vote

1 answer

Can Crawler4j interpret wildcarding using astericks(*) in robots.txt?

I want to be able to block web-crawlers from accessing pages other than page1. The following should be able to block all directories/file names containing the word page. So something like /localhost/myApp/page2.xhtml should be blocked. #Disallow:…

wildcard robots.txt crawler4j

asked Apr 13 '15 at 21:00

Andy T

136
11

1

vote

1 answer

Getting all iframes,base64 codes which are present in html pages using crawler4j

I am using crawler4j for crawling some websites and it is working fine. I am able to download all the files present in a website and now I have a new task ahead of me.I need to extract iframe,base64 and other embedded codes also if possible! Till…

java html iframe crawler4j

asked Dec 15 '14 at 11:02

Sudhir kumar

549
2
8
31

1

vote

1 answer

Permission external jar create file tomcat

I am with a problem in my application. It obtains data from websites through Crawler4j and it needs to create some directories and files to manipulate data, but tomcat doesn't give permissions. The answer is like that: Couldn't create this folder:…

spring-mvc tomcat crawler4j

asked Dec 08 '14 at 18:00

Marcelo

63
6

1

vote

0 answers

crawler4j not working while using it with TimerTask

We have been trying to use the crawler so that we can crawl a particular website at a certain interval. For this we have been trying to incorporate the crawler in timer. But after the first successful crawling using the timer, it always says in the…

java timer timertask crawler4j

asked Nov 20 '14 at 07:23

redjohn

81
1
5

1

vote

1 answer

crawler4j: website bans my IP address for few minutes after 20-30 seconds of crawl

I've trying to crawl a website at mystore411.com using open source crawler4j. The crawler works fine for a limites period of time (say 20-30 seconds) and then website bans my address for few minutes before I can crawl again. I couldn't figure out a…

web-crawler robots.txt crawler4j

asked Oct 15 '14 at 18:30

user3311019

13
3

1

vote

1 answer

Improving Crawler4j-Crawler efficiency,scalabitlity

I am Using Crawler4j crawler to crawl some domains.Now I want to Improve the efficiency of the crawler, I want my crawler to use my full bandwidth and crawl as many url's as possible in a given time period.For that I am taking the following…

java web-crawler crawler4j

asked Sep 29 '14 at 07:26

Sudhir kumar

549
2
8
31

1

vote

1 answer

Web crawler with incremental crawling support for windows

I need a open source web crawler developed in java with incremental crawling support. Web crawler should be easily customized and integrated with solr or elasticsearch. It should be an active one which is developing further with more…

java solr web-crawler nutch crawler4j

asked Sep 22 '14 at 12:13

Kumar

3,782
4
39
87

1

vote

2 answers

Crawl urls with a certain prefix

I would like to just crawl with crawler4j, certain URLs which have a certain prefix. So for example, if an URL starts with http://url1.com/timer/image it is valid. E.g.: http://url1.com/timer/image/text.php. This URL is not valid:…

java web-crawler crawler4j

asked Sep 14 '14 at 08:05

Carol.Kar

4,581
36
131
264

1

vote

2 answers

Crawl a list of sites using Crawler4j

I have a problem to load a list of links; these links should be used by controller.addSeed in a loop. Here is the code SelectorString selector = new SelectorString(); List lista = new ArrayList<>(); lista=selector.leggiFile(); String…

java web-crawler crawler4j

asked Aug 05 '14 at 12:33

Justin

1,149
2
19
35

1

vote

0 answers

Pass values from visit() to handlePageStatusCode()

I am working on groovy and grails project. I have a requirement where I need to pass some values from the visit() to handlePageStatusCode() in crawler4j. The two methods are inside a class src/groovy/BasicCrawler.groovy class. I cannot change the…

grails groovy crawler4j

asked Jul 17 '14 at 15:35

clever_bassi

2,392
2
24
43

1

vote

1 answer

Get mp3/pdf files using JSoup in Groovy

I am developing an application for crawling the web using crawler4j and Jsoup. I need to parse a webpage using JSoup and check if it has zip files, pdf/doc and mp3/mov file available as a resource for download. For zip files i did the following and…

grails groovy jsoup crawler4j

asked Jul 07 '14 at 15:39

clever_bassi

2,392
2
24
43

Questions tagged [crawler4j]