Questions tagged [crawler4j]

Crawler4j is an open source Java web crawler.

Crawler4j is an open source Java crawler which provides a simple interface for crawling the Web.

Reference: https://github.com/yasserg/crawler4j

174 questions
1
vote
1 answer

start vs startNonBlocking in crawler4j

I am using crawler4j lib and their dependencies for crawling the page. What is the difference between controller.start(BasicCrawler.class, numberOfCrawlers); vs. controller.startNonBlocking(BasicCrawler.class, numberOfCrawlers); ?
Selva
  • 546
  • 5
  • 12
  • 34
1
vote
1 answer

crawler4j do not recognize all links on a page

Basically I am facing a problem where crawler4j do not recognize all links on the page. say for example there are 5 links existing on the page out of them only 3 gets recognized and hence fetched. Rest 2 are not even recognized. What is the expected…
1
vote
1 answer

Can I add https url as my seed with Crawler4j

I have to crawl through an ssl website using crawler4j-4.1.jar and its dependencies. Can I add an https url as my first seed?
Rache
  • 217
  • 1
  • 7
  • 20
1
vote
1 answer

Crawler4j keeps blocking after crawl

I am using Crawler4j to simply get the HTML from the crawled pages. It successfully stores the retrieved HTML for my test site of about 50 pages. It uses the shoudVisit method I implemented, and it uses the visit method I implemented. These both…
Indigenuity
  • 9,332
  • 6
  • 39
  • 68
1
vote
1 answer

Can Crawler4j interpret wildcarding using astericks(*) in robots.txt?

I want to be able to block web-crawlers from accessing pages other than page1. The following should be able to block all directories/file names containing the word page. So something like /localhost/myApp/page2.xhtml should be blocked. #Disallow:…
Andy T
  • 136
  • 11
1
vote
1 answer

Getting all iframes,base64 codes which are present in html pages using crawler4j

I am using crawler4j for crawling some websites and it is working fine. I am able to download all the files present in a website and now I have a new task ahead of me.I need to extract iframe,base64 and other embedded codes also if possible! Till…
Sudhir kumar
  • 549
  • 2
  • 8
  • 31
1
vote
1 answer

Permission external jar create file tomcat

I am with a problem in my application. It obtains data from websites through Crawler4j and it needs to create some directories and files to manipulate data, but tomcat doesn't give permissions. The answer is like that: Couldn't create this folder:…
Marcelo
  • 63
  • 6
1
vote
0 answers

crawler4j not working while using it with TimerTask

We have been trying to use the crawler so that we can crawl a particular website at a certain interval. For this we have been trying to incorporate the crawler in timer. But after the first successful crawling using the timer, it always says in the…
redjohn
  • 81
  • 1
  • 5
1
vote
1 answer

crawler4j: website bans my IP address for few minutes after 20-30 seconds of crawl

I've trying to crawl a website at mystore411.com using open source crawler4j. The crawler works fine for a limites period of time (say 20-30 seconds) and then website bans my address for few minutes before I can crawl again. I couldn't figure out a…
1
vote
1 answer

Improving Crawler4j-Crawler efficiency,scalabitlity

I am Using Crawler4j crawler to crawl some domains.Now I want to Improve the efficiency of the crawler, I want my crawler to use my full bandwidth and crawl as many url's as possible in a given time period.For that I am taking the following…
Sudhir kumar
  • 549
  • 2
  • 8
  • 31
1
vote
1 answer

Web crawler with incremental crawling support for windows

I need a open source web crawler developed in java with incremental crawling support. Web crawler should be easily customized and integrated with solr or elasticsearch. It should be an active one which is developing further with more…
Kumar
  • 3,782
  • 4
  • 39
  • 87
1
vote
2 answers

Crawl urls with a certain prefix

I would like to just crawl with crawler4j, certain URLs which have a certain prefix. So for example, if an URL starts with http://url1.com/timer/image it is valid. E.g.: http://url1.com/timer/image/text.php. This URL is not valid:…
Carol.Kar
  • 4,581
  • 36
  • 131
  • 264
1
vote
2 answers

Crawl a list of sites using Crawler4j

I have a problem to load a list of links; these links should be used by controller.addSeed in a loop. Here is the code SelectorString selector = new SelectorString(); List lista = new ArrayList<>(); lista=selector.leggiFile(); String…
Justin
  • 1,149
  • 2
  • 19
  • 35
1
vote
0 answers

Pass values from visit() to handlePageStatusCode()

I am working on groovy and grails project. I have a requirement where I need to pass some values from the visit() to handlePageStatusCode() in crawler4j. The two methods are inside a class src/groovy/BasicCrawler.groovy class. I cannot change the…
clever_bassi
  • 2,392
  • 2
  • 24
  • 43
1
vote
1 answer

Get mp3/pdf files using JSoup in Groovy

I am developing an application for crawling the web using crawler4j and Jsoup. I need to parse a webpage using JSoup and check if it has zip files, pdf/doc and mp3/mov file available as a resource for download. For zip files i did the following and…
clever_bassi
  • 2,392
  • 2
  • 24
  • 43