Questions tagged [crawler4j]

Crawler4j is an open source Java web crawler.

Crawler4j is an open source Java crawler which provides a simple interface for crawling the Web.

Reference: https://github.com/yasserg/crawler4j

174 questions
1
vote
1 answer

Crawler4j - Getting exception java.lang.NoSuchMethodError

I am trying to setup crawler4j via eclipse(juno). When I run it, I am getting the below exception(even though the program keeps running without logging anything): "Exception in thread "main" java.lang.NoSuchMethodError: …
neha
  • 11
  • 2
1
vote
2 answers

What does !FILTERs mean?

I have recently implemented Crawler4j and I am trying to teach myself the code by breaking it down line by line. I am having trouble understanding what the !FILTERS object on the line of code below means. @Override public boolean…
Octavius
  • 583
  • 5
  • 19
1
vote
2 answers

NoSuchMethodError in crawler4j CrawelController class

I am using example given here And included necessary files(crawler4j-3.3.zip &crawler4j-3.x-dependencies.zip) from [here] (http://code.google.com/p/crawler4j/downloads/list) in my build path and run path. I am getting this error: Exception in thread…
user801154
1
vote
1 answer

crawler4j to crawl a list of urls without crawling entire website

I have a list of web URLS need to be crawl. Is that possible to crawl only the list of webpage s with out crawling it deep. If i add the url as seed it crawls full website with full depth.
Ramesh
  • 2,295
  • 5
  • 35
  • 64
1
vote
1 answer

How to extract all links on a page using crawler4j?

I am implementing a web crawler and I am using Crawler4j library. I am not getting all the links on a web site . I tried to extract all the links on one page using Crawler4j and missed some links. Crawler4j version : crawler4j-3.3 Url I used…
user801154
1
vote
1 answer

Crawler4j gives null as parentURL and zero as parentDocID in url redirection

I am using the latest version of Crawler4j to crawl some feed URLs. I've passed some seed URLs along with the doc ID and I have also set the depth to zero as I only want the content of that page. The problem is that I am not able to get the…
Pratik
  • 51
  • 3
  • 10
1
vote
1 answer

Why would using hdfs:// prefix for a path to a file allow a file to be opened?

I'm writing a hadoop job that crawls pages. The library I am using uses the file system to store crawl data while it crawls. I was sure that the library would have to be modified to use the HDFS since a completely different set of classes need to be…
Raj
  • 3,051
  • 6
  • 39
  • 57
0
votes
1 answer

Scrape a Dynamic Website using Java with Selenium?

I'm trying to scrape https://www.rspca.org.uk/findapet#onSubmitSetHere to get a list of all pets for adoption. I've built web scrapers before using crawler4j but the websites were static. Since https://www.rspca.org.uk/findapet#onSubmitSetHere is…
breaktop
  • 1,899
  • 4
  • 37
  • 58
0
votes
1 answer

Feign client always throws a null pointer exception in a Spring boot/Crawler4j app

I am running a Crawler4j instance in a Spring boot application and my OpenFeign client is always null. public class MyCrawler extends WebCrawler { @Autowired HubClient hubClient; @Override public void visit(Page page) { // Lots of…
Nikolai Manek
  • 980
  • 6
  • 16
0
votes
1 answer

Directing the search depths in Crawler4j Solr

I am trying to make the crawler "abort" searching a certain subdomain every time it doesn't find a relevant page after 3 consecutive tries. After extracting the title and the text of the page I start looking for the correct pages to submit to my…
ge0rgi0
  • 59
  • 1
  • 9
0
votes
1 answer

crawler4j detects lines between the tag as text

This question already
hi crawler4j