Questions tagged [crawler4j]

Crawler4j is an open source Java web crawler.

Crawler4j is an open source Java crawler which provides a simple interface for crawling the Web.

Reference: https://github.com/yasserg/crawler4j

174 questions
3
votes
2 answers

How to disable Crawler4J logger?

I am crawling using Crawler4J. I don't want to print log messages. But Crawler4J has a logger in it. How can I disable logger inner Crawler4J lib?
이현규
  • 97
  • 1
  • 7
3
votes
2 answers

Crawler4j, Some urls are crawled without issue while others are not crawled at all

I have been playing around with Crawler4j and have successfully had it crawl some pages but have no success crawling others. For example I have gotten it to successfully crawl Reddi with this code: public class Controller { public static void…
theGuy05
  • 417
  • 1
  • 7
  • 22
3
votes
3 answers

How to get scrape using crawler4j?

I've been going at this for 4 hours now, and I simply can't see what I'm doing wrong. I have two files: MyCrawler.java Controller.java MyCrawler.java import edu.uci.ics.crawler4j.crawler.Page; import…
rockstardev
  • 13,479
  • 39
  • 164
  • 296
3
votes
1 answer

Grails: Pass value from controller to thread

In my project, the action of my Grails controller is creating a new thread and calling a class form src/groovy folder each time this action is executed. I need to pass the value from this action to the new thread being created. How can I achieve…
clever_bassi
  • 2,392
  • 2
  • 24
  • 43
3
votes
1 answer

Params for WebCrawler in crawler4j

Is it possible to pass params to WebCrawler ? For example I want to pass new rule for WebCrawler.shouldVisit(WebURL url) method in runtime or set some field in my WebCrawler. Is it possible?
chinchilla
  • 93
  • 5
3
votes
1 answer

Set values from src/groovy classes to domain class properties

I'm working on crawler4j using groovy and grails. I have a BasicCrawler.groovy class in src/groovy and the domain class Crawler.groovy and a controller called CrawlerController.groovy. I have few properties in BasicCrawler.groovy class like url,…
clever_bassi
  • 2,392
  • 2
  • 24
  • 43
3
votes
1 answer

how to parse the html when using crawler4j

Recently,I had to crawl some website with open Source project crawler4j.However,crawler4j didn't offer any api for using.Now,i came to a problem that how i can parse a html with the function and class provided by crawler4j and find element like we…
mly
  • 31
  • 2
3
votes
2 answers

Replace all URLs in a HTML

I'm crawling some HTML files with crawler4j and I want to replace all links in those pages with custom links. Currently I can get the source HTML and a list of all outgoing links with this code: HtmlParseData htmlParseData = (HtmlParseData)…
Alireza Noori
  • 14,961
  • 30
  • 95
  • 179
3
votes
1 answer

Browse .jdb output?

I am running crawler4j and the output is to the directory /frontier/. The files in this directory are 00000000.jdb je.info.0 je.info.lck je.lck the .jdb file is the only one with data the other three files have zero bytes. I am not sure what to do…
KDEx
  • 3,505
  • 4
  • 31
  • 39
2
votes
2 answers

Efficient design of crawler4J to get data

I am trying to get the data from various websites.After searcing in stack overflow, am using crawler4j as many suggested this. Below is my understanding/design: 1. Get sitemap.xml from robots.txt. 2. If sitemap.xml is not available in robots.txt,…
topblog
  • 93
  • 2
  • 7
2
votes
1 answer

What sequence of steps does crawler4j follow to fetch data?

I'd like to learn, how crawler4j works? Does it fetch web page then download its content and extract it ? What about .db and .cvs file and its structures? Generally ,What sequences it follows? please, I want a descriptive content Thanks
Ahmed Sakr
  • 129
  • 1
  • 9
2
votes
2 answers

Web Crawler vs Html Parser

What is the difference between web crawler and parser? In java there are some name for fetching libraries . For example , they name nutch as a crawler and jsoup as a parser . Are they do the same purpose? Are they fully similar for the job? thanks
Ahmed Sakr
  • 129
  • 1
  • 9
2
votes
2 answers

Is it able to retrieve website content by Crawler4j?

I m very new to Java. Now , I want to retrieve the news article contents using Google news search -keyword: "toy" from page 1 to page…
evabb
  • 405
  • 3
  • 21
2
votes
1 answer

crawler4j seems to be ignoring robots.txt file...How to fix it?

I am working on a project to crawl a small web directory and have implemented a crawler using crawler4j. I know that RobotstxtServer should be checking to see if a file is allow/disallowed by the robots.txt file, but mine is still showing a…
drewfiss90
  • 53
  • 5
2
votes
1 answer

crawler4j asynchronously saving results to file

I'm evaluating crawler4j for ~1M crawls per day My scenario is this: I'm fetching the URL and parsing its description, keywords and title, now I would like to save each URL and its words into a single file I've seen how it's possible to save crawled…
Gideon
  • 2,211
  • 5
  • 29
  • 47
1
2
3
11 12