Questions tagged [crawler4j]

Crawler4j is an open source Java web crawler.

Crawler4j is an open source Java crawler which provides a simple interface for crawling the Web.

Reference: https://github.com/yasserg/crawler4j

174 questions
0
votes
1 answer

Why does Crawler4j non blocking method is not waiting for links in queue?

Given this simple code: CrawlConfig config = new…
0
votes
1 answer

update java swing component from different class

I am working on a crawler project using crawler4j and on top of it, I have a swing interface. I have 2 different cases, namely the controller.java (also containing the SWING components) and crawler.java. I am attempting to append output processed by…
kenAu89
  • 101
  • 1
  • 11
0
votes
1 answer

Why does this env object keep growing in size ?

I have been working on a web crawler for some time now, the idea is simple, I have a SQL table containing a list of websites, I have many threads fetching the first website from the table and deleting it, then crawling it ( in a heap like…
0
votes
0 answers

Multi-thread web crawling with Crawler4j: Missing pages

I am using multi-thread crawler Crawler4j to crawl some websites. This crawler allows the user to define the number of threads of the crawler to be run on a website. I decided to run the crawler up to depth/layer = 10 and crawl up to 501 pages per…
Rushdi Shams
  • 2,423
  • 19
  • 31
0
votes
1 answer

How to download text contained in JavaScript files via crawler4j?

I'm trying to use crawler4j to extract text from some websites. However, while I have changed the Filters to allow extensions with js in the following manner private final static Pattern FILTERS = Pattern.compile(".*(\\.(css|gif|jpg" +…
aardwolf
  • 69
  • 2
  • 6
0
votes
1 answer

Crawler4j downloading articles

I'm trying to download articles from news portals using Crawler4j. I would like to store them in folders under categories 'sport' 'science' 'health' or any other made by that portal. Url parsing isn't enough since some portals don't use categories…
Chris Lup
  • 3
  • 2
0
votes
1 answer

Need clarification on shouldVisit and visit methods of Crawler4j

I need to download PDFs from websites using Crawler4j. I am following this documentation to create two classes: PDFCrawler PDFCrawlController Now, in my PDFCrawler class, I have a shouldVisit(Page page, WebURL url) method as follows: public…
Rushdi Shams
  • 2,423
  • 19
  • 31
0
votes
1 answer

How to parse a document using crawler4j

I wanted to parse all the documents containing some text I enter as "query" using crawler4j in Eclipse. Any ideas?
Bruno Fernandes
  • 427
  • 2
  • 6
  • 14
0
votes
1 answer

How to collect contact information from websites?

Does anyone know a web crawler tool for collecting contact details from a website? Say I have a www.website/contact.. I want to pull out the address, phone number, etc.. There are 2 tools I've been looking at: cralwer4j opensource jar for java and…
0
votes
2 answers

JavaDoc for Crawler4j

I recently came across crawler4j Api for WebCrawling in Java , but during developing my custom crawler I came to know that no javaDoc is present for this Do anybody knows is this API having JavaDoc and if yes then where it is ?
Neeraj Jain
  • 7,643
  • 6
  • 34
  • 62
0
votes
1 answer

How to schedule crawler4j crawl control to run periodically?

I'm using crawler4j to build a simple web crawler. What I want to do is to invoke the crawl control every 10 minutes. I created a servlet that starts when my Tomcat server starts, and in the servlet I am using ScheduledExecutorService for the…
rawPotato
  • 33
  • 6
0
votes
1 answer

Cannot Deploy Project involving Crawler4j

After I add the crawler4j jar file with the dependencies (I am not Maven) into the classpath library, I try deploying and running the project but my Glassfish 4.1 shows the following error; Severe: Exception during lifecycle…
0
votes
2 answers

Can Crawler4j be run from another class

I need to call Crawler4j from a different class. Instead of the main method in the Controller class I used a simple method called setup. class Controller { public void setup(String seed) { try { String rootFolder = "data/crawler"; …
Mallik Kumar
  • 540
  • 1
  • 5
  • 28
0
votes
1 answer

How to retrieve all the user comments from a site?

I want all the user comments from this site : http://www.consumercomplaints.in/?search=chevrolet The problem is the comments are just displayed partially, and to see the complete comment I have to click on the title above it, and this process has…
0
votes
1 answer

Blocking Task on Java web application, and request timeout on Heroku server

I am new to Java web programming, I'm trying to make a web crawler, Using the Crawler4j sample code My problem is that when I submit the repost request, the Crawling task ( which is a blocking task) takes some time to get done, Heroku hosting has a…