Questions tagged [crawler4j]

Crawler4j is an open source Java web crawler.

Crawler4j is an open source Java crawler which provides a simple interface for crawling the Web.

Reference: https://github.com/yasserg/crawler4j

174 questions
2
votes
0 answers

Crawler4j with authentication

I'm trying to execute the crawler4j in a personal redmine for testing purposes. I want to authenticate and crawl several leves of depth in the application. I follow this tutorial from the FAQ of crawler4j. And create the next snippet: import…
Antonio J.
  • 1,750
  • 3
  • 21
  • 33
2
votes
1 answer

Crawler4j - NoSuchMethod getOutgoingUrls()

I am trying to setup the craweler4j. I am building it from source in Netbeans. I am using 3.5 version of crawler4j and calling classes are same as that of the once given on the site - reproducing for ease below - public class MyCrawler extends…
Prakash
  • 75
  • 6
2
votes
1 answer

Is it possible to ignore Http Content-Length?

I am using Crawler4J to collect information about a website. But sometimes I get the following error: INFORMATION: Exception while fetching content for: {someurl} [Premature end of Content-Length delimited message body (expected: X; received:…
Hisushi
  • 67
  • 1
  • 11
2
votes
1 answer

Crawler4j with mongoDB

I was researching on crawler4j. I found that it uses BerkeleyDB as the database. I am developing a Grails app using mongoDB and was wondering how flexible will crawler4j be to work within my application. I basically want to store the crawled…
clever_bassi
  • 2,392
  • 2
  • 24
  • 43
2
votes
1 answer

Crawler4j with Grails App throws error

This might be a very basic and silly question for experienced people. But please help. I am trying to use Crawler4j with in my Grails app by following this tutorial. I know its Java code but I am using it in a controller class called…
clever_bassi
  • 2,392
  • 2
  • 24
  • 43
2
votes
2 answers

How to get the resource types from a webpage using JSoup?

I am trying to make a webcrawler in Groovy. I am looking to extract the resource types from a webpage. I need to check if a particular webpage has the following resource types: PDFs JMP Files SWF Files ZIP Files MP3 Files Images Movie Files JSL…
clever_bassi
  • 2,392
  • 2
  • 24
  • 43
2
votes
1 answer

Crawler4j - Many URLs are discarded / not processed(missing in output)

I am running crawler4j to find status(http response) code for one million URLs. I have not set any filters to filter out URLs to be processed. I get proper response for 90% URLs, but 10% are missing in the output. They dont even appear in…
user1746666
  • 161
  • 1
  • 2
  • 9
2
votes
1 answer

crawler4j get full parent list

im new to crawler4j. I crawled a website to a certain depth and found what i searched for. What i am trying to do now is to trace back my steps and find out how i got on this page. I need a list of the links that led me to the page where the content…
IDontKnow
  • 159
  • 1
  • 12
2
votes
2 answers

Crawling only Dynamic Data

I am trying to crawl the archives of a local news paper and am getting the desired result. Is there any way for me to program the crawler such that the static buttons such as the Home, Button and their footers which are the same on every, page not…
2
votes
3 answers

Kill threads created by an object

I have created a custom crawler using crawler4j. In my app, I create a lot of controllers and after a while, the number of threads in the system will hit the maximum value and the JVM will throw an Exception. Even though I call ShutDown() on the…
Alireza Noori
  • 14,961
  • 30
  • 95
  • 179
2
votes
2 answers

Use crawler4j to download js files

I'm trying to use crawler4j to download some websites. The only problem I have is that even though I return true for all .js files in the shouldVisit function, they never get downloaded. @Override public boolean shouldVisit(WebURL url) { return…
Alireza Noori
  • 14,961
  • 30
  • 95
  • 179
2
votes
2 answers

What html parser should I use?

I am working on a product where I need to parse a HTML document. I looked for Jericho, TagSoup, Jsoup and Crawl4J. Which parser should I use to parse HTML as I need to run this process in multi thread environment using quartz? At a time if 10 thread…
vaibought
  • 461
  • 2
  • 6
  • 20
2
votes
2 answers

Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/http/conn/scheme/SchemeSocketFactory while Using Crawler4j

i am using Crawler4j Example code but i found that i got an exception. Here is my exception : Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/http/conn/scheme/SchemeSocketFactory at…
Raghavender Reddy
  • 180
  • 2
  • 5
  • 18
1
vote
1 answer

Get link text of links when crawling a website using crawler4j

I am using crawler4j to crawl a website. When I visit a page, I would like to get the link text of all the links, not only the full URLs. Is this possible? Thanks in advance.
rustybeanstalk
  • 2,722
  • 9
  • 37
  • 57
1
vote
1 answer

How to send crawler4j data to CrawlerManager?

I'm working with a project where user can search some websites and look for pictures which have unique identifier. public class ImageCrawler extends WebCrawler { private static final Pattern filters = Pattern.compile( …
Przemek
  • 27
  • 4
1 2
3
11 12