Highest Voted 'crawler4j' Questions

2

votes

0 answers

Crawler4j with authentication

I'm trying to execute the crawler4j in a personal redmine for testing purposes. I want to authenticate and crawl several leves of depth in the application. I follow this tutorial from the FAQ of crawler4j. And create the next snippet: import…

java web-crawler crawler4j

asked May 28 '15 at 14:42

Antonio J.

1,750
3
21
33

2

votes

1 answer

Crawler4j - NoSuchMethod getOutgoingUrls()

I am trying to setup the craweler4j. I am building it from source in Netbeans. I am using 3.5 version of crawler4j and calling classes are same as that of the once given on the site - reproducing for ease below - public class MyCrawler extends…

web-crawler crawler4j

asked Nov 17 '14 at 12:04

Prakash

75
6

2

votes

1 answer

Is it possible to ignore Http Content-Length?

I am using Crawler4J to collect information about a website. But sometimes I get the following error: INFORMATION: Exception while fetching content for: {someurl} [Premature end of Content-Length delimited message body (expected: X; received:…

java crawler4j http-content-length

asked Aug 12 '14 at 09:51

Hisushi

67
1
11

2

votes

1 answer

Crawler4j with mongoDB

I was researching on crawler4j. I found that it uses BerkeleyDB as the database. I am developing a Grails app using mongoDB and was wondering how flexible will crawler4j be to work within my application. I basically want to store the crawled…

mongodb crawler4j

asked Jun 30 '14 at 18:19

clever_bassi

2,392
2
24
43

2

votes

1 answer

Crawler4j with Grails App throws error

This might be a very basic and silly question for experienced people. But please help. I am trying to use Crawler4j with in my Grails app by following this tutorial. I know its Java code but I am using it in a controller class called…

grails groovy crawler4j

asked Jun 25 '14 at 14:16

clever_bassi

2,392
2
24
43

2

votes

2 answers

How to get the resource types from a webpage using JSoup?

I am trying to make a webcrawler in Groovy. I am looking to extract the resource types from a webpage. I need to check if a particular webpage has the following resource types: PDFs JMP Files SWF Files ZIP Files MP3 Files Images Movie Files JSL…

types groovy resources jsoup crawler4j

asked Jun 23 '14 at 19:32

clever_bassi

2,392
2
24
43

2

votes

1 answer

Crawler4j - Many URLs are discarded / not processed(missing in output)

I am running crawler4j to find status(http response) code for one million URLs. I have not set any filters to filter out URLs to be processed. I get proper response for 90% URLs, but 10% are missing in the output. They dont even appear in…

java web-crawler crawler4j

asked Feb 16 '14 at 11:51

user1746666

161
1
2
9

2

votes

1 answer

crawler4j get full parent list

im new to crawler4j. I crawled a website to a certain depth and found what i searched for. What i am trying to do now is to trace back my steps and find out how i got on this page. I need a list of the links that led me to the page where the content…

java crawler4j

asked Nov 28 '13 at 21:36

IDontKnow

159
1
12

2

votes

2 answers

Crawling only Dynamic Data

I am trying to crawl the archives of a local news paper and am getting the desired result. Is there any way for me to program the crawler such that the static buttons such as the Home, Button and their footers which are the same on every, page not…

web-crawler crawler4j

asked Feb 11 '13 at 18:02

Aparajith Chandran

139
10

2

votes

3 answers

Kill threads created by an object

I have created a custom crawler using crawler4j. In my app, I create a lot of controllers and after a while, the number of threads in the system will hit the maximum value and the JVM will throw an Exception. Even though I call ShutDown() on the…

java multithreading web-crawler crawler4j

asked Feb 01 '13 at 10:48

Alireza Noori

14,961
30
95
179

2

votes

2 answers

Use crawler4j to download js files

I'm trying to use crawler4j to download some websites. The only problem I have is that even though I return true for all .js files in the shouldVisit function, they never get downloaded. @Override public boolean shouldVisit(WebURL url) { return…

java web-crawler crawler4j

asked Jan 19 '13 at 11:43

Alireza Noori

14,961
30
95
179

2

votes

2 answers

What html parser should I use?

I am working on a product where I need to parse a HTML document. I looked for Jericho, TagSoup, Jsoup and Crawl4J. Which parser should I use to parse HTML as I need to run this process in multi thread environment using quartz? At a time if 10 thread…

java tag-soup jericho-html-parser crawler4j

asked Sep 11 '12 at 11:36

vaibought

461
2
6
20

2

votes

2 answers

Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/http/conn/scheme/SchemeSocketFactory while Using Crawler4j

i am using Crawler4j Example code but i found that i got an exception. Here is my exception : Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/http/conn/scheme/SchemeSocketFactory at…

java exception crawler4j

asked May 03 '12 at 11:53

Raghavender Reddy

180
2
5
18

1

vote

1 answer

Get link text of links when crawling a website using crawler4j

I am using crawler4j to crawl a website. When I visit a page, I would like to get the link text of all the links, not only the full URLs. Is this possible? Thanks in advance.

html hyperlink web-crawler crawler4j

asked Mar 07 '12 at 23:49

rustybeanstalk

2,722
9
37
57

1

vote

1 answer

How to send crawler4j data to CrawlerManager?

I'm working with a project where user can search some websites and look for pictures which have unique identifier. public class ImageCrawler extends WebCrawler { private static final Pattern filters = Pattern.compile( …

spring asynchronous crawler4j

asked Nov 22 '18 at 12:44

Przemek

27
4

Questions tagged [crawler4j]