Highest Voted 'crawler4j' Questions

1

vote

1 answer

How to resume crawling after last depth I reached when I restart my crawler?

Hello Everyone I am making a web application that crawl lots of pages from a specific website, I started my crawler4j software with unlimited depth and pages but suddenly it stopped because of internet connection. Now I want to continue crawling…

java web-crawler crawler4j

asked Nov 20 '18 at 19:34

Ahmed Sakr

129
1
9

1

vote

1 answer

Crawler4J seed url gets encoded and error page is crawler instead of actual page

I am using crawler 4J to crawl user profile on gitHub for instance I want to crawl url: https://github.com/search?q=java+location:India&p=1 for now I am adding this hard coded url in my crawler controller like: String url =…

urlencode crawler4j

asked Feb 08 '18 at 05:39

ravi katiyar

11
1
7

1

vote

0 answers

Use Java to Crawl and download entire website overriding the HttpsURLConnection

I am looking to crawl the entire website and save it locally offline. It should have 2 parts: Authentication This needs to be implemented using Java and I need to override HttpsURLConnection logic to add couple lines of authentication (Hadoop) in…

web-crawler nutch crawler4j websphinx

asked Jan 19 '17 at 22:18

Spartan

11
2

1

vote

1 answer

Crawling and extracting info using crawler4j

I need help figuring out how to crawl through this page: http://www.marinetraffic.com/en/ais/index/ports/all go through each port, and extract the name and coordinates and write them onto a file. The main class looks as follows: import…

web-crawler html-parsing crawler4j

asked Nov 21 '16 at 03:28

Almanz

79
1
10

1

vote

1 answer

How to adapt the URL that I want to crawl in crawler4j

I tried modifying the code crawler4j-Quickstart example I want to crawl the following…

java parsing web-crawler jsoup crawler4j

asked Sep 13 '16 at 02:11

evabb

405
3
21

1

vote

1 answer

Crawler4j authentication not working

I'm trying to use the FormAuthInfo authentication from Crawler4J to crawler into a specific LinkedIn page. This page can only be rendered, when I am correctly logged. This is my Controller with the access URLs: public class Controller { public…

java http web-crawler httprequest crawler4j

asked Jun 22 '16 at 15:12

andreybleme

689
9
23

1

vote

1 answer

crawler4j crawls only seed URLs

Why does the following code build upon crawler4j only crawl the given seed URLs and does not start to crawl other links? public static void main( String[] args ) { String crawlStorageFolder = "F:\\crawl"; int numberOfCrawlers = 7; …

web-crawler crawler4j

asked May 12 '16 at 21:04

user1025852

2,684
11
36
58

1

vote

1 answer

Transferring one object between classes using crawler4j

I am a simple web crawler that is built using the building blocks of crawler4j. I am trying to build a dictionary as my crawler crawls and then pass it to my main (controller) as it builds and parses text. How can I do this since my MyCrawler object…

java web-crawler crawler4j

asked Mar 08 '16 at 16:39

drewfiss90

53
5

1

vote

0 answers

Text extract using Jsoup and wordcount

I am crawling websites using crawler4j. I am using jsoup to extract content and save it in a text format file. Then I use omegaT to find the number of words in those text files. The problem I am having is with text extraction. I am using the…

java jsoup word-count crawler4j

asked Mar 06 '16 at 19:48

user3558596

1

vote

1 answer

Authentication with crawler4j

My goal is to log-in to a site and then get my account information. I'm using crawler4j 4.2 AuthInfo authJavaForum = new FormAuthInfo("myuser", "mypwd", "http://www.java-forum.org", "login",…

java authentication crawler4j

asked Feb 18 '16 at 12:46

divadpoc

903
10
31

1

vote

1 answer

Crawler4j warning "invalid cookie header" is causing the crawler not to fetch that page

I am using crawler4j in a very amateur settings to crawl articles from a site (and boilerpipe for content scraping). In some of the sites, the crawler is working very neatly. But in other cases it just fails to fetch the website (though I can still…

java cookies web-crawler crawler4j

asked Feb 11 '16 at 11:19

d1xlord

239
3
4
12

1

vote

1 answer

Web spider, which is able to crawl ajax-based websites

Right now i'm using Crawler4j and i'm pretty happy with that - but it can not crawl ajax-based websites. I used selenium once for another approach and this works fine combined with phantomjs. So is there a way to plug in Selenium into crawler4j? If…

ajax selenium web-crawler crawler4j

asked Nov 12 '15 at 15:15

Fabian Lurz

2,029
6
26
52

1

vote

1 answer

crawler4j - I can't get the title

In short: I can’t get this URL’s title http://www.namlihipermarketleri.com.tr/default.asp?git=9&urun=10277 (which is broken now (18-11-2015) ) İn my WebCrawler implementation: @Override public void visit(Page page) { …

crawler4j html-title

asked Jul 08 '15 at 13:34

Ismail Yavuz

6,727
6
29
50

1

vote

1 answer

How do I write my own exception handling for Crawler4J?

I want my crawler to wait for 5 minutes if it gets a SocketConnectException(i.e. if the internet connection is down) and resume again and also maybe send a mail to an admin about this. I have seen the source code, and the methods that throw this…

java exception web-crawler crawler4j

asked Jun 02 '15 at 07:00

CoralReef

61
1
7

1

vote

1 answer

calling controller(crawler4j-3.5) inside loop

Hi I am calling controller inside for-loop, because I am having more than 100 url, so I am having all in list and I will iterate and crawl the page, I set that url for setCustomData also, because it should not leave the domain. for…

java web-crawler crawler4j

asked May 19 '15 at 10:49

Selva

546
5
12
34

Questions tagged [crawler4j]