Questions tagged [crawler4j]

Crawler4j is an open source Java web crawler.

Crawler4j is an open source Java crawler which provides a simple interface for crawling the Web.

Reference: https://github.com/yasserg/crawler4j

174 questions
1
vote
1 answer

How to resume crawling after last depth I reached when I restart my crawler?

Hello Everyone I am making a web application that crawl lots of pages from a specific website, I started my crawler4j software with unlimited depth and pages but suddenly it stopped because of internet connection. Now I want to continue crawling…
Ahmed Sakr
  • 129
  • 1
  • 9
1
vote
1 answer

Crawler4J seed url gets encoded and error page is crawler instead of actual page

I am using crawler 4J to crawl user profile on gitHub for instance I want to crawl url: https://github.com/search?q=java+location:India&p=1 for now I am adding this hard coded url in my crawler controller like: String url =…
ravi katiyar
  • 11
  • 1
  • 7
1
vote
0 answers

Use Java to Crawl and download entire website overriding the HttpsURLConnection

I am looking to crawl the entire website and save it locally offline. It should have 2 parts: Authentication This needs to be implemented using Java and I need to override HttpsURLConnection logic to add couple lines of authentication (Hadoop) in…
Spartan
  • 11
  • 2
1
vote
1 answer

Crawling and extracting info using crawler4j

I need help figuring out how to crawl through this page: http://www.marinetraffic.com/en/ais/index/ports/all go through each port, and extract the name and coordinates and write them onto a file. The main class looks as follows: import…
Almanz
  • 79
  • 1
  • 10
1
vote
1 answer

How to adapt the URL that I want to crawl in crawler4j

I tried modifying the code crawler4j-Quickstart example I want to crawl the following…
evabb
  • 405
  • 3
  • 21
1
vote
1 answer

Crawler4j authentication not working

I'm trying to use the FormAuthInfo authentication from Crawler4J to crawler into a specific LinkedIn page. This page can only be rendered, when I am correctly logged. This is my Controller with the access URLs: public class Controller { public…
andreybleme
  • 689
  • 9
  • 23
1
vote
1 answer

crawler4j crawls only seed URLs

Why does the following code build upon crawler4j only crawl the given seed URLs and does not start to crawl other links? public static void main( String[] args ) { String crawlStorageFolder = "F:\\crawl"; int numberOfCrawlers = 7; …
user1025852
  • 2,684
  • 11
  • 36
  • 58
1
vote
1 answer

Transferring one object between classes using crawler4j

I am a simple web crawler that is built using the building blocks of crawler4j. I am trying to build a dictionary as my crawler crawls and then pass it to my main (controller) as it builds and parses text. How can I do this since my MyCrawler object…
drewfiss90
  • 53
  • 5
1
vote
0 answers

Text extract using Jsoup and wordcount

I am crawling websites using crawler4j. I am using jsoup to extract content and save it in a text format file. Then I use omegaT to find the number of words in those text files. The problem I am having is with text extraction. I am using the…
user3558596
1
vote
1 answer

Authentication with crawler4j

My goal is to log-in to a site and then get my account information. I'm using crawler4j 4.2 AuthInfo authJavaForum = new FormAuthInfo("myuser", "mypwd", "http://www.java-forum.org", "login",…
divadpoc
  • 903
  • 10
  • 31
1
vote
1 answer

Crawler4j warning "invalid cookie header" is causing the crawler not to fetch that page

I am using crawler4j in a very amateur settings to crawl articles from a site (and boilerpipe for content scraping). In some of the sites, the crawler is working very neatly. But in other cases it just fails to fetch the website (though I can still…
d1xlord
  • 239
  • 3
  • 4
  • 12
1
vote
1 answer

Web spider, which is able to crawl ajax-based websites

Right now i'm using Crawler4j and i'm pretty happy with that - but it can not crawl ajax-based websites. I used selenium once for another approach and this works fine combined with phantomjs. So is there a way to plug in Selenium into crawler4j? If…
Fabian Lurz
  • 2,029
  • 6
  • 26
  • 52
1
vote
1 answer

crawler4j - I can't get the title

In short: I can’t get this URL’s title http://www.namlihipermarketleri.com.tr/default.asp?git=9&urun=10277 (which is broken now (18-11-2015) ) İn my WebCrawler implementation: @Override public void visit(Page page) { …
Ismail Yavuz
  • 6,727
  • 6
  • 29
  • 50
1
vote
1 answer

How do I write my own exception handling for Crawler4J?

I want my crawler to wait for 5 minutes if it gets a SocketConnectException(i.e. if the internet connection is down) and resume again and also maybe send a mail to an admin about this. I have seen the source code, and the methods that throw this…
CoralReef
  • 61
  • 1
  • 7
1
vote
1 answer

calling controller(crawler4j-3.5) inside loop

Hi I am calling controller inside for-loop, because I am having more than 100 url, so I am having all in list and I will iterate and crawl the page, I set that url for setCustomData also, because it should not leave the domain. for…
Selva
  • 546
  • 5
  • 12
  • 34