1
  • I need a open source web crawler developed in java with incremental crawling support.

  • Web crawler should be easily customized and integrated with solr or elasticsearch.

  • It should be an active one which is developing further with more features.

  • Aperture is one of a good crawler, it has all features i mentioned but its not an active crawler and due to license (if i use it for commercial purpose) of their dependency i ignored.

  • Nutch - a web crawler which has more features with hadoop support. But i go through many websites and tutorials, there is no proper documents, api found for customizing it programmatically in windows. I could edit the code in eclipse but it cause many errors while running map reduce jobs. There is no java api for nutch to implement like aperture.

  • Crawl4j is a good web crawler but it has no incremental crawling features and i haven't checked license problems.

Is there any other crawler which have all features that i mentioned or is there any way to use any one of above mentioned crawler for my requirements?

Helpful answers will be greatly appreciated.

Kumar
  • 3,782
  • 4
  • 39
  • 87

1 Answers1

0

Looks like a perfect match for Norconex HTTP Collector:

  • It is written 100% in Java.
  • It runs fully on Windows (without the need for Cygwin or a Linux/Unix VM).
  • It is well documented with examples and a forum to ask questions/raise issues (github).
  • It supports incremental crawlings, detecting modified documents as well as deleted ones.
  • It supports both Solr and Elasticsearch, and more (via the use of its "Committers").
  • It is extensively configurable/flexible. It is easy to integrate with it and provide custom features to it without having to learn complex plugin mechanism (implement an interface, put it in classpath, and voilĂ ).
  • Its development is very active.

It is maintained by Norconex, a company of enterprise search professionals. Issues addressed quickly. Version 2.0.0 is heavily being worked on, soon bringing many new features (language detection, document splitting, etc).

It is GPL but Norconex offers a commercial license if GPL is a problem for you.

It also has many other features you did not list, like the ability to manipulate the document content before sending it to your search engine. It also supports sitemaps, robots rules, etc. I invite you to give it a try: http://www.norconex.com/product/collector-http/