0

I am working on a project where I need to crawl through more than 10TB of data and index it. I need to implement incremental crawling that takes less time.

My question is : Which is the best tool suitable that all the big organizations are using for this along with java?

I was trying it out using Solr and Manifold CF but Manifold has very little documentation on the internet.

Malte Hartwig
  • 4,477
  • 2
  • 14
  • 30
Shashank Raj
  • 25
  • 1
  • 12

2 Answers2

1

For any Crawling activities using Java best to go with the open source JSOUP and SolrJ API, Clear and neat easy understable documentations.

Jsoup is a Java library for working with real-world HTML. It provides a very convenient API for extracting and manipulating data, using the best of DOM, CSS, and jquery-like methods.

SolrJ is an API that makes it easy for Java applications to talk to Solr. SolrJ hides a lot of the details of connecting to Solr and allows your application to interact with Solr with simple high-level methods.

for more option you can also try Elasticsearch with the java API

Harisudhan. A
  • 662
  • 1
  • 6
  • 20
  • I am not going to parse through HTML. I need to crawl through a NTFS based file system present either on windows or linux. I have a working solution but I feel I need to go with industry standards and also need to implement effective incremental crawling using clusters since it's a lot of data that I need to crawl through. For this purpose Manifold CF comes into picture but it does not seem efficient enough. – Shashank Raj Dec 01 '17 at 10:00
  • That answer is not even close to what I asked. – Shashank Raj Dec 01 '17 at 10:37
0

We ended up using Solr J (JAVA) and Apache Manifold CF. Although the documentation for Manifold CF was little to none, we subscribed to the newsletter and asked questions to the developers and they responded quickly. However, I would not recommend anyone to use this setup as Apache Manifold CF is something that is outdated and poorly built. So better search for alternatives. Hope this helped somebody.

Shashank Raj
  • 25
  • 1
  • 12