0

I know how to develop a simple inverted index on a single machine. In short it is a standard hash table kept in-memory where: - key - a word - value - a List of word locations As an example, the code is here: http://rosettacode.org/wiki/Inverted_Index#Java

Question:

Now I'm trying to make it distributed among n nodes and in turn:

  1. Make this index horizontally scalable
  2. Apply automatic sharding to this index.

I'm interested especially in automatic sharding. Any ideas or links are welcome!

Thanks.

Ivan Voroshilin
  • 5,233
  • 3
  • 32
  • 61

1 Answers1

0

Sharding by it self is quite a complex task which is not completely solved in the modern DBs. Typical problems in distributed DBs are a CAP theorem, and some other low-level and quite challenging tasks like rebalancing your cluster data after adding a new blank node or after naturally-occured imbalance in the data.

The best data distribution implemented in a DB I've seen was in Cassandra. However full text search is not yet implemented in Cassandra, so you might consider building your distributed index upon it.

Some other already implemented options are Elasticsearch and SolrCloud. In the example given one important detail is missing which is a word-stemming. With word stemming you basically search for any form of a word like "sing", "sings", "singer". Lucene and two previous solutions have it implemented for the majority of the languages.

Andrey Chaschev
  • 16,160
  • 5
  • 51
  • 68
  • Thanks for the answer. I've found a comparison of Cassandra and some other DBs and came across consistent hashing algorithm again at: http://www.quora.com/How-would-you-compare-and-contrast-MySQL-sharding-vs-Cassandra-vs-MongoDB Actually, previously I researched this algorithm which is described in my blog: http://ivoroshilin.com/2013/07/15/distributed-caching-under-consistent-hashing/ and think this algorithm can be applied to inverted index too. I guess keys of inverted index can be passed into Consistent hash data structure. – Ivan Voroshilin Sep 25 '13 at 09:00
  • Correction: I want to hear also other approaches and pitfalls if I combining inverted index with consistent hashing – Ivan Voroshilin Sep 25 '13 at 09:24
  • In the simplest form NoSQL DB with hash-based keys can be considered as a huge HashMap. So yes, you can simply pass your inverted index keys to the DB. – Andrey Chaschev Sep 25 '13 at 09:54