What is the best way to run Lucene/Solr on Hadoop?

Question

We run Solr on an Amazon Web Services EC2 instance with a 1TB EBS volume to store the index so that we can easily launch additional servers with the same (read-only) index. However, our index is soon going to exceed 1TB, and I don't really want to deal with striping multiple EBS volumes to hold the index. Also, regenerating the index is very slow. I would like to move the index generation--and maybe hosting--to Hadoop, and preferably to Amazon's Elastic MapReduce, although I can set up separate Hadoop servers if need be. We use RightScale, so their library of ServerTemplates is available to us.

What would be the best place to get started using Lucene/Solr on Hadoop?

Have you taken a look at Katta (http://katta.sourceforge.net/)? It provides the means to shard and distribute Lucene indecies. — Brent Worden, Jun 02 '11 at 13:41
I would really like my index creation to be sped up, not just delivery. It looks like Katta would help with delivery, but not with creation? — Joe Emison, Jun 03 '11 at 01:58

score 1 · Answer 1 · answered Jun 04 '11 at 01:55

Take a look at ElasticSearch. You can index to ElasticSearch from Hadoop for bulk loading. Infochimps has open sourced an ElasticSearch bulk indexer called Wonderdog that you can look at for a proof of concept.

https://github.com/infochimps/wonderdog http://www.elasticsearch.com

It's cloud friendly (See cloud-aws plugin for discovery), and can scale up / down by adding nodes to hold the index.

score 1 · Answer 2 · answered Jul 10 '11 at 13:28

1

Is your index sharded? You could shard the index and distribute shards across several instances.

answered Jul 10 '11 at 13:28

D_K

1,410
12
35

What is the best way to run Lucene/Solr on Hadoop?

2 Answers2