3

We run Solr on an Amazon Web Services EC2 instance with a 1TB EBS volume to store the index so that we can easily launch additional servers with the same (read-only) index. However, our index is soon going to exceed 1TB, and I don't really want to deal with striping multiple EBS volumes to hold the index. Also, regenerating the index is very slow. I would like to move the index generation--and maybe hosting--to Hadoop, and preferably to Amazon's Elastic MapReduce, although I can set up separate Hadoop servers if need be. We use RightScale, so their library of ServerTemplates is available to us.

What would be the best place to get started using Lucene/Solr on Hadoop?

Joe Emison
  • 31
  • 2
  • Have you taken a look at Katta (http://katta.sourceforge.net/)? It provides the means to shard and distribute Lucene indecies. – Brent Worden Jun 02 '11 at 13:41
  • I would really like my index creation to be sped up, not just delivery. It looks like Katta would help with delivery, but not with creation? – Joe Emison Jun 03 '11 at 01:58

2 Answers2

1

Take a look at ElasticSearch. You can index to ElasticSearch from Hadoop for bulk loading. Infochimps has open sourced an ElasticSearch bulk indexer called Wonderdog that you can look at for a proof of concept.

https://github.com/infochimps/wonderdog http://www.elasticsearch.com

It's cloud friendly (See cloud-aws plugin for discovery), and can scale up / down by adding nodes to hold the index.

Jeremy Carroll
  • 188
  • 1
  • 6
1

Is your index sharded? You could shard the index and distribute shards across several instances.

D_K
  • 1,410
  • 12
  • 35