6

spoiler :
This is just another Lucene vs Sphinx vs whatever,
I saw that all other threads were almost two years old, so decided to start again..

Here is the requirement :

data size : max 10 GB.
rows : nearly billions
indexing should be fast
searching should be under 0 ms [ ok, joke... laugh... but keep this as low as possible ]

In today's world, which/what/how do I go about it ?

edit : I did some timing on lucene, and for indexing 1.8gb data, it took 5 minutes.
searching is pretty fast, unless I do a a*. a* takes 400 ~ 500 ms.
My biggest worry is indexing, which is taking loooonnnnggg time, and lot of resources!!

Shrinath
  • 7,888
  • 13
  • 48
  • 85
  • 1
    you only have to do index on new data, updated data, deleted data, not always the whole collection – ajreal Feb 23 '11 at 14:20

3 Answers3

2

I have no experience other than with Lucene - it's pretty much the default indexing solution so don't think you can go too wrong.

10GB is not a lot of data. You'll be able to re-index it pretty rapidly - or keep it on SSDs for extra speed. And of course keep your whole index in RAM (which Lucene supports) for super-fast lookups.

Richard H
  • 38,037
  • 37
  • 111
  • 138
  • I'm going to keep everything on clouds, so I don't see anyone giving SSD like speeds there :( And btw, whole data on RAM, I can't take it for the app I am working on... It'd be like 1000 GB of unique data per computer, so everything can't be brought into memory... – Shrinath Feb 23 '11 at 14:05
  • OK - well the SSDs will only make diff wrt to building the index. BUt confused - you said max data size 10GB, not 1000? – Richard H Feb 23 '11 at 14:10
  • Lol :D true, not 1000 GB :) its only 10 GB... Check the edits now :) – Shrinath Feb 23 '11 at 14:14
  • well, its not that simple, which for certain reasons I didn't specify in the post... There are going to be multiple indexes of 10 gb each... and there'll be multiple searchers going for every different index.. then how does this work ? that was my point... sorry for the confusion, if it was only 10 GB, you are 100% right... – Shrinath Feb 23 '11 at 17:09
0

Please check Lucene wiki for tips on improving Lucene indexing speed. This is quite succinct. In general, Lucene is quite fast (it is used for real-time search.) The tips will be handy to figure out if you are missing out on something "obvious."

Shashikant Kore
  • 4,952
  • 3
  • 31
  • 40
  • I've done everything "obvious" by now :) just wanted to know if "this" IS the way to go :) And btw, is the indexing time allright ? its 5 minutes to 1.8GB ? – Shrinath Feb 23 '11 at 17:30
  • Size is somewhat inaccurate metric. Indexing 1.8G of plain text will be different from indexing 1.8G HTML (which you will parse and index extracted text.) You need to see, if that is "fast enough" for your needs. If existing indexing speed falls short of your expectations, you may wish to explore how to use Lucene in real-time environment. That is non-trivial. – Shashikant Kore Feb 23 '11 at 17:59
  • @Shrinath - your indexing speed is limited by how fast you can read off disk, and how much that data needs to be processed before index insertion. – Richard H Feb 24 '11 at 11:28
  • @Richard : Agreed.. There is just a few String manipulations done before inserting, that is adding to the time too... I will try to reduce manipulations, but just wanted to be sure if there is a way to speed up lucene more.. – Shrinath Feb 24 '11 at 12:44
0

My biggest worry is indexing, which is taking loooonnnnggg time, and lot of resources!!

Take a look at Lusql, we used it once, FWIW 100 GBdata from mysql on a decent machine took little more than an hour to index, on filesystem(NTFS)

Now if u add SSD or whatever ultra fast disk tecnnology, you can bring it down considerably

Narayan
  • 6,031
  • 3
  • 41
  • 45