Fast, line-wise "grep -n" equivalent for Unix directory structure

Question

I am trying to create a web interface for searching through a large number of huge configuration files (approx 60000 files, each one with a size between 20 KByte to 50 MByte). Those files are also updated frequently (~3 times/day).

Requirements:

Concurrency
Must identify the line numbers for each matching line
Good update performance

What I have looked into:

Lucene: To identify a line number, each line must be stored in a separate Lucene document, each containing two fields (the line number and the line). This makes updates hard/slow.
SOLR and Sphinx: Both based on Lucene, they have the same problem and do not allow for identifying the line number.
SQL table with a fulltext index: Again, no way to show the line number.
SQL table with each line in a separate row: Tested this with SQLite or MySQL, and update performance was the worst of all options. Updating a 50 MB document took more than an hour.
eXist-db: We converted each text file to XML like this: <xml><line number="1">test</line>...</xml>. Updates take ~5 minutes, which somewhat works but we are still not happy with it.
Whoosh for Python: Pretty much like Lucene. I have implemented a prototype that sort-of works by dropping/re-importing all lines of a given file. Updating a 50MB document takes about 2-3 minutes using this method.
GNU id utils: Suggested by sarnold, this is blazingly fast (50MB document is updated in less then 10 seconds on my test machine) and would be perfect if it had pagination and an API.

How would you implement an alternative?

I don't know why you would say Lucene is slow, I have used OpenGrok (which I believe uses Lucene) on projects far bigger then this, the speed is good and the update is not hard at all. — Anders, Nov 14 '11 at 11:58
Lucene isn't slow, identifying/updating all lines of each document is, if they are stored in this way. — knipknap, Nov 14 '11 at 12:00
When updating your files do you know exactly which line numbers are being updated? Also I'm pretty sure that Sphinx isn't based on lucene. — William King, Nov 22 '11 at 19:15

sarnold · Answer 1 · 2011-11-14T23:25:59.133

You might wish to investigate the GNU idutils toolkit. On a local copy of the Linux kernel sources, it can give output like this:

$ gid ugly
include/linux/hil_mlc.h:66:  * a positive return value causes the "ugly" branch to be taken.
include/linux/hil_mlc.h:101:    int         ugly;   /* Node to jump to on timeout       */

Rebuilding the index from a cold cache is reasonably quick:

$ time mkid

real    1m33.022s
user    0m17.360s
sys     0m2.730s

Rebuilding the index from a warm cache is much faster:

$ time mkid

real    0m15.692s
user    0m15.070s
sys     0m0.520s

The index only takes 46 megabytes for my 2.1 gigs of data -- which is tiny in comparison to yours, but the ratio feels good.

Finding 399 occurrences of foo took only 0.039 seconds:

$ time gid foo > /dev/null

real    0m0.038s
user    0m0.030s
sys     0m0.000s

Update

Larsmans was curious about the performance of git grep on the kernel sources -- which is an excellent way to show how much performance gain gid(1) provides.

On a cold cache, git grep foo (which returned 1656 entries, far more than idutils):

$ time git grep foo > /dev/null

real    0m19.231s
user    0m1.480s
sys     0m0.680s

Once the cache was warm, git grep foo runs much faster:

$ time git grep foo > /dev/null

real    0m0.264s
user    0m1.320s
sys     0m0.330s

Because my dataset fits entirely in RAM once the cache is warm, git grep is pretty amazing: it's only seven times slower than the gid(1) utility and certainly it would be more than fast enough for interactive use. If the dataset in question cannot be entirely cached (which is probably where things actually get interesting) then the performance benefit of the index is unmistakable.

The two complaints about idutils:

No pagination. This is definitely a downside, though in my experience it runs quickly enough to simply store the results of the search elsewhere. If the search is going to return an appreciable percentage of the original dataset, then storage of partial results is definitely going to be annoying.
No API: true enough, there's no API. But the source is available; src/lid.c function report_grep() takes a linked list of files that match the output. A little fiddling with this function should even offer pagination. (It would take some doing.) At the end of the day, you'd have a C API, which might still not be ideal. But customizing it doesn't look awful.

However, the weakness that is probably worst is the lack of an incremental database update. If all files are updated three times per day, this is not a big deal. If some files are updated three times a day, it is doing needless work. If a handful of files are updated three times a day, there must be a better solution.

+1. I'm curious as to how this compares to `git grep` performance? — Fred Foo, Nov 14 '11 at 12:11
Just tested and id-utils are definitely fast enough. However, the lack of an API and pagination make it unfeasible for a web app. — knipknap, Nov 14 '11 at 13:13
@larsmans, I've included `git grep` timings. `git grep` is incredible on warm cache. — sarnold, Nov 14 '11 at 23:26

score 1 · Answer 2 · answered Nov 15 '11 at 11:37

In case anyone needs it, I created Whooshstore, which is essentially a Whoosh-based, pure Python clone of GNU id utils that provides incremental updates, pagination and a Python API.

The command line client works like this:

ws-update -b --index my.idx datadir  # build the index
ws-update -b --append --index my.idx datadir  # incremental update
ws --index my.idx hello world     # query the index

(-b is for batch updating, which is faster but requires more memory. For the full CLI syntax use --help.)

It does not come close to the speed of GNU id utils, but by updating the index using several incremental batch (in-memory) updates it's fast enough for us.

Fast, line-wise "grep -n" equivalent for Unix directory structure

2 Answers2