I am trying to create a web interface for searching through a large number of huge configuration files (approx 60000 files, each one with a size between 20 KByte to 50 MByte). Those files are also updated frequently (~3 times/day).
Requirements:
- Concurrency
- Must identify the line numbers for each matching line
- Good update performance
What I have looked into:
- Lucene: To identify a line number, each line must be stored in a separate Lucene document, each containing two fields (the line number and the line). This makes updates hard/slow.
- SOLR and Sphinx: Both based on Lucene, they have the same problem and do not allow for identifying the line number.
- SQL table with a fulltext index: Again, no way to show the line number.
- SQL table with each line in a separate row: Tested this with SQLite or MySQL, and update performance was the worst of all options. Updating a 50 MB document took more than an hour.
- eXist-db: We converted each text file to XML like this:
<xml><line number="1">test</line>...</xml>
. Updates take ~5 minutes, which somewhat works but we are still not happy with it. - Whoosh for Python: Pretty much like Lucene. I have implemented a prototype that sort-of works by dropping/re-importing all lines of a given file. Updating a 50MB document takes about 2-3 minutes using this method.
- GNU id utils: Suggested by sarnold, this is blazingly fast (50MB document is updated in less then 10 seconds on my test machine) and would be perfect if it had pagination and an API.
How would you implement an alternative?