How to index text files to improve grep time

Question

I have a large number of text files I need to grep through on a regular basis.

There are ~230,000 files amounting to around 15GB of data.

I've read the following threads:

The machine I'll be grepping on is an Intel Core i3 (i.e. dual-core), so I can't parallelize to any great extent. The machine is running Ubuntu and I'd prefer to do everything via the command line.

Instead of running a bog-standard grep each time, is there any way I can either index or tag the contents of the text files to improve searching?

Do the files have data that you can meaningfully index on? If you can do that then you can write anything you want to operate on that index. I don't know of off-the-shelf tools for this though. — Etan Reisner, Jul 08 '15 at 20:11
Thanks for the follow-up. The data in the files is human-readable text. I've edited my explanation to read "...is there a way I can either index or tag the contents...". I think I should have used the word "tag" rather than "index" originally. — Richard Horrocks, Jul 08 '15 at 20:51
@glennjackman How would I go about putting this into a database? What tools would I use? MySQL, or a type of database specific to this text-based problem? — Richard Horrocks, Jul 11 '15 at 15:30

score 2 · Answer 1 · answered Apr 19 '20 at 01:03

To search a large number of files for text patterns, qgrep uses indexing. See the article on why and how: https://zeux.io/2019/04/20/qgrep-internals

Alternatively, perhaps try modern multi-threaded grep tools like the new ugrep or ag aka silver searcher (note: the ag bug list on GitHub shows that the most recent ag 2.2.0 may run slower with multiple threads, which I assume will be fixed in a future update).

score 0 · Answer 2 · answered Jul 09 '15 at 04:53

Have you tried ag as a replacement for grep? It should be in the Ubuntu repositories. I had a similar problem as yours, and ag is really much faster than grep for most regex searches. There are some differences in syntax and features, but that would only matter if you had special grep-specific needs.

How to index text files to improve grep time

2 Answers2