What is the fastest way to search for patterns through 20-30 GB of multiple logfiles

Question

I am performing log analysis, which I want to automate so that it runs daily and reports findings. The analysis runs on standard workstations, 8 cores, up to 32 GB of free RAM. The prototyping is based on GNU Grep (--mmap), Sqlite (on a RAM disk) and Bash (for parameters).

One problem with this is that I need to go through the files multiple times. If I find a pattern match, I search upwards for related things. This might get recursive and each time it re-reads Gigabytes of data.

Is there any fast way / lib in C for memory backed segment wise multi-threaded file reading/writing?

When I look at the "in memory" search (to go up and down within a loaded segment, or to load more in case this is necessary) I get the feeling that this is a very general requirement.

"*One problem with this is that I need to go through the files multiple times*" Feed it to a database, and search the database's content. — alk, Mar 20 '15 at 14:15
@alk: And now he has two problems. Sorry, you don't know what you are talking about. Do you have any practical experience with this sort of task? — Hynek -Pichi- Vychodil, Mar 20 '15 at 14:19
We did (context depending) full text seach on 10++TB document data, yes. But you are correct, SM-Access won't do. @Hynek-Pichi-Vychodil — alk, Mar 20 '15 at 14:24
a DB like FastDB or an index'ed DB could probably do this... yes. :) — wishi, Mar 20 '15 at 14:27
@alk: I bet Perl WF solution by Sean O'Rourke will beat the DB approach by Huge margin. It will finish job way before you even finish an import. — Hynek -Pichi- Vychodil, Mar 20 '15 at 14:33

Hynek -Pichi- Vychodil · Accepted Answer · 2015-03-20T15:33:54.447

2

Look for the Tim Bray's Wide Finder Project. It has surprisingly simple and versatile solution in Perl by Sean O'Rourke. It mmaps log into memory and then forks subprocesses for searching. The fact, that you have accessible whole log file in each child process so you can flexible going forward and backward across initial partitions is what makes it very versatile. You can do it in C in the same manner, but I recommend use Perl first to test the concept and then rewrite to C if you are not satisfied. Personally I would go from Perl POC to Erlang + C NIF just because my personal preferences. (Erlang solutions in WF project doesn't use NIFs.)

Or if you have a lot of money to afford splunk>, it's way to go.

edited Mar 20 '15 at 15:33

answered Mar 20 '15 at 14:17

Hynek -Pichi- Vychodil

26,174
5
52
73

I am aware that I can throw files in Splunk / Elasticsearch / Lucene or even Hadoop. This doesn't really strike me as a simple enough solution in this case. – wishi Mar 20 '15 at 14:24
@wishi: So it is the reason I mention surprisingly simple solution from Wide Finder Project. – Hynek -Pichi- Vychodil Mar 20 '15 at 14:26

What is the fastest way to search for patterns through 20-30 GB of multiple logfiles

1 Answers1