Speeding up zgrep to pull matched lines out of compressed json files

Question

I've got about 100k gzipped JSON files that together equal about 100GB. All the files are in the same directory. I'm running this locally on my mac os x.

I have several different patterns to match, and have tried running the command a variety of ways, but they all take hours to complete.

I started with this command, pulling from here and here:

find . -name "*.gz" | xargs zgrep pattern >> results.json

This works fine, but takes about 4 hours to complete.

I tried to parallelize it with one or more patterns:

find . -name "*.gz" | parallel zgrep pattern >> results/melanys.json

find . -name "*.gz" | parallel zgrep -e pattern1 -e pattern2 -e pattern3 -e pattern4 >> results/melanys.json

These do indeed spawn multiple zgrep processes, but most of the time they're idle and they don't seem to run any faster (in the single pattern solution), and the multiple pattern solution was running for 8 hours before I decided to shut it down.

I hadn't though that zgrep would really take this long -- my hope was to zgrep the relevant lines out of this data set and then plug those into a structure more suitable for analysis, maybe a local database.

Is there a way to speed up zgrep?

score 1 · Accepted Answer · answered Oct 22 '14 at 21:28

It is not surprising that zgrepping 100GB of files takes hours to complete. The majority of that time will be consumed simply by decompressing the files. If you like, you can estimate how much by

time find . -name "*.gz" | xargs zcat > /dev/null

With that being the case, there's pretty much nothing useful you can do cheaply with this collection of files. If your zgrep is not I/O bound then you might hope to get some advantage from parallelising, but the best possible outcome in that case would be a speedup porportional to the number of CPU cores in your machine. You will not see that much speedup in practice, and you will see none if zgrep is I/O bound.

You might also consider putting the data on faster media, such as a solid-state drive, or a RAID array composed of such. Even so, you are unlikely to go from requiring hours to requiring only minutes.

By all means, though, do make every effort to extract all the data you want in one pass.

score 1 · Answer 2 · answered Oct 23 '14 at 08:48

1

GNU Parallel's manual has a section dedicated to grepping multiple lines for multiple regexps: http://www.gnu.org/software/parallel/man.html#EXAMPLE:-Grepping-n-lines-for-m-regular-expressions

answered Oct 23 '14 at 08:48

Ole Tange

31,768
5
86
104

Speeding up zgrep to pull matched lines out of compressed json files

2 Answers2