I've got about 100k gzipped JSON files that together equal about 100GB. All the files are in the same directory. I'm running this locally on my mac os x.
I have several different patterns to match, and have tried running the command a variety of ways, but they all take hours to complete.
I started with this command, pulling from here and here:
find . -name "*.gz" | xargs zgrep pattern >> results.json
This works fine, but takes about 4 hours to complete.
I tried to parallelize it with one or more patterns:
find . -name "*.gz" | parallel zgrep pattern >> results/melanys.json
find . -name "*.gz" | parallel zgrep -e pattern1 -e pattern2 -e pattern3 -e pattern4 >> results/melanys.json
These do indeed spawn multiple zgrep processes, but most of the time they're idle and they don't seem to run any faster (in the single pattern solution), and the multiple pattern solution was running for 8 hours before I decided to shut it down.
I hadn't though that zgrep would really take this long -- my hope was to zgrep the relevant lines out of this data set and then plug those into a structure more suitable for analysis, maybe a local database.
Is there a way to speed up zgrep?