parallel grep pattern multiple files

Question

I'm searching successfully with this command : search for a list of suspicious IPs from a txt file ips.txt in a logs directory (compressed files).

root@yop# find /mylogs/ -exec zgrep -i -f ips.txt {} \; > ips.result.txt

I want now to use parallel with it.. to speed up the search. I'm not able to find correct args for it at the moment.. I mean use the pattern file (one per line) and also export it in a result file.

Is there any parallel guru for it please ?

The more close command i found was this : grep-or-anything-else-many-files-with-multiprocessor-power

But wasn't able to use it with a file list of patterns and export results to a file too...

Please help, thanks all.

Depending on how may logfiles you have, you may just be able to background your jobs to parallelise them. Also, if you are looking for IP addresses without needng wildcards, you may find `zfgrep` faster too. — Mark Setchell, Feb 25 '14 at 11:22

Steve · Answer 1 · 2014-02-25T13:22:42.160

5

If you just want to run multiple jobs at once, consider using GNU parallel:

parallel zgrep -i -f ips.txt :::: <(find /mylogs -type f) > results.txt

edited Feb 25 '14 at 13:22

answered Feb 25 '14 at 12:56

Steve

51,466
13
89
103

It is not recommended to use $() with `:::`. The reason is that if the $() returns filenames containing space, then these will be split by the shell. It is better to do: `find /mylogs -type f | parallel zgrep -i -f ips.txt > results.txt` or even: `parallel zgrep -i -f ips.txt :::: <(find . -type f) > results.txt` but if the files are all in the same dir this will work, too: `parallel zgrep -i -f ips.txt ::: * > results.txt` – Ole Tange Feb 25 '14 at 13:13
Ole is correct. It may also be worth mentioning that GNU parallel supports the `\0` (null) delimiter when one specifies the `-0` flag. So the following can be used to process files with strange file names: `find /mylogs -type f -print0 | parallel -0 zgrep -i -f ips.txt > results.txt` – Steve Feb 25 '14 at 13:46
I tink i got the point a little bit for the arguments... Steve : find /my/logs/ -type f -print0 | parallel -0 zgrep -i -f ips.txt > results.txt /bin/bash: -c: option requires an argument /bin/zgrep: line 161: 1: missing pattern; try `/bin/zgrep --help' for help /bin/bash: -c: option requires an argument /bin/bash: ips.txt: command not found parallel zgrep -i -f ips.txt :::: <(find /mys/logs -type f) > results.txt /bin/zgrep: line 161: 1: missing pattern; try `/bin/zgrep --help' for help – mastarah Feb 25 '14 at 14:15
I think i got the point a little bit for the arguments... I need to understand the manual :) **Steve** : find /my/logs/ -type f -print0 | parallel -0 zgrep -i -f ips.txt > results.txt /bin/bash: -c: option requires an argument /bin/zgrep: line 161: 1: missing pattern; try `/bin/zgrep --help' for help /bin/bash: -c: option requires an argument /bin/bash: ips.txt: command not found parallel zgrep -i -f ips.txt :::: <(find /mys/logs -type f) > results.txt /bin/zgrep: line 161: 1: missing pattern; try `/bin/zgrep --help' for help – mastarah Feb 25 '14 at 14:18
@mastarah: The commands work fine for me in my testing. Are you _actually_ using GNU parallel? What's the output of: `parallel -V | head -n 1`? – Steve Feb 25 '14 at 14:31
2

$ parallel -V | head -n 1 WARNING: YOU ARE USING --tollef. IF THINGS ARE ACTING WEIRD USE --gnu. I'm going to add --gnu and test.. OK it's working with : **parallel --gnu zgrep -i -f ips.txt :::: <(find /mylogs -type f) > results.txt** – mastarah Feb 25 '14 at 14:43
@mastarah: Excellent. Glad you hear you got that sorted :-) – Steve Feb 25 '14 at 22:37

Josh Jolly · Answer 2 · 2014-02-25T13:33:34.630

0

How about looping over the files, and then putting each file into a background job? As Mark commented, this may not be suitable if you have a very large number of log files. Also assumes you are not running anything else backgrounded.

mkdir results

for f in "$(find /mylogs/)"; do 
    (zgrep -i -f ips.txt "$f" >> results/"$f".result &); 
done

wait

cat results/* > ip.results.txt
rm -rf results

You can limit the number of files to search for by using head and/or tail, eg only search the first 50 files:

for f in "$(find /mylogs/ | head -50)"; do...

Then the next 50:

for f in "$(find /mylogs/ | head -100 | tail -50)"; do...

And so on.

edited Feb 25 '14 at 13:33

answered Feb 25 '14 at 12:15

Josh Jolly

11,258
2
39
55

That might result in lines intersecting in each other. – flx Feb 25 '14 at 12:18
Amended to avoid writing to same file – Josh Jolly Feb 25 '14 at 12:27
Yes sure for the backdound jobs. I'll also try that one. My question was more indeed on performance issue (i suppose) because there are a lot of files in the directory. I want to make performance speed increase :) I'll do some tests with jobs sure to get a time value and compare with parallel. Is there a parallel equiv for this anyway ? Thanks flx josh for your answers ! – mastarah Feb 25 '14 at 12:43
1

No need to do the waits in a loop, just do a single wait with no parameters to wait for them all. – Mark Setchell Feb 25 '14 at 13:25

parallel grep pattern multiple files

2 Answers2