40

I have a large number of small files to be searched. I have been looking for a good de-facto multi-threaded version of grep but could not find anything. How can I improve my usage of grep? As of now I am doing this:

grep -R "string" >> Strings
Legend
  • 113,822
  • 119
  • 272
  • 400

2 Answers2

81

If you have xargs installed on a multi-core processor, you can benefit from the following just in case someone is interested.

Environment:

Processor: Dual Quad-core 2.4GHz
Memory: 32 GB
Number of files: 584450
Total Size: ~ 35 GB

Tests:

1. Find the necessary files, pipe them to xargs and tell it to execute 8 instances.

time find ./ -name "*.ext" -print0 | xargs -0 -n1 -P8 grep -H "string" >> Strings_find8

real    3m24.358s
user    1m27.654s
sys     9m40.316s

2. Find the necessary files, pipe them to xargs and tell it to execute 4 instances.

time find ./ -name "*.ext" -print0 | xargs -0 -n1 -P4 grep -H "string" >> Strings

real    16m3.051s
user    0m56.012s
sys     8m42.540s

3. Suggested by @Stephen: Find the necessary files and use + instead of xargs

time find ./ -name "*.ext" -exec grep -H "string" {} \+ >> Strings

real    53m45.438s
user    0m5.829s
sys     0m40.778s

4. Regular recursive grep.

grep -R "string" >> Strings

real    235m12.823s
user    38m57.763s
sys     38m8.301s

For my purposes, the first command worked just fine.

Legend
  • 113,822
  • 119
  • 272
  • 400
  • 11
    Might I suggest you use find's `-print0` with xarg's `-0` to delimit file names with the NUL character so you don't get into trouble with filenames with spaces, newlines or other garbage characters in their name. – SiegeX Mar 05 '11 at 00:05
  • +2 interesting answer. Cheers. A. – armandino Mar 05 '11 at 00:55
  • nice! would try to use this more often :) – ken Mar 05 '11 at 01:02
  • I'd like to see the results over the same fileset of `time find ./ -name "*.ext" -exec grep -H "string" {} \+ >> Strings_findExec` (the `\+` terminating the find doing essentially the same as the `find|xargs` combo) – Stephen P Mar 05 '11 at 01:19
  • @Stephen Not quite the same, `xargs` allows you to utilize multiprocessor capability with the `-P` flag whereas the POSIX-2004 compliant versions of `find` that can be terminated with `+` act the same as if you were to pass `-P1` to `xargs`, i.e. only one processor utilization – SiegeX Mar 05 '11 at 01:34
  • @SiegeX That's what I was wondering - would the timings come out the same as the `grep -R "string" >> Strings` version? – Stephen P Mar 05 '11 at 01:36
  • @Stephen: I updated my post with the new results. :) Not sure about the behavior though. – Legend Mar 05 '11 at 02:30
  • 4
    If you have a multicore CPU, you could pipe the output of find to GNU parallel to do parallel greping. – fpmurphy Mar 05 '11 at 15:53
  • @fpmurphy: Actually, on the system that I am running my experiments, parallel is not installed and I was used to xargs :) But thanks for the tip though! – Legend Mar 05 '11 at 18:36
  • @fpmurphy isn't that exactly what `xargs -P` does? After I heard of this switch to `xargs`, I never really understood the purpose of GNU parallel. – Christian Dec 11 '12 at 17:55
  • 1
    @Christian: [here](http://www.gnu.org/software/parallel/man.html#differences_between_xargs_and_gnu_parallel) is a link to GNU parallel documentation that compares xargs and parallel. – Thor Sep 25 '13 at 02:07
  • Using this approach, I get error with large number of files http://stackoverflow.com/questions/19694379/how-to-use-grep-with-large-millions-number-of-files-to-search-for-string-and-g – Watt Oct 30 '13 at 21:49
  • Why adding `-n1` to `xargs`? On my tests, it is 20 times faster without this option (or with a fair value like `-n 1024`). – Jérôme Pouiller Mar 29 '16 at 08:22
4

Wondering why -n1 is used below won't it be faster to use a higher value (say -n8? or leave it out so xargs will do the right thing)?

xargs -0 -n1 -P8 grep -H "string"

Seems it will be more efficient to give each grep that's forked to process on more than one file (I assume -n1 will give only one file name in argv for the grep) -- as I see it, we should be able to give the highest n possible on the system (based on argc/argv max length limitation). So the setup cost of bringing up a new grep process is not incurred more often.

Nayan
  • 3,014
  • 2
  • 17
  • 33
  • Leaving it out is not a good idea since this will break if `find` returns a lot of files and even if this doesn't occur, I doubt `xargs` has a very good way of knowing beforehand how many files to give to each instance of grep. The forking overhead will depend on the file average size of course and an unlucky combination of n small files for one `grep` and n very large one for the other could eat up even n=2 very easily. On my downloads dir, I picked a value of n=64 out of thin air though and gained 3% though. – Christian Dec 11 '12 at 18:20