1

I would like to efficiently search through a few hundred log files for ~200 filenames.

I can easily do this using grep's -f directive and putting the needle(s) in a file.

However, there are a few problems:

  • I'm interested in doing this efficiently, as in How to use grep efficiently?
  • I want to know all the matches for each search term (i.e. filename) in all log files separately. grep -f would match as it finds needles in each file.
  • I would like to know when a filename is not matched anywhere.

2.7 i7 MBP w/ 16gb of ram

Using grep -ron -f needle * gives me:

access_log-2013-01-01:88298:google
access_log-2013-01-01:88304:google
access_log-2013-01-01:88320:test
access_log-2013-01-01:88336:google
access_log-2013-01-02:396244:test
access_log-2013-01-02:396256:google
access_log-2013-01-02:396262:google

where needle contains:

google
test

The problems here is the whole directory is searched for any match from needle and the process is single-threaded so it takes forever. There's also no explicit information as to whether it fails to find a match.

Community
  • 1
  • 1
kayaker243
  • 2,580
  • 3
  • 22
  • 30

2 Answers2

1

How about combining grep and find in a bash script?

for needle in $(cat needles.txt); do
    echo $needle
    matches=$(find . -type f -exec grep -nH -e $needle {} +)
    if [[ 0 == $? ]] ; then
        if [[ -z "$matches" ]] ; then
            echo "No matches found"
        else
            echo "$matches"
        fi
    else
        echo "Search failed / no matches"
    fi
    echo
done

needles.txt contains a list of your target filenames.

To read the needles (which now can contain spaces) line-by-line from the file, use this version:

cat needles.txt | while read needle ; do
    echo $needle
    matches=$(find . -type f -exec grep -nH -e "$needle" {} +)
    if [[ 0 == $? ]] ; then
        if [[ -z "$matches" ]] ; then
            echo "No matches found"
        else
            echo "$matches"
        fi
    else
        echo "Search failed / no matches"
    fi
    echo
done

If you do the combination with xargs, the error code $? is no longer zero even on success. This may be less safe, but works for me:

cat needles.txt | while read needle ; do
  echo $needle
  matches=$(find . -type f -print0 | xargs -0 -n1 -P2 grep -nH -e "$needle")
  if [[ -z "$matches" ]] ; then
        echo "No matches found"
  else
        echo "$matches"
  fi
  echo
done
rerx
  • 1,133
  • 8
  • 19
  • Thanks! I modified it slightly to use xargs to spread grep over 8 processes. `matches=$(find . -type f -print0 | xargs -0 -n1 -P8 grep -nH -E $needle)`. This appears to work. However, it turns out I do need to match for spaces - the term I want to search is actually `GET /term/`. Including a backslash before the term in needles.txt fails, seeming to exit execution. Quoting `$needle` seems to prevent evaluation of `$needle`. Any suggestion? – kayaker243 Sep 25 '13 at 21:22
  • @kayaker243 Can you adapt the version in the edit to your needs? Thanks for pointing out the parallelization with xargs, that is new to me. – rerx Sep 25 '13 at 21:37
  • no, my bash skills aren't up to the challenge of dealing with whitespace in this context :( – kayaker243 Sep 25 '13 at 21:39
  • Does the second version above not work with needles that contain spaces? Just put `GET /term/` on a dingle line in needles.txt. – rerx Sep 25 '13 at 22:13
  • Yes, putting needles with spaces on their own lines causes the script to die. – kayaker243 Sep 25 '13 at 22:20
  • For me with the xargs call included the `if [[ 0 == $? ]] ` no longer worked correctly. Does the third snippet above work for you? – rerx Sep 25 '13 at 23:01
  • Ah, forgot to mention I had to comment that portion out. – kayaker243 Sep 25 '13 at 23:10
  • @kayaker: Can you post your current code and a simple test case that fails? – rerx Sep 26 '13 at 07:25
1

To determine which needles no longer have matches, you can take the output from grep and:

  1. Use awk or something similar to extract just the matched strings to a separate file.
  2. Concatenate that needles file to that file
  3. Do sort --uniq filename -o temp1
  4. Concatenate the needles file to temp1
  5. Do sort temp1 -o temp2
  6. uniq -u temp2 > temp3

temp3 will contain the needles that are no longer used.

There might be a more concise way to do that. Steps 1 through 3 get a list of the unique needles that are found in the files.

Say your needles file contains:

google
foo
bar

And grep finds foo and bar in multiple files, but doesn't find google. Step 1 would create a file like:

foo
bar
bar
foo
foo
bar
foo

sort --uniq will create:

foo
bar

Concatenating the needles file gives

foo
bar
google
foo
bar

Sorting gives:

bar
bar
foo
foo
google

And the final uniq -u command will output a single line:

google
Jim Mischel
  • 131,090
  • 20
  • 188
  • 351