Search for a few hundred filenames in a few hundred log files

Question

I would like to efficiently search through a few hundred log files for ~200 filenames.

I can easily do this using grep's -f directive and putting the needle(s) in a file.

However, there are a few problems:

I'm interested in doing this efficiently, as in How to use grep efficiently?
I want to know all the matches for each search term (i.e. filename) in all log files separately. grep -f would match as it finds needles in each file.
I would like to know when a filename is not matched anywhere.

2.7 i7 MBP w/ 16gb of ram

Using grep -ron -f needle * gives me:

access_log-2013-01-01:88298:google
access_log-2013-01-01:88304:google
access_log-2013-01-01:88320:test
access_log-2013-01-01:88336:google
access_log-2013-01-02:396244:test
access_log-2013-01-02:396256:google
access_log-2013-01-02:396262:google

where needle contains:

google
test

The problems here is the whole directory is searched for any match from needle and the process is single-threaded so it takes forever. There's also no explicit information as to whether it fails to find a match.

Do any of the filenames contain spaces? Also, will there be times when a filename is appended to other text or will it always be separated by whitespace/start of line/end of line? — Desidero, Sep 25 '13 at 00:36
@Desidero filenames don't contain spaces. Filenames may be appended to other text. Think /foor/bar/baz/needle.txt — kayaker243, Sep 25 '13 at 06:07
@kayaker243 , assuming you have a solution for this problem , how does the output look like . give us an example of input and output — michael501, Sep 25 '13 at 17:09

rerx · Accepted Answer · 2013-09-25T22:59:20.663

1

How about combining grep and find in a bash script?

for needle in $(cat needles.txt); do
    echo $needle
    matches=$(find . -type f -exec grep -nH -e $needle {} +)
    if [[ 0 == $? ]] ; then
        if [[ -z "$matches" ]] ; then
            echo "No matches found"
        else
            echo "$matches"
        fi
    else
        echo "Search failed / no matches"
    fi
    echo
done

needles.txt contains a list of your target filenames.

To read the needles (which now can contain spaces) line-by-line from the file, use this version:

cat needles.txt | while read needle ; do
    echo $needle
    matches=$(find . -type f -exec grep -nH -e "$needle" {} +)
    if [[ 0 == $? ]] ; then
        if [[ -z "$matches" ]] ; then
            echo "No matches found"
        else
            echo "$matches"
        fi
    else
        echo "Search failed / no matches"
    fi
    echo
done

If you do the combination with xargs, the error code $? is no longer zero even on success. This may be less safe, but works for me:

cat needles.txt | while read needle ; do
  echo $needle
  matches=$(find . -type f -print0 | xargs -0 -n1 -P2 grep -nH -e "$needle")
  if [[ -z "$matches" ]] ; then
        echo "No matches found"
  else
        echo "$matches"
  fi
  echo
done

edited Sep 25 '13 at 22:59

answered Sep 25 '13 at 20:01

rerx

1,133
8
19

Thanks! I modified it slightly to use xargs to spread grep over 8 processes. `matches=$(find . -type f -print0 | xargs -0 -n1 -P8 grep -nH -E $needle)`. This appears to work. However, it turns out I do need to match for spaces - the term I want to search is actually `GET /term/`. Including a backslash before the term in needles.txt fails, seeming to exit execution. Quoting `$needle` seems to prevent evaluation of `$needle`. Any suggestion? – kayaker243 Sep 25 '13 at 21:22
@kayaker243 Can you adapt the version in the edit to your needs? Thanks for pointing out the parallelization with xargs, that is new to me. – rerx Sep 25 '13 at 21:37
no, my bash skills aren't up to the challenge of dealing with whitespace in this context :( – kayaker243 Sep 25 '13 at 21:39
Does the second version above not work with needles that contain spaces? Just put `GET /term/` on a dingle line in needles.txt. – rerx Sep 25 '13 at 22:13
Yes, putting needles with spaces on their own lines causes the script to die. – kayaker243 Sep 25 '13 at 22:20
For me with the xargs call included the `if [[ 0 == $? ]] ` no longer worked correctly. Does the third snippet above work for you? – rerx Sep 25 '13 at 23:01
Ah, forgot to mention I had to comment that portion out. – kayaker243 Sep 25 '13 at 23:10
@kayaker: Can you post your current code and a simple test case that fails? – rerx Sep 26 '13 at 07:25

Jim Mischel · Answer 2 · 2013-09-25T20:51:41.797

To determine which needles no longer have matches, you can take the output from grep and:

Use awk or something similar to extract just the matched strings to a separate file.
Concatenate that needles file to that file
Do sort --uniq filename -o temp1
Concatenate the needles file to temp1
Do sort temp1 -o temp2
uniq -u temp2 > temp3

temp3 will contain the needles that are no longer used.

There might be a more concise way to do that. Steps 1 through 3 get a list of the unique needles that are found in the files.

Say your needles file contains:

google
foo
bar

And grep finds foo and bar in multiple files, but doesn't find google. Step 1 would create a file like:

foo
bar
bar
foo
foo
bar
foo

sort --uniq will create:

foo
bar

Concatenating the needles file gives

foo
bar
google
foo
bar

Sorting gives:

bar
bar
foo
foo
google

And the final uniq -u command will output a single line:

google

Search for a few hundred filenames in a few hundred log files

2 Answers2