2

I have a number of search logs that I want to compare against certain dictionary files. Once I process the search logs to filter out certain entries and get all the search terms into separate lines, what is an easy way to figure out how many of the search terms are present in the dictionary file?

Chris Henry
  • 1,552
  • 3
  • 15
  • 15

1 Answers1

3

I'll take the preparation of input aside and assume these inputs:

Search log - one searched term on line, no repetition, something like this:

car
tramway
bus
train
skate
rollerblade
bike

Dictionary - one dictionary word on line, no repetition, something like this:

car
tramway
bus
train
bike
aeroplane
submarine

And if you want to select lines from search log, which are in dictionary, you can do it like this:

grep -f dictionary search_log

It'll return

car
tramway
bus
train
bike

And if you want number of these words just pipe it to wc -l

grep -f dictionary search_log | wc -l

And result will be 5.

mkudlacek
  • 1,677
  • 1
  • 11
  • 15
  • +1 for `grep -f file` – chmeee Jul 27 '10 at 09:18
  • The only problem with that is partial matches in `search_log` will be counted more than once. Is there a way to only count exact matches? Sorry, probably should've specified that in my question... – Chris Henry Jul 27 '10 at 17:58
  • The -f parameter except file with regexp patterns, so line in dictionary can look like this `^bike$` which defines exact match. – mkudlacek Jul 28 '10 at 06:26
  • Great, exactly what I needed. Any tips on performance? This takes really long. – Chris Henry Jul 28 '10 at 14:19
  • I went thru manual and if you do `grep -x -f dictionary search_log` you don't have to surround words in dictionary with ^$ - grep will look for exact matches. That could help. Otherwise you can divide dictionary files and execute grep in more instances. – mkudlacek Jul 28 '10 at 14:42