1

My problem in a nutshell is that I want the list of english words that are output when I run 'strings' on a binary file. Currently the file I run it on dumps a lot of trash to the screen, and I'm only interested in words that are, well, words.

After poking around here, I see that grep -f accompanied by a Linux Dictionary File will do what I want, but it is slow.

Is there a faster alternative available, or is it really just that hard to match english words?

Gus
  • 249
  • 5
  • 15
  • Yeah, I get that brute force matching will be hard, but I was hoping for a utility that used something like [Aho-Carasick matching](http://en.wikipedia.org/wiki/Aho%E2%80%93Corasick_string_matching_algorithm) to preprocess the match list and be faster, without having to write it myself. (It's possible that grep's already doing that and is still that slow, and if so I'll just have to live with it) – Gus Dec 26 '12 at 18:07
  • fgrep searches for any of a list of fixed strings using the Aho–Corasick string matching algorithm. - http://en.wikipedia.org/wiki/Grep. Note how the link to grep was right in that article you linked to? – Zoredache Dec 26 '12 at 18:17
  • I see 'formed the basis of the original' implying 'doesn't anymore'. Which makes sense when you consider that grep treats each input line as a mini regex, which would preclude using that algorithm to match fixed strings – Gus Dec 26 '12 at 18:28
  • @Zoredache, fgrep (or grep -F) works. Performance improved from minutes to less than a second. Make it an answer and I'll accept it. – Gus Dec 26 '12 at 19:09

2 Answers2

2

It's not hard to match, the problem is you're matching a possibly long list against a really long list. It takes a long time simply due to the sheer number of comparisons that have to be made.

John
  • 9,070
  • 1
  • 29
  • 34
0

Grep can use a faster matching algorithm when it knows that it's only matching fixed strings, (vs regular expressions). You enable this behavior by supplying the -F argument, or using the fgrep command.

The full command is:

strings fileToScan | grep -F -f /usr/share/dict/words

assuming the dictionary file is present at /usr/share/dict/words

Gus
  • 249
  • 5
  • 15