using strings (the command) to only find english words

Question

My problem in a nutshell is that I want the list of english words that are output when I run 'strings' on a binary file. Currently the file I run it on dumps a lot of trash to the screen, and I'm only interested in words that are, well, words.

After poking around here, I see that grep -f accompanied by a Linux Dictionary File will do what I want, but it is slow.

Is there a faster alternative available, or is it really just that hard to match english words?

Yeah, I get that brute force matching will be hard, but I was hoping for a utility that used something like [Aho-Carasick matching](http://en.wikipedia.org/wiki/Aho%E2%80%93Corasick_string_matching_algorithm) to preprocess the match list and be faster, without having to write it myself. (It's possible that grep's already doing that and is still that slow, and if so I'll just have to live with it) — Gus, Dec 26 '12 at 18:07
fgrep searches for any of a list of fixed strings using the Aho–Corasick string matching algorithm. - http://en.wikipedia.org/wiki/Grep. Note how the link to grep was right in that article you linked to? — Zoredache, Dec 26 '12 at 18:17
I see 'formed the basis of the original' implying 'doesn't anymore'. Which makes sense when you consider that grep treats each input line as a mini regex, which would preclude using that algorithm to match fixed strings — Gus, Dec 26 '12 at 18:28
@Zoredache, fgrep (or grep -F) works. Performance improved from minutes to less than a second. Make it an answer and I'll accept it. — Gus, Dec 26 '12 at 19:09

score 2 · Answer 1 · answered Dec 26 '12 at 17:59

2

It's not hard to match, the problem is you're matching a possibly long list against a really long list. It takes a long time simply due to the sheer number of comparisons that have to be made.

answered Dec 26 '12 at 17:59

John

9,070
1
29
34

score 0 · Accepted Answer · answered Jan 04 '13 at 15:08

Grep can use a faster matching algorithm when it knows that it's only matching fixed strings, (vs regular expressions). You enable this behavior by supplying the -F argument, or using the fgrep command.

The full command is:

strings fileToScan | grep -F -f /usr/share/dict/words

assuming the dictionary file is present at /usr/share/dict/words

using strings (the command) to only find english words

2 Answers2