Tool or API needed to find text contain any word from a large dictionary of words

Question

I'm looking for a tool (ideally) or failing that an API to search text for instances of any word from a large dictionary of words in a large number of text files. "Words" in my case are actually file names but won't contain spaces.

A fast algorithm might perhaps build a DFA (deterministic finite automata) by reading the dictionary and then be able to use a single pass to find instances of the dictionary words over any number of files.

Note: I'm wanting exact text matching not fuzzy matching like this SO question: - Algorithm wanted: Find all words of a dictionary that are similar to words in a free text

score 1 · Answer 1 · answered Jul 13 '11 at 06:09

1

Have you looked at lucene ? There's a java and a .net version

http://lucene.apache.org/java/docs/index.html

answered Jul 13 '11 at 06:09

Boas Enkler

12,264
16
69
143

score 0 · Answer 2 · answered Jul 13 '11 at 06:09

0

I'd load the dictionary of words to a HashMap or "Dictionary", then read the file in line by line or word by word, checking if the map contains the word.

answered Jul 13 '11 at 06:09

NoBugs

9,310
13
80
146

Sorry this will be far too slow. I'm looking for an algorithm that that can read a text stream and search cost is constant (does not increase as number of words increases). I think what I might be looking for is a Perfect Hash function. http://en.wikipedia.org/wiki/Perfect_hash_function – Tony O'Hagan Jul 14 '11 at 03:24
This is still not as good as an DFA approach that can simply read a text stream as a byte sequence and emit match events. – Tony O'Hagan Jul 14 '11 at 04:11
1

Ahh looks like [fgrep](http://ss64.com/bash/fgrep.html) does the job. It implements the [Aho and Corasick string match alg.](http://en.wikipedia.org/wiki/Aho%E2%80%93Corasick_string_matching_algorithm) – Tony O'Hagan Jul 14 '11 at 04:25

Tool or API needed to find text contain any word from a large dictionary of words

2 Answers2