1

I'm looking for a tool (ideally) or failing that an API to search text for instances of any word from a large dictionary of words in a large number of text files. "Words" in my case are actually file names but won't contain spaces.

A fast algorithm might perhaps build a DFA (deterministic finite automata) by reading the dictionary and then be able to use a single pass to find instances of the dictionary words over any number of files.

Note: I'm wanting exact text matching not fuzzy matching like this SO question: - Algorithm wanted: Find all words of a dictionary that are similar to words in a free text

Community
  • 1
  • 1
Tony O'Hagan
  • 21,638
  • 3
  • 67
  • 78

2 Answers2

1

Have you looked at lucene ? There's a java and a .net version

http://lucene.apache.org/java/docs/index.html

Boas Enkler
  • 12,264
  • 16
  • 69
  • 143
0

I'd load the dictionary of words to a HashMap or "Dictionary", then read the file in line by line or word by word, checking if the map contains the word.

NoBugs
  • 9,310
  • 13
  • 80
  • 146
  • Sorry this will be far too slow. I'm looking for an algorithm that that can read a text stream and search cost is constant (does not increase as number of words increases). I think what I might be looking for is a Perfect Hash function. http://en.wikipedia.org/wiki/Perfect_hash_function – Tony O'Hagan Jul 14 '11 at 03:24
  • This is still not as good as an DFA approach that can simply read a text stream as a byte sequence and emit match events. – Tony O'Hagan Jul 14 '11 at 04:11
  • 1
    Ahh looks like [fgrep](http://ss64.com/bash/fgrep.html) does the job. It implements the [Aho and Corasick string match alg.](http://en.wikipedia.org/wiki/Aho%E2%80%93Corasick_string_matching_algorithm) – Tony O'Hagan Jul 14 '11 at 04:25