2

I am working with GATE (Java Based NLP Framework) and want to find words with partial match with a dictionary. For example I have a disease dictionary with following terms

Congestive cardiac failure
Congestive Heart Failure
Colon Cancer
      .
      . 
      .
Thousands of more terms

Let's assume I have as string "Father had cardiac failure last year" from this string I want to identify "cardiac failure" as partial match because it occurs as part of a term in the dictionary.

I have seen some discussion on similar subject in Python, JS and C# but I am not sure what can help in such a case here. I wonder if I can utilize Aho-Corrasick over here.

Sap
  • 5,197
  • 8
  • 59
  • 101
  • @eowl hey thanks, i just committed to the proposal, but as of now I can not post my question over there, correct? – Sap Jan 10 '12 at 06:15
  • @eowl I used the same login as SO and thus it shows me the same name but not the same score. – Sap Jan 10 '12 at 12:28

3 Answers3

1

Maybe you should use Lucene. Treat each line of the dictionary as a document, and each sentence in the text as a query.

cyborg
  • 9,989
  • 4
  • 38
  • 56
  • I thought of that, but I am sure that this technique will need a lot of iterations and match quality will not be great. As of now I am already using Lucene with the same dictionary for free text search and I get vivid results when I make a query like "knee pain" – Sap Jan 06 '12 at 11:29
1

One question that arises is which substrings you want to include in the search. If you included all substrings just "Heart" would also be a match, but that is not really a disease. Maybe all right-aligned (word-)substrings (perhaps with length > 1) would be acceptable.

So one thing you could do is to train the Aho-Corrasick pattern matcher with the substrings you want to include. To keep the information from which dictionary term the substring came you probably need to modify the algorithm a bit (if keeping that information is important) or build another datastructure to look it up afterwards.

In any case I would convert the disease list and the documents you want to search to lower case before training/matching. If there is a chance of misspellings - there are also papers on fuzzy aho-corasick automata.

tobigue
  • 3,557
  • 3
  • 25
  • 29
1

The UIMA Concept Mapper annotator addon includes a functionality similar to what you are looking. You may consider:

dedek
  • 7,981
  • 3
  • 38
  • 68
zdepablo
  • 452
  • 3
  • 6