4

I have around 500-1000 entities where each has a name and string content. To find how these entities are connected every name in every content field has to be searched. Entities can be edited, so I might have to rebuild connections for an edited entity by searching its name in all content fields again.

Exact string matching (.indexOf or .contains) is not an option as there are further rules:

  • names can consist of multiple words and predefined special chars (_, /, -, ,, ...)
  • names may be surrounded by special chars and will still be recognized
  • names may be ended by predefined plural endings (s, es, ...) and will still be recognized

Example names: fine apple juice, apple, app, _n, n

Example content: apps are like fine apple juice_n

matches all example names

edit: Clarification on rule 2: a match must not be something like "appxxy" or other gibberish but separated words by blanks (or the special chars).

I have looked through various possible solutions such as Aho-Corasick, using regex, string-search, regex pattern, Apache Lucene or using a custom Scanner with WordDetector. However I'm lost when choosing which one is best suited for my purpose and best in performance as I'm not too experienced in programming.

Community
  • 1
  • 1
  • 2
    for 500-1000 entries I would go with a regexp. –  Mar 16 '14 at 11:49
  • Thats the number of entities yes. I hope you are aware though that to find all connections for 1000 entities I have to search 1000*1000 content fields where each has up to 1000 characters. –  Mar 16 '14 at 12:10
  • 1
    Because of the phrasing rules you've described, I would suggest Aho-Corasick as your best bet. While building the structure the first time is onerous, the update is pretty quick. That being said, it comes down to the implementation in your environment. There is no substitute for testing. – Bob Dalgleish Mar 16 '14 at 12:23
  • @BobDalgleish: It's linear time, how much getter can it get? oO – Niklas B. Mar 16 '14 at 15:53
  • @BobDalgleish: I looked at the two Aho-Corasick implementations which are given on the wikipedia. I cant see a way to customize it for any of my rules. Could you give an example how to find a name "app" followed by an allowed plural ending ("apps") but to NOT get a result on unallowed plural endings ("appxxy")? Also I edited the original question a bit –  Mar 16 '14 at 16:32
  • @user2762016 Just add an additional node to the automaton. A simpler way would be to just insert both `app` and `apps`. You can introduce a custom "boundary" character to your string and add all words to the automaton surrounded by that character (`@app@` instead of `app`) – Niklas B. Mar 16 '14 at 17:30
  • @Niklas B.: Thanks for your help. The only thing I understand is adding all words to the trie, which would make a simple "app" to hundreds of words (e.g. -app-, -app_, -apps, ...), isnt this VERY inefficient? If I do understand wrong an example would be appreciated. –  Mar 16 '14 at 19:39

0 Answers0