I have around 500-1000 entities where each has a name and string content. To find how these entities are connected every name in every content field has to be searched. Entities can be edited, so I might have to rebuild connections for an edited entity by searching its name in all content fields again.
Exact string matching (.indexOf or .contains) is not an option as there are further rules:
- names can consist of multiple words and predefined special chars (_, /, -, ,, ...)
- names may be surrounded by special chars and will still be recognized
- names may be ended by predefined plural endings (s, es, ...) and will still be recognized
Example names: fine apple juice, apple, app, _n, n
Example content: apps are like fine apple juice_n
matches all example names
edit: Clarification on rule 2: a match must not be something like "appxxy" or other gibberish but separated words by blanks (or the special chars).
I have looked through various possible solutions such as Aho-Corasick, using regex, string-search, regex pattern, Apache Lucene or using a custom Scanner with WordDetector. However I'm lost when choosing which one is best suited for my purpose and best in performance as I'm not too experienced in programming.