I'm using java, and have a large-ish (~15000) set of keywords (strings), and I have a document (string) that contains these keywords periodically.
I'd like to find the indices of each use of the keywords in the document, with a preference to longer keywords (ones with the most characters). For example, if my keywords were "water", "bottle", "drank", and "water bottle", and my document were "I drank from my water bottle", I'd like a result of:
2 drank
16 water bottle
My initial attempts were to use a trie, and go through the document character-by-character, and whenever a substring matches a keyword, record the initial index. However some of the keywords are prefixes for longer keywords (for example, "water" and "water bottle"), and the code would never find the longer one, as it would record "water"'s index, and then start over.
If it matters, the keywords may contain lower case letters, upper case letters, spaces, hyphens, and apostrophes (and capitalization matters).
So, any help in finding the longest keywords would be much appreciated. Thanks.