How can I tokenize a text efficiently?

Question

Given a text (T) and a dictionary (D), how can I find all words that occur in the text?

A1. One can assume that there are just few repetitions of characters in T, for example, the T is in Chinese.

A2. Iterating over the D, as one may suspect, is costly. Thus it either should be preproccessed, broken down or simply: Multiple iterations should be avoided.

A3. The upper length of a word is L and comparatively small compared to the text.

B1. The simplest solutions might be to just iterate over D for every substring of sensible length I have in my T. This method would definitely guarantee me that all words are found. This however seems vastly inefficient.

B2. Another idea would be to iterate over the text once, retrieve all characters in T in a set and proceed as in B1 for finding all words.

B3. This variation could work like B2, however would use/assert that D is in lexicographical order. That means, it would actually only check words with the same starting characters. Possibly I could also use a look ahead of the characters following my current character T. I would iterate over the D just once, over T multiple times. This however seems bearable.

B4. Here, I would also proceed like in B3, however re-order D in such a way, that more likely occurring words T are checked earlier. The problem here: How do I find out, what words are occurring more likely? I would have to first digest a lot more data beforehand and be then sure that what I then measure, is actually what I want to measure..

Surely, there are many other possibilities, likely more sophisticated ones. But what is the current state of the art? How can one do this / approach this problem best?

Hashmaps. However, this question does not seem appropriate for SO as it is asking for an opinion from many possible solutions — Paul S., May 26 '18 at 14:12
Not really, I am not interested in anyone's favourite algorithm, however in how this is doable. If you consider any algorithm opinion based, then there wouldn't be much legit questions/answers left here. — Imago, May 26 '18 at 14:15
Process the dictionary and convert to a data structure where strings can be looked up efficiently, such as a [trie](https://en.wikipedia.org/wiki/Trie) or a hash table. Note that in actual Chinese text words are not separated by spaces, which makes the problem much more interesting; there is no way to parse a Chinese text into words without an understanding of Chinese -- is 十三五 (shi san wu) one word, two words or three words? — AlexP, May 26 '18 at 14:25

How can I tokenize a text efficiently?

0 Answers0