Basically, Let's say I have a list phrases in vocabulary
- University of Texas Dallas
- University of Tokyo
- University of Toronto
Let's say I have 3 documents
- doc1: I study at University of Texas Dallas and its awsome
- doc2: I study at University of Tokyo and its awsome
- doc3: I study at University of Toronto and its awsome
By using a whitespace tokenizer the following tokens would be identified in the index
-doc1: ["i", "study", "in", "university", "of", "texas", "dallas", "and", "its", "awsome"]
-doc2: ["i", "study", "in", "university", "of", "tokyo", "and", "its", "awsome"]
-doc3: ["i", "study", "in", "university", "of", "toronto", "and", "its", "awsome"]
However, Since I have a known "vocabulary", I would like to phrase tokenize and achieve the following
-doc1: ["i", "study", "in", "university of texas dallas", "and", "its", "awsome"]
-doc2: ["i", "study", "in", "university of tokyo", "and", "its", "awsome"]
-doc3: ["i", "study", "in", "university of toronto", "and", "its", "awsome"]
How do I achieve phrase tokenization given I have a list of phrases in vocabulary?