0

Basically, Let's say I have a list phrases in vocabulary

- University of Texas Dallas
- University of Tokyo
- University of Toronto

Let's say I have 3 documents

- doc1: I study at University of Texas Dallas and its awsome 
- doc2: I study at University of Tokyo and its awsome            
- doc3: I study at University of Toronto and its awsome

By using a whitespace tokenizer the following tokens would be identified in the index

-doc1: ["i", "study", "in", "university", "of", "texas", "dallas", "and", "its", "awsome"]
-doc2: ["i", "study", "in", "university", "of", "tokyo", "and", "its", "awsome"]
-doc3: ["i", "study", "in", "university", "of", "toronto", "and", "its", "awsome"]

However, Since I have a known "vocabulary", I would like to phrase tokenize and achieve the following

-doc1: ["i", "study", "in", "university of texas dallas", "and", "its", "awsome"]
-doc2: ["i", "study", "in", "university of tokyo", "and", "its", "awsome"]
-doc3: ["i", "study", "in", "university of toronto", "and", "its", "awsome"]

How do I achieve phrase tokenization given I have a list of phrases in vocabulary?

Kaushik J
  • 962
  • 7
  • 17
  • Just curious, Why you need to index it as a single term? Phrase/Span queries should solve the querying problem. – Nirmal Jun 02 '20 at 19:18
  • if I do not index it as a single token...and...If I search for "Toronto" I get the result "university of Toronto", SImilarly If I search phrase "texas"...I get the result "university of texas Dallas" ......My application needs to avoid these results. ................................................. PS: sorry for the poor example – Kaushik J Jun 03 '20 at 06:41
  • @Nirmal Basically, I don't want to match a subset of words present in the vocabulary...It should match only for full phrase "University of Tokyo", not for "Tokyo" – Kaushik J Jun 03 '20 at 06:43
  • I am not sure if there are built-in tokenizers to do it based on the dictionary. This may be a good reference - https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-tokenizers.html – Nirmal Jun 04 '20 at 18:02

0 Answers0