1

I'm creating a search engine to search a list of roughly 20k English phrases, each one being a few words long.

I've looked into ways to create the search engine, and currently I am using a TfidfVectorizer from sklearn and Cosine Similarity to compute the ranking scores.

From what I understand in information retrieval you have retrieval and ranking phases, however I'm confused how you could use a data structure like an inverted index to speed up the search before using TfidfVectorizer? It seems like TfidfVectorizer creates a term-document matrix which is different to an index. Could you just store TF and IDF values in an inverted index and use cosine similarity at run time? Ideally I want autocomplete of phrases so I need to store edge ngrams as well, and a boolean model isn't useful here.

0 Answers0