I am trying to make a user-query based autosuggest. I have a bunch of aggregated queries like:
QUERY COUNT
"harry potter" 100
"iron man" 93
"harry pott" 32
"harr pott" 5
with around 200.000 rows. As you can see some users are extensively using the prefixed search typing in only the first letters of a word. Those queries in the example should be aggregated with a full "harry potter" row.
Now assuming that the majority of users searches with full words, I think I can do that aggregation effectively (avoiding a nested for-loop over the whole index) in the following way:
I sort the tokens in the query alphabetically and generate a map "first_token" like:
"h" "harry potter"
"ha" "harry potter"
"har" "harry potter"
"harr" "harry potter"
"harry" "harry potter"
and respectively "second_token" and so forth...
"p" "harry potter"
"po" "harry potter"
"pot" "harry potter"
"pott" "harry potter"
"potte" "harry potter"
"potter" "harry potter"
and then I iterate from top to bottom and for each element like "harr pott" I check if there is an element in both "first_token" and "second_token" whose value is the same document, eg "harry potter" and that document is not identical to the original ("harr pott") and has a higher score, in which case I aggregate it. The runtime of this should be O(index_size * max_number_of_tokens).
Now I was wondering if there is any lib for Python that can make it easier for me implementing all of this. Coming from Java/JS I am not so familiar with Python yet, I just know it has lots of tools for NLP.
Can anything in NLTK or so help me? I think there should be at least a tool for vectorizing strings. Perhaps using that you can do the "starts-with" operation as a simple lookup without generating tries-maps manually?