Python: avoiding nested for-loop NLP edition; any lib-support?

Question

I am trying to make a user-query based autosuggest. I have a bunch of aggregated queries like:

QUERY          COUNT
"harry potter" 100
"iron man"     93
"harry pott"   32
"harr pott"    5

with around 200.000 rows. As you can see some users are extensively using the prefixed search typing in only the first letters of a word. Those queries in the example should be aggregated with a full "harry potter" row.

Now assuming that the majority of users searches with full words, I think I can do that aggregation effectively (avoiding a nested for-loop over the whole index) in the following way:

I sort the tokens in the query alphabetically and generate a map "first_token" like:

"h"         "harry potter"
"ha"        "harry potter"
"har"       "harry potter"
"harr"      "harry potter"
"harry"     "harry potter"

and respectively "second_token" and so forth...

"p"         "harry potter"
"po"        "harry potter"
"pot"       "harry potter"
"pott"      "harry potter"
"potte"     "harry potter"
"potter"    "harry potter"

and then I iterate from top to bottom and for each element like "harr pott" I check if there is an element in both "first_token" and "second_token" whose value is the same document, eg "harry potter" and that document is not identical to the original ("harr pott") and has a higher score, in which case I aggregate it. The runtime of this should be O(index_size * max_number_of_tokens).

Now I was wondering if there is any lib for Python that can make it easier for me implementing all of this. Coming from Java/JS I am not so familiar with Python yet, I just know it has lots of tools for NLP.

Can anything in NLTK or so help me? I think there should be at least a tool for vectorizing strings. Perhaps using that you can do the "starts-with" operation as a simple lookup without generating tries-maps manually?

@alvas yup that's what I was describing. In fact I don't even need full tries, they can start with the smallest tokens I have (around 3 characters) instead of single letters. My question is if there is a module for doing just that — Phil, Jan 02 '19 at 07:02

score 0 · Answer 1 · answered Jan 01 '19 at 11:01

0

Autosuggest and many other features specific to search are handled well in Lucene. You can try Python implementation of it PyLucene

Alternatively, if you want answer limited to the specifics of the question you have asked, try out ngram module of Python. Details here

answered Jan 01 '19 at 11:01

rishi

2,564
6
25
47

Hi and thanks for the reply. I am using ElasticSearch which is built on top of Lucene. Lucene is doing a great job of providing a quick suggest function, which in the newer versions is built with tries, as far as I know. The problem is to create suggest data in the first place; that's what I try to do right now. But I will check out the ngram module; it might indeed be just what I am looking for. Thanks again! – Phil Jan 01 '19 at 14:20
If you are already using ElasticSearch and your issue is around creating database for suggestions, check out https://stackoverflow.com/questions/22882927/how-to-insert-data-into-elasticsearch – rishi Jan 01 '19 at 16:36
No, I can work with ElasticSearch just fine, my issue is data-cleaning – Phil Jan 01 '19 at 21:28

Python: avoiding nested for-loop NLP edition; any lib-support?

1 Answers1