search keywords efficiently when keywords are multi words

Question

I needs to match a really large list of keywords (>1000000) in a string efficiently using python. I found some really good libraries which try to do this fast:

1) FlashText (https://github.com/vi3k6i5/flashtext)

2) Aho-Corasick Algorithm etc.

However I have a peculiar requirement: In my context a keyword say 'XXXX YYYY' should return a match if my string is ' XXXX is a very good indication of YYYY'. Note here that 'XXXX YYYY' is not occuring as a substring but XXXX and YYYY are present in the string and this is good enough for me for a match.

I know how to do it naively. What I am looking for is efficiency, any more good libraries for this?

It'd be nice to know your naive solutions, not to repeat them. One of the ideas could be a removal of everything not being in the list of your keywords from the string and then applying one of the fast libraries. — Maciek, Jan 15 '18 at 07:43
@Maciek By naive I mean convert multi-word key words into a list and match each element seperately with an and condition (this is without using the fats libraries). What you suggest assumes that YYYY occurs after XXXX which might not be true too. — suzee, Jan 15 '18 at 07:46
Are you looking for patterns which are space-separated or something more general? — tripleee, Jan 15 '18 at 08:04
@tripleee space and punctuation seperated would be best for my case — suzee, Jan 15 '18 at 08:14
Split into tokens, look for the longest or most uncommon and then see if the match also contains the other(s)? — tripleee, Jan 15 '18 at 08:33
how long your keywords?for "xx yy","xx is yy"is ok,but "yy is xx"?and " xxyy"? — obgnaw, Jan 22 '18 at 01:34
so say you have 2 keywords {cost benefit analysis} and {machine learning} - you want to find documents that contain those keywords or any permutation of those keywords, possibly with some non-keyword characters in between the constituents of the keyword. Does 'the cost benefit of different machine learning techniques requires more analysis' match only {machine learning} or both keywords? — Matti Lyra, Jan 22 '18 at 11:16

score 1 · Answer 1 · answered Jan 19 '18 at 00:58

This falls into the "naive" camp, but here's a method that uses sets as food for thought:

docs = [
    """ Here's a sentence with dog and apple in it """,
    """ Here's a sentence with dog and poodle in it """,
    """ Here's a sentence with poodle and apple in it """,
    """ Here's a dog with and apple and a poodle in it """,
    """ Here's an apple with a dog to show that order is irrelevant """
]

query = ['dog', 'apple']

def get_similar(query, docs):
    res = []
    query_set = set(query)
    for i in docs:
        # if all n elements of query are in i, return i
        if query_set & set(i.split(" ")) == query_set:
            res.append(i)
    return res

This returns:

[" Here's a sentence with dog and apple in it ", 
" Here's a dog with and apple and a poodle in it ", 
" Here's an apple with a dog to show that order is irrelevant "]

Of course, the time complexity is not that great but it's much, much quicker than using lists overall because of the speed of doing hash/set operations.

Part 2 is that Elasticsearch is a great candidate for this if you're willing to put in the effort and you're dealing with lots and lots of data.

score 1 · Accepted Answer · answered Jan 23 '18 at 12:41

What you ask sound like a full text search task. There's Python search package called whoosh. @derek's corpus can be indexed and searched in memory like the following.

from whoosh.filedb.filestore import RamStorage
from whoosh.qparser import QueryParser
from whoosh import fields


texts = [
    "Here's a sentence with dog and apple in it",
    "Here's a sentence with dog and poodle in it",
    "Here's a sentence with poodle and apple in it",
    "Here's a dog with and apple and a poodle in it",
    "Here's an apple with a dog to show that order is irrelevant"
]

schema = fields.Schema(text=fields.TEXT(stored=True))
storage = RamStorage()
index = storage.create_index(schema)
storage.open_index()

writer = index.writer()
for t in texts:
    writer.add_document(text = t)
writer.commit()

query = QueryParser('text', schema).parse('dog apple')
results = index.searcher().search(query)

for r in results:
    print(r)

This produces:

<Hit {'text': "Here's a sentence with dog and apple in it"}>
<Hit {'text': "Here's a dog with and apple and a poodle in it"}>
<Hit {'text': "Here's an apple with a dog to show that order is irrelevant"}>

You can also persist your index using FileStorage as described in How to index documents.

search keywords efficiently when keywords are multi words

2 Answers2