5

I needs to match a really large list of keywords (>1000000) in a string efficiently using python. I found some really good libraries which try to do this fast:

1) FlashText (https://github.com/vi3k6i5/flashtext)

2) Aho-Corasick Algorithm etc.

However I have a peculiar requirement: In my context a keyword say 'XXXX YYYY' should return a match if my string is ' XXXX is a very good indication of YYYY'. Note here that 'XXXX YYYY' is not occuring as a substring but XXXX and YYYY are present in the string and this is good enough for me for a match.

I know how to do it naively. What I am looking for is efficiency, any more good libraries for this?

suzee
  • 563
  • 4
  • 25
  • 1
    It'd be nice to know your naive solutions, not to repeat them. One of the ideas could be a removal of everything not being in the list of your keywords from the string and then applying one of the fast libraries. – Maciek Jan 15 '18 at 07:43
  • @Maciek By naive I mean convert multi-word key words into a list and match each element seperately with an and condition (this is without using the fats libraries). What you suggest assumes that YYYY occurs after XXXX which might not be true too. – suzee Jan 15 '18 at 07:46
  • OK, I see. It was not clear from your question. – Maciek Jan 15 '18 at 07:56
  • Are you looking for patterns which are space-separated or something more general? – tripleee Jan 15 '18 at 08:04
  • @tripleee space and punctuation seperated would be best for my case – suzee Jan 15 '18 at 08:14
  • Split into tokens, look for the longest or most uncommon and then see if the match also contains the other(s)? – tripleee Jan 15 '18 at 08:33
  • how long your keywords?for "xx yy","xx is yy"is ok,but "yy is xx"?and " xxyy"? – obgnaw Jan 22 '18 at 01:34
  • so say you have 2 keywords {cost benefit analysis} and {machine learning} - you want to find documents that contain those keywords or any permutation of those keywords, possibly with some non-keyword characters in between the constituents of the keyword. Does 'the cost benefit of different machine learning techniques requires more analysis' match only {machine learning} or both keywords? – Matti Lyra Jan 22 '18 at 11:16
  • @MattiLyra Both – suzee Jan 23 '18 at 01:24
  • @obgnaw "xx is yy" and "yy is xx" is ok. Not the other one – suzee Jan 23 '18 at 01:25

2 Answers2

1

This falls into the "naive" camp, but here's a method that uses sets as food for thought:

docs = [
    """ Here's a sentence with dog and apple in it """,
    """ Here's a sentence with dog and poodle in it """,
    """ Here's a sentence with poodle and apple in it """,
    """ Here's a dog with and apple and a poodle in it """,
    """ Here's an apple with a dog to show that order is irrelevant """
]

query = ['dog', 'apple']

def get_similar(query, docs):
    res = []
    query_set = set(query)
    for i in docs:
        # if all n elements of query are in i, return i
        if query_set & set(i.split(" ")) == query_set:
            res.append(i)
    return res

This returns:

[" Here's a sentence with dog and apple in it ", 
" Here's a dog with and apple and a poodle in it ", 
" Here's an apple with a dog to show that order is irrelevant "]

Of course, the time complexity is not that great but it's much, much quicker than using lists overall because of the speed of doing hash/set operations.


Part 2 is that Elasticsearch is a great candidate for this if you're willing to put in the effort and you're dealing with lots and lots of data.

Derek Janni
  • 2,397
  • 2
  • 13
  • 13
1

What you ask sound like a full text search task. There's Python search package called whoosh. @derek's corpus can be indexed and searched in memory like the following.

from whoosh.filedb.filestore import RamStorage
from whoosh.qparser import QueryParser
from whoosh import fields


texts = [
    "Here's a sentence with dog and apple in it",
    "Here's a sentence with dog and poodle in it",
    "Here's a sentence with poodle and apple in it",
    "Here's a dog with and apple and a poodle in it",
    "Here's an apple with a dog to show that order is irrelevant"
]

schema = fields.Schema(text=fields.TEXT(stored=True))
storage = RamStorage()
index = storage.create_index(schema)
storage.open_index()

writer = index.writer()
for t in texts:
    writer.add_document(text = t)
writer.commit()

query = QueryParser('text', schema).parse('dog apple')
results = index.searcher().search(query)

for r in results:
    print(r)

This produces:

<Hit {'text': "Here's a sentence with dog and apple in it"}>
<Hit {'text': "Here's a dog with and apple and a poodle in it"}>
<Hit {'text': "Here's an apple with a dog to show that order is irrelevant"}>

You can also persist your index using FileStorage as described in How to index documents.

saaj
  • 23,253
  • 3
  • 104
  • 105