Python text search library

Question

I am looking for a library that would let me do something like the following:

matches(
    user_input="hello world how are you what are you doing",
    keywords='+world -tigers "how are" -"bye bye"'
)

Basically I want it to match strings based on presence of words, absence of words and sequences of words. I don't need a search engine a la Solr, because strings will not be known in advance and will only be searched once. Does such a library already exist, and if so, where would I find it? Or am I doomed to creating a regex generator?

try nltk.org. thats the natural language processing library for python — Kelvin, Feb 26 '15 at 22:10
Not sure what size of data you are looking to match over, but Lucene/Solr is the best option for a larger scale application - http://lucene.apache.org/solr/ . Also look at [pysolr](https://github.com/toastdriven/pysolr). — Shashank Agarwal, Feb 26 '15 at 22:15
I am looking to match very small amounts of data: strings with under a 100 words using keyword rules of a few keywords only. After the matching is done, I no longer have a use for the original string, so I don't think Solr is what I need. I also don't need the search to be fuzzy or language-specific. — ipartola, Feb 26 '15 at 22:19

jfs · Accepted Answer · 2015-02-27T08:14:01.470

regex module supports named lists:

import regex

def match_words(words, string):
    return regex.search(r"\b\L<words>\b", string, words=words)

def match(string, include_words, exclude_words):
    return (match_words(include_words, string) and
            not match_words(exclude_words, string))

Example:

if match("hello world how are you what are you doing",
         include_words=["world", "how are"],
         exclude_words=["tigers", "bye bye"]):
    print('matches')

You could implement named lists using standard re module e.g.:

import re

def match_words(words, string):
    re_words = '|'.join(map(re.escape, sorted(words, key=len, reverse=True)))
    return re.search(r"\b(?:{words})\b".format(words=re_words), string)

how do I build the list of included and excluded words based on the +, -, and "" grammar?

You could use shlex.split():

import shlex

include_words, exclude_words = [], []
for word in shlex.split('+world -tigers "how are" -"bye bye"'):
    (exclude_words if word.startswith('-') else include_words).append(word.lstrip('-+'))

print(include_words, exclude_words)
# -> (['world', 'how are'], ['tigers', 'bye bye'])

Clever and probably faster than Amrita's solution, but I don't think this helps me with the keyword grammar either, unless I am missing some magic in the re_words creation. — ipartola, Feb 27 '15 at 03:46
I've updated the answer to show how re_words are used in implementing the match() function and how to parse `'+world -tigers "how are" -"bye bye"'`. — jfs, Feb 27 '15 at 08:15
Perfect. Didn't realize that shlex would do this with just the split function. This is perfect! — ipartola, Feb 27 '15 at 15:48

score 1 · Answer 2 · answered Feb 27 '15 at 00:52

1

From the example you have given, you do not need Regex unless you are looking for patterns/expressions within words..

    d="---your string ---"
    mylist= d.split()
    M=[]
    Excl=["---excluded words---"]
    for word in mylist:
        if word not in Excl:
            M.append(word)
    print M

You can write a generic function which can be used with any string list and exclusion list.

answered Feb 27 '15 at 00:52

Amrita Sawant

10,403
4
22
26

Sure, that works, but how do I build the list of included and excluded words based on the +, -, and "" grammar? Is there a ready made solution, or will I have to use lex to create it? – ipartola Feb 27 '15 at 03:44
*"you do not need Regex"* but regexps help. The regex-based solution looks at the string only twice. Yours solution looks at it `len(mylist)` times i.e., `~100` times. – jfs Feb 27 '15 at 08:00

Python text search library

2 Answers2