0

I am looking for a library that would let me do something like the following:

matches(
    user_input="hello world how are you what are you doing",
    keywords='+world -tigers "how are" -"bye bye"'
)

Basically I want it to match strings based on presence of words, absence of words and sequences of words. I don't need a search engine a la Solr, because strings will not be known in advance and will only be searched once. Does such a library already exist, and if so, where would I find it? Or am I doomed to creating a regex generator?

jfs
  • 399,953
  • 195
  • 994
  • 1,670
ipartola
  • 1,612
  • 3
  • 15
  • 25
  • try nltk.org. thats the natural language processing library for python – Kelvin Feb 26 '15 at 22:10
  • Not sure what size of data you are looking to match over, but Lucene/Solr is the best option for a larger scale application - http://lucene.apache.org/solr/ . Also look at [pysolr](https://github.com/toastdriven/pysolr). – Shashank Agarwal Feb 26 '15 at 22:15
  • I am looking to match very small amounts of data: strings with under a 100 words using keyword rules of a few keywords only. After the matching is done, I no longer have a use for the original string, so I don't think Solr is what I need. I also don't need the search to be fuzzy or language-specific. – ipartola Feb 26 '15 at 22:19

2 Answers2

2

regex module supports named lists:

import regex

def match_words(words, string):
    return regex.search(r"\b\L<words>\b", string, words=words)

def match(string, include_words, exclude_words):
    return (match_words(include_words, string) and
            not match_words(exclude_words, string))

Example:

if match("hello world how are you what are you doing",
         include_words=["world", "how are"],
         exclude_words=["tigers", "bye bye"]):
    print('matches')

You could implement named lists using standard re module e.g.:

import re

def match_words(words, string):
    re_words = '|'.join(map(re.escape, sorted(words, key=len, reverse=True)))
    return re.search(r"\b(?:{words})\b".format(words=re_words), string)

how do I build the list of included and excluded words based on the +, -, and "" grammar?

You could use shlex.split():

import shlex

include_words, exclude_words = [], []
for word in shlex.split('+world -tigers "how are" -"bye bye"'):
    (exclude_words if word.startswith('-') else include_words).append(word.lstrip('-+'))

print(include_words, exclude_words)
# -> (['world', 'how are'], ['tigers', 'bye bye'])
jfs
  • 399,953
  • 195
  • 994
  • 1,670
  • Clever and probably faster than Amrita's solution, but I don't think this helps me with the keyword grammar either, unless I am missing some magic in the re_words creation. – ipartola Feb 27 '15 at 03:46
  • I've updated the answer to show how re_words are used in implementing the match() function and how to parse `'+world -tigers "how are" -"bye bye"'`. – jfs Feb 27 '15 at 08:15
  • Perfect. Didn't realize that shlex would do this with just the split function. This is perfect! – ipartola Feb 27 '15 at 15:48
1

From the example you have given, you do not need Regex unless you are looking for patterns/expressions within words..

    d="---your string ---"
    mylist= d.split()
    M=[]
    Excl=["---excluded words---"]
    for word in mylist:
        if word not in Excl:
            M.append(word)
    print M

You can write a generic function which can be used with any string list and exclusion list.

Amrita Sawant
  • 10,403
  • 4
  • 22
  • 26
  • Sure, that works, but how do I build the list of included and excluded words based on the +, -, and "" grammar? Is there a ready made solution, or will I have to use lex to create it? – ipartola Feb 27 '15 at 03:44
  • *"you do not need Regex"* but regexps help. The regex-based solution looks at the string only twice. Yours solution looks at it `len(mylist)` times i.e., `~100` times. – jfs Feb 27 '15 at 08:00