The following is not ideal, but it should get you started. It uses nltk
to first split your text into words, and then produces a set containing the stems of all the words, filtering any stop word. It does this both for your sample text and the sample query.
If the intersection of the two sets contains all of the words in the query, it is considered a match.
import nltk
from nltk.stem import PorterStemmer
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
stop_words = stopwords.words('english')
ps = PorterStemmer()
def get_word_set(text):
return set(ps.stem(word) for word in word_tokenize(text) if word not in stop_words)
text1 = "The arterial high blood pressure may engage the prognosis for survival of the patient as a result of complications. TENSTATEN enters within the framework of a preventive treatment(processing). His(Her,Its) report(relationship) efficiency / effects unwanted is important. diuretics, medicine of first intention of which TENSTATEN, is. The therapeutic alternatives are very numerous."
text2 = "The arterial high blood pressure may engage the for survival of the patient as a result of complications. TENSTATEN enters within the framework of a preventive treatment(processing). His(Her,Its) report(relationship) efficiency / effects unwanted is important. diuretics, medicine of first intention of which TENSTATEN, is. The therapeutic alternatives are very numerous."
query = "engage the prognosis for survival"
set_query = get_word_set(query)
for text in [text1, text2]:
set_text = get_word_set(text)
intersection = set_query & set_text
print "Query:", set_query
print "Test:", set_text
print "Intersection:", intersection
print "Match:", len(intersection) == len(set_query)
print
The script provides two texts, one passes and the other does not, it produces the following output to show you what it is doing:
Query: set([u'prognosi', u'engag', u'surviv'])
Test: set([u'medicin', u'prevent', u'effici', u'engag', u'Her', u'process', u'within', u'surviv', u'high', u'pressur', u'result', u'framework', u'diuret', u')', u'(', u',', u'/', u'.', u'numer', u'Hi', u'treatment', u'import', u'complic', u'altern', u'patient', u'relationship', u'may', u'arteri', u'effect', u'prognosi', u'intent', u'blood', u'report', u'The', u'TENSTATEN', u'unwant', u'It', u'therapeut', u'enter', u'first'])
Intersection: set([u'prognosi', u'engag', u'surviv'])
Match: True
Query: set([u'prognosi', u'engag', u'surviv'])
Test: set([u'medicin', u'prevent', u'effici', u'engag', u'Her', u'process', u'within', u'surviv', u'high', u'pressur', u'result', u'diuret', u')', u'(', u',', u'/', u'.', u'numer', u'Hi', u'treatment', u'import', u'complic', u'altern', u'patient', u'relationship', u'may', u'arteri', u'effect', u'framework', u'intent', u'blood', u'report', u'The', u'TENSTATEN', u'unwant', u'It', u'therapeut', u'enter', u'first'])
Intersection: set([u'engag', u'surviv'])
Match: False