2

I have a big sample text, for example :

"The arterial high blood pressure may engage the prognosis for survival of the patient as a result of complications. TENSTATEN enters within the framework of a preventive treatment(processing). His(Her,Its) report(relationship) efficiency / effects unwanted is important. diuretics, medicine of first intention of which TENSTATEN, is. The therapeutic alternatives are very numerous."

And I am trying to detect if "engage the prognosis for survival" in the text but in a fuzzy way. For example "has engage the progronosis of survival" must return a positive answer too.

I looked into fuzzywuzzy, nltk and the new regex fuzzy functions, but i didn't find a way to do :

if [anything similar (>90%) to "that sentence"] in mybigtext:
    print True
Martin Evans
  • 45,791
  • 17
  • 81
  • 97
  • Im new here but I think this should solve your problem: http://stackoverflow.com/questions/30449452/python-fuzzy-text-search?rq=1 – opeonikute Feb 29 '16 at 18:08
  • Have a look at [gensim](https://radimrehurek.com/gensim/index.html), especially the [similarity section](https://radimrehurek.com/gensim/tut3.html). – Jan Feb 29 '16 at 20:23

3 Answers3

1

The following is not ideal, but it should get you started. It uses nltk to first split your text into words, and then produces a set containing the stems of all the words, filtering any stop word. It does this both for your sample text and the sample query.

If the intersection of the two sets contains all of the words in the query, it is considered a match.

import nltk

from nltk.stem import PorterStemmer
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords

stop_words = stopwords.words('english')
ps = PorterStemmer()

def get_word_set(text):
    return set(ps.stem(word) for word in word_tokenize(text) if word not in stop_words)

text1 = "The arterial high blood pressure may engage the prognosis for survival of the patient as a result of complications. TENSTATEN enters within the framework of a preventive treatment(processing). His(Her,Its) report(relationship) efficiency / effects unwanted is important. diuretics, medicine of first intention of which TENSTATEN, is. The therapeutic alternatives are very numerous."
text2 = "The arterial high blood pressure may engage the for survival of the patient as a result of complications. TENSTATEN enters within the framework of a preventive treatment(processing). His(Her,Its) report(relationship) efficiency / effects unwanted is important. diuretics, medicine of first intention of which TENSTATEN, is. The therapeutic alternatives are very numerous."

query = "engage the prognosis for survival"

set_query = get_word_set(query)
for text in [text1, text2]:
    set_text = get_word_set(text)
    intersection = set_query & set_text

    print "Query:", set_query
    print "Test:", set_text
    print "Intersection:", intersection
    print "Match:", len(intersection) == len(set_query)
    print

The script provides two texts, one passes and the other does not, it produces the following output to show you what it is doing:

Query: set([u'prognosi', u'engag', u'surviv'])
Test: set([u'medicin', u'prevent', u'effici', u'engag', u'Her', u'process', u'within', u'surviv', u'high', u'pressur', u'result', u'framework', u'diuret', u')', u'(', u',', u'/', u'.', u'numer', u'Hi', u'treatment', u'import', u'complic', u'altern', u'patient', u'relationship', u'may', u'arteri', u'effect', u'prognosi', u'intent', u'blood', u'report', u'The', u'TENSTATEN', u'unwant', u'It', u'therapeut', u'enter', u'first'])
Intersection: set([u'prognosi', u'engag', u'surviv'])
Match: True

Query: set([u'prognosi', u'engag', u'surviv'])
Test: set([u'medicin', u'prevent', u'effici', u'engag', u'Her', u'process', u'within', u'surviv', u'high', u'pressur', u'result', u'diuret', u')', u'(', u',', u'/', u'.', u'numer', u'Hi', u'treatment', u'import', u'complic', u'altern', u'patient', u'relationship', u'may', u'arteri', u'effect', u'framework', u'intent', u'blood', u'report', u'The', u'TENSTATEN', u'unwant', u'It', u'therapeut', u'enter', u'first'])
Intersection: set([u'engag', u'surviv'])
Match: False
Martin Evans
  • 45,791
  • 17
  • 81
  • 97
1

Using the regex module, first split by sentences then test if the fuzzy pattern is in the sentence:

tgt="The arterial high blood pressure may engage the prognosis for survival of the patient as a result of complications. TENSTATEN enters within the framework of a preventive treatment(processing). His(Her,Its) report(relationship) efficiency / effects unwanted is important. diuretics, medicine of first intention of which TENSTATEN, is. The therapeutic alternatives are very numerous."

for sentence in regex.split(r'(?<=[.?!;])\s+(?=\p{Lu})', tgt):
    pat=r'(?e)((?:has engage the progronosis of survival){e<%i})' 
    pat=pat % int(len(pat)/5)
    m=regex.search(pat, sentence)
    if m:
        print "'{}'\n\tfuzzy matches\n'{}'\n\twith \n{} substitutions, {} insertions, {} deletions".format(pat,m.group(1), *m.fuzzy_counts)

Prints:

'(?e)((?:has engage the progronosis of survival){e<10})'
    fuzzy matches
'may engage the prognosis for survival'
    with 
3 substitutions, 1 insertions, 2 deletions
dawg
  • 98,345
  • 23
  • 131
  • 206
  • So by playing with the number fuzzy_counts numbers like limitating them... i could do something make the difference between : 'has engage the prognosis' and 'do not engage the prognosis' That seems perfect thanks ! I'll try and show my problem as solved if the case. – Mickael_Paris Mar 02 '16 at 12:38
0

There's a function below which if a word is contained inside the text it will display a match. You could improvise to get it to check for complete phrases in a text.

Here's the function i made:

def FuzzySearch(text, phrase):
    """Check if word in phrase is contained in text"""
    phrases = phrase.split(" ")

    for x in range(len(phrases)):
        if phrases[x] in text:
            print("Match! Found " + phrases[x] + " in text")
        else:
            continue
Hazim Sager
  • 82
  • 10