2

I have the following text and want to isolate a part of the sentence related to a keyword, in this case keywords = ['pizza', 'chips'].

text = "The pizza is great but the chips aren't the best"

Expected Output:

{'pizza': 'The pizza is great'}
{'chips': "the chips aren't the best"}

I have tried using the Spacy Dependency Matcher but admittedly I'm not quite sure how it works. I tried the following pattern for chips which yields no matches.

import spacy
from spacy.matcher import DependencyMatcher

nlp = spacy.load("en_core_web_sm")

pattern = [
  {
    "RIGHT_ID": "chips_id",
    "RIGHT_ATTRS": {"ORTH": "chips"}
  },
    
  {
    "LEFT_ID": "chips_id",
    "REL_OP": "<<",
    "RIGHT_ID": "other_words",
    "RIGHT_ATTRS": {"POS": '*'}
  }
]

matcher = DependencyMatcher(nlp.vocab)
matcher.add("chips", [pattern])

doc = nlp("The pizza is great but the chips aren't the best")
for id_, (_, other_words) in matcher(doc):
    print(doc[other_words])

Edit:

Additional example sentences:

example_sentences = [
    "The pizza's are just OK, the chips is stiff and the service mediocre",
    "Then the mains came and the pizza - these we're really average - chips had loads of oil and was poor",
    "Nice pizza freshly made to order food is priced well, but chips are not so keenly priced.",
    "The pizzas and chips taste really good and the Tango Ice Blast was refreshing"
]
Ali
  • 328
  • 2
  • 7
  • Will the sentences you need to handle be similar in structure to the example you are using? – Richard K Yu Jan 02 '22 at 22:46
  • Yes, the example sentence provided is a good representation of the text I will need to handle. I have updated the question with more example sentences. – Ali Jan 02 '22 at 23:07
  • Is it alright if I post a preliminary solution so we can both check it out? The solution I have works for the first sentence you put and some of the example sentences, but some of the other example sentences we may need to modify in some way before SpaCy can work effectively on them – Richard K Yu Jan 02 '22 at 23:21
  • It looks like you're doing sentence simplification for the purposes of aspect based sentiment analysis. spaCy gives you the tools to do that but if you're not familiar with the problems previously it'll be kind of involved. I recommend looking at the Jurafsky and Martin book (free online) sections on dependency parsing and sentiment analysis. to get started. https://web.stanford.edu/~jurafsky/slp3/ – polm23 Jan 03 '22 at 05:20

2 Answers2

0

you could use the following function :

def spliter(text : str , keyword :list, number_of_words:int):
    L = text.split()
    sentences = dict()
    for k in L :
        if k in keyword :
            n = L.index(k)
            if len(L) -n -1 > number_of_words :
                sentences.update({k:' '.join(L[n : n + number_of_words])})
            else :
                sentences.update({k:' '.join(L[n :])})
    return sentences

Note : number_of_word define how many word you want to get after the desired keyword

Output : for number_of_words = 3 you get :

{'pizza': 'pizza is great', 'chips': "chips aren't the best"}
Ayyoub ESSADEQ
  • 776
  • 2
  • 14
  • Unfortunately, this won't be suitable for my use case since the `number_of_words` argument is static. This approach will fail with different sentence structures. – Ali Jan 03 '22 at 08:19
0

Here is my attempt at a very limited solution to your problem, since I do not know how extensive you will want this to be.

I utilized code from this answer in order to address the problem.

import spacy
import re

en = spacy.load('en_core_web_sm')

text = "The pizza is great but the chips aren't the best"

doc = en(text)

seen = set() # keep track of covered words

chunks = []
for sent in doc.sents:
    heads = [cc for cc in sent.root.children if cc.dep_ == 'conj']

    for head in heads:
        words = [ww for ww in head.subtree]
        for word in words:
            seen.add(word)
        chunk = (' '.join([ww.text for ww in words]))
        chunks.append( (head.i, chunk) )

    unseen = [ww for ww in sent if ww not in seen]
    chunk = ' '.join([ww.text for ww in unseen])
    chunks.append( (sent.root.i, chunk) )

chunks = sorted(chunks, key=lambda x: x[0])


output_dict = {}

for np in doc.noun_chunks:
    insensitive_the = re.compile(re.escape('the '), re.IGNORECASE)
    new_np = insensitive_the.sub('',np.text)
    output_dict[new_np]=''

for ii, chunk in chunks:
    #print(ii, chunk)
    for key in output_dict:
        if key in chunk:
            output_dict[key]=chunk

print(output_dict)

The output I get is: enter image description here

I am aware there are a few problems:

  1. The conjunction 'but' should not be in the value of the pizza key.
  2. The word "are n't" should be aren't in the second value of the dictionary.

However, I believe we can fix this if we know more information about what sort of sentences you are dealing with. For instance, we might have a list of conjunctions that we can strip from all the values of the dict if the sentences are simple enough.

Update with example sentences: enter image description here

As you can see, I think SpaCy struggles a bit with the punctuation, as well as knowing that you only want food items as nouns in the dictionary, presumably.

Richard K Yu
  • 2,152
  • 3
  • 8
  • 21