Search for particular parts of speech (e.g. nouns) and print them along with a preceding word

Question

I have a text which is made up of a list of basic sentences, such as "she is a doctor", "he is a good person", and so forth. I'm trying to write a program which will return only the nouns and the preceding pronoun (e.g. she, he, it). I need them to print as a pair, for example (she, doctor) or (he, person). I'm using SpaCy as this will allow me to work with similar texts in French and German as well.

This is the closest thing I've found elsewhere on this site as to what I need. What I've been trying so far is to produce a list of nouns in the text and then search the text for nouns in the list, and print the noun and the word 3 places before it (since this is the pattern for most of the sentences, and most is good enough for my purposes). This is what I've got for creating the list:

def spacy_tag(text):
  text_open = codecs.open(text, encoding='latin1').read()
  parsed_text = nlp_en(text_open)
  tokens = list([(token, token.tag_) for token in parsed_text])
  list1 = []
  for token, token.tag_ in tokens:
    if token.tag_ == 'NN':
      list1.append(token)
  return(list1)

However, when I try to do anything with it, I get an error message. I've tried using enumerate but I couldn't get that to work either. This is the current code I have for searching the text for the words in the list (I haven't gotten around to adding the part which should print the word several places beforehand as I'm still stuck on the searching part):

def spacy_search(text, list):
  text_open = codecs.open(text, encoding='latin1').read()
  for word in text_open:
   if word in list:
     print(word)

The error I get is at line 4, "if word in list:", and it says "TypeError: Argument 'other' has incorrect type (expected spacy.tokens.token.Token, got str)"

Is there a more efficient way of printing a PRP, NN pair using SpaCy? And alternatively, how can I amend my code to work so it searches the text for the nouns in the list? (It doesn't need to be a particularly elegant solution, it just needs to produce a result).

score 1 · Answer 1 · edited Jan 02 '22 at 16:51

You took the wrong approach :

First append all token attributes in the sentence :

tokonized=[]
for token in doc:
 tokonized.append((token.text ,token.lemma_, token.pos_, token.tag_, token.dep_,
                    token.shape_, token.is_alpha, token.is_stop,token.head,token.left_edge,token.right_edge,token.ent_type_))

write a function that receive a token and return it relevant head and checks if Token pos == 'NOUN' and tag== 'NN'

Head=''
if token[2]=='NOUN' and token[3]=='NN': 
 return token[8]

Now if return head is a PRON you found what your are looking for, if not send the head token to the function again.

Below you can see the running example for:

sentences=["she is a doctor", "he is a good person"]

('she', 'she', 'PRON', 'PRP', 'nsubj', 'xxx', True, True, is, she, she, '')
('is', 'be', 'AUX', 'VBZ', 'ROOT', 'xx', True, True, is, she, doctor, '')
('a', 'a', 'DET', 'DT', 'det', 'x', True, True, doctor, a, a, '')
('doctor', 'doctor', 'NOUN', 'NN', 'attr', 'xxxx', True, False, is, a, doctor, '')

So first call will return Is, 2nd call will return she and then you stop.

the same for:

('he', 'he', 'PRON', 'PRP', 'nsubj', 'xx', True, True, is, he, he, '')
('is', 'be', 'AUX', 'VBZ', 'ROOT', 'xx', True, True, is, he, person, '')
('a', 'a', 'DET', 'DT', 'det', 'x', True, True, person, a, a, '')
('good', 'good', 'ADJ', 'JJ', 'amod', 'xxxx', True, False, person, good, good, '')
('person', 'person', 'NOUN', 'NN', 'attr', 'xxxx', True, False, is, a, person, '')

So first call will return Is, 2nd call will return he and then you stop.

score 0 · Accepted Answer · answered Jan 03 '22 at 05:37

0

Here is a clean way to implement your intended approach.

# put your nouns of interest here
NOUN_LIST = ["doctor", ...]

def find_stuff(text):
    doc = nlp(text)
    if len(doc) < 4: return None # too short
    
    for tok in doc[3:]:
        if tok.pos_ == "NOUN" and tok.text in NOUN_LIST and doc[tok.i-3].pos_ == "PRON":
            return (doc[tok.i-3].text, tok.text)

As the other answer mentioned, your approach here is wrong though. You want the subject and object (or predicate nominative) of the sentence. You should use the DependencyMatcher for that. Here's an example:

from spacy.matcher import DependencyMatcher
import spacy

nlp = spacy.load("en_core_web_sm")
doc = nlp("she is a good person")

pattern = [
  # anchor token: verb, usually "is"
  {
    "RIGHT_ID": "verb",
    "RIGHT_ATTRS": {"POS": "AUX"}
  },
  # verb -> pronoun
  {
    "LEFT_ID": "verb",
    "REL_OP": ">",
    "RIGHT_ID": "pronoun",
    "RIGHT_ATTRS": {"DEP": "nsubj", "POS": "PRON"}
  },
  # predicate nominatives have "attr" relation
  {
    "LEFT_ID": "verb",
    "REL_OP": ">",
    "RIGHT_ID": "target",
    "RIGHT_ATTRS": {"DEP": "attr", "POS": "NOUN"}
  }
]

matcher = DependencyMatcher(nlp.vocab)
matcher.add("PREDNOM", [pattern])
matches = matcher(doc)

for match_id, (verb, pron, target) in matches:
    print(doc[pron], doc[verb], doc[target])

You can check dependency relations using displacy. You can learn more about what they are in the Jurafsky and Martin book.

answered Jan 03 '22 at 05:37

polm23

14,456
7
35
59

Thank you! The version of SpaCy I'm able to run is an earlier version and doesn't have Dependency Matcher, from what I can tell, but the first approach works and is suitable for what I need. Except for one issue - my texts consist of a list of sentences separated by line breaks, which SpaCy isn't parsing as separate sentences, so it's working for the first sentence and not the rest of the text. Do you know how I can tell it to read each line of the text file separately? – beatrixx Jan 03 '22 at 20:43
DependencyMatcher has been around for a while - out of curiosity, what version of spaCy are you using and why can't you upgrade? – polm23 Jan 04 '22 at 04:21
Also, for the sentences, if you have a separate question then make a new question rather than asking in comments. – polm23 Jan 04 '22 at 04:22
I'm not sure what version I'm running, but I'm getting an error which is apparently specific to older versions of SpaCy and the solution is to upgrade it. I'm using Google Colaboratory. I might be wrong about not being able to upgrade but I've gotten the impression that I'd need to download it locally. I need to share the Google Colab file with other people and can't rely on them being able to download things locally. I'm pretty short on time though so I'm honestly not too worried about working that one out. – beatrixx Jan 04 '22 at 04:44
Uh, OK. I am absolutely sure you can use the latest version of spaCy in colab easily, and you don't have to download anything locally, so you might want to look into doing that. – polm23 Jan 04 '22 at 05:31

Search for particular parts of speech (e.g. nouns) and print them along with a preceding word

2 Answers2