22

Can Python + NLTK be used to identify the subject of a sentence? From what I have learned till now is that a sentence can be broken into a head and its dependents. For e.g. "I shot an elephant". In this sentence, I and elephant are dependents to shot. But How do I discern that the subject in this sentence is I.

singhalc
  • 343
  • 1
  • 2
  • 8
  • I wanted to link my question to a similar one asked previously - [link](http://stackoverflow.com/questions/8841569/find-subject-in-incomplete-sentence-with-nltk?rq=1) and an answer given by @mjv. Perhaps the author of the question and/or the responder can shed more light. Thanks. – singhalc Feb 20 '15 at 22:18

7 Answers7

31

You can use Spacy.

Code

import spacy
nlp = spacy.load('en')
sent = "I shot an elephant"
doc=nlp(sent)

sub_toks = [tok for tok in doc if (tok.dep_ == "nsubj") ]

print(sub_toks) 
Ronak Shah
  • 377,200
  • 20
  • 156
  • 213
Sohel Khan
  • 661
  • 2
  • 8
  • 14
15

As NLTK book (exercise 29) says, "One common way of defining the subject of a sentence S in English is as the noun phrase that is the child of S and the sibling of VP."

Look at tree example: indeed, "I" is the noun phrase that is the child of S that is the sibling of VP, while "elephant" is not.

Nikita Astrakhantsev
  • 4,701
  • 1
  • 15
  • 26
  • 2
    Thanks for pointing me to the appropriate section. I was able to identify the NP using the examples in the book, but I understand now that identifying the subject will be a combination of two criteria- child of S and sibling of VP. Can you also point me to a code example that identifies the subject in a sentence? Thanks. – singhalc Feb 20 '15 at 22:14
  • 3
    This is an old post, but how do you generate the tree without manually defining it? I haven't seen that yet. – John Sly May 17 '17 at 20:03
6

English language has two voices: Active voice and passive voice. Lets take most used voice: Active voice.

It follows subject-verb-object model. To mark the subject, write a rule set with POS tags. Tag the sentence I[NOUN] shot[VERB] an elephant[NOUN]. If you see the first noun is subject, then there is a verb and then there is an object.

If you want to make it more complicated, a sentence- I shot an elephant with a gun. Here the prepositions or subordinate conjunctions like with, at, in can be given roles. Here the sentence will be tagged as I[NOUN] shot[VERB] an elephant[NOUN] with[IN] a gun[NOUN]. You can easily say that word with gets instrumentative role. You can build a rule based system to get role of every word in the sentence.

Also look at the patterns in passive voice and write rules for the same.

rishi
  • 2,564
  • 6
  • 25
  • 47
1

rake_nltk (pip install rake_nltk) is a python library that wraps nltk and apparently uses the RAKE algorithm.

from rake_nltk import Rake

rake = Rake()

kw = rake.extract_keywords_from_text("Can Python + NLTK be used to identify the subject of a sentence?")

ranked_phrases = rake.get_ranked_phrases()

print(ranked_phrases)

# outputs the keywords ordered by rank
>>> ['used', 'subject', 'sentence', 'python', 'nltk', 'identify']

By default the stopword list from nltk is used. You can provide your custom stopword list and punctuation chars by passing them in the constructor:

rake = Rake(stopwords='mystopwords.txt', punctuations=''',;:!@#$%^*/\''')

By default string.punctuation is used for punctuation.

The constructor also accepts a language keyword which can be any language supported by nltk.

ccpizza
  • 28,968
  • 18
  • 162
  • 169
0

Stanford Corenlp Tool can also be used to extract Subject-Relation-Object information of a sentence.

Attaching screenshot of same:

enter image description here

krish___na
  • 692
  • 7
  • 14
0

code using spacy : here the doc is the sentence and dep='nsubj' for subject and 'dobj' for object

import spacy 
nlp = spacy.load('en_core_web_lg')


def get_subject_object_phrase(doc, dep):
    doc = nlp(doc)
    for token in doc:
        if dep in token.dep_:
            subtree = list(token.subtree)
            start = subtree[0].i
            end = subtree[-1].i + 1
    return str(doc[start:end])
Abd Hendi
  • 1
  • 1
-3

You can paper over the issue by doing something like doc = nlp(text.decode('utf8')), but this will likely bring you more bugs in future.

Credits: https://github.com/explosion/spaCy/issues/380

roschach
  • 8,390
  • 14
  • 74
  • 124
cronos
  • 1
  • 1