How to identify the subject of a sentence?

Question

Can Python + NLTK be used to identify the subject of a sentence? From what I have learned till now is that a sentence can be broken into a head and its dependents. For e.g. "I shot an elephant". In this sentence, I and elephant are dependents to shot. But How do I discern that the subject in this sentence is I.

I wanted to link my question to a similar one asked previously - [link](http://stackoverflow.com/questions/8841569/find-subject-in-incomplete-sentence-with-nltk?rq=1) and an answer given by @mjv. Perhaps the author of the question and/or the responder can shed more light. Thanks. — singhalc, Feb 20 '15 at 22:18

score 31 · Answer 1 · edited Dec 23 '17 at 14:34

31

You can use Spacy.

Code

import spacy
nlp = spacy.load('en')
sent = "I shot an elephant"
doc=nlp(sent)

sub_toks = [tok for tok in doc if (tok.dep_ == "nsubj") ]

print(sub_toks)

edited Dec 23 '17 at 14:34

Ronak Shah

377,200
20
156
213

answered Dec 17 '16 at 19:17

Sohel Khan

661
2
8
14

score 15 · Accepted Answer · answered Feb 20 '15 at 16:30

15

As NLTK book (exercise 29) says, "One common way of defining the subject of a sentence S in English is as the noun phrase that is the child of S and the sibling of VP."

Look at tree example: indeed, "I" is the noun phrase that is the child of S that is the sibling of VP, while "elephant" is not.

answered Feb 20 '15 at 16:30

Nikita Astrakhantsev

4,701
1
15
26

2

Thanks for pointing me to the appropriate section. I was able to identify the NP using the examples in the book, but I understand now that identifying the subject will be a combination of two criteria- child of S and sibling of VP. Can you also point me to a code example that identifies the subject in a sentence? Thanks. – singhalc Feb 20 '15 at 22:14
3

This is an old post, but how do you generate the tree without manually defining it? I haven't seen that yet. – John Sly May 17 '17 at 20:03

score 6 · Answer 3 · answered Feb 20 '15 at 18:00

English language has two voices: Active voice and passive voice. Lets take most used voice: Active voice.

It follows subject-verb-object model. To mark the subject, write a rule set with POS tags. Tag the sentence I[NOUN] shot[VERB] an elephant[NOUN]. If you see the first noun is subject, then there is a verb and then there is an object.

If you want to make it more complicated, a sentence- I shot an elephant with a gun. Here the prepositions or subordinate conjunctions like with, at, in can be given roles. Here the sentence will be tagged as I[NOUN] shot[VERB] an elephant[NOUN] with[IN] a gun[NOUN]. You can easily say that word with gets instrumentative role. You can build a rule based system to get role of every word in the sentence.

Also look at the patterns in passive voice and write rules for the same.

How to find the active verb and passive verb? – Sundeep Pidugu May 18 '20 at 01:34 — Sundeep Pidugu, May 18 '20 at 01:34

ccpizza · Answer 4 · 2020-10-20T07:53:15.233

rake_nltk (pip install rake_nltk) is a python library that wraps nltk and apparently uses the RAKE algorithm.

from rake_nltk import Rake

rake = Rake()

kw = rake.extract_keywords_from_text("Can Python + NLTK be used to identify the subject of a sentence?")

ranked_phrases = rake.get_ranked_phrases()

print(ranked_phrases)

# outputs the keywords ordered by rank
>>> ['used', 'subject', 'sentence', 'python', 'nltk', 'identify']

By default the stopword list from nltk is used. You can provide your custom stopword list and punctuation chars by passing them in the constructor:

rake = Rake(stopwords='mystopwords.txt', punctuations=''',;:!@#$%^*/\''')

By default string.punctuation is used for punctuation.

The constructor also accepts a language keyword which can be any language supported by nltk.

is there any way in js ? – Chitrang Sharma Jul 01 '21 at 10:32 — Chitrang Sharma, Jul 01 '21 at 10:32

score 0 · Answer 5 · answered Sep 14 '20 at 09:02

0

Stanford Corenlp Tool can also be used to extract Subject-Relation-Object information of a sentence.

Attaching screenshot of same:

answered Sep 14 '20 at 09:02

krish___na

692
7
14

Abd Hendi · Answer 6 · 2021-08-14T13:19:34.253

0

code using spacy : here the doc is the sentence and dep='nsubj' for subject and 'dobj' for object

import spacy 
nlp = spacy.load('en_core_web_lg')


def get_subject_object_phrase(doc, dep):
    doc = nlp(doc)
    for token in doc:
        if dep in token.dep_:
            subtree = list(token.subtree)
            start = subtree[0].i
            end = subtree[-1].i + 1
    return str(doc[start:end])

edited Aug 14 '21 at 13:19

answered Aug 14 '21 at 13:10

Abd Hendi

1
1

score -3 · Answer 7 · edited Feb 06 '19 at 16:55

-3

You can paper over the issue by doing something like doc = nlp(text.decode('utf8')), but this will likely bring you more bugs in future.

Credits: https://github.com/explosion/spaCy/issues/380

edited Feb 06 '19 at 16:55

roschach

8,390
14
74
124

answered Oct 23 '17 at 22:01

cronos

1
1

1

? I don't understand this answer. – rayryeng Dec 27 '18 at 16:06
2

That issue is about giving non-unicode data to Spacy. Nothing to do with this question. – Darren Cook Feb 06 '19 at 14:42

How to identify the subject of a sentence?

7 Answers7

Code

Linked