5

I am trying to find the span (start index, end index) of a noun phrase in a given sentence. The following is the code for extracting noun phrases

sent=nltk.word_tokenize(a)
sent_pos=nltk.pos_tag(sent)
grammar = r"""
    NBAR:
        {<NN.*|JJ>*<NN.*>}  # Nouns and Adjectives, terminated with Nouns

    NP:
        {<NBAR>}
        {<NBAR><IN><NBAR>}  # Above, connected with in/of/etc...
    VP:
        {<VBD><PP>?}
        {<VBZ><PP>?}
        {<VB><PP>?}
        {<VBN><PP>?}
        {<VBG><PP>?}
        {<VBP><PP>?}
"""

cp = nltk.RegexpParser(grammar)
result = cp.parse(sent_pos)
nounPhrases = []
for subtree in result.subtrees(filter=lambda t: t.label() == 'NP'):
  np = ''
  for x in subtree.leaves():
    np = np + ' ' + x[0]
  nounPhrases.append(np.strip())

For a = "The American Civil War, also known as the War between the States or simply the Civil War, was a civil war fought from 1861 to 1865 in the United States after several Southern slave states declared their secession and formed the Confederate States of America.", the noun phrases extracted are

['American Civil War', 'War', 'States', 'Civil War', 'civil war fought', 'United States', 'several Southern', 'states', 'secession', 'Confederate States', 'America'].

Now I need to find the span (start position and end position of the phrase) of noun phrases. For example, the span of above noun phrases will be

[(1,3), (9,9), (12, 12), (16, 17), (21, 23), ....].

I'm fairly new to NLTK and I've looked into http://www.nltk.org/_modules/nltk/tree.html. I tried to use Tree.treepositions() but I couldn't manage to extract absolute positions using these indices. Any help would be greatly appreciated. Thank You!

Corleone
  • 361
  • 1
  • 3
  • 6

3 Answers3

4

There isn't any implicit function that returns the offsets of strings/tokens as highlighted by https://github.com/nltk/nltk/issues/1214

But you can use an ngram searcher that is used by the RIBES score from https://github.com/nltk/nltk/blob/develop/nltk/translate/ribes_score.py#L123

>>> from nltk import word_tokenize
>>> from nltk.translate.ribes_score import position_of_ngram
>>> s = word_tokenize("The American Civil War, also known as the War between the States or simply the Civil War, was a civil war fought from 1861 to 1865 in the United States after several Southern slave states declared their secession and formed the Confederate States of America.")
>>> position_of_ngram(tuple('American Civil War'.split()), s)
1
>>> position_of_ngram(tuple('Confederate States of America'.split()), s)
43

(It returns the starting position of the query ngram)

alvas
  • 115,346
  • 109
  • 446
  • 738
  • 1
    Thanks for the reply! but in the case of the NP *'War'*, as there are multiple occurrences of the same word in the sentence, `position_of_ngram(tuple('War'.split()), s)` will return the index of the first occurrence which is 3 whereas the NP extracted is at index 9. Is there a work around for this? Thanks again! – Corleone Apr 25 '16 at 05:57
  • @Corleone, since there's no notion of "offset", the best you can do is to either get the first instance of an ngram or recursive get the "next" instance. – alvas Apr 25 '16 at 10:45
  • @alvas, Thanks! I think I'll do that. I tried to accept the answer but my karma isn't good enough as of now :/ – Corleone Apr 25 '16 at 12:41
0

Here is another approach that augments the tokens with their absolute positions in a tree string. The absolute positions can now be extracted from the leaves of any subtree.

def add_indices_to_terminals(treestring):
    tree = ParentedTree.fromstring(treestring)
    for idx, _ in enumerate(tree.leaves()):
        tree_location = tree.leaf_treeposition(idx)
        non_terminal = tree[tree_location[:-1]]
        non_terminal[0] = non_terminal[0] + "_" + str(idx)
    return str(tree)

Example use case

>>> treestring = (S (NP (NNP John)) (VP (V runs)))
>>> add_indices_to_terminals(treestring)
(S (NP (NNP John_0)) (VP (V runs_1)))
shyamupa
  • 1,528
  • 4
  • 16
  • 24
0

Achieved token offsets of a constituent parsed tree with below code:

def get_tok_idx_of_tree(t, mapping_label_2_tok_idx, count_label, i):
    if isinstance(t, str):
        pass
    else:
        if count_label[0] == 0:
            idx_start = 0
        elif i == 0:
            idx_start = mapping_label_2_tok_idx[list(mapping_label_2_tok_idx.keys())[-1]][0]
        else:
            idx_start = mapping_label_2_tok_idx[list(mapping_label_2_tok_idx.keys())[-1]][1] + 1

        idx_end = idx_start + len(t.leaves()) - 1
        mapping_label_2_tok_idx[t.label() + "_" + str(count_label[0])] = (idx_start, idx_end)
        count_label[0] += 1

        for i, child in enumerate(t):
            get_tok_idx_of_tree(child, mapping_label_2_tok_idx, count_label, i)

    

Below is a constituent tree:

enter image description here

Outpur of above code:

{'ROOT_0': (0, 3), 'S_1': (0, 3), 'VP_2': (0, 2), 'VB_3': (0, 0), 'NP_4': (1, 2), 'DT_5': (1, 1), 'NN_6': (2, 2), '._7': (3, 3)}
krish___na
  • 692
  • 7
  • 14