0

I'm new to python and struggle with data types concept and their conversions.

I have sentences in NLTK Tree format (obtained from Stanford parser and converted to an NLTK tree). I need to apply functions written for NLTK Chunker. However, NLTK tree format is different from NLTK Chunker format. Both formats are NLTK trees, but elements structure seems to be different (see below).

Could you please help me to convert an NLTK tree to an NLTK Chunker output format?

Thanks in advance!

Here is an NLTK Chunker output:

(S
  (NP Pierre/NNP Vinken/NNP)
  ,/,
  (NP 61/CD years/NNS old/JJ)
  ,/,
  will/MD
  join/VB
  (NP the/DT board/NN)
  as/IN
  (NP a/DT nonexecutive/JJ director/NN Nov./NNP 29/CD)
  ./.)

Now printed by element and each element type:

class 'nltk.tree.Tree' (NP Pierre/NNP Vinken/NNP)
type 'tuple' (',', ',')
class 'nltk.tree.Tree' (NP 61/CD years/NNS old/JJ)
type 'tuple' (',', ',')
type 'tuple' ('will', 'MD')
type 'tuple' ('join', 'VB')
class 'nltk.tree.Tree' (NP the/DT board/NN)
type 'tuple' ('as', 'IN')
class 'nltk.tree.Tree' (NP a/DT nonexecutive/JJ director/NN Nov./NNP 29/CD)
type 'tuple' ('.', '.')

Here is an NLTK "pure" Tree output (exactly as in NLTK doc):

(S
  (NP
    (NP (NNP Pierre) (NNP Vinken))
    (, ,)
    (ADJP (NP (CD 61) (NNS years)) (JJ old))
    (, ,))
  (VP
    (MD will)
    (VP
      (VB join)
      (NP (DT the) (NN board))
      (PP (IN as) (NP (DT a) (JJ nonexecutive) (NN director) (NNP Nov.) (CD 29)))
      ))
  (. .))

Now printed by element and each element type:

class 'nltk.tree.Tree' (NP
  (NP (NNP Pierre) (NNP Vinken))
  (, ,)
  (ADJP (NP (CD 61) (NNS years)) (JJ old))
  (, ,))
class 'nltk.tree.Tree' (NP (NNP Pierre) (NNP Vinken))
class 'nltk.tree.Tree' (NNP Pierre)
type 'str' Pierre
class 'nltk.tree.Tree' (NNP Vinken)
type 'str' Vinken
class 'nltk.tree.Tree' (, ,)
type 'str' ,
class 'nltk.tree.Tree' (ADJP (NP (CD 61) (NNS years)) (JJ old))
class 'nltk.tree.Tree' (NP (CD 61) (NNS years))
class 'nltk.tree.Tree' (CD 61)
type 'str' 61
class 'nltk.tree.Tree' (NNS years)
type 'str' years
class 'nltk.tree.Tree' (JJ old)
type 'str' old
class 'nltk.tree.Tree' (, ,)
type 'str' ,
class 'nltk.tree.Tree' (VP
  (MD will)
  (VP
    (VB join)
    (NP (DT the) (NN board))
    (PP (IN as) (NP (DT a) (JJ nonexecutive) (NN director)))
    (NP (NNP Nov.) (CD 29))))
class 'nltk.tree.Tree' (MD will)
type 'str' will
class 'nltk.tree.Tree' (VP
  (VB join)
  (NP (DT the) (NN board))
  (PP (IN as) (NP (DT a) (JJ nonexecutive) (NN director)))
  (NP (NNP Nov.) (CD 29)))
class 'nltk.tree.Tree' (VB join)
type 'str' join
class 'nltk.tree.Tree' (NP (DT the) (NN board))
class 'nltk.tree.Tree' (DT the)
type 'str' the
class 'nltk.tree.Tree' (NN board)
type 'str' board
class 'nltk.tree.Tree' (PP (IN as) (NP (DT a) (JJ nonexecutive) (NN director)))
class 'nltk.tree.Tree' (IN as)
type 'str' as
class 'nltk.tree.Tree' (NP (DT a) (JJ nonexecutive) (NN director))
class 'nltk.tree.Tree' (DT a)
type 'str' a
class 'nltk.tree.Tree' (JJ nonexecutive)
type 'str' nonexecutive
class 'nltk.tree.Tree' (NN director)
type 'str' director
class 'nltk.tree.Tree' (NP (NNP Nov.) (CD 29))
class 'nltk.tree.Tree' (NNP Nov.)
type 'str' Nov.
class 'nltk.tree.Tree' (CD 29)
type 'str' 29
class 'nltk.tree.Tree' (. .)
type 'str' .
alvas
  • 115,346
  • 109
  • 446
  • 738
uzla
  • 515
  • 1
  • 4
  • 20
  • 1
    welcome to Stackoverflow, i've edited your post because the comments in the block quotes are actually confusing potential answer because it looked like it was part of the code. – alvas Dec 30 '13 at 04:51
  • how did you get the NLTK chunker's output? what is the code that generate these output? – alvas Dec 30 '13 at 05:22
  • It is just standard recursive walk through the tree: `def trav(tree): for tree_el in tree: print str(type(tree_el)), print tree_el if isinstance(tree_el, nltk.tree.Tree): trav(tree_el) – uzla Dec 30 '13 at 16:48
  • can you add the link to how you get the second tree format "exactly from the NLTK docs"? – alvas Dec 30 '13 at 17:59
  • That's easy: instantiate and nltk.Tree class and feed a respectively marked up sentence to it: `tree = nltk.Tree(sent)` and then `print tree`. The sentence itself is build by a custom function from Stanford parser output. It is too long and irrelevant to dump it here. The point is that the output is exactly matches the nltk documentation requirements. – uzla Dec 30 '13 at 19:26

1 Answers1

2

Partial answer (i.e., no code):

The NLTK represents chunked data using the Tree class, which is really designed for arbitrary syntactic trees. A chunked sentence is a tree with just one level of grouping, so to go from a full parse to a chunked structure you need to discard all but one kind of non-recursive groups. Which groups? That depends on your application, since there are different kinds of "chunks" (e.g., named entities).

Your example shows NP chunks, so you could walk the tree and omit all structure except for the top level of NP (or the lowest level, if you want to break up complex NP chunks into small ones).

alexis
  • 48,685
  • 16
  • 101
  • 161