4

Google SyntaxNet gives an output like ..

saw VBD ROOT
 +-- Alice NNP nsubj
 |   +-- , , punct
 |   +-- reading VBG rcmod
 |       +-- who WP nsubj
 |       +-- had VBD aux
 |       +-- been VBN aux
 |       +-- about IN prep
 |           +-- SyntaxNet NNP pobj
 +-- , , punct
 +-- Bob NNP dobj
 +-- in IN prep
 |   +-- hallway NN pobj
 |       +-- the DT det
 +-- yesterday NN tmod
 +-- . . punct

I want to use python to read and parse this output(String data). and print out with 'labelled bracket notation' like ->

[saw [Alice [,] [reading [who][had][been][about [SyntaxNet]]]][,][Bob][in [hallway [the]]][yesterday][.]]

Can you help me?

CPUU
  • 71
  • 2
  • 7

1 Answers1

6

Instead of parsing the tree, you can make SyntaxNet output everything in conll format which is easier to parse. The conll format for your sentence looks like this:

1       Alice   _       NOUN    NNP     _       10      nsubj   _       _
2       ,       _       .       ,       _       1       punct   _       _
3       who     _       PRON    WP      _       6       nsubj   _       _
4       had     _       VERB    VBD     _       6       aux     _       _
5       been    _       VERB    VBN     _       6       aux     _       _
6       reading _       VERB    VBG     _       1       rcmod   _       _
7       about   _       ADP     IN      _       6       prep    _       _
8       SyntaxNet       _       NOUN    NNP     _       7       pobj    _       _
9       ,       _       .       ,       _       10      punct   _       _
10      saw     _       VERB    VBD     _       0       ROOT    _       _
11      Bob     _       NOUN    NNP     _       10      dobj    _       _
12      in      _       ADP     IN      _       10      prep    _       _
13      the     _       DET     DT      _       14      det     _       _
14      hallway _       NOUN    NN      _       12      pobj    _       _
15      yesterday       _       NOUN    NN      _       10      tmod    _       _
16      .       _       .       .       _       10      punct   _       _

The meaning of each column can be found here. The only columns we are concerned with at the moment are the first (the ID of the word), the second (the word itself) and the 7th (the head, in other words, the parent). The root node has a parent of 0.

To get the conll format we just have to comment out the last few lines of demo.sh (which I assume you used to get your output):

$PARSER_EVAL \
  --input=$INPUT_FORMAT \
  --output=stdout-conll \
  --hidden_layer_sizes=64 \
  --arg_prefix=brain_tagger \
  --graph_builder=structured \
  --task_context=$MODEL_DIR/context.pbtxt \
  --model_path=$MODEL_DIR/tagger-params \
  --slim_model \
  --batch_size=1024 \
  --alsologtostderr \
   | \
  $PARSER_EVAL \
  --input=stdin-conll \
  --output=stdout-conll \
  --hidden_layer_sizes=512,512 \
  --arg_prefix=brain_parser \
  --graph_builder=structured \
  --task_context=$MODEL_DIR/context.pbtxt \
  --model_path=$MODEL_DIR/parser-params \
  --slim_model \
  --batch_size=1024 \
  --alsologtostderr #\
#  | \
#  bazel-bin/syntaxnet/conll2tree \
#  --task_context=$MODEL_DIR/context.pbtxt \
#  --alsologtostderr

(don't forget to comment out the backslash on the previous line)

(where I got this trick from, see the comment)

When I run demo.sh myself I get a lot of information I do not need. How you can get rid of that I leave for you to figure out (Let me know :)). I coppied the relevant part to a file for now so I can pipe it into the python program I'm going to write. If you can get rid of the info, you should be able to pipe demo.sh directly into the python program as well.

Note: I'm fairly new to python so feel free to improve my code.

First, we just want to read the conll file from the input. I like to put each word in a nice class.

#!/usr/bin/env python

import sys

# Conll data format:
# http://ilk.uvt.nl/conll/#dataformat
#
# The only parts we need:
# 1: ID
# 2: FORM (The original word)
# 7: HEAD (The ID of its parent)

class Word:
    "A class containing the information of a single line from a conll file."

    def __init__(self, columns):
        self.id = int(columns[0])
        self.form = columns[1]
        self.head = int(columns[6])
        self.children = []

# Read the conll input and put it in a list of words.
words = []
for line in sys.stdin:
    # Remove newline character, split on spaces and remove empty columns.
    line = filter(None, line.rstrip().split(" "))

    words.append(Word(line))

Nice, but it isn't a tree structure yet. We have to do a little more work.

I could foreach the whole list a couple of times to look up every child for every word but this would be inefficient. I sort them by their parent instead and then it should just be a quick lookup to get every child for a given parent.

# Sort the words by their head (parent).
lookup = [[] for _ in range(len(words) + 1)]
for word in words:
    lookup[word.head].append(word)

Create a tree structure:

# Build a tree
def buildTree(head):
    "Find the children for the given head in the lookup, recursively"

    # Get all the children of this parent.
    children = lookup[head]

    # Get the children of the children.
    for child in children:
        child.children = buildTree(child.id)

    return children

# Get the root's child. There should only be one child. The function returns an
# array of children so just get the first one.
tree = buildTree(0)[0] # Start with head = 0 (which is the ROOT node)

To be able to print the tree in a new format you could add some method overloads to the Word class:

def __str__(self):
    if len(self.children) == 0:
        return "[" + self.form + "]"
    else:
        return "[" + self.form + " " + "".join(str(child) for child in self.children) + "]"

def __repr__(self):
    return self.__str__()

Now you can just do this:

print tree

And pipe it like so:

cat input.conll | ./my_parser.py

or directyly from syntaxnet:

 echo "Alice, who had been reading about SyntaxNet, saw Bob in the hallway yesterday." | syntaxnet/demo.sh | ./my_parser.py
Community
  • 1
  • 1
user3389196
  • 161
  • 1
  • 6
  • Thanks very much! I got an idea I modified syntaxnet/conll2tree.py which contains building a tree. when it is printing, I use your function "def __str__(self):" logic. So good! Thanks – CPUU Jun 14 '16 at 02:03
  • I use server - client design model. When client send a sentence, Server is running(it shows a lot of information , but don't need) and reply only output tree. – CPUU Jun 14 '16 at 02:04