Instead of parsing the tree, you can make SyntaxNet output everything in conll format which is easier to parse. The conll format for your sentence looks like this:
1 Alice _ NOUN NNP _ 10 nsubj _ _
2 , _ . , _ 1 punct _ _
3 who _ PRON WP _ 6 nsubj _ _
4 had _ VERB VBD _ 6 aux _ _
5 been _ VERB VBN _ 6 aux _ _
6 reading _ VERB VBG _ 1 rcmod _ _
7 about _ ADP IN _ 6 prep _ _
8 SyntaxNet _ NOUN NNP _ 7 pobj _ _
9 , _ . , _ 10 punct _ _
10 saw _ VERB VBD _ 0 ROOT _ _
11 Bob _ NOUN NNP _ 10 dobj _ _
12 in _ ADP IN _ 10 prep _ _
13 the _ DET DT _ 14 det _ _
14 hallway _ NOUN NN _ 12 pobj _ _
15 yesterday _ NOUN NN _ 10 tmod _ _
16 . _ . . _ 10 punct _ _
The meaning of each column can be found here. The only columns we are concerned with at the moment are the first (the ID of the word), the second (the word itself) and the 7th (the head, in other words, the parent). The root node has a parent of 0.
To get the conll format we just have to comment out the last few lines of demo.sh (which I assume you used to get your output):
$PARSER_EVAL \
--input=$INPUT_FORMAT \
--output=stdout-conll \
--hidden_layer_sizes=64 \
--arg_prefix=brain_tagger \
--graph_builder=structured \
--task_context=$MODEL_DIR/context.pbtxt \
--model_path=$MODEL_DIR/tagger-params \
--slim_model \
--batch_size=1024 \
--alsologtostderr \
| \
$PARSER_EVAL \
--input=stdin-conll \
--output=stdout-conll \
--hidden_layer_sizes=512,512 \
--arg_prefix=brain_parser \
--graph_builder=structured \
--task_context=$MODEL_DIR/context.pbtxt \
--model_path=$MODEL_DIR/parser-params \
--slim_model \
--batch_size=1024 \
--alsologtostderr #\
# | \
# bazel-bin/syntaxnet/conll2tree \
# --task_context=$MODEL_DIR/context.pbtxt \
# --alsologtostderr
(don't forget to comment out the backslash on the previous line)
(where I got this trick from, see the comment)
When I run demo.sh myself I get a lot of information I do not need. How you can get rid of that I leave for you to figure out (Let me know :)).
I coppied the relevant part to a file for now so I can pipe it into the python program I'm going to write. If you can get rid of the info, you should be able to pipe demo.sh directly into the python program as well.
Note: I'm fairly new to python so feel free to improve my code.
First, we just want to read the conll file from the input. I like to put each word in a nice class.
#!/usr/bin/env python
import sys
# Conll data format:
# http://ilk.uvt.nl/conll/#dataformat
#
# The only parts we need:
# 1: ID
# 2: FORM (The original word)
# 7: HEAD (The ID of its parent)
class Word:
"A class containing the information of a single line from a conll file."
def __init__(self, columns):
self.id = int(columns[0])
self.form = columns[1]
self.head = int(columns[6])
self.children = []
# Read the conll input and put it in a list of words.
words = []
for line in sys.stdin:
# Remove newline character, split on spaces and remove empty columns.
line = filter(None, line.rstrip().split(" "))
words.append(Word(line))
Nice, but it isn't a tree structure yet. We have to do a little more work.
I could foreach the whole list a couple of times to look up every child for every word but this would be inefficient. I sort them by their parent instead and then it should just be a quick lookup to get every child for a given parent.
# Sort the words by their head (parent).
lookup = [[] for _ in range(len(words) + 1)]
for word in words:
lookup[word.head].append(word)
Create a tree structure:
# Build a tree
def buildTree(head):
"Find the children for the given head in the lookup, recursively"
# Get all the children of this parent.
children = lookup[head]
# Get the children of the children.
for child in children:
child.children = buildTree(child.id)
return children
# Get the root's child. There should only be one child. The function returns an
# array of children so just get the first one.
tree = buildTree(0)[0] # Start with head = 0 (which is the ROOT node)
To be able to print the tree in a new format you could add some method overloads to the Word class:
def __str__(self):
if len(self.children) == 0:
return "[" + self.form + "]"
else:
return "[" + self.form + " " + "".join(str(child) for child in self.children) + "]"
def __repr__(self):
return self.__str__()
Now you can just do this:
print tree
And pipe it like so:
cat input.conll | ./my_parser.py
or directyly from syntaxnet:
echo "Alice, who had been reading about SyntaxNet, saw Bob in the hallway yesterday." | syntaxnet/demo.sh | ./my_parser.py