I have a json file on which I have to do the following:
- Run just the "text" field on the json through Syntaxnet.
- From Syntaxnet output, create a new json field that looks like:
text_syntaxnet = [{'word' = <WORD1>, 'position = <word_position>, 'pos_tag' = <POS_TAG>}, {...........}]
- Add this new json field to the original json which came in as an input.
I am doing this using Pig Streaming. I would like to stream the input data to the function parse.py
whose contents are:
import sys
import re
import subprocess
import json
def create_new_json_field(tags_list):
word_tags = {}
new_json_field = []
for line in tags_list:
line = line.strip()
if not line:
continue
else:
words = line.split()
word_tags['word'] = words[1]
word_tags['position'] = words[0]
word_tags['pos_tag'] = words[4]
new_json_field.append(word_tags.copy())
return new_json_field
def main(argv):
try:
for line in sys.stdin:
json_original = json.loads(line)
print json_original
tags = subprocess.check_output('./parse.sh %s' % line, shell=True)
tags_list = tags.split('\n')
new_json_field = create_new_json_field(tags_list)
result = json_original['text_syntaxnet'] = new_json_field
print new_json_field
print result
except Exception as e:
sys.stdout.write(str(e))
main(sys.argv)
The contents of parse.sh
are:
#!/bin/sh
cd ........../models/syntaxnet
jq --raw-output '.["text"]' | syntaxnet/demo.sh
This code where I call parse.sh does not work. Rest all works. I am not sure if it is the syntax of the command or some environment issue. Some one please help me debug this problem.
NOTE: The subporcess call works when I don't do for line in sys.stdin
in parse.py
. But i want to do it because i want to parse line by line and then create the json objects.
Thanks!