1

I have a json file on which I have to do the following:

  1. Run just the "text" field on the json through Syntaxnet.
  2. From Syntaxnet output, create a new json field that looks like: text_syntaxnet = [{'word' = <WORD1>, 'position = <word_position>, 'pos_tag' = <POS_TAG>}, {...........}]
  3. Add this new json field to the original json which came in as an input.

I am doing this using Pig Streaming. I would like to stream the input data to the function parse.py whose contents are:

import sys
import re
import subprocess
import json


def create_new_json_field(tags_list):
    word_tags = {}
    new_json_field = []
    for line in tags_list:
        line = line.strip()
        if not line:
            continue
        else:
            words = line.split()
            word_tags['word'] = words[1]
            word_tags['position'] = words[0]
            word_tags['pos_tag'] = words[4]
            new_json_field.append(word_tags.copy())
    return new_json_field


def main(argv):
    try:
        for line in sys.stdin:
            json_original = json.loads(line)
            print json_original
            tags = subprocess.check_output('./parse.sh %s' % line, shell=True)
            tags_list = tags.split('\n')
            new_json_field = create_new_json_field(tags_list)
            result = json_original['text_syntaxnet'] = new_json_field
            print new_json_field
            print result
    except Exception as e:
        sys.stdout.write(str(e))

main(sys.argv)

The contents of parse.sh are:

#!/bin/sh
cd ........../models/syntaxnet
jq --raw-output '.["text"]' | syntaxnet/demo.sh

This code where I call parse.sh does not work. Rest all works. I am not sure if it is the syntax of the command or some environment issue. Some one please help me debug this problem.

NOTE: The subporcess call works when I don't do for line in sys.stdin in parse.py. But i want to do it because i want to parse line by line and then create the json objects.

Thanks!

CristiFati
  • 38,250
  • 9
  • 50
  • 87
kskp
  • 692
  • 5
  • 11
  • 23
  • 1
    What is ....... from `cd ........../models/syntaxnet`? – CristiFati Sep 12 '16 at 19:15
  • 1
    I changed to the models/syntaxnet directory from where I need to run the syntaxnet/demo.sh command to get syntaxnet running. It is a long path so I just put it that way. – kskp Sep 12 '16 at 20:15
  • 1
    Oh.. I thought it was literally. One thing about iterating over `sys.stdin`: when does that stop? When you hit Ctrl + C? And all the previously entered lines are processed at that time? Hint: you could write the contents of those lines in a file __once__, and from your program iterate over that file, instead of writing them every time you run your program. Also, what __exactly__ doesn't work? What's the error? What should it do? – CristiFati Sep 12 '16 at 20:35

0 Answers0