2

I have a jsonl file which contains per line both a sentence and the tokens that are found in that sentence. I wish to extract the tokens from each line in the JSON lines file, but my loop only returns the tokens from the last line.

This is the input.

{"text":"This is the first sentence.","_input_hash":2083129218,"_task_hash":-536378640,"spans":[],"meta":{"score":0.5,"pattern":65},"answer":"accept","tokens":[
{"text":"This","id":0},
{"text":"is","id":1},
{"text":"the","id":2},
{"text":"first","id":3},
{"text":"sentence","id":4},
{"text":".","id":5}]}
{"text":"This is the second sentence.","_input_hash":2083129218,"_task_hash":-536378640,"spans":[],"meta":{"score":0.5,"pattern":65},"answer":"accept","tokens":[
{"text":"This","id":0},
{"text":"is","id":1},
{"text":"the","id":2},
{"text":"second","id":3},
{"text":"sentence","id":4},
{"text":".","id":5}]}

I have tried running the following code:

with jsonlines.open('path/to/file') as reader:
        for obj in reader:
        data = obj['tokens'] # just extract the tokens
        data = [(i['text'], i['id']) for i in data] # elements from the tokens

data

The actual result:

[('This', 0), ('is', 1), ('the', 2), ('first', 3), ('sentence', 4), ('.', 5)]

What the result is that I want to get to:

enter image description here

Additional question

Some tokens contain a "label" instead of an "id". How could I incorporate that into the code? An example would be:

{"text":"This is the first sentence.","_input_hash":2083129218,"_task_hash":-536378640,"spans":[],"meta":{"score":0.5,"pattern":65},"answer":"accept","tokens":[
{"text":"This","id":0},
{"text":"is","id":1},
{"text":"the","id":2},
{"text":"first","id":3},
{"text":"sentence","id":4},
{"text":".","id":5}]}
{"text":"This is coded in python.","_input_hash":2083129218,"_task_hash":-536378640,"spans":[],"meta":{"score":0.5,"pattern":65},"answer":"accept","tokens":[
{"text":"This","id":0},
{"text":"is","id":1},
{"text":"coded","id":2},
{"text":"in","id":3},
{"text":"python","label":"Programming"},
{"text":".","id":5}]}
  • You can use triple backtick to format your data and correct the data yourself `replace | by { and \ by }: ` Take a look at https://stackoverflow.com/editing-help#syntax-highlighting – Devesh Kumar Singh May 26 '19 at 14:20
  • BTW, `jq -r '[.text, .id] | @csv'` might do what you want with no Python at all. – Charles Duffy May 26 '19 at 14:42
  • Beyond that -- I'm surprised you're getting only the *first* line, not only the *last* one, since your code reassigns `data` over and over. If you want a list of results, it would make sense to have a separate list object. (OTOH, since you're showing us a spreadsheet screenshot, not a Python datastructure, we need to guess what kind of datastructure you want that spreadsheet to map to). – Charles Duffy May 26 '19 at 14:43
  • Thanks @bjornvandijkman I added an answer below, please check if out – Devesh Kumar Singh May 26 '19 at 14:52

2 Answers2

1
f=open('data.csv','w')
print('Sentence','Word','ID',file=f)
with jsonlines.open('path/to/file') as reader:
        for sentence_no,obj in enumerate(reader):
            data = obj['tokens']
            for i in data:
                print(sentence_no+1,i['text'], i['id']+1,file=f)
Smart Manoj
  • 5,230
  • 4
  • 34
  • 59
1

Some issues/changes in the code

  • You are reassign the variable data in the loop everytime, hence you only see the result for the last json line, instead you want to extend the list everytime

  • You want to use enumerate on the reader iterator to get the first item of the tuple

The code then changes to

import jsonlines

data = []
#Iterate over the json files
with jsonlines.open('file.txt') as reader:
    #Iterate over the each line on the reader via enumerate
    for idx, obj in enumerate(reader):

        #Append the data to the result
        data.extend([(idx+1, i['text'], i['id']+1) for i in obj['tokens']])  # elements from the tokens

print(data)

Or more compact by making a double for-loop in the list comprehension itself

import jsonlines

#Open the file, iterate over the tokens and make the tuples
result = [(idx+1, i['text'], i['id']+1) for idx, obj in enumerate(jsonlines.open('file.txt')) for i in obj['tokens']]

print(result)

The output will be

[
(1, 'This', 1), 
(1, 'is', 2), 
(1, 'the', 3), 
(1, 'first', 4), 
(1, 'sentence', 5), 
(1, '.', 6), 
(2, 'This', 1), 
(2, 'is', 2), 
(2, 'the', 3), 
(2, 'second', 4), 
(2, 'sentence', 5), 
(2, '.', 6)
]
Devesh Kumar Singh
  • 20,259
  • 5
  • 21
  • 40