How to extract elements from each line in a jsonline file?

Question

I have a jsonl file which contains per line both a sentence and the tokens that are found in that sentence. I wish to extract the tokens from each line in the JSON lines file, but my loop only returns the tokens from the last line.

This is the input.

{"text":"This is the first sentence.","_input_hash":2083129218,"_task_hash":-536378640,"spans":[],"meta":{"score":0.5,"pattern":65},"answer":"accept","tokens":[
{"text":"This","id":0},
{"text":"is","id":1},
{"text":"the","id":2},
{"text":"first","id":3},
{"text":"sentence","id":4},
{"text":".","id":5}]}
{"text":"This is the second sentence.","_input_hash":2083129218,"_task_hash":-536378640,"spans":[],"meta":{"score":0.5,"pattern":65},"answer":"accept","tokens":[
{"text":"This","id":0},
{"text":"is","id":1},
{"text":"the","id":2},
{"text":"second","id":3},
{"text":"sentence","id":4},
{"text":".","id":5}]}

I have tried running the following code:

with jsonlines.open('path/to/file') as reader:
        for obj in reader:
        data = obj['tokens'] # just extract the tokens
        data = [(i['text'], i['id']) for i in data] # elements from the tokens

data

The actual result:

[('This', 0), ('is', 1), ('the', 2), ('first', 3), ('sentence', 4), ('.', 5)]

What the result is that I want to get to:

Additional question

Some tokens contain a "label" instead of an "id". How could I incorporate that into the code? An example would be:

{"text":"This is the first sentence.","_input_hash":2083129218,"_task_hash":-536378640,"spans":[],"meta":{"score":0.5,"pattern":65},"answer":"accept","tokens":[
{"text":"This","id":0},
{"text":"is","id":1},
{"text":"the","id":2},
{"text":"first","id":3},
{"text":"sentence","id":4},
{"text":".","id":5}]}
{"text":"This is coded in python.","_input_hash":2083129218,"_task_hash":-536378640,"spans":[],"meta":{"score":0.5,"pattern":65},"answer":"accept","tokens":[
{"text":"This","id":0},
{"text":"is","id":1},
{"text":"coded","id":2},
{"text":"in","id":3},
{"text":"python","label":"Programming"},
{"text":".","id":5}]}

You can use triple backtick to format your data and correct the data yourself `replace | by { and \ by }: ` Take a look at https://stackoverflow.com/editing-help#syntax-highlighting — Devesh Kumar Singh, May 26 '19 at 14:20
BTW, `jq -r '[.text, .id] | @csv'` might do what you want with no Python at all. — Charles Duffy, May 26 '19 at 14:42
Beyond that -- I'm surprised you're getting only the *first* line, not only the *last* one, since your code reassigns `data` over and over. If you want a list of results, it would make sense to have a separate list object. (OTOH, since you're showing us a spreadsheet screenshot, not a Python datastructure, we need to guess what kind of datastructure you want that spreadsheet to map to). — Charles Duffy, May 26 '19 at 14:43
Thanks @bjornvandijkman I added an answer below, please check if out — Devesh Kumar Singh, May 26 '19 at 14:52

score 1 · Answer 1 · answered May 26 '19 at 14:48

f=open('data.csv','w')
print('Sentence','Word','ID',file=f)
with jsonlines.open('path/to/file') as reader:
        for sentence_no,obj in enumerate(reader):
            data = obj['tokens']
            for i in data:
                print(sentence_no+1,i['text'], i['id']+1,file=f)

Devesh Kumar Singh · Accepted Answer · 2019-05-26T14:54:52.007

Some issues/changes in the code

You are reassign the variable data in the loop everytime, hence you only see the result for the last json line, instead you want to extend the list everytime
You want to use enumerate on the reader iterator to get the first item of the tuple

The code then changes to

import jsonlines

data = []
#Iterate over the json files
with jsonlines.open('file.txt') as reader:
    #Iterate over the each line on the reader via enumerate
    for idx, obj in enumerate(reader):

        #Append the data to the result
        data.extend([(idx+1, i['text'], i['id']+1) for i in obj['tokens']])  # elements from the tokens

print(data)

Or more compact by making a double for-loop in the list comprehension itself

import jsonlines

#Open the file, iterate over the tokens and make the tuples
result = [(idx+1, i['text'], i['id']+1) for idx, obj in enumerate(jsonlines.open('file.txt')) for i in obj['tokens']]

print(result)

The output will be

[
(1, 'This', 1), 
(1, 'is', 2), 
(1, 'the', 3), 
(1, 'first', 4), 
(1, 'sentence', 5), 
(1, '.', 6), 
(2, 'This', 1), 
(2, 'is', 2), 
(2, 'the', 3), 
(2, 'second', 4), 
(2, 'sentence', 5), 
(2, '.', 6)
]

How to extract elements from each line in a jsonline file?

Additional question

2 Answers2

Linked