0

I am extracting extra fields from a JSONL file using json2csv.py (compiled using twarc), and am having trouble extracting some text fields that are held within an array. This is the array, and I want to be able to pull out just the hashtag text.

"entities": {
      "hashtags": [
        {
          "text": "NoJusticeNoPeace",
          "indices": [
            65,
            82
          ]
        },
        {
          "text": "justiceforNaledi",
          "indices": [
            83,
            100
          ]
        },

I am able to extra other fields that don't have arrays using this code:

python json2csv.py tweets_may.jsonl -e full_text retweeted_status.extended_tweet.full_text > testfull_text.csv

However, I can't work out how to pull out the array, or elements of it. Individual hashtag text can be identified using the following retweeted_status.extended_tweet.entities.hashtags.0.text I've tried using:

python json2csv.py tweets_may.jsonl -e all_hashtags retweeted_status.extended_tweet.entities.hashtags.0.text > testhash.csv

But this just returns an empty column. Ideally I would like to be able to pull out all occurrences of 'text' within the 'hashtag' array into either a single column or separate columns.

2 Answers2

0

As Adam already said, you can just use the json module to access these kind of files.

For instance, when I have the following in file.jsonl:

{
    "entities": {
        "hashtags": [
            {
            "text": "NoJusticeNoPeace",
            "indices": [
                65,
                82
            ]
            },
            {
            "text": "justiceforNaledi",
            "indices": [
                83,
                100
            ]
            }
        ]
    }
}

To access the information stored in this file you can do the following:

import json

with open('file.jsonl','r') as file:
    jsonl = json.load(file)

This jsonl variable is now just a dictionary you can access like you normally would.

hashtags = jsonl['entities']['hashtags']
print(hashtags[0]['text'])
>>> NoJusticeNoPeace
print(hashtags[1]['indices'])
>>> [83, 100]
Daan Klijn
  • 1,269
  • 3
  • 11
  • 28
0

json module: json encoder and decoder

JSON (JavaScript Object Notation), specified by RFC 7159 (which obsoletes RFC 4627) and by ECMA-404, is a lightweight data interchange format inspired by JavaScript object literal syntax (although it is not a strict subset of JavaScript 1 )...

I encourge you to review and read more in python documentation json encoder decoder module

folowing my comment, the json module and json.load() do all the work for you. just import it and call its API.

if you are on python 3.xx:

import json
import pprint
json_file_path="t.json"

json_data = {}

with open(json_file_path,'r') as jp:
    json_data=json.load(jp)
    pprint.pprint(json_data)
    # sinse hashtags is a list (json array) we access its elements like:
    var = json_data['entities']['hashtags'][0]['text']
    print("var is : {}".format(var))
    print("var type is : {}".format(type(var)))    

python 3.xx console output of the above code

{'entities': {'hashtags': [{'indices': [65, 82], 'text': 'NoJusticeNoPeace'},
                           {'indices': [83, 100], 'text': 'justiceforNaledi'}]}}
var is : NoJusticeNoPeace
var type is : <class 'str'>

on python 2.xx the only change is to omit the parnetheses from the print lines. but there is one major difference between the outputs of the above script.

on python 3 the dictionary items type is str. which is ready for use. but in python 2 the dictionary items are of type: <type 'unicode'>. so be aware. you need to convert it to str, just by doing this: str(var)

Adam
  • 2,820
  • 1
  • 13
  • 33