Reading nested JSON file in Pandas dataframe

Question

I have a JSON file with the following structure (it's not the complete json file, but the structure is the same):

{"data":[{"referenced_tweets":[{"type":"retweeted","id":"xxxxxxx"}],"text":"abcdefghijkl","created_at":"2020-03-09T00:11:41.000Z","author_id":"xxxxx","id":"xxxxxxxxxxx"},{"referenced_tweets":[{"type":"retweeted","id":"xxxxxxxxxxxx"}],"text":"abcdefghijkl","created_at":"2020-03-09T00:11:41.000Z","author_id":"xxxxxxxx","id":"xxxxxxxxxxx"}]}
..... 

//The rest of json continues with the same structure, but referenced_tweets is not always present

My question: How can I load this data into a dataframe with these columns: type, id(referenced_tweet id), text, created_at, author_id, and id (tweet id)?

What I could do so far: I could get the following columns:

referenced_tweets	text	cerated_at	author_id	id (tweet id)
[{'type': 'xx', 'id': 'xxx'}]	xxx	xxxx	xxxxx	xxxxxxxxxxxx

Here is the code to get the above table:

with open('Test_SampleRetweets.json') as json_file:
    data_list = json.load(json_file)

df1 = json_normalize(data_list, 'data')
df1.head()

However, I'd like to get the type and id (in referenced_tweets) in separate columns and I could get the following so far:

type	id (referenced_tweet id)
xxxx	xxxxxxxxxxxxxxxxxxxxxxx

and here is the code to get the above table:

df2 = json_normalize(data_list, record_path=['data','referenced_tweets'], errors='ignore')
df2.head()

What is the problem? I'd like to get everything in one table, i.e., a table similar to the first one here but with type and id in separate columns (like the 2nd table). So, the final columns should be : type, id (referenced_tweet id), text, created_at, author_id, and id (tweet id).

dukkee · Accepted Answer · 2021-03-27T10:40:47.437

1

import pandas as pd


with open('Test_SampleRetweets.json') as json_file:
    raw_data = json.load(json_file)


data = []
for item in raw_data["data"]:
    item["tweet_id"] = item["id"]
    item.update(item["referenced_tweets"][0])
    del item["referenced_tweets"]
    data.append(item)


df1 = pd.DataFrame(data)
print(df1.head())

edited Mar 27 '21 at 10:40

answered Mar 27 '21 at 10:00

dukkee

1,112
1
9
17

Thanks for your answer. Your code almost gives me the columns I need. I just don't know why the `id (tweet id)` column is missing. Do you have any idea? – mOna Mar 27 '21 at 10:37
There should be two `id` columns. One id refers to referenced_tweet id and the other one (the missing one) is tweet id – mOna Mar 27 '21 at 10:38
@mOna ah sorry, I see, sec, i will update my answer – dukkee Mar 27 '21 at 10:40
@mOna it was fixed – dukkee Mar 27 '21 at 10:41
Thanks for the update. Could I ask why only `id` was missing? I mean all other columns (e.g., author_id, conversation_id) were present – mOna Mar 27 '21 at 10:44
Sorry, I just have one more question. Could I ask what does `[0]`.. in `item.update(item["referenced_tweets"][0])` means? and why you used [0]? – mOna Mar 30 '21 at 16:07
@mOna, sure, because in your original structure value of "referenced_tweets" isn't a dict, but a list with one element (this dict), therefore we need to use access by index 0 in this list – dukkee Mar 30 '21 at 19:44
Got it. Thank you! – mOna Apr 01 '21 at 11:15

Racooneer · Answer 2 · 2021-03-27T11:05:39.670

when working with a nested json in json_normalize(), you need to work with the meta parameter to get the fields in the meta level. So, essentially what you are doing is taking the nested and normalizing it and than left joining several other field from a level above. Apparently, you can combine this for several nested fields, see this for reference.

import json
import pandas as pd

j = '{"data":[{"referenced_tweets":[{"type":"retweeted","id":"xxxxxxx"}],"text":"abcdefghijkl","created_at":"2020-03-09T00:11:41.000Z","author_id":"xxxxx","id":"xxxxxxxxxxx"},{"referenced_tweets":[{"type":"retweeted","id":"xxxxxxxxxxxx"}],"text":"abcdefghijkl","created_at":"2020-03-09T00:11:41.000Z","author_id":"xxxxxxxx","id":"xxxxxxxxxxx"}]}'
j = json.loads(j)

# since you have id twice, it's a bit more complicated and you need to 
# introduce a meta prefix
df = pd.json_normalize(
    j,
    record_path=["data", 'referenced_tweets'],
    meta_prefix="data.",
    meta=[["data", "text"], ["data", "created_at"], ["data", "author_id"], ["data", "id"]]
    )
print(df)

resulting in:

        type            id data.data.text      data.data.created_at  \
0  retweeted       xxxxxxx   abcdefghijkl  2020-03-09T00:11:41.000Z   
1  retweeted  xxxxxxxxxxxx   abcdefghijkl  2020-03-09T00:11:41.000Z   

  data.data.author_id data.data.id  
0               xxxxx  xxxxxxxxxxx  
1            xxxxxxxx  xxxxxxxxxx

I would prefere it this way, since it seems simpler to handle

df = pd.json_normalize(
    j["data"],
    record_path=['referenced_tweets'],
    meta_prefix="data.",
    meta=["text", "created_at", "author_id", "id"]
    )
print(df)

resulting in:

        type            id     data.text           data.created_at  \
0  retweeted       xxxxxxx  abcdefghijkl  2020-03-09T00:11:41.000Z   
1  retweeted  xxxxxxxxxxxx  abcdefghijkl  2020-03-09T00:11:41.000Z   

  data.author_id      data.id  
0          xxxxx  xxxxxxxxxxx  
1       xxxxxxxx  xxxxxxxxxxx

I get this error when I run your code: `module 'pandas' has no attribute 'json_normalize'`. I imported the following too: `import json`, `from pandas.io.json import json_normalize` and `import pandas as pd` — mOna, Mar 27 '21 at 11:19
you could update pandas, you are probably using something before v.1.2.3. `json_normalize` was in io tools before that version. You may get a future waring now when using `pandas.io.json` — Racooneer, Mar 27 '21 at 11:36

Reading nested JSON file in Pandas dataframe

2 Answers2