0

I have a JSON file with the following structure (it's not the complete json file, but the structure is the same):

{"data":[{"referenced_tweets":[{"type":"retweeted","id":"xxxxxxx"}],"text":"abcdefghijkl","created_at":"2020-03-09T00:11:41.000Z","author_id":"xxxxx","id":"xxxxxxxxxxx"},{"referenced_tweets":[{"type":"retweeted","id":"xxxxxxxxxxxx"}],"text":"abcdefghijkl","created_at":"2020-03-09T00:11:41.000Z","author_id":"xxxxxxxx","id":"xxxxxxxxxxx"}]}
..... 

//The rest of json continues with the same structure, but referenced_tweets is not always present  

My question: How can I load this data into a dataframe with these columns: type, id(referenced_tweet id), text, created_at, author_id, and id (tweet id)?

What I could do so far: I could get the following columns:

referenced_tweets text cerated_at author_id id (tweet id)
[{'type': 'xx', 'id': 'xxx'}] xxx xxxx xxxxx xxxxxxxxxxxx

Here is the code to get the above table:

with open('Test_SampleRetweets.json') as json_file:
    data_list = json.load(json_file)

df1 = json_normalize(data_list, 'data')
df1.head()

However, I'd like to get the type and id (in referenced_tweets) in separate columns and I could get the following so far:

type id (referenced_tweet id)
xxxx xxxxxxxxxxxxxxxxxxxxxxx

and here is the code to get the above table:

df2 = json_normalize(data_list, record_path=['data','referenced_tweets'], errors='ignore')
df2.head()

What is the problem? I'd like to get everything in one table, i.e., a table similar to the first one here but with type and id in separate columns (like the 2nd table). So, the final columns should be : type, id (referenced_tweet id), text, created_at, author_id, and id (tweet id).

halfer
  • 19,824
  • 17
  • 99
  • 186
mOna
  • 2,341
  • 9
  • 36
  • 60

2 Answers2

1
import pandas as pd


with open('Test_SampleRetweets.json') as json_file:
    raw_data = json.load(json_file)


data = []
for item in raw_data["data"]:
    item["tweet_id"] = item["id"]
    item.update(item["referenced_tweets"][0])
    del item["referenced_tweets"]
    data.append(item)


df1 = pd.DataFrame(data)
print(df1.head())
dukkee
  • 1,112
  • 1
  • 9
  • 17
  • Thanks for your answer. Your code almost gives me the columns I need. I just don't know why the `id (tweet id)` column is missing. Do you have any idea? – mOna Mar 27 '21 at 10:37
  • There should be two `id` columns. One id refers to referenced_tweet id and the other one (the missing one) is tweet id – mOna Mar 27 '21 at 10:38
  • @mOna ah sorry, I see, sec, i will update my answer – dukkee Mar 27 '21 at 10:40
  • @mOna it was fixed – dukkee Mar 27 '21 at 10:41
  • Thanks for the update. Could I ask why only `id` was missing? I mean all other columns (e.g., author_id, conversation_id) were present – mOna Mar 27 '21 at 10:44
  • Sorry, I just have one more question. Could I ask what does `[0]`.. in `item.update(item["referenced_tweets"][0])` means? and why you used [0]? – mOna Mar 30 '21 at 16:07
  • @mOna, sure, because in your original structure value of "referenced_tweets" isn't a dict, but a list with one element (this dict), therefore we need to use access by index 0 in this list – dukkee Mar 30 '21 at 19:44
  • Got it. Thank you! – mOna Apr 01 '21 at 11:15
1

when working with a nested json in json_normalize(), you need to work with the meta parameter to get the fields in the meta level. So, essentially what you are doing is taking the nested and normalizing it and than left joining several other field from a level above. Apparently, you can combine this for several nested fields, see this for reference.

import json
import pandas as pd

j = '{"data":[{"referenced_tweets":[{"type":"retweeted","id":"xxxxxxx"}],"text":"abcdefghijkl","created_at":"2020-03-09T00:11:41.000Z","author_id":"xxxxx","id":"xxxxxxxxxxx"},{"referenced_tweets":[{"type":"retweeted","id":"xxxxxxxxxxxx"}],"text":"abcdefghijkl","created_at":"2020-03-09T00:11:41.000Z","author_id":"xxxxxxxx","id":"xxxxxxxxxxx"}]}'
j = json.loads(j)

# since you have id twice, it's a bit more complicated and you need to 
# introduce a meta prefix
df = pd.json_normalize(
    j,
    record_path=["data", 'referenced_tweets'],
    meta_prefix="data.",
    meta=[["data", "text"], ["data", "created_at"], ["data", "author_id"], ["data", "id"]]
    )
print(df)

resulting in:

        type            id data.data.text      data.data.created_at  \
0  retweeted       xxxxxxx   abcdefghijkl  2020-03-09T00:11:41.000Z   
1  retweeted  xxxxxxxxxxxx   abcdefghijkl  2020-03-09T00:11:41.000Z   

  data.data.author_id data.data.id  
0               xxxxx  xxxxxxxxxxx  
1            xxxxxxxx  xxxxxxxxxx

I would prefere it this way, since it seems simpler to handle

df = pd.json_normalize(
    j["data"],
    record_path=['referenced_tweets'],
    meta_prefix="data.",
    meta=["text", "created_at", "author_id", "id"]
    )
print(df)

resulting in:

        type            id     data.text           data.created_at  \
0  retweeted       xxxxxxx  abcdefghijkl  2020-03-09T00:11:41.000Z   
1  retweeted  xxxxxxxxxxxx  abcdefghijkl  2020-03-09T00:11:41.000Z   

  data.author_id      data.id  
0          xxxxx  xxxxxxxxxxx  
1       xxxxxxxx  xxxxxxxxxxx 
Racooneer
  • 329
  • 1
  • 2
  • 11
  • Thanks for your answer. I appreciate it. – mOna Mar 27 '21 at 11:14
  • I get this error when I run your code: `module 'pandas' has no attribute 'json_normalize'`. I imported the following too: `import json`, `from pandas.io.json import json_normalize` and `import pandas as pd` – mOna Mar 27 '21 at 11:19
  • you could update pandas, you are probably using something before v.1.2.3. `json_normalize` was in io tools before that version. You may get a future waring now when using `pandas.io.json` – Racooneer Mar 27 '21 at 11:36