how to read this column of jsonl data that mayb null in json file using pandas.read_json

Question

the sample data i show below i try to select multiple column using df[["mentioneduser","place","hashtags"]]

{..."mentionedUsers": null,"place": {"_type": "snscrape.modules.twitter.Place", "fullName": "Berlin, Germany", "name": "Berlin", "type": "city", "country": "Germany", "countryCode": "DE"}, "hashtags": null, ...}

now i try to select df[["mentioneduser","country","hashtags"]] inside the place column and it gave me this error

KeyError: "['country'] not in index"

do note that some place column only contain null{..."mentionedUsers": null,"place": null, "hashtags": null, ...} , is that the reason causing this not to work? if so, is there a way to work around?

Edit: sample data

{"_type": "snscrape.modules.twitter.Tweet", "url": "https://twitter.com/QruxB/status/1344431618279284739", "date": "2020-12-30T23:54:10+00:00", "content": "a fucking gay furry", "renderedContent": "a fucking gay furry", "id": 1344431618279284739, "user": {"_type": "snscrape.modules.twitter.User", "username": "QruxB", "id": 1283418985522966532, "displayname": "Qrux_bot", "description": "Bot made by: @qrux5", "rawDescription": "Bot made by: @qrux5", "descriptionUrls": null, "verified": false, "created": "2020-07-15T15:11:51+00:00", "followersCount": 21, "friendsCount": 7, "statusesCount": 40918, "favouritesCount": 1, "listedCount": 2, "mediaCount": 1, "location": "", "protected": false, "linkUrl": "https://www.youtube.com/watch?v=j4Ph02gzqmY", "linkTcourl": null, "profileImageUrl": "https://pbs.twimg.com/profile_images/1329475104867295232/SSCHVJTw_normal.jpg", "profileBannerUrl": "https://pbs.twimg.com/profile_banners/1283418985522966532/1597230131", "url": "https://twitter.com/QruxB"}, "replyCount": 0, "retweetCount": 0, "likeCount": 0, "quoteCount": 0, "conversationId": 1344431618279284739, "lang": "en", "source": "<a href=\"https://cheapbotsdonequick.com\" rel=\"nofollow\">Cheap Bots, Done Quick!</a>", "sourceUrl": "https://cheapbotsdonequick.com", "sourceLabel": "Cheap Bots, Done Quick!", "outlinks": null, "tcooutlinks": null, "media": null, "retweetedTweet": null, "quotedTweet": null, "inReplyToTweetId": null, "inReplyToUser": null, "mentionedUsers": null, "coordinates": null, "place": {"_type": "snscrape.modules.twitter.Place", "fullName": "Berlin, Germany", "name": "Berlin", "type": "city", "country": "Germany", "countryCode": "DE"}, "hashtags": null, "cashtags": null}

Edit: upload my data structures

Goh Jia Yi · Answer 1 · 2021-05-28T09:46:22.273

For mentionedUsers, place and hashtags they are imported as columns and for country, it is imported as row under the place column (see below).

                  mentionedUsers                           place  hashtags
_type                        NaN  snscrape.modules.twitter.Place       NaN
username                     NaN                             NaN       NaN
id                           NaN                             NaN       NaN
displayname                  NaN                             NaN       NaN
description                  NaN                             NaN       NaN
rawDescription               NaN                             NaN       NaN
descriptionUrls              NaN                             NaN       NaN
verified                     NaN                             NaN       NaN
created                      NaN                             NaN       NaN
followersCount               NaN                             NaN       NaN
friendsCount                 NaN                             NaN       NaN
statusesCount                NaN                             NaN       NaN
favouritesCount              NaN                             NaN       NaN
listedCount                  NaN                             NaN       NaN
mediaCount                   NaN                             NaN       NaN
location                     NaN                             NaN       NaN
protected                    NaN                             NaN       NaN
linkUrl                      NaN                             NaN       NaN
linkTcourl                   NaN                             NaN       NaN
profileImageUrl              NaN                             NaN       NaN
profileBannerUrl             NaN                             NaN       NaN
url                          NaN                             NaN       NaN
fullName                     NaN                 Berlin, Germany       NaN
name                         NaN                          Berlin       NaN
type                         NaN                            city       NaN
country                      NaN                         Germany       NaN
countryCode                  NaN                              DE       NaN

df[["mentionedUsers","country","hashtags"]] will not work since "country" is under "place". Thus, if you want to get country you will have to use df["place"].loc["country"].

Your best bet is to perform some data manipulation, something like below:

df = df[["mentionedUsers","place","hashtags"]]
country = df["place"].loc["country"]

On the side note I don't see a point in importing json data in table form for you to build a csv for graph plotting. I would suggest you to use the json library instead. (Note: This suggestion is made without knowing the full data structure, given that mentionedUsers and hashtags are null in the sample json provided.)

import json

with open("test.json") as f:
    df = json.load(f)

mentionedUsers = df['mentionedUsers']
country = df['place']['country']
hashtags = df['hashtags']

is it possible to read directly like this df = df[["mentioneduser","place","hashtags"]["country"]] ? — someone u don't know, May 28 '21 at 07:06
but right now i got this error 'KeyError: 'country'' when i use your method `test_country_df=test_df[["content","place"]] test_country_df["country"]=test_country_df["place"]["country"] print(test_country_df[["content","country"]])` — someone u don't know, May 28 '21 at 07:23
Apologies I've fixed the bug, it should work fine now. For `mentioneduser, place and hashtags` they are imported as columns and for `country`, it is imported as row under the `place` column. So you can't do what you've suggested in the first comment. Do you have an idea of how you want your data structure to be like? Then perhaps I can advice you how you should read your json file. If you do not need a table structure like what you are getting from pandas, json.loads() is frequently used. — Goh Jia Yi, May 28 '21 at 07:35
btw im still getting the same error in same line with your fix `test_country_df = test_df[["content","place"]] country = test_country_df["place"].loc["country"] print(test_country_df[["content","country"]])` — someone u don't know, May 28 '21 at 08:08
i try to process my data scrape from twitter to json data as jsonl (the data structure: content and country)and pass it into a CSV file to generate a bar chart — someone u don't know, May 28 '21 at 08:14
I realised that you had the typo of mentioneduser instead of mentionedUsers so I've changed that in my answer as well. I don't see a need to use pandas to import the json file, there are way too many empty fields in the table imported. I've updated my answer and you should consider using the json library. — Goh Jia Yi, May 28 '21 at 09:49
is there anyway to dm you like reddit or discord? so i can easily discuss my situations to you — someone u don't know, May 28 '21 at 09:57
I have answered the main question posed - how to get the country value out. Would recommend you to post a more specific question to discuss how to extract the information you need into CSV. — Goh Jia Yi, May 28 '21 at 11:15

how to read this column of jsonl data that mayb null in json file using pandas.read_json

1 Answers1

Linked