JSON file with different array lengths in Python

Question

I want to explore the population data freely available online at https://www.nomisweb.co.uk/api/v01/dataset/NM_31_1.jsonstat.json . It contains population details of UK from 1981 to 2017. The code I used so far is below

import requests
import json
import pandas
json_url = 'https://www.nomisweb.co.uk/api/v01/dataset/NM_31_1.jsonstat.json'

# download the data
j = requests.get(url=json_url) 

# load the json
content = json.loads(j.content)

list(content.keys())

The last line of code above gives me the below output:

 ['version',
'class',
'label',
'source',
'updated',
'value',
'id',
'size',
'role',
'dimension',
'extension']

I then tried to have a look at the lengths of 'Value', 'size' and 'role'

 print (len(content['value']))
 print (len(content['size']))
 print (len(content['role']))

And I got the results as below:

22200
5
3

As we can see the lengths very different. I cannot covert it into a dataframe as they are all different lengths. How can I change this to a meaningful format so that I can start exploring it? Iam required to do analysis as below:

1.A table showing the male, female and total population in columns, per UK region in rows, as well as the UK total, for the most recent year

Exploratory data analysis to show how the population progressed by regions and age groups

Serge Ballesta · Answer 1 · 2019-02-27T14:53:13.803

You should first read the content of the Json file except value, because the other fields explain what the value field is. And it is a (flattened...) multidimensional matrix with dimensions content['size'], that is 37x4x3x25x2, and the description of each dimension is given in content['dimension']. First dimension is time with 37 years from 1981 to 2017, then geography with Wales, Scotland, Northern Ireland and England_and_Wales. Next come sex with Male, Female and Total, followed by ages with 25 classes. At the very end, you will find the measures where first is the total number of persons, and the second is its percent number.

Long story short, only content['value'] will be used to feed the dataframe, but you first need to understand how.

But because of the 5 dimensions, it is probably better to first use a numpy matrix...

score 0 · Answer 2 · answered Feb 27 '19 at 11:45

The data is a complex JSON file and as you stated correctly, you need the data frame columns to be of an equal length. What you mean to say by that, is that you need to understand how the records are stored inside your dataset.

I would advise you to use some JSON Viewer/Prettifier to first research the file and understand its structure.

Only then you would be able to understand which data you need to load to the DataFrame. For example, obviously, there is no need to load the 'version' and 'class' values into the DataFrame as they are not part of any record, but are metadata about the dataset itself.

score 0 · Answer 3 · edited Apr 22 '19 at 01:44

0

This is JSON-stat format. See https://json-stat.org. You can use the python libraries pyjstat or json.stat.py to get the data to a pandas dataframe.

You can explore this dataset using the JSON-stat explorer

edited Apr 22 '19 at 01:44

Khalid Ali

1,224
1
8
12

answered Apr 21 '19 at 20:04

janbrus

1
2

JSON file with different array lengths in Python

3 Answers3