How Do I Start Pulling Apart This Block of JSON Data?

Question

I'd like to make a program that makes offline copies of math questions from Khan Academy. I have a huge 21.6MB text file that contains data on all of their exercises, but I have no idea how to start analyzing it, much less start pulling the questions from it.

Here is a pastebin containing a sample of the JSON data. If you want to see all of it, you can find it here. Warning for long load time.

I've never used JSON before, but I wrote up a quick Python script to try to load up individual "sub-blocks" (or equivalent, correct term) of data.

import sys
import json

exercises = open("exercises.txt", "r+b")
byte = 0
frontbracket = 0
backbracket = 0
while byte < 1000: #while byte < character we want to read up to
                   #keep at 1000 for testing purposes
    char = exercises.read(1)
    sys.stdout.write(char)
    #Here we decide what to do based on what char we have
    if str(char) == "{":
        frontbracket = byte
        while True:
            char = exercises.read(1)
            if str(char)=="}":
                backbracket=byte
                break
        exercises.seek(frontbracket)
        block = exercises.read(backbracket-frontbracket)
        print "Block is " + str(backbracket-frontbracket) + " bytes long"
        jsonblock = json.loads(block)
        sys.stdout.write(block)
        print jsonblock["translated_display_name"]
        print "\nENDBLOCK\n"


    byte = byte + 1

You're not taking in consideration nested brackets... you can use a stack to keep track of that. — dfranca, Aug 16 '16 at 15:19
You can use something like [jsonlint](http://jsonlint.com/) to copy/paste portions of the JSON to at least get it in a readable format. I think I've found a repeated pattern in there but do you know what data you're actually looking for? Once you load it into `json` then you can start treating it just as nested lists and dictionaries. — roganjosh, Aug 16 '16 at 15:20
This appears to be your repeating pattern: http://pastebin.com/4nSnLEFZ — roganjosh, Aug 16 '16 at 15:25
@danielfranca thank you for the advice. Is an array that passes starting bracket positions first-in-last-out to the block = exercises.read... line of code efficient? — Xeneficus, Aug 16 '16 at 15:29
@roganjosh thank you for that amazing resource! I don't actually know if the data I want is directly in there, but I suspect I can at least build a table listing the exercises and links to their individual pages from it. — Xeneficus, Aug 16 '16 at 15:29
@Xeneficus you're welcome :) Even if your copy/paste is not valid json (i.e. you didn't get a complete string) it will still attempt to format it for you. The pastebin I linked is only the repeated pattern from the "small" portion that you posted. I don't know whether the API returns something later on that's in a completely different kind of structure. Is there no documentation to help you with this? — roganjosh, Aug 16 '16 at 15:32
Another comments: you're increments bytes after the inner loop, which won't work, I assume. Use something to keep track of open/close brackets (a stack or a simple integer variable) I don't think you need to re-read the file, you can just concatenate the content you're reading. — dfranca, Aug 16 '16 at 15:35
@roganjosh The webpage I linked is supposed to work with their API, but I've been having trouble getting it to work, so I just took the raw data. Their documentation only really shows what calls are available. — Xeneficus, Aug 16 '16 at 15:37
@roganjosh no, I got so frustrated with their API I just went in directly and copied the data it was supposed to get. — Xeneficus, Aug 16 '16 at 15:39

roganjosh · Accepted Answer · 2016-08-16T16:15:27.120

Ok, the repeated pattern appears to be this: http://pastebin.com/4nSnLEFZ

To get an idea of the structure of the response, you can use JSONlint to copy/paste portions of your string and 'validate'. Even if the portion you copied is not valid, it will still format it into something you can actually read.

First I have used requests library to pull the JSON for you. It's a super-simple library when you're dealing with things like this. The API is slow to respond because it seems you're pulling everything, but it should work fine.

Once you get a response from the API, you can convert that directly to python objects using .json(). What you have is essentially a mixture of nested lists and dictionaries that you can iterate through and pull specific details. In my example below, my_list2 has to use a try/except structure because it would seem that some of the entries do not have two items in the list under translated_problem_types. In that case, it will just put 'None' instead. You might have to use trial and error for such things.

Finally, since you haven't used JSON before, it's also worth noting that it can behave like a dictionary itself; you are not guaranteed the order in which you receive details. However, in this case, it seems the outermost structure is a list, so in theory it's possible that there is a consistent order but don't rely on it - we don't know how the list is constructed.

import requests

api_call = requests.get('https://www.khanacademy.org/api/v1/exercises')
json_response = api_call.json()

# Assume we first want to list "author name" with "author key"
# This should loop through the repeated pattern in the pastebin 
# access items as a dictionary
my_list1 = []

for item in json_response:
    my_list1.append([item['author_name'], item['author_key']])

print my_list1[0:5]

# Now let's assume we want the 'sha' of the SECOND entry in translated_problem_types
# to also be listed with author name

my_list2 = []

for item in json_response:
    try:
        the_second_entry = item['translated_problem_types'][0]['items'][1]['sha']
    except IndexError:
        the_second_entry = 'None'

    my_list2.append([item['author_name'], item['author_key'], the_second_entry])
print my_list2[0:5]

Thank you very much! Now I'm off to figure out how to pull the question text using this as a table. I'm thinking web scraper directed to the "relative-url" item. Thank you again! — Xeneficus, Aug 16 '16 at 16:15
@Xeneficus very welcome :) The API response is cumbersome to dig through in the way that I have. You should probably try understanding a bit more about the API itself to target your request and only get the info you're interested in. This will a) improve response time and b) stop ridiculous things like `item['translated_problem_types'][0]['items'][1]['sha']` for a single value. Best of luck :) — roganjosh, Aug 16 '16 at 16:18

How Do I Start Pulling Apart This Block of JSON Data?

1 Answers1

Linked