0

I am trying to extract data from a REST API using python and put it into one neat JSON file, and having difficulty. The date is rather lengthy, with a total of nearly 4,000 records, but the max record allowed by the API is 100.

I've tried using some other examples to get through the code, and so far this is what I'm using (censoring the API URL and auth key, for the sake of confidentiality):

import requests
import json
from requests.structures import CaseInsensitiveDict

url = "https://api.airtable.com/v0/CENSORED/Vendors?maxRecords=100"

headers = CaseInsensitiveDict()
headers["Authorization"] = "Bearer CENSORED"

resp = requests.get(url, headers=headers)

resp.content.decode("utf-8")

vendors = []
new_results = True
page = 1
while new_results:
    centiblock = requests.get(url + f"&page={page}", headers=headers).json()
    new_results = centiblock.get("results", [])
    vendors.extend(centiblock)
    page += 1

full_directory = json.dumps(vendors, indent=4)

print(full_directory)

For the life of me, I cannot figure out why it isn't working. The output keeps coming out as just:

[
    "records"
]

If I play around with the print statement at the end, I can get it to print centiblock (so named for being a block of 100 records at a time) just fine - it gives me 100 records in un-formated text. However, if I try printing vendors at the end, the output is:

['records']

...which leads me to guess that somehow, the vendors array is not getting filled with the data. I suspect that I need to modify the get request where I define new_results, but I'm not sure how.

For reference, this is a censored look at how the json data begins, when I format and print out one centiblock:

{
    "records": [
        {
            "id": "XXX",
            "createdTime": "2018-10-15T19:23:59.000Z",
            "fields": {
                "Vendor Name": "XXX",
                "Main Phone": "XXX",
                "Street": "XXX",

Can anyone see where I'm going wrong? Thanks in advance!

Lif
  • 5
  • 2

1 Answers1

0

When you are extending vendors with centiblock, your are giving a dict to the extend function. extend is expecting an Iterable, so that works, but when you iterate over a python dict, you only iterate over the keys of the dict. In this case, ['records'].

Note as well, that your loop condition becomes False after the first iteration, because centiblock.get("results", []) returns [], since "results" is not a key of the output of the API. and [] has a truthiness value of False.

Hence to correct those errors you need to get the correct field from the API into new_results, and extend vendors with new_results, which is itself an array. Note that on the last iteration, new_results will be the empty list, which means vendors won't be extended with any null value, and will contain exactly what you need:

This should look like:

import requests
import json
from requests.structures import CaseInsensitiveDict

url = "https://api.airtable.com/v0/CENSORED/Vendors?maxRecords=100"

headers = CaseInsensitiveDict()
headers["Authorization"] = "Bearer CENSORED"

resp = requests.get(url, headers=headers)

resp.content.decode("utf-8")

vendors = []
new_results = True
page = 1
while len(new_results) > 0:
    centiblock = requests.get(url + f"&page={page}", headers=headers).json()
    new_results = centiblock.get("records", [])
    vendors.extend(new_results)
    page += 1

full_directory = json.dumps(vendors, indent=4)

print(full_directory)

Note that I replaced the while new_results with a while len(new_results)>0 which is equivalent in this case, but more readable, and better practice in general.

Florent Monin
  • 1,278
  • 1
  • 3
  • 16
  • Hey, thank you - it's definitely getting somewhere! However, the new error message is: `TypeError: object of type 'bool' has no len()` I can fairly easily guess at that - a boolean doesn't have a length, so it doesn't know what to make of new_results equaling "True". Should I replace the boolean with a random array with a couple numbers in it? When I tried that, it started endlessly processing, and I'm not sure if that's an error, or if it just needs time to sift through our nearly 4,000 records. – Lif Oct 06 '22 at 22:52
  • Oh right, I did forget about the initialisation! Yes, absolutely, set `new_results = [""]` or anything (it shouldn't matter since the value should be overwritten before it is read again). – Florent Monin Oct 06 '22 at 23:13
  • It is highly possible that the processing is taking a lot of time. If you want to check that the processing is actually happening, you can print the page number for instance, or the length of `vendors` at each iteration to follow the progress being made – Florent Monin Oct 06 '22 at 23:14
  • Hmm... So I let it run for an hour, and then decided to interrupt it and see what was going on. I changed the loop to `while page < 32:` because there should not be more than 32 pages (our records are slightly less than I thought, at just over 3,000), and decided to give that a shot. To my dismay, it seems that all it's doing is repeating the same 100 entries, over and over, and thus was caught in an infinite loop. Somehow, pagination isn't working. I greatly appreciate your help, but if you can see where I went wrong this time, that would ALSO be greatly appreciated! – Lif Oct 07 '22 at 00:15
  • Regarding the Airtable pagination, [this question](https://stackoverflow.com/q/62096868/20121320) can probably help you. long story short, the api requires you to manually do the pagination, so you need to get the `offset` parameter from your previous query, and add it to the next, and you also need to remove the `maxRecords=100` parameters, which doesn't set the page size, but only tells the api to send 100 rows max – Florent Monin Oct 07 '22 at 00:51
  • You’re very welcome! Glad to have helped you!If this answer solved your problem, don’t hesitate to mark it as accepted :) – Florent Monin Oct 10 '22 at 20:27