Unable to loop through paged API responses with Python

Question

So, i'm scratching my head with this one. Using HubSpot's API, i need to get a list of ALL the companies in my client's "portal" (account). Sadly, the standard API call only returns 100 companies at a time. When it does return a response, it includes two parameters which make paging through responses possible.

One of those is "has-more": True (this lets you know if you can expect any more pages) and the other is "offset":12345678 (the timestamp to offset the request by.)

These two parameters are things you can pass back into the next API call to get the next page. So for example, the initial API call might look like:

"https://api.hubapi.com/companies/v2/companies/?hapikey={hapikey}".format(hapikey=wta_hubspot_api_key)

Whereas the follow up calls might look like:

"https://api.hubapi.com/companies/v2/companies/?hapikey={hapikey}&offset={offset}".format(hapikey=wta_hubspot_api_key, offset=offset)

So this is what i've tried so far:

#!/usr/bin/python
# -*- coding: utf-8 -*-

import sys
import os.path
import requests
import json
import csv
import glob2
import shutil
import time
import time as howLong
from time import sleep
from time import gmtime, strftime

HubSpot_Customer_Portal_ID = "XXXXXX"

wta_hubspot_api_key = "XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX"

findCSV = glob2.glob('*contact*.csv')

theDate = time=strftime("%Y-%m-%d", gmtime())
theTime = time=strftime("%H:%M:%S", gmtime())

try:
    testData = findCSV[0]
except IndexError:
    print ("\nSyncronisation attempted on {date} at {time}: There are no \"contact\" CSVs, please upload one and try again.\n").format(date=theDate, time=theTime)
    print("====================================================================================================================\n")
    sys.exit()

for theCSV in findCSV:

    def get_companies():
        create_get_recent_companies_call = "https://api.hubapi.com/companies/v2/companies/?hapikey={hapikey}".format(hapikey=wta_hubspot_api_key)
        headers = {'content-type': 'application/json'}
        create_get_recent_companies_response = requests.get(create_get_recent_companies_call, headers=headers)
        if create_get_recent_companies_response.status_code == 200:

            offset = create_get_recent_companies_response.json()[u'offset']
            hasMore = create_get_recent_companies_response.json()[u'has-more']

            while hasMore == True:
                for i in create_get_recent_companies_response.json()[u'companies']:
                    get_more_companies_call = "https://api.hubapi.com/companies/v2/companies/?hapikey={hapikey}&offset={offset}".format(hapikey=wta_hubspot_api_key, offset=offset)
                    get_more_companies_call_response = requests.get(get_more_companies_call, headers=headers)
                    companyName = i[u'properties'][u'name'][u'value']
                    print("{companyName}".format(companyName=companyName))


        else:
            print("Something went wrong, check the supplied field values.\n")
            print(json.dumps(create_get_recent_companies_response.json(), sort_keys=True, indent=4))

    if __name__ == "__main__":
        get_companies()
        sys.exit()

The problem is that it just keeps returning the same intitial 100 results; this is happening because the parameter "has-more":True is true on the initial call, so it'll just keep returning the same ones...

My ideal scenario is that I'm able to parse ALL the companies across approximately 120 response pages (there are around 12000 companies). As I pass through each page, i'd like to append it's JSON content to a list, so that eventually I have this list which contains the JSON responses of all 120 pages, so that I can parse that list for use in a different function.

I am in desperate need of a solution :(

This is the function I am replacing in my main script:

            def get_companies():

                create_get_recent_companies_call = "https://api.hubapi.com/companies/v2/companies/recent/modified?hapikey={hapikey}".format(hapikey=wta_hubspot_api_key)
                headers = {'content-type': 'application/json'}
                create_get_recent_companies_response = requests.get(create_get_recent_companies_call, headers=headers)
                if create_get_recent_companies_response.status_code == 200:

                    for i in create_get_recent_companies_response.json()[u'results']:
                        company_name = i[u'properties'][u'name'][u'value']
                        #print(company_name)
                        if row[0].lower() == str(company_name).lower():
                            contact_company_id = i[u'companyId']
                            #print(contact_company_id)
                            return contact_company_id
                else:
                    print("Something went wrong, check the supplied field values.\n")
                    #print(json.dumps(create_get_recent_companies_response.json(), sort_keys=True, indent=4))

SiHa · Accepted Answer · 2016-03-22T10:56:15.730

1

The problem seems to be that:

You get the offset in your first call, but don't do anything with the actual companies data that this call returns.
You then use this same offset in your while loop; you never use the new one from subsequent calls. This is why you get the same companies every time.

I think this code for get_companies() should work for you. I can't test it, obviously, but hopefully it is OK:

def get_companies():
        create_get_recent_companies_call = "https://api.hubapi.com/companies/v2/companies/?hapikey={hapikey}".format(hapikey=wta_hubspot_api_key)
        headers = {'content-type': 'application/json'}
        create_get_recent_companies_response = requests.get(create_get_recent_companies_call, headers=headers)
        if create_get_recent_companies_response.status_code == 200:

            while True:
                for i in create_get_recent_companies_response.json()[u'companies']:
                    companyName = i[u'properties'][u'name'][u'value']
                    print("{companyName}".format(companyName=companyName))
                offset = create_get_recent_companies_response.json()[u'offset']
                hasMore = create_get_recent_companies_response.json()[u'has-more']
                if not hasMore:
                    break
                else:
                    create_get_recent_companies_call = "https://api.hubapi.com/companies/v2/companies/?hapikey={hapikey}&offset={offset}".format(hapikey=wta_hubspot_api_key, offset=offset)
                    create_get_recent_companies_response = requests.get(create_get_recent_companies_call, headers=headers)


        else:
            print("Something went wrong, check the supplied field values.\n")
            print(json.dumps(create_get_recent_companies_response.json(), sort_keys=True, indent=4))

Strictly, the else after the break isn't required but it is in keeping with the Zen of Python "Explicit is better than implicit"

Note that you are only checking for a 200 response code once, if something goes wrong inside your loop you will miss it. You should probably put all your calls inside the loop and check for a proper response every time.

edited Mar 22 '16 at 10:56

answered Mar 22 '16 at 07:44

SiHa

7,830
13
34
43

Hi @SiHa, thanks for responding - unfortunately that also returned the same result, albeit returning the first 100 straight away instead of one-by-one (which is an improvement!) – Marko Mar 22 '16 at 10:43
@Marko Sorry, I'd missed the fact that you were using different names (`create_get_recent_companies...` vs `get_more_companies_call`) inside and outside your while loop. This meant that, in my first draft, although more results were being fetched in the loop, the *first* response was being iterated through each time. I've changed the names now so that they are the same. Hopefully it will work now. – SiHa Mar 22 '16 at 11:00
@SiHia you're an absolute legend - that totally worked. I have one more question. The script above is a "test script" - I was attempting to narrow down the functionality away from the main script. Back in the main script however, the function I was needing to replace is the one i've now added above... what do you think is the best way to collect the results of each page of results? I was going to try to append it to a list, or do you think I could just "return" it as was being done above originally? – Marko Mar 22 '16 at 11:12
To elaborate - the `get_companies()` function actually sits inside another function, and that function gets run for each row in a CSV. I guess ideally I'd probably be better off using the test script to pull down ALL of the companies into a local file, and then use that to extract the companyId for all the other operations, it would be much faster than extracting all 120 pages for ~9000 rows... – Marko Mar 22 '16 at 11:32
I agree, pulling all the data down once, then processing it locally is probably quicker. What structure you return from `get_companies()` would depend very much on what data you need; If it's only the names then a simple list would do, but if you need associated data as well, then probably a dictionary (or maybe just the raw JSON) would be best. You have what you need to proceed further now. If you get stuck again, it would be best to post another question. Happy coding! – SiHa Mar 22 '16 at 12:23
No worries @SiHa, i've managed to write certain portions of the returned data into a JSON document - now I just need to parse it and hopefully that will work! Thank you so much for your help, you've saved me many stressful hours! – Marko Mar 22 '16 at 13:05
Unsurprisingly @SiHa, i've run into another wall - I'd love to know what you think is happening here - http://stackoverflow.com/questions/36157696/unable-to-parse-json-file-keep-getting-valueerror-extra-data :( – Marko Mar 22 '16 at 14:53
Sorry - I'm no expert at JSON. – SiHa Mar 22 '16 at 14:56

Unable to loop through paged API responses with Python

1 Answers1

Linked