0

I created the following script to download images from an API endpoint which works as intended. Thing is that it is rather slow as all the requests have to wait on each other. What is the correct way to make it possible to still have the steps synchronously for each item I want to fetch, but make it parallel for each individual item. This from an online service called servicem8 So what I hope to achieve is:

  • fetch all possible job ids => keep name/and other info
  • fetch name of the customer
  • fetch each attachment of a job

These three steps should be done for each job. So I could make things parallel for each job as they do not have to wait on each other.

Update:

Problem I do not understand is how can you make sure that you bundle for example the three calls per item in one call as its only per item that I can do things in parallel so for example when I want to

  • fetch item( fetch name => fetch description => fetch id)

so its the fetch item I want to make parallel?

The current code I have is working but rather slow:

import requests
import dateutil.parser
import shutil
import os

user = "test@test.com"
passw = "test"

print("Read json")
url = "https://api.servicem8.com/api_1.0/job.json"
r = requests.get(url, auth=(user, passw))

print("finished reading jobs.json file")
scheduled_jobs = []
if r.status_code == 200:
    for item in r.json():
        scheduled_date = item['job_is_scheduled_until_stamp']
        try:
            parsed_date = dateutil.parser.parse(scheduled_date)
            if parsed_date.year == 2016:
                if parsed_date.month == 10:
                    if parsed_date.day == 10:
                        url_customer = "https://api.servicem8.com/api_1.0/Company/{}.json".format(item[
                                                                                                  'company_uuid'])
                        c = requests.get(url_customer, auth=(user, passw))
                        cus_name = c.json()['name']
                        scheduled_jobs.append(
                            [item['uuid'], item['generated_job_id'], cus_name])

        except ValueError:
            pass

    for job in scheduled_jobs:
        print("fetch for job {}".format(job))
        url = "https://api.servicem8.com/api_1.0/Attachment.json?%24filter=related_object_uuid%20eq%20{}".format(job[
                                                                                                                 0])
        r = requests.get(url, auth=(user, passw))
        if r.json() == []:
            pass
        for attachment in r.json():
            if attachment['active'] == 1 and attachment['file_type'] != '.pdf':
                print("fetch for attachment {}".format(attachment))
                url_staff = "https://api.servicem8.com/api_1.0/Staff.json?%24filter=uuid%20eq%20{}".format(
                    attachment['created_by_staff_uuid'])
                s = requests.get(url_staff, auth=(user, passw))
                for staff in s.json():
                    tech = "{}_{}".format(staff['first'], staff['last'])

                url = "https://api.servicem8.com/api_1.0/Attachment/{}.file".format(attachment[
                                                                                    'uuid'])
                r = requests.get(url, auth=(user, passw), stream=True)
                if r.status_code == 200:
                    creation_date = dateutil.parser.parse(
                        attachment['timestamp']).strftime("%d.%m.%y")
                    if not os.path.exists(os.getcwd() + "/{}/{}".format(job[2], job[1])):
                        os.makedirs(os.getcwd() + "/{}/{}".format(job[2], job[1]))
                    path = os.getcwd() + "/{}/{}/SC -O {} {}{}".format(
                        job[2], job[1], creation_date, tech.upper(), attachment['file_type'])
                    print("writing file to path {}".format(path))
                    with open(path, 'wb') as f:
                        r.raw.decode_content = True
                        shutil.copyfileobj(r.raw, f)
else:
    print(r.text)

Update [14/10] I updated the code in the following way with some hints given. Thanks a lot for that. Only thing I could optimize I guess is the attachment downloading but it is working fine now. Funny thing I learned is that you cannot create a CON folder on a windows machine :-) did not know that.

I use pandas as well just to try to avoid some loops in my list of dicts but not sure if I am already most performant. Longest is actually reading in the full json files. I fully read them in as I could not find an API way of just telling the api, return me only the jobs from september 2016. The api query function seems to work on eq/lt/ht.

import requests
import dateutil.parser
import shutil
import os
import pandas as pd

user = ""
passw = ""

FOLDER = os.getcwd()
headers = {"Accept-Encoding": "gzip, deflate"}

import grequests
urls = [
    'https://api.servicem8.com/api_1.0/job.json',
    'https://api.servicem8.com/api_1.0/Attachment.json',
    'https://api.servicem8.com/api_1.0/Staff.json',
    'https://api.servicem8.com/api_1.0/Company.json'
]

#Create a set of unsent Requests:

print("Read json files")
rs = (grequests.get(u, auth=(user, passw), headers=headers) for u in urls)
#Send them all at the same time:
jobs,attachments,staffs,companies = grequests.map(rs)

#create dataframes
df_jobs = pd.DataFrame(jobs.json())
df_attachments = pd.DataFrame(attachments.json())
df_staffs = pd.DataFrame(staffs.json())
df_companies = pd.DataFrame(companies.json())

#url_customer = "https://api.servicem8.com/api_1.0/Company/{}.json".format(item['company_uuid'])
#c = requests.get(url_customer, auth=(user, passw))

#url = "https://api.servicem8.com/api_1.0/job.json"
#jobs = requests.get(url, auth=(user, passw), headers=headers)


#print("Reading attachments json")
#url = "https://api.servicem8.com/api_1.0/Attachment.json"
#attachments = requests.get(url, auth=(user, passw), headers=headers)

#print("Reading staff.json")
#url_staff = "https://api.servicem8.com/api_1.0/Staff.json"
#staffs = requests.get(url_staff, auth=(user, passw))

scheduled_jobs = []

if jobs.status_code == 200:
    print("finished reading json file")
    for job in jobs.json():
        scheduled_date = job['job_is_scheduled_until_stamp']
        try:
            parsed_date = dateutil.parser.parse(scheduled_date)
            if parsed_date.year == 2016:
                if parsed_date.month == 9:
                    cus_name = df_companies[df_companies.uuid == job['company_uuid']].iloc[0]['name'].upper()
                    cus_name = cus_name.replace('/', '')
                    scheduled_jobs.append([job['uuid'], job['generated_job_id'], cus_name])

        except ValueError:
            pass
    print("{} jobs to fetch".format(len(scheduled_jobs)))

    for job in scheduled_jobs:
        print("fetch for job attachments {}".format(job))
        #url = "https://api.servicem8.com/api_1.0/Attachment.json?%24filter=related_object_uuid%20eq%20{}".format(job[0])

        if attachments == []:
            pass
        for attachment in attachments.json():
            if attachment['related_object_uuid'] == job[0]:
                if attachment['active'] == 1 and attachment['file_type'] != '.pdf' and attachment['attachment_source'] != 'INVOICE_SIGNOFF':
                    for staff in staffs.json():
                        if staff['uuid'] == attachment['created_by_staff_uuid']:
                            tech = "{}_{}".format(
                                staff['first'].split()[-1].strip(), staff['last'])

                    creation_timestamp = dateutil.parser.parse(
                        attachment['timestamp'])
                    creation_date = creation_timestamp.strftime("%d.%m.%y")
                    creation_time = creation_timestamp.strftime("%H_%M_%S")

                    path = FOLDER + "/{}/{}/SC_-O_D{}_T{}_{}{}".format(
                        job[2], job[1], creation_date, creation_time, tech.upper(), attachment['file_type'])

                    # fetch attachment

                    if not os.path.isfile(path):
                        url = "https://api.servicem8.com/api_1.0/Attachment/{}.file".format(attachment[
                                                                                            'uuid'])
                        r = requests.get(url, auth=(user, passw), stream = True)
                        if r.status_code == 200:
                            if not os.path.exists(FOLDER + "/{}/{}".format(job[2], job[1])):
                                os.makedirs(
                                    FOLDER + "/{}/{}".format(job[2], job[1]))

                            print("writing file to path {}".format(path))
                            with open(path, 'wb') as f:
                                r.raw.decode_content = True
                                shutil.copyfileobj(r.raw, f)
                    else:
                        print("file already exists")
else:
    print(r.text)
Koen
  • 858
  • 2
  • 7
  • 22
  • Careful with selecting a method to do this, as the ServiceM8 API is rate limited, and too many simultaneous requests results in a "HTTP/1.1 429 Too Many Requests" However, what you could do is progressively resolve the attachment links, but instead of downloading them as you go; build a url file out of them. From the file you could use a number of methods to download them simultaneously. Where you have this line: `r = requests.get(url, auth=(user, passw), stream=True)` the `r.url` response will contain the direct "https://data-cdn.servicem8.com/...." link which wouldn't be rate limited. – hmedia1 Oct 13 '16 at 02:26
  • Two other simple steps that should greatly improve the efficiency of this: **1.** Instead of making a call to the Attachment API for every Job uuid, just grab the entire Attachments file in a single request and filter the related_object_uuid with the job uuids which you got in one hit **2.** Once you have downloaded an attachment successfully, store the attachment uuid in a file or database somewhere, and skip any iterations where the uuid has already been processed - that way every time you run the attachment downloader, you are quickly retrieving only new attachments. – hmedia1 Oct 13 '16 at 02:34
  • ....continued... the method you currently have runs an API request for every file before testing whether or not the file currently exists. – hmedia1 Oct 13 '16 at 02:39
  • PS: This is not a duplicate question - as it's not directly about simultaneous downloads using python, but more about how to make the OPs attachment downloader more efficient as per his original post, for which there are a number of ways. – hmedia1 Oct 13 '16 at 02:43

1 Answers1

1

General idea is to use asynchronous url requests and there is a python module named grequests for that-https://github.com/kennethreitz/grequests

From Documentation:

import grequests
urls = [
    'http://www.heroku.com',
    'http://python-tablib.org',
    'http://httpbin.org',
    'http://python-requests.org',
    'http://fakedomain/',
    'http://kennethreitz.com'
]
#Create a set of unsent Requests:
rs = (grequests.get(u) for u in urls)
#Send them all at the same time:
grequests.map(rs)

And the resopnse

[<Response [200]>, <Response [200]>, <Response [200]>, <Response [200]>, None, <Response [200]>]

Jose Cherian
  • 7,207
  • 3
  • 36
  • 39
  • Could you give some example how to go from my version to a grequest version? as I need the information of the other http request to create the filename for the file to save. – Koen Oct 10 '16 at 23:22