0

I have about 130 asynchronous GET requests being sent using httpx and asyncio in python, via a proxy which I created myself on AWS.

In the python script, I have printed the time just before each request is sent and can see that they are all sent within less than 70ms. However, I have timed the duration of the requests by getting the current time immediately after and some requests take up to 30 seconds! The distribution seems pretty level over this time so I am getting back about 3-5 requests every second for 30 seconds.

I used tcpdump and wireshark to look at the packets coming back, and it seems that all the application data is coming back within 4 seconds (including the tcp handshakes) so I don't understand the reason for the delay in python.

The tcp teardowns are happening up to 35 seconds later so maybe this could be the reason for the delay? Does httpx wait for the connection to close (FIN and ACK) before the httpx.get() is unblocked and the request can be read?

What can I try to speed this up?

Here is a simplified version of my code:

import asyncio
import datetime

import httpx

from utils import store_data, get_proxy_addr


CLIENT = None

async def get_and_store_thing_data(thing):
    t0 = datetime.now()
    res = await CLIENT.get('https://www.placetogetdata.com', params={'thing': thing})
    t1 = datetime.now()
    # It's this line that shows the time is anywhere from 0-30 seconds for the
    # request to return
    print(f'time taken: {t1-t0}')
    data = res.json()
    store_data(data)
    return data


def get_tasks(things):
    tasks = []
    for thing in things:
        tasks = get_and_store_thing_data(thing)
        tasks.append(tasks)

    return tasks


async def run_tasks(tasks):
    global CLIENT
    CLIENT = httpx.AsyncClient(proxies={'https://': proxy_addr})
    try:
        await asyncio.wait(tasks)
    finally:
        await CLIENT.aclose()


def run():
    proxy_addr = get_proxy_addr()
    tasks = get_tasks
    asyncio.run(run_tasks(tasks, proxy_addr))
Jonny Shanahan
  • 351
  • 3
  • 13
  • All these requests are being made in a single eventloop? Could it be that it's taking 30 seconds to process all the previous responses? Can you add some code – Iain Shelvington Nov 29 '21 at 15:32
  • Does each request go to a unique domain or are they all going to the same one? If the latter, there's a good chance you're being throttled by the server. – dirn Nov 29 '21 at 15:42
  • @IainShelvington Please see the code example I added. They are all being run with a single asyncio.run() – Jonny Shanahan Nov 29 '21 at 15:54
  • @dirn Yes all going to the same domain....however they are not being throttled since I have created 26 proxy servers in AWS all servers by a single network load balancer, and I get all the data back for all requests each time....when I don't use the proxy I get an error from the domain and banned for 5 mins. Plus I have look at the returning packets using wireshark and all the application data comes back in a few seconds. It seems to be a delay locally between the packets coming back and the python code unblocking the awaited httpx.get coroutine – Jonny Shanahan Nov 29 '21 at 15:55
  • @JonnyShanahan if you remove everything other than the request and the print of the time taken from your function, is the time taken drastically reduced? I have a hunch that 130 calls to `store_data` is taking 30 seconds – Iain Shelvington Nov 29 '21 at 16:09
  • @IainShelvington Oh wow! yes I just tried it and it fixed the problem....I thought that this wouldn't be an issue because it's happening inside the asynchronous function...but am i thinking about it incorrectly? Does each store_data() function need to wait for the previous one to finish? I thought they could all be storing data at the same time but maybe storing data on hard disk (my temporary solution) can't be done asynchronously like http requests? – Jonny Shanahan Nov 29 '21 at 16:15
  • @JonnyShanahan your async code is running in a single thread, only one task runs at a time. – Iain Shelvington Nov 29 '21 at 16:17
  • @IainShelvington yes but the requests are sent asynchronously...each request is sent out less than 1ms from the previous one, even though the time of each response is about 500ms. I thought it would work the same way for writes to the file system – Jonny Shanahan Nov 29 '21 at 16:20
  • Ahhh but store_data() is not asynchronous function...so they will be storing sequentially. I will look into a asynchronous file IO package in python. Thanks for your help! – Jonny Shanahan Nov 29 '21 at 16:24
  • @JonnyShanahan each task is run one at a time, each one triggers a request and yields control back to the main loop via the `await` call. You then have 130 tasks all waiting for their request to finish and the event loop is in an infinite while loop checking if any tasks should be resumed. When the first request finishes, the event loop resumes that task, when that task finishes the event loop then goes back to checking if any tasks need to be resumed. The event loop only runs one task at a time so by the time the last task is resumed a fair amount of time has passed – Iain Shelvington Nov 29 '21 at 16:26

0 Answers0