0

I have a streaming application that almost continuously takes the data given as input and sends an HTTP request using that value and does something with the returned value.

Obviously to speed things up I've used asyncio and aiohttp libraries in Python 3.7 to get the best performance, but it becomes hard to debug given how fast the data moves.

This is what my code looks like

'''
Gets the final requests
'''
async def apiRequest(info, url, session, reqType, post_data=''):
    if reqType:
        async with session.post(url, data = post_data) as response:
            info['response'] = await response.text()
    else:
        async with session.get(url+post_data) as response:
            info['response'] =  await response.text()
    logger.debug(info)
    return info

'''
Loops through the batches and sends it for request
'''
async def main(data, listOfData):
    tasks = []
    async with ClientSession() as session:
        for reqData in listOfData:
            try:
                task = asyncio.ensure_future(apiRequest(**reqData))
                tasks.append(task)
            except Exception as e:
                print(e)
                exc_type, exc_obj, exc_tb = sys.exc_info()
                fname = os.path.split(exc_tb.tb_frame.f_code.co_filename)[1]
                print(exc_type, fname, exc_tb.tb_lineno)
        responses = await asyncio.gather(*tasks)
    return responses #list of APIResponses

'''
Streams data in and prepares batches to send for requests
'''
async def Kconsumer(data, loop, batchsize=100):
        consumer = AIOKafkaConsumer(**KafkaConfigs)
        await consumer.start()
        dataPoints = []
        async for msg in consumer:
            try:
                sys.stdout.flush()
                consumedMsg = loads(msg.value.decode('utf-8'))
                if consumedMsg['tid']:
                    dataPoints.append(loads(msg.value.decode('utf-8')))
                if len(dataPoints)==batchsize or time.time() - startTime>5:
                    '''
                    #1: The task below goes and sends HTTP GET requests in bulk using aiohttp
                    '''
                    task = asyncio.ensure_future(getRequests(data, dataPoints))
                    res = await asyncio.gather(*[task])
                    if task.done():
                        outputs = []
                        '''
                        #2: Does some ETL on the returned values
                        '''
                        ids = await asyncio.gather(*[doSomething(**{'tid':x['tid'],
                                                'cid':x['cid'], 'tn':x['tn'],
                                                'id':x['id'], 'ix':x['ix'],
                                                'ac':x['ac'], 'output':to_dict(xmltodict.parse(x['response'],encoding='utf-8')),
                                                'loop':loop, 'option':1}) for x in res[0]])
                        simplySaveDataIntoDataBase(id) # This is where I see some missing data in the database
                    dataPoints = []
            except Exception as e:
                    logger.error(e)
                    logger.error(traceback.format_exc())
                    exc_type, exc_obj, exc_tb = sys.exc_info()
                    fname = os.path.split(exc_tb.tb_frame.f_code.co_filename)[1]
                    logger.error(str(exc_type) +' '+ str(fname) +' '+ str(exc_tb.tb_lineno))


if __name__ == '__main__':
    loop = asyncio.get_event_loop()
    asyncio.ensure_future(Kconsumer(data, loop, batchsize=100))
    loop.run_forever()

Does the ensure_future need to be awaited ? How does aiohttp handle requests that come a little later than the others? Shouldn't it hold the whole batch back instead of forgetting about it altoghter?

Augmented Jacob
  • 1,567
  • 19
  • 45

1 Answers1

0

Does the ensure_future need to be awaited ?

Yes, and your code is doing that already. await asyncio.gather(*tasks) awaits the provided tasks and returns their results in the same order.

Note that await asyncio.gather(*[task]) doesn't make sense, because it is equivalent to await asyncio.gather(task), which is again equivalent to await task. In other words, when you need the result of getRequests(data, dataPoints), you can write res = await getRequests(data, dataPoints) without the ceremony of first calling ensure_future() and then calling gather().

In fact, you almost never need to call ensure_future yourself:

  • if you need to await multiple tasks, you can pass coroutine objects directly to gather, e.g. gather(coroutine1(), coroutine2()).
  • if you need to spawn a background task, you can call asyncio.create_task(coroutine(...))

How does aiohttp handle requests that come a little later than the others? Shouldn't it hold the whole batch back instead of forgetting about it altoghter?

If you use gather, all requests must finish before any of them return. (That is not aiohttp policy, it's how gather works.) If you need to implement a timeout, you can use asyncio.wait_for or similar.

user4815162342
  • 141,790
  • 18
  • 296
  • 355
  • thanks for your answer. I do see how I could've used await task instead of gathering, but as for the last part about gathering I don't think it does wait for it to finish all the time. As you can see in my code, I had to ensure they were done using an if task.done(): because otherwise it my functions would say `Received future expected char or str` along those lines. – Augmented Jacob Mar 06 '19 at 19:38
  • @AugmentedJacob "Received future expected xxx" indicates that you were missing an `await`, or that you didn't call `future.result()`, regardless of whether the future is done. `gather` guarantees that, if it returns without raising, all the futures will be done. Otherwise, it couldn't populate its result tuple. – user4815162342 Mar 06 '19 at 20:04