Expanding on my comments above: the bottleneck in your process is likely to be posting data to the network rather than reading the file from disk. I have run a set of benchmarks below to illustrate this:
Method |
Time |
original method with print statement |
10.600501500070095 |
reading file but not writing out results |
1.5313809999497607 |
posting to local web server with requests library |
8.359419499989599 |
posting to local web server with aiohttp |
7.359877599985339 |
In your example, you are reading a file from disk then printing to the console. Writing to the console is much slow than reading from disk due to contention updating the console. In this example, writing to console made the process 7 times slower.
There are two keep processes here, reading the data from disk and writing it to the network. Due to disk predictive caching, the fastest way to read data off disk is if a process reads the data sequentially moving forward. If anything having multiple readers will slow down reading.
The next optimisation step you might look to make is reading from disk and writing to the network at the same time. Python does not support multithreading due to the GIL as you mention, however, there is no reason why libraries which python calls can't be multithread. If you imagine a python module written in C like numpy, there is no reason why it can't kick new threads which go off and do something in parallel to python code. asyncio
uses a mechanism like this, where you can create tasks using it's objects; these are then handed to a manager object which can keep checking them until everything is finished. The objects which it creates may start new threads, use thread pools etc. in the background, but from a python perspective your code just hangs around until everything is done and respects GIL constraints.
So, to answer your question, as I understand it asyncio does utilise multithreading rather than multiprocessing, but this is done behind the scenes and all the python code which you write is single threaded as usual.
The test code reads the csv files in chunks and creates tasks to send that data to the server. There is also a very simple server implementation for accepting the requests. In this case you see a small gain in performance over the requests library, of same order of size as the time taken to read the file from disk. Note the example is sending data as json and method 2 skips this. Note in practice, the upload process is likely to be much slower and bandwidth constrained, so it is unlikely to benefit much, if at all, from multiprocessing.
Test code:
import random, os, timeit, requests, asyncio
import pandas as pd
from aiohttp import ClientSession
from aiohttp import TCPConnector
import json
def create_test_data(path, col_count, chunk_count, chunk_size):
col_count = [f'col-{hex(i)}' for i in range(col_count)]
header = True
os.remove(path)
for chunk in range(chunk_count):
data = {c: [random.randint(0, 10000) for i in range(chunk_size)] for c in col_count}
df = pd.DataFrame(data)
df.to_csv(path, mode='a', header=header, index=False)
header = False
server = 'http://127.0.0.1:8080'
src = r'./bigfile.csv'
#create_test_data(src, 50, 100, 1000)
def method0():
"""original method with print statement"""
for file_chunk in pd.read_csv(src, chunksize=500):
print(file_chunk)
def method1():
"""reading file but not writing out results"""
for file_chunk in pd.read_csv(src, chunksize=500):
pass
def method2():
"""posting to local web server with requests library"""
for file_chunk in pd.read_csv(src, chunksize=500):
with requests.post(server, json=file_chunk.to_dict()) as resp:
pass
async def do_post(session, data):
async with session.post(server, json=json.dumps(data)):
pass
async def upload_data():
tasks = []
connector = TCPConnector(limit=10)
async with ClientSession(connector=connector) as session:
for file_chunk in pd.read_csv(src, chunksize=500):
tasks.append(asyncio.create_task(do_post(session, file_chunk.to_dict())))
await asyncio.gather(*tasks)
def method3():
"""posting to local web server wtih aiohttp"""
asyncio.run(upload_data())
t0, t1, t2, t3 = None, None, None, None
t0 = timeit.timeit(lambda: method0(), number=1)
t1 = timeit.timeit(lambda: method1(), number=1)
t2 = timeit.timeit(lambda: method2(), number=1)
t3 = timeit.timeit(lambda: method3(), number=1)
print(f'| Method | Time |')
print(f'|------------------ |------|')
print(f'| {method0.__doc__} | {t0} |')
print(f'| {method1.__doc__} | {t1} |')
print(f'| {method2.__doc__} | {t2} |')
print(f'| {method3.__doc__} | {t3} |')
Server code:
from aiohttp import web
import asyncio
async def handler(request):
data = await request.json()
await asyncio.sleep(0.1)
return web.json_response({'read': f'{len(data)}', 'status': 'OK'})
app = web.Application()
app.add_routes([
web.get('/', handler),
web.post('/', handler)])
web.run_app(app, port=8080, host='localhost')