0

So, this is my use case. I need to read the data from pipe separated CSV and then process it (just map as variables) and then send it to a third party.

Originally, the CSV is around 600MB with 1million rows, which can also increase in size or variable.

I tried to load in chunks using pandas and print it to console

import time
import pandas as pd;
import numpy as np
    
start = time.perf_counter();

def do_something():
    for file_chunk in pd.read_csv(r".\excel.csv", encoding='utf-16',sep="|", chunksize=500):
        print (file_chunk)
    

if __name__=="__main__":
    
    do_something()
    finish=time.perf_counter();
    print(finish-start)

It takes around 170 seconds on average.

And then, I tried with python multiprocessing using two processes.

import time
import pandas as pd;
import numpy as np
from multiprocessing import Process

start = time.perf_counter();

def do_something():
    for file_chunk in pd.read_csv(r".\excel.csv", encoding='utf-16',sep="|", chunksize=500):
        print (file_chunk)
   
if __name__=="__main__":
    
    p1 = Process(target=do_something)
    p2= Process(target=do_something)

    p1.start()
    p2.start()

    p1.join()
    p2.join()
    finish=time.perf_counter();
    print(finish-start)
    

190 seconds

It took around 2x or 1.5x times.

So, the issue is what would be best way to complete the use case ? Since, I need to optimal in reading the CSV as well as making a network request. Both needs to be done optimally.

That Coder
  • 11
  • 4
  • It is a little unclear what you are trying to do. However, if you are trying to check performance of any code you shouldn't print data to get console. In many cases, and certainly in is case, printing the data will be much slower than reading it from disk. I would expect your code to be somewhere between 10 and 100 times faster if you remove the print statement. Multithreading reading a file will also make it slower as you are introducing contention. – John M. Nov 04 '22 at 09:22
  • I am trying to do this: Read the CSV file. Send the rows to a third party/make a POST request with it. My constraints are : It is a large csv file. It needs to be done in multithreaded/multiprocessing fashion. It needs to as fast as possible. I can remove printing to console but reading the file is essential part of my task – That Coder Nov 04 '22 at 09:54
  • I haven't yet implemented sending each row to the third party via a network request – That Coder Nov 04 '22 at 09:55
  • If you are posting the data over the network, then reading the CSV file isn't going to be your bottleneck. A typical hard disk reading speed is 80-160`MB/s` and you will need a pretty fast network connection before reading is going to be a bottleneck. The bottleneck will most likely writing to your outgoing connection. There are lots of things you can do to improve the performance of outgoing data, some harder work than others. Is this a one off or is it something you need to do regularly? Does the data need to arrive at the third party in the same order as in the CSV? – John M. Nov 04 '22 at 14:02
  • I would probably look at using [`asycio`](https://docs.python.org/3/library/asyncio.html). If the blocking part of the process is going to be writing outgoing data, using asyncio you can let the system deal with this on another thread while you read the next chunk of your CSV. I would also make sure you're compressing outgoing data with something like gzip. [This question](https://stackoverflow.com/questions/51699817/python-async-post-requests) discusses something similar to your problem. – John M. Nov 04 '22 at 14:17
  • asycio uses multithreading rather than multiprocessing right ? and there's GIL in python to be handled – That Coder Nov 06 '22 at 02:34

1 Answers1

0

Expanding on my comments above: the bottleneck in your process is likely to be posting data to the network rather than reading the file from disk. I have run a set of benchmarks below to illustrate this:

Method Time
original method with print statement 10.600501500070095
reading file but not writing out results 1.5313809999497607
posting to local web server with requests library 8.359419499989599
posting to local web server with aiohttp 7.359877599985339

In your example, you are reading a file from disk then printing to the console. Writing to the console is much slow than reading from disk due to contention updating the console. In this example, writing to console made the process 7 times slower.

There are two keep processes here, reading the data from disk and writing it to the network. Due to disk predictive caching, the fastest way to read data off disk is if a process reads the data sequentially moving forward. If anything having multiple readers will slow down reading.

The next optimisation step you might look to make is reading from disk and writing to the network at the same time. Python does not support multithreading due to the GIL as you mention, however, there is no reason why libraries which python calls can't be multithread. If you imagine a python module written in C like numpy, there is no reason why it can't kick new threads which go off and do something in parallel to python code. asyncio uses a mechanism like this, where you can create tasks using it's objects; these are then handed to a manager object which can keep checking them until everything is finished. The objects which it creates may start new threads, use thread pools etc. in the background, but from a python perspective your code just hangs around until everything is done and respects GIL constraints.

So, to answer your question, as I understand it asyncio does utilise multithreading rather than multiprocessing, but this is done behind the scenes and all the python code which you write is single threaded as usual.

The test code reads the csv files in chunks and creates tasks to send that data to the server. There is also a very simple server implementation for accepting the requests. In this case you see a small gain in performance over the requests library, of same order of size as the time taken to read the file from disk. Note the example is sending data as json and method 2 skips this. Note in practice, the upload process is likely to be much slower and bandwidth constrained, so it is unlikely to benefit much, if at all, from multiprocessing.

Test code:

import random, os, timeit, requests, asyncio
import pandas as pd
from aiohttp import ClientSession
from aiohttp import TCPConnector
import json

def create_test_data(path, col_count, chunk_count, chunk_size):
    col_count = [f'col-{hex(i)}' for i in range(col_count)]
    header = True
    os.remove(path)
    for chunk in range(chunk_count):
        data = {c: [random.randint(0, 10000) for i in range(chunk_size)] for c in col_count}
        df = pd.DataFrame(data)
        df.to_csv(path, mode='a', header=header, index=False)
        header = False

server = 'http://127.0.0.1:8080'
src = r'./bigfile.csv'
#create_test_data(src, 50, 100, 1000)

def method0():
    """original method with print statement"""
    for file_chunk in pd.read_csv(src, chunksize=500):
        print(file_chunk)

def method1():
    """reading file but not writing out results"""
    for file_chunk in pd.read_csv(src, chunksize=500):
        pass

def method2():
    """posting to local web server with requests library"""
    for file_chunk in pd.read_csv(src, chunksize=500):
        with requests.post(server, json=file_chunk.to_dict()) as resp:
            pass

async def do_post(session, data):
    async with session.post(server, json=json.dumps(data)):
        pass

async def upload_data():
    tasks = []
    connector = TCPConnector(limit=10)
    async with ClientSession(connector=connector) as session:
        for file_chunk in pd.read_csv(src, chunksize=500):
            tasks.append(asyncio.create_task(do_post(session, file_chunk.to_dict())))
        await asyncio.gather(*tasks)

def method3():
    """posting to local web server wtih aiohttp"""
    asyncio.run(upload_data())

t0, t1, t2, t3 = None, None, None, None
t0 = timeit.timeit(lambda: method0(), number=1)
t1 = timeit.timeit(lambda: method1(), number=1)
t2 = timeit.timeit(lambda: method2(), number=1)
t3 = timeit.timeit(lambda: method3(), number=1)

print(f'| Method                                  | Time |')
print(f'|------------------                       |------|')
print(f'| {method0.__doc__}                       | {t0} |')
print(f'| {method1.__doc__}                       | {t1} |')
print(f'| {method2.__doc__}                       | {t2} |')
print(f'| {method3.__doc__}                       | {t3} |')

Server code:

from aiohttp import web
import asyncio

async def handler(request):
    data = await request.json()
    await asyncio.sleep(0.1)
    return web.json_response({'read': f'{len(data)}', 'status': 'OK'})

app = web.Application()
app.add_routes([
    web.get('/', handler),
    web.post('/', handler)])

web.run_app(app, port=8080, host='localhost')
John M.
  • 775
  • 4
  • 16
  • Thanks alot for this brief explanation. So,it'd be feasible to read the 500rows as chunks linearly and make a POST request using asyncio ? – That Coder Nov 07 '22 at 04:43
  • Yes that should be feasible. The first limit you are likely to run into is how much data the web server you are posting to will accept in one request. – John M. Nov 07 '22 at 07:23
  • It is 500KB as batch request and 32KB per object. The third party is nothing but segment – That Coder Nov 07 '22 at 08:15