0

I have a list dataframe_chunk which contains chunks of a very large pandas dataframe.I would like to write every single chunk into a different csv, and to do so in parallel. However, I see the files being written sequentially and I'm not sure why this is the case. Here's the code:

import concurrent.futures as cfu

def write_chunk_to_file(chunk, fpath):  
    chunk.to_csv(fpath, sep=',', header=False, index=False)

pool = cfu.ThreadPoolExecutor(N_CORES)

futures = []
for i in range(N_CORES):
    fpath = '/path_to_files_'+str(i)+'.csv'
    futures.append(pool.submit( write_chunk_to_file(dataframe_chunk[i], fpath)))

for f in cfu.as_completed(futures):
    print("finished at ",time.time())

Any clues?

elelias
  • 4,552
  • 5
  • 30
  • 45

2 Answers2

0

One thing that is stated in the Python 2.7.x threading docs but not in the 3.x docs is that Python cannot achieve true parallelism using the threading library - only one thread will execute at a time.

You should try using concurrent.futures with the ProcessPoolExecutor which uses separate processes for each job and therefore can achieve true parallelism on a multi-core CPU.

Update

Here is your program adapted to use the multiprocessing library instead:

#!/usr/bin/env python3

from multiprocessing import Process

import os
import time

N_CORES = 8

def write_chunk_to_file(chunk, fpath):  
    with open(fpath, "w") as f:
      for x in range(10000000):
        f.write(str(x))

futures = []

print("my pid:", os.getpid())
input("Hit return to start:")

start = time.time()
print("Started at:", start)

for i in range(N_CORES):
    fpath = './tmp/file-'+str(i)+'.csv'
    p = Process(target=write_chunk_to_file, args=(i,fpath))
    futures.append(p)

for p in futures:
  p.start()

print("All jobs started.")

for p in futures:
  p.join()

print("All jobs finished at ",time.time())

You can monitor the jobs with this shell command in another window:

while true; do clear; pstree 12345; ls -l tmp; sleep 1; done

(Replace 12345 with the pid emitted by the script.)

ErikR
  • 51,541
  • 9
  • 73
  • 124
0

Your code probably works, it starts making the 2nd+ file(s) while the 1st chunk is being written, etc. It will be slightly faster than the simple synchronous version because the syscalls follow each other sooner.

But from the kernel perspective the IO syscalls will still come in one after another from a single python process, so the files will be created in serial, albeit at a faster frequency.

jiggunjer
  • 1,905
  • 1
  • 19
  • 18