How to write dask dataframe into single csv in aws s3 using dask delayed so it can be faster?

Question

Currently I am using below code but its taking too much time. As I am converting dask dataframe to buffer and using multipart-upload to upload it in s3

def multi_part_upload_with_s3(file_buffer_obj,BUCKET_NAME,key_path):
    client = boto3.client('s3')
    s3 = boto3.resource('s3') 
    config = TransferConfig(multipart_threshold=1024 *25,max_concurrency=10,multipart_chunksize=1024 * 25,use_threads=True)
    s3.meta.client.upload_fileobj(file_buffer_obj, BUCKET_NAME, key_path,Config=config)

ddf.compute().to_csv(target_buffer_old,sep=",")
target_buffer_old=io.BytesIO(target_buffer_old.getvalue().encode())

multi_part_upload_with_s3(target_buffer_old,"bucket","key/file.csv")

score 1 · Answer 1 · answered Oct 22 '19 at 15:27

I advise you to write to separate S3 files in parallel using dask (which is the default way to work) and then use multi-part-upload to merge together the outputs. You could use the s3fs method merge to do this. Note that you will want to write without headers.

How to write dask dataframe into single csv in aws s3 using dask delayed so it can be faster?

1 Answers1