2

Currently I am using below code but its taking too much time. As I am converting dask dataframe to buffer and using multipart-upload to upload it in s3

def multi_part_upload_with_s3(file_buffer_obj,BUCKET_NAME,key_path):
    client = boto3.client('s3')
    s3 = boto3.resource('s3') 
    config = TransferConfig(multipart_threshold=1024 *25,max_concurrency=10,multipart_chunksize=1024 * 25,use_threads=True)
    s3.meta.client.upload_fileobj(file_buffer_obj, BUCKET_NAME, key_path,Config=config)

ddf.compute().to_csv(target_buffer_old,sep=",")
target_buffer_old=io.BytesIO(target_buffer_old.getvalue().encode())

multi_part_upload_with_s3(target_buffer_old,"bucket","key/file.csv")
Simson
  • 3,373
  • 2
  • 24
  • 38

1 Answers1

1

I advise you to write to separate S3 files in parallel using dask (which is the default way to work) and then use multi-part-upload to merge together the outputs. You could use the s3fs method merge to do this. Note that you will want to write without headers.

mdurant
  • 27,272
  • 5
  • 45
  • 74