-3

I have a 10+ gb csv file coming in everyday in my S3 and I want to add an extra column to that file and save it there. I am using MWAA to do this task but its failing due to large file size.

tried using boto3.client.get_object. And then use object['Body'].read().decode('utf-8').

Abhi
  • 3
  • 1
  • What do you mean by "add an extra column"? Will it be blank? If not, how do you determine the values for the "extra column"? – John Rotenstein Oct 28 '22 at 03:29
  • That extra column will have the value as the file name. I found a library called smart_open with that I m able to process 10gb file and add a new column using rstrip and append for each line – Abhi Oct 30 '22 at 00:03

1 Answers1

0

The simple method would be:

  • Download the file
  • Modify the file
  • Upload the file

Attempting to read() a 10+ GB file is not a good idea. Downloading it to disk using download_file() will work much better. You can then modify it locally however you wish, and then upload the resulting file.

I recommend that such a script be run on an Amazon EC2 instance, or in an AWS Lambda function, so that it stays within AWS. This will be much faster rather than transferring the data across the Internet (and therefore lower cost, too).

If the extra column is merely a calculation based on the existing columns, then you could use Amazon Athena:

  • Define a table based upon the format and location of the incoming file
  • Use CREATE TABLE AS to select data from the incoming 'table' and output your desired data -- you can specify a location for where to output the data

However, I suspect that Amazon Athena would produce multiple output files rather than one big file.

John Rotenstein
  • 241,921
  • 22
  • 380
  • 470