3

I have a very large (~300GB) .tar.gz file. Upon extracting it (with tar -xzvf file.tar.gz), it yields many .json.xz files. I wish to extract and upload the raw json files to s3 without saving locally (as I don't have space to do this). I understand I could spin up an ec2 instance with enough space to extract and upload the files, but I am wondering how (or if) it may be done directly.

I have tried various versions of tar -xzvf file.tar.gz | aws s3 cp - s3://the-bucket, but this is still extracting locally; also, it seems to be resulting in json.xz files, and not raw json. I've tried to adapt this response from this question which zips and uploads a file, but haven't had any success yet.

I'm working on Ubuntu16.04 and quite new to linux, so any help is much appreciated!

congbaoguier
  • 985
  • 4
  • 20
hardwhere
  • 37
  • 6

1 Answers1

3

I think this is how I would do it. There may be more elegant/efficient solutions:

tar --list -zf file.tar.gz | while read -r item
do
    tar -xzvfO file.tar.gz $item | aws s3 cp - s3://the-bucket/$item
done

So you're iterating over the files in the archive, extracting them one-by-one to stdout and uploading them directly to S3 without first going to disk.

This assumes there is nothing funny going on with the names of the items in your tar file (no spaces, etc.).

Sean Bright
  • 118,630
  • 17
  • 138
  • 146
  • Thanks for this. The only change I needed is that the -f flag needs to come last; this ended up not working for me, but it seems to be a problem with uploading large files to s3 and not with your solution. – hardwhere Aug 21 '19 at 20:34