2

I'm doing some processing on Hive. Usually, the result of this process is a folder (on S3), with multiple files (named with some random letters and numbers, in order) that I can just 'cat' together.

But for reports, I only need the first and the last file in the folder. Now, if the files number in the hundreds, I can simply download it via the web-gui.

But if it's in the thousands, scrolling down is a pain. Not to mention, Amazon loads things on the fly when needed, as opposed to showing it all.

I tried s3cmd get but my experience with that is basic at best. I end up downloading the contents of the entire folder.

As far as I know one can pipe in extra commands, but I'm not sure how to do that.

So, how do I use s3cmd get to download only the last file in a specific folder?

Thanks.

zack_falcon
  • 4,186
  • 20
  • 62
  • 108

1 Answers1

6

I guess this command should work for you,

s3cmd get $(s3cmd ls s3://bucket_name/folder_name/ | tail -1 | awk '{ print $4 }')

tail -1 will pick the last line in folder listing and awk '{ print $4 }' will pick the name of the file(fourth field).

For first file just replace tail -1 with head -1

Taha Husain
  • 292
  • 2
  • 11
  • 1
    That would make two requests to S3 to retrieve the same list of files which could take a long time if the files number in the thousands. You can get the first and last lines with a single `awk` script: `s3cmd ls s3://bucket_name/folder_name/ | awk 'NR == 1 { print $4 }END{ print $4 }' | xargs s3cmd get`. – jwadsack Jul 27 '16 at 17:21