take sample of file from AWS s3 and put to another location in s3

Question

It is always possible using s3distcp to copy a file(or set of files) into another location of s3, but is it possible, using mapred or any other functionality of Hadoop/EMR to take a random sample(or every nth line) of the file(s) to a new location in s3. The catch is save the time of copying data to the local machine and upload it again to s3.

Here's the time-taking code I want to optimize with this process.

aws s3 cp s3://... localLocation
cat localLocation | awk '{if(NR%10==0) print $0' > samp.txt
aws s3 cp samp.txt s3://..anotherLocation

That awk script would fail with a syntax error. FWIW `cat localLocation | awk '{if(NR%10==0) print $0}'` can be written as just `awk '!(NR%10)' localLocation` — Ed Morton, Nov 30 '15 at 16:01

score 0 · Answer 1 · answered Nov 30 '15 at 16:04

0

When retrieving a file from Amazon S3, the entire file must be downloaded. Random-access is not supported.

answered Nov 30 '15 at 16:04

Matt Houser

33,983
6
70
88

Technically... S3 supports HTTP `Range: bytes=${first}-${last}` requests, but the ranges are bytes, not lines. That's not to say it's going to be useful here without a lot of creativity, of course, but S3 does support random access on reads. – Michael - sqlbot Nov 30 '15 at 19:21

take sample of file from AWS s3 and put to another location in s3

1 Answers1