0

It is always possible using s3distcp to copy a file(or set of files) into another location of s3, but is it possible, using mapred or any other functionality of Hadoop/EMR to take a random sample(or every nth line) of the file(s) to a new location in s3. The catch is save the time of copying data to the local machine and upload it again to s3.

Here's the time-taking code I want to optimize with this process.

aws s3 cp s3://... localLocation
cat localLocation | awk '{if(NR%10==0) print $0' > samp.txt
aws s3 cp samp.txt s3://..anotherLocation
Kuber
  • 1,023
  • 12
  • 21
  • That awk script would fail with a syntax error. FWIW `cat localLocation | awk '{if(NR%10==0) print $0}'` can be written as just `awk '!(NR%10)' localLocation` – Ed Morton Nov 30 '15 at 16:01

1 Answers1

0

When retrieving a file from Amazon S3, the entire file must be downloaded. Random-access is not supported.

Matt Houser
  • 33,983
  • 6
  • 70
  • 88
  • Technically... S3 supports HTTP `Range: bytes=${first}-${last}` requests, but the ranges are bytes, not lines. That's not to say it's going to be useful here without a lot of creativity, of course, but S3 does support random access on reads. – Michael - sqlbot Nov 30 '15 at 19:21