Partially reading a tar.gz file from Amazon S3

Question

I'm trying to extract specific files from Amazon S3 without having to read all the bytes because the archives can be huge and I only need 2 or 3 files out of it.

I'm using the AWS Java SDK. Here's the code (Exception Handing skipped):

AWSCredentials credentials = new BasicAWSCredentials("accessKey", "secretKey");
AWSCredentialsProvider credentialsProvider = new AWSStaticCredentialsProvider(credentials);
AmazonS3 s3Client = AmazonS3ClientBuilder.standard().withRegion(Regions.US_EAST_1).withCredentials(credentialsProvider).build();
S3Object object = s3Client.getObject("bucketname", "file.tar.gz");
S3ObjectInputStream objectContent = object.getObjectContent();

TarArchiveInputStream tarInputStream = new TarArchiveInputStream(new GZIPInputStream(objectContent));
TarArchiveEntry currentEntry;
while((currentEntry = tarInputStream.getNextTarEntry()) != null) {
    if(currentEntry.getName().equals("1/foo.bar") && currentEntry.isFile()) {
        FileOutputStream entryOs = new FileOutputStream("foo.bar");
        IOUtils.copy(tarInputStream, entryOs);
        entryOs.close();
        break;
    }
}
objectContent.abort();  // Warning at this line
tarInputStream.close(); // warning at this line

When I use this method it gives a warning that not all the bytes from the stream were read which I'm doing intentionally.

WARNING: Not all bytes were read from the S3ObjectInputStream, aborting HTTP connection. This is likely an error and may result in sub-optimal behavior. Request only the bytes you need via a ranged GET or drain the input stream after use.

Is it necessary to drain the stream and what would be the downsides of not doing it? Can I just ignore the warning?

score 2 · Answer 1 · edited Apr 11 '19 at 09:25

2

You don't have to worry about the warning - it only warns you that it will result in the closure of HTTP connection and that there might be data which you will miss. Since close() delegates to abort() you get the warning in either of the calls.

Note that it is not guaranteed as you are not reading the whole archive anyway if the files you are interested in are located towards the end of the archive.

S3's HTTP server supports ranges, so if you could influence the format of the archive or during the creation of it generate some metadata you could actually skip or perhaps request only the file you are interested in.

edited Apr 11 '19 at 09:25

Ashishkumar Singh

3,580
1
23
41

answered Aug 02 '17 at 11:05

diginoise

7,352
2
31
39

Yes, I maybe reading the whole file some times, but for the most cases, I think I would be able to save on some reading; and no I can't influence how the file is uploaded. My question is whether this warning can be ignored and what impact can it have if the stream is not drained? – ares Aug 02 '17 at 11:09
1

fair comment - the warning can be ignored. It tells you that you will loose something in transit as it will terminate the HTTP connection. `close()` delegates to `abort()` hence it also causes this warning - added to the answer now – diginoise Aug 02 '17 at 11:19

Partially reading a tar.gz file from Amazon S3

1 Answers1