1

Trying to figure out if it's possible to download a specific file, or a range of bytes, from an uncompressed TAR archive in S3.

The use case can be described like this:

  • The TAR file is generated by my application (so we have control of that)
  • The TAR file lives in an S3 bucket
  • The TAR file is named archive.tar
  • The TAR file contains two files: metadata.txt and payload.png
  • metadata.txt is guaranteed to always be of size "n" bytes, where "n" is relatively small
  • payload.png can be any size and thus can be a very large file (> 1 GB)
  • My application needs to be able to download metadata.txt to understand how to process the TAR file, and I don't want the application to have to download the whole TAR file just for the metadata.txt file

Ideally, at any given point, I should only ever have the metadata.txt file opened in memory and never the entire TAR archive or any part of payload.png. I don't want to incur the network or memory overhead of downloading a huge TAR archive just to be able to read the small metadata.txt file contained.

I've noticed S3ObjectInputStream in the AWS SDK, but I'm not sure how to use it with a TAR file for my use case.

Anyone ever implement something similar or have any pointers to references I can check out to help with this?

  • 3
    Yes, you can specify the byte range that you want in a 'get object' request. As long as you have some kind of index of the contents of the TAR file and it's not compressed or encrypted, it sounds like this could work. – jarmod Feb 12 '19 at 19:36
  • 2
    Only one question - why so complex? TAR does not compress files; so if you need to process it's content separately, it's much simple to put this files to separate directory and process them one by one. Isn't it? – Bor Laze Feb 12 '19 at 19:39

1 Answers1

1

Yes, it’s possible for an uncompressed tarball, the file format has header records about the files you can use to check its contents.

I'm more of a Python than a Java guy, but take a look at my implementation of tarball range requests here and docs here.

In short, you can check the header (the file name always comes first, and is padded to 512 byte blocks with NULL b"\x00" bytes), get the range corresponding to the file length to determine the variable length, get the remainder of that file length of 512 to determine the end-of-file padding, and then iterate up to 1024 before the end of the file (you can send a HEAD request to get the total bytes, or it's sent back when you execute a range request, AKA partial content request). The 1024-before-the-end part is because there are at least 2 empty blocks of 512 bytes at the end of a tar archive.

When iterating, it's probably sensible to check if the filename of each new block you expect to find a file header in is actually all NULL bytes, as this indicates you've actually entered one of the end-of-file blocks (the spec seems to say "at least 2 empty blocks" so there may be more). But if you control the tar files being generated maybe you wouldn't need to bother.

Louis Maddox
  • 5,226
  • 5
  • 36
  • 66