0

I'm using the NOUNZ data compiler on OSX (or Linux), which automatically generates a massive directory structure of static HTML files (hundreds of thousands and sometimes millions of files).

A simplified example of the generated directory tree looks something like the following...

enter image description here

Normally, if I want to move the entire tree to a remote web server, I simply tar and compress the tree, using the commands:

tar -cvf HTML.tar HTML
gzip HTML.tar

This generates a tar-ed and compressed file called HTML.tar.gz

I can then FTP or SCP the above file to the remote web server and I can simply uncompress an untar the file using the following commands:

gzip -d HTML.tar.gz
tar -xvf HTML.tar

This will result in the exact same file tree on the web server that was generated by the data compiler on the local machine.

THE PROBLEM: I'd like to mimic the same behavior as above using Amazon Web Services (AWS) Simple Storage Solution (S3).

MY QUESTION: What is the best way to mimic the same (or similar behavior), where I can move the entire tar-ed and compressed tree from a local server to AWS S3, and then uncompress and untar my file to recreate the entire directory structure?

The tar and gzip commands are not a part of the S3 CLI API so I need to find a solid way of moving a directory structure that can contain millions of files (which will happen possibly once a day). It would be VERY slow to move and recreate everything without first tar-ing and compressing.

NOTE: Just an FYI that when the data compiler runs, it always deletes the entire old tree and regenerates an entire new tree, resulting in completely new inodes for all directories and files. This means "incremental" copies and syncs are not viable. I need to move the whole tree, each time.

skytaker
  • 4,159
  • 1
  • 21
  • 31
Information Technology
  • 2,243
  • 3
  • 30
  • 46
  • You say it recreates all the files, but do the file contents actually change? You can do a sync using md5 hashes to check if the files have actually changed using the `aws s3 sync` command. – Mark B Nov 09 '16 at 04:55
  • Yes, contents of existing files may change. There are three outcomes that can happen when the compiler runs: 1) New folders and/or files can be added; 2) Existing folders and/or files may be deleted; 3) Existing file contents may (and often do) change; Keep in mind that "AWS s3 sync" may take a long time to pump millions of files across the pipe. – Information Technology Nov 09 '16 at 07:23

1 Answers1

0

S3 isn't going to uncompress files for you. You have to push the files to S3 in the state you want S3 to store them in. The aws s3 sync command (or a similar tool that does incremental updates based on the MD5 hash) is going to be your best option. You could probably split up the sync command into multiple, parallel sync commands. Perhaps run one process per subdirectory.

Regarding your comment that aws s3 sync "may take a long time to pump millions of files across the pipe", you should zip up the files and push them to an EC2 server first if you aren't already doing this on EC2. You should be using an EC2 server in the same region as the S3 bucket, an instance type with 10Gbps network performance, and the EC2 server should have Enhanced Networking enabled. This will give you the fastest possible connection to S3.

Mark B
  • 183,023
  • 24
  • 297
  • 295
  • It sounds "functional" but not as clean, simple or efficient as simply tar-ing, compressing, sending, decompressing, untar-ing. There has to be a better way than splitting up the {aws s3 sync} for each directory, especially since new directories might dynamically show up and old directories might be deleted, with each new compiler run. I wonder if AWS will be smart enough to add more of the fundamental unix commands to the S3 CLI, in order to make it more user-friendly and compatible. – Information Technology Nov 10 '16 at 17:29
  • You should be able to write a script to spawn sync tasks based on the directories. You shouldn't need to hard-code the directories. It might not be as clean or simple as you would like, but if you want to use S3 you are going to have to come to terms with the limitations involved and stop trying to treat it like a unix server. S3 is simply storage, not a "server" as you are implying. It can't uncompress files for you because that would require CPU usage, which S3 does not provide. – Mark B Nov 10 '16 at 17:39
  • I appreciate the help and can see that your suggestion can work but you have to admit that it sounds very much like a hack, due to the lack of some simple CLI commands that should already be there. – Information Technology Nov 11 '16 at 22:53
  • I disagree. Cli commands can't add features to a service that doesn't support those features. The cli doesn't run on S3, it runs on your server, so how could it add unzip support to S3? Regardless, this isn't the venue for complaints or suggestions for AWS. Take that to the AWS forums. – Mark B Nov 11 '16 at 23:15
  • Hi Mark. Just for clarification, the CLI runs on, both, the client and on the service side. It is a set of RPCs. – Information Technology Dec 06 '16 at 04:00
  • The CLI is just a tool for making calls to the AWS API. The API runs on the service side. The CLI runs locally – Mark B Dec 06 '16 at 04:20