22

I have a fairly large amount of data (~30G, split into ~100 files) I'd like to transfer between S3 and EC2: when I fire up the EC2 instances I'd like to copy the data from S3 to EC2 local disks as quickly as I can, and when I'm done processing I'd like to copy the results back to S3.

I'm looking for a tool that'll do a fast / parallel copy of the data back and forth. I have several scripts hacked up, including one that does a decent job, so I'm not looking for pointers to basic libraries; I'm looking for something fast and reliable.

Parand
  • 102,950
  • 48
  • 151
  • 186

5 Answers5

34

Unfortunately, Adam's suggestion won't work as his understanding of EBS is wrong (although I wish he was right and often thought myself it should work that way)... as EBS has nothing to do with S3, but it will only give you an "external drive" for EC2 instances that are separate, but connectable to the instances. You still have to do copying between S3 and EC2, even though there are no data transfer costs between the two.

You didn't mention an operating system of your instance, so I cannot give tailored information. A popular command line tool I use is http://s3tools.org/s3cmd ... it is based on Python and therefore, according to info on its website it should work on Win as well as Linux, although I use it ALL the time on Linux. You could easily whip up a quick script that uses its built in "sync" command that works similar to rsync, and have it triggered every time you're done processing your data. You could also use the recursive put and get commands to get and put data only when needed.

There are graphical tools like Cloudberry Pro that have some command line options for Windows too that you can setup schedule commands. http://s3tools.org/s3cmd is probably the easiest.

jedierikb
  • 12,752
  • 22
  • 95
  • 166
Tyler
  • 1,291
  • 1
  • 13
  • 20
  • Good answer, but maybe it's worth to note that s3cmd doesn't support the --delete option with sync, which means that if you delete something on the source it will still stay on the destination :( – golja Jul 08 '12 at 00:48
  • I will need to look into that with our backup scripts. I could've sworn there was some way --delete worked with sync. Although, I do remember it takes some fandangling to get just right. The script has saved me time in numerous areas by far though! – Tyler Jul 12 '12 at 08:15
  • Bit late replying - my answer wasn't suggesting using EBS to copy between S3 and EC2, but *instead* of S3. I'll update to clarify. – Adam Hopkinson Dec 01 '12 at 19:42
  • Though this doesn't really add to the answer of this particularly question, it is worth noting that I checked out our scripts and the sync command DOES offer a --delete option for @golja however, the correct syntax is --delete-removed – Tyler Jul 01 '13 at 00:34
3

By now, there is a sync command in the AWS Command line tools, that should do the trick: http://docs.aws.amazon.com/cli/latest/reference/s3/sync.html

On startup: aws s3 sync s3://mybucket /mylocalfolder

before shutdown: aws s3 sync /mylocalfolder s3://mybucket

Of course, the details are always fun to work out eg. how can parallel it is (and can you make it more parallel and is that any faster goven the virtual nature of the whole setup)

Btw hope you're still working on this... or somebody is. ;)

Gyuri
  • 4,548
  • 4
  • 34
  • 44
2

I think you might be better off using an Elastic Block Store to store your files instead of S3. An EBS is akin to a 'drive' on S3 that can be mounted into your EC2 instance without having to copy the data each time, thereby allowing you to persist your data between EC2 instances without having to write to or read from S3 each time.

http://aws.amazon.com/ebs/

Adam Hopkinson
  • 28,281
  • 7
  • 65
  • 99
  • 3
    This is a good suggestion. The one drawback of EBS volumes is they can only be mounted on an instance that is running in the same availability zone as the volume. E.g. a volume in us-east-1a cannot be used by an instance in us-east-1b. So if one cannot or would prefer not to run an instance in that zone (due to problems, or simply a shortage of capacity) one cannot use the volume. – c-urchin Jan 30 '13 at 18:48
  • 2
    This is not how EBS works. EBS is NOT a drive on S3. EBS doesn't read/write data to S3 except when creating a Snapshot or creating an EBS volume from a Snapshot. – Chris M. May 06 '13 at 23:09
  • I wasn't meaning to say it *was* a drive on S3, I meant it was *like* a drive - if you want to use S3-like storage in the way you would use a drive, EBS is a good fit. – Adam Hopkinson May 07 '13 at 07:31
  • 1
    As far as I understand, EBS can fail, whereas s3 has backups. This may be another thing that should be thought of before deciding to use EBS instead of S3. – shashi Jan 15 '14 at 20:42
1

Install s3cmd Package as

yum install s3cmd

or

sudo apt-get install s3cmd

depending on your OS

then copy data with this

s3cmd get s3://tecadmin/file.txt

also ls can list the files.

for more detils see this

Vikas Hardia
  • 2,635
  • 5
  • 34
  • 53
0

For me the best form is:

wget http://s3.amazonaws.com/my_bucket/my_folder/my_file.ext

from PuTTy

RRuiz
  • 2,159
  • 21
  • 32