0

I am trying to find the most cost effective way of doing this, will appreciate any help:

  • I have 100s of millions of files. Each file is under 1MB each (usually 100KB or so)
  • In total this is over 5 TB of data - as of now, and this will grow weekly
  • I cannot merge/concatenate the files together. The files must be stored as is
  • Query and download requirements are basic. Around 1 Million files to be selected and downloaded per month
  • I am not worried about S3 storage or Data Retrieval or Data Scan cost.

My question is when I upload 100s of million files, does this count as one PUT request per file (meaning one per object)? If so, just the cost to upload the data will be massive. If I upload a directory with a million files, is that one PUT request?

What if I zip the 100 million files on prem, then upload the zip, and use lambda to unzip. Would that count as one PUT request?

Any advise?

rogerwhite
  • 335
  • 4
  • 16
  • Each object requires it's own put request, and that would be true if you transfer from on prem, lambda, or some other service like Snow. – Anon Coward Aug 13 '21 at 15:26
  • @AnonCoward Thanks, for the update. is there a way to batch them uploads? Anyway to reduce the cost – rogerwhite Aug 13 '21 at 15:48
  • 1
    @Perimosh I dont think your calculator (with an estimate of $180K) helps. The question is what is the most cost effective way to do this. – rogerwhite Aug 13 '21 at 16:20
  • There you go, that's a valid comment. Thanks! Going back to your question, you can't upload directories. In fact, there is no such concept in S3. The S3 console shows you the objects as folder structures, but technically they are all objects key. This is a key "foo.txt". This is another key "path/to/my/key/foo.txt" – Perimosh Aug 13 '21 at 16:50
  • @rogerwhite There's no way I know of to get around it, short of doing something mildly crazy like merging the files into one zip file (or other container format) and changing a bunch of code to read the inner files with byte range requests. – Anon Coward Aug 13 '21 at 17:59
  • I did some digging and I don't think there is a way to avoid the 1 put request per file, even if you use something like snowball. The cheapest way would be to just upload the files on on-prem but you'd still be paying $2.5k+ on just the put requests fees for 500 million files – JD D Aug 13 '21 at 18:08
  • Can you elaborate a bit on what you plan to do with the files after they're uploaded to S3, and why you must maintain the existing file structure? That might stimulate some relevant thoughts on how to reduce the upload cost. – jscott Aug 13 '21 at 23:03
  • @Perimosh Your answer was downvoted/deleted because it is a "link-only answer". It is better if you can provide a self-contained answer. It is welcome to point to external content, but shouldn't _require_ going to an external location to answer the question. – John Rotenstein Aug 14 '21 at 04:16

1 Answers1

1

You say that you have "100s of millions of files", so I shall assume you have 400 million objects, making 40TB of storage. Please adjust accordingly. I have shown my calculations so that people can help identify my errors.

Initial upload

PUT requests in Amazon S3 are charged at $0.005 per 1,000 requests. Therefore, 400 million PUTs would cost $2000. (.005*400m/1000)

This cost cannot be avoided if you wish to create them all as individual objects.

Future uploads would be the same cost at $5 per million.

Storage

Standard storage costs $0.023 per GB, so storing 400 million 100KB objects would cost $920/month. (.023*400m*100/1m)

Storage costs can be reduced by using lower-cost Storage Classes.

Access

GET requests are $0.0004 per 1,000 requests, so downloading 1 million objects each month would cost 40c/month. (.0004*1m/1000)

If the data is being transferred to the Internet, Data Transfer costs of $0.09 per GB would apply. The Data Transfer cost of downloading 1 million 100KB objects would be $9/month. (.09*1m*100/1m)

Analysis

You seem to be most fearful of the initial cost of uploading 100s of millions of objects at a cost of $5 per million objects.

However, storage will also be high, and the cost of $2.30/month per million objects ($920/month for 400m objects). That ongoing cost is likely to dwarf the cost of initial uploads.

Some alternatives would be:

  • Store the data on-premises (disk storage is $100/4TB, so 400m files would require $1000 of disks, but you would want extra drives for redundancy), or
  • Store the data in a database: There are no 'PUT' costs for databases, but you would need to pay for running the database. This might work out a lower cost. or
  • Combine the data in the files (which you say you do not wish to do), but in a way that can be easily split-apart. For example, marking records by an identifier for easy extractions. or
  • Use a different storage service, such as Digital Ocean, who do not appear to have a 'PUT' cost.
John Rotenstein
  • 241,921
  • 22
  • 380
  • 470
  • Many thanks, very helpful. I am at my wits end. Trying to figure out the problem to manage my big data (well big for me). DynamoDB is going to cost an arm, a leg and a kidney too. – rogerwhite Aug 14 '21 at 13:29