8

I have a project where there will be about 80 million objects in an S3 bucket. Every day, I will be deleting about 4 million and adding 4 million. The object names will be in a pseudo directory structure:

/012345/0123456789abcdef0123456789abcdef

For deletion, I will need to list all objects with a prefix of 012345/, and then delete them. I am concerned of the time it will take for this LIST operation. While it seems clear that S3's access time for individual assets does not increase for individual objects, I haven't found anything definitive that says that a LIST operation over 80MM objects, searching for 10 objects that all have the same prefix will remain fast in such a large bucket.

In a side comment on a question about the maximum number of objects that can be stored in a bucket (from 2008):

In my experience, LIST operations do take (linearly) longer as object count increases, but this is probably a symptom of the increased I/O required on the Amazon servers, and down the wire to your client.

From the Amazon S3 documentation:

There is no limit to the number of objects that can be stored in a bucket and no difference in performance whether you use many buckets or just a few. You can store all of your objects in a single bucket, or you can organize them across several buckets.

While I am inclined to believe the Amazon documentation, it isn't entirely clear what operations their comment is directed to.

Before committing to this expensive plan, I would like to definitively know if LIST operations when searching by prefix remain fast when buckets contain millions of objects. If someone has real-world experience with such large buckets, I would love to hear your input.

Community
  • 1
  • 1
Brad
  • 159,648
  • 54
  • 349
  • 530
  • 1
    Hey, 2 years passed since your question. Can you tell if you made the system in the end using S3 list by prefix and how did it perform? – Kaplan Ilya Jul 27 '16 at 16:22
  • @KaplanIlya I did end up using prefix, but I don't remember how well it did. Sorry! – Brad Jul 01 '17 at 00:57

2 Answers2

10

Prefix searches are fast, if you've chosen the prefixes correctly. Here's an explanation: https://cloudnative.io/blog/2015/01/aws-s3-performance-tuning/

r3m0t
  • 1,850
  • 16
  • 21
  • don't know why this doesn't have more upvotes, it specifically speaks to the question and addresses OP's issue, even at 80m + records (although I'd additionally look at an object-retirement strategy) – MrMesees Feb 03 '18 at 23:18
  • @MrMesees because my answer was 3 years late :) – r3m0t Feb 04 '18 at 19:45
  • @MrMesees and because they suggested solution is exactly what the other answer already points to; keep an index somewhere else. – Josep Valls Apr 17 '18 at 16:42
  • 2
    The original link is dead, an archived version is available here: https://web.archive.org/web/20160327092959/http://cloudnative.io/blog/2015/01/aws-s3-performance-tuning/ – rbu Jul 19 '22 at 13:30
4

I've never seen a problem, but why would you ever list a million files just to pull a few files out of the list? It's not S3 performance, it's likely do to the call just taking longer.

Why not store the file names in a database, index them, then query from there. That'd be a better solution I'd think.

  • It's a valid question, and an option I am considering for a future enhancement. Unfortunately right now, the system creating objects on S3 has no knowledge of the database and cannot connect to it. There is a system (made up of a bunch of legacy Perl scripts I'm drastically trying to kill off) which uploads an original asset to S3. From there, another system will create derived assets (resized, cropped, and re-compressed photos) on-the-fly and uploads the derived versions to S3 in this pseudo directory structure. The only way around this is to have yet another database to keep track of.... – Brad Jul 31 '14 at 15:01
  • ... the derived assets, but if S3 is fast enough with its list operations, there would be no need for that other database. – Brad Jul 31 '14 at 15:01
  • I think parsing the list would be much slower than the speed S3 provides it to you. So I suspect you'd be ok. I haven't dealt with a size that big, but I doubt amazon is lying here. We have many directories that are larger than best practices for other filesystems that work really badly, but in S3 work great. – Paul Frederiksen Jul 31 '14 at 15:17
  • 1
    To be clear, I'm talking about listing objects by prefix, so S3 is only going to return those 4 or 5 objects with the prefix I've specified. And, I don't think Amazon is lying, it's just that at that point in the documentation it seems to be talking about getting and putting objects, not necessarily listing by prefix. – Brad Jul 31 '14 at 15:18
  • in that case I can't imagine there being any kind of noticeable performance hit. If there is, I think posting the the aws forums would be in order. You might even email your sales manager or a ticket informing them you will be doing this. That might help them guarantee the performance for you. – Paul Frederiksen Jul 31 '14 at 15:35