Why does s3cmd du give different results depending on slash at end of path?

Question

s3cmd du -H s3://bucketabc/prefix/further-prefix

gives 21G

s3cmd du -H s3://bucketabc/prefix/further-prefix/

gives 10G.

There are no files directly in there, just four "subdirectories."

I have five buckets which are near-copies and this only happens in two of them. The others show 10G consistently.

The only apparent difference between buckets -- and a seemingly irrelevant one -- is that the two which give 10G with or without with the slash have one more subdirectory than the others, with a single 138M file in it.

Why 21G vs 10G? which is the right answer?

s3cmd is a dated program, but what does it give if you add `--verbose`? Does it list the files? (if so, paste the lines here) — tedder42, May 18 '15 at 16:51
@Michael-sqlbot Good point, the one with the slash returns less (10G) — Joshua Fox, May 19 '15 at 04:54
@tedder42 du -H --verbose generates the same output as without --verbose (no additional info) — Joshua Fox, May 19 '15 at 04:55

score 2 · Accepted Answer · answered May 19 '15 at 10:19

In the S3 REST API, when iterating through objects, you often specify a key prefix, which is a left-anchored substring matching all the key values you want returned.

When you tell S3 you want foo/, what you are, of course, asking for is foo/*.

Perhaps less intuitive is the fact that asking for foo is really asking for foo*, which would include foo*/*.

It's a prefix match. Any key with a matching prefix will be included, so the prefix foo would include not only foo/* but also foobar/*, etc.

This is why some of us are so seemingly fond of issuing the friendly reminder that "S3 is not a filesystem, it is an object store," even though at some level, you already knew that. It doesn't precisely follow filesystem semantics. This, I would suggest, is one of the reasons the sometimes subtle-seeming distinctions are important.

Unlike a filesystem, the directory hierarchy in S3 is not really there. It's a convenient illusion based on the / character. The folders you can create in the console are similarly an illusion -- they're empty objects the console lets you add in order to create the appearance of a hierarchy before you actually have any keys with that prefix in the bucket. So, there is no concept of objects actually being "in" folders, they're just "under" folders.

Without the trailing slash, I suspect you are matching more than you anticipate, because of the prefix-matching paradigm.

Yes, that's it. There were other "directories" that I didn't notice earlier and are only in those two buckets. I was aware that AWS "directories" are not really directories, but found that some functionality treats slash as a special separator. Anyway, you got it. — Joshua Fox, May 19 '15 at 15:10
Yes, `/` is often treated as a path delimiter, but at the API level, you have to specify it, for that to happen... and if it's used, you only get the content of exactly one "directory" down, and you'd have to keep recurring down, down, down, sending additional requests, hurting performance and increasing costs with potentially a large number of requests. — Michael - sqlbot, May 19 '15 at 17:07

Why does s3cmd du give different results depending on slash at end of path?

1 Answers1