Why randomize your file names for cloud storage/CDN?

Question

When you look at a profile picture on a social networking site like Twitter, they store image files like:

http://a1.twimg.com/profile_images/1082228637/a-smile_twitter_100.jpg

or even with a date somewhere in the path like 20110912. The only immediate benefit I can think of is preventing a bot from going through and downloading all files in your storage in a linear fashion. Am I missing any other benefits? What is the best way to go about randomizing it?

I am using Amazon S3 so I will have one subdomain serving all my static content. My plan was to store an integer ID in my database and then just concat the URL with the id to form the location.

score 10 · Accepted Answer · edited Apr 08 '14 at 11:20

One reason I cryptographically scramble identifiers in public URLs is so that the business' rate of growth is not always public.

If the current ids can be deduced simply by creating a new user account or uploading an image, then an outside person can calculate the growth rate (or an upper limit) by doing this on a regular basis and seeing how many ids were used during the elapsed time.

Whether it's stagnating or whether it's exploding exponentially, I want to be able to control the release of this information instead of letting competitors or business analysts be able to deduce it for themselves.

Offline examples of this are invoice and check numbers. If you get billed by or paid by a company on a regular basis, then you can see how many invoices or checks they write in that time period.

Here's a CPAN (Perl) module I maintain that scrambles 32-bit ids using two way encryption based on SkipJack:

http://metacpan.org/pod/Crypt::Skip32

It's a direct translation of the Skip32 algorithm written in C by Greg Rose:

http://www.qualcomm.com.au/PublicationsDocs/skip32.c

Use of this approach maps each 32-bit id into an (effectively random) corresponding 32-bit number which can be reversed back into the original id. You don't have to save anything extra in your database.

I convert the scrambled id into 8 hex digits for displaying in URLs.

Once your ids approach 4.29 billion (32-bits) you'll need to plan for extending the URL structure to support more, but I like having shorter URLs for as long as possible.

I like this line of thought. I will have to rethink my id generation strategy. — Adam, Oct 09 '11 at 17:54

score 5 · Answer 2 · answered Oct 10 '11 at 21:19

Changing URLs is a safe way to invalidate outdated assets.

It is also a necessity if you want to allow users storing private images. Using a path deductible from the users account name/id/path would render privacy settings useless as soon as you store assets on a CDN.

score 2 · Answer 3 · answered Oct 09 '11 at 16:50

2

Mainly, it prevents name collisions. More than one person might upload "IMG_0001.JPG", for example. You also avoid limits on the number of files in one directory, and you can shard images across multiple servers - there's no way a huge site like Twitter or Facebook could store all photos on one server, no matter how large.

answered Oct 09 '11 at 16:50

ceejayoz

176,543
40
303
368

I understand what you mean but Twitter uses Amazon S3 so they don't have to worry about the concept of a server or a directory. They could store a trillion objects (or as many as they have) in one directory and never have to worry about it. In my case, I'm using an auto-increment integer column in MySQL to act as the corresponding filename on S3 so naming collisions shouldn't be an issue. So do you think there is a good way to prevent bots from downloading all your files systematically? – Adam Oct 09 '11 at 17:07
They sure as hell have to worry about if if they want to list the files in a directory looking for a specific one. – ceejayoz Oct 09 '11 at 17:09
They have to store meta data about a file location somewhere else (database, JSON document, etc.). I know for a fact Twitter uses Amazon S3 and if they really wanted to, they could store all images under a1.twimg.com/*. Amazon's cloud handles the hardware so at a high level you don't have to think in terms of directories. Since I'm asking about S3 specifically, sharding and clustering is not an issue in this case. – Adam Oct 09 '11 at 17:51
1

As ceejayoz mentioned, listing the objects in a particular path is problematic when you have them all in a single "folder". – A.J. Brown Jan 21 '13 at 16:20

Why randomize your file names for cloud storage/CDN?

3 Answers3

Linked