52

I'm thinking about developing my own PHP based gallery for storing lots of pictures, maybe in the tens of thousands.

At the database I'll point to the url of the image, but here's the problem: I know is impractical to have all of them sitting at the same directory in the server as it would slow access to a crawl, so, how would you store all of them? Some kind of tree based on the name of the jpeg/png?

What rules to partition the images would you recommend me?

(It will focused for using in cheapo dot coms, so no mangling with the server is possible)

Pete Kirkham
  • 48,893
  • 5
  • 92
  • 171
Saiyine
  • 1,579
  • 3
  • 15
  • 27

12 Answers12

50

We had a similar problem in the past. And found a nice solution:

  • Give each image an unique guid.
  • Create a database record for each image containing the name, location, guid and possible location of sub images (thumbnails, reducedsize, etc.).
  • Use the first (one or two) characters of the guid to determine the toplevel folder.
  • If the folders have too much files, split again. Update the references and you are ready to go.
  • If the number of files and the accesses are too high, you can spread folders over different file servers.

We have experienced that using the guids, you get a more or less uniform division. And it worked like a charm.

Links which might help to generate a unique ID:

Aaron Digulla
  • 321,842
  • 108
  • 597
  • 820
Toon Krijthe
  • 52,876
  • 38
  • 145
  • 202
  • If you use a database anyway, why not just make it a blob and let the database worry about it? – falstro Jan 15 '09 at 11:26
  • 3
    because of performance, database calls are usually really expensive especially for binary data like images. – Mike Geise Jan 15 '09 at 11:32
  • 3
    not to mention that serving images out of the database means you pretty much always send the data where as if you can serve from the file system you can let the browser/server handle caching of images – MikeJ Mar 24 '09 at 12:31
  • 7
    @Gamecat IMHO, much better than generating UUIDs is to simply hash the filename and use its beginning as a directory name. This way you need no database, since you can always recompute the hash, which is much faster than the database access. (I see you mentioned SHA-1, but didn't advice this explicitly). – maaartinus Mar 21 '11 at 18:04
  • 2
    @maaartinus, you are probably right. But we already had a database (for a CMS) we just needed to link with the pictures and this worked great for us. – Toon Krijthe Mar 21 '11 at 19:56
  • I see (I'm going to use a hash and a database, too). – maaartinus Mar 21 '11 at 23:07
  • 2
    If you have an integer unique ID, a simple way to do it is break it into three levels: xxx/yyy/filename.jpg. This way you can use the unique ID. For example, if the id is 100789, it would be stored as 100/789/filename.jpg. Then you have up to 1,000 directories in each level. And a total of 1,000,000 files. And, you can have multiple filenames based on resolution: thumbnail.jpg, small.jpg, etc. – B Seven May 03 '11 at 07:27
  • I'd also recommend salting your hash with a known constant when generating folder names. This prevents uploaders from easily determining the folder where you are putting their files (they can hash their own images with sha1 if they know that's how you create folders). – Steve Midgley Nov 10 '14 at 18:33
11

I worked on an Electronic Document Management system a few years ago, and we did pretty much what Gamecat and wic suggested.

That is, assign each image a unique ID, and use that to derive a relative path to the image file. We used MOD similar to what wic suggested, but we allowed 1024 folders/files at each level, with 3 levels, so we could support 1G files.

We stripped off the extension from the files however. The DB records contained the MIME Type, so extension was not needed.

I would not recommend storing the full URL in the DB record, only the Image ID. If you store the URL you can't move or restructure your storage without converting your DB. A relative URL would be ok since that way you can at least move the image repository around, but you'll get more flexibility if you just store the ID and derive the URL.

Also, I would not recommend allowing direct references to your image files from the web. Instead, provide a URL to a server-side program (e.g., Java Servlet), with the Image ID being supplied in the URL Query (http://url.com/GetImage?imageID=1234).

The servlet can use that ID to look up the DB record, determine MIME Type, derive the actual location, check for security restrictions, logging, etc.

Clayton
  • 920
  • 1
  • 6
  • 13
  • good points. does the servlet request still allow for caching ? i am looking at a similar problem but in my app the transfer time is critical so I was looking for ways to cache the images on the client. Am I dreaming? – MikeJ Mar 24 '09 at 12:35
  • @MikeJ: You could create a separate class for access to the images. That class would know how to derive a path from an id, etc. It could also contain a cache, possibly as a hashtable that you manage yourself, or maybe a canned cache class. Servlet would get images from this object, not from disk. – Clayton Mar 24 '09 at 17:05
9

I usually just use the numerical database id (auto_increment) and then use the modulu (%) operator to figure out where to put the file. Simple and scalable. For instance the path to image with id 12345 could be created like this:

12345 % 100 = 45
12345 % 1000 = 345

Ends up in:

/home/joe/images/345/45/12345.png

Or something like that.

If you're using Linux and ext3 and the filesystem, you must be aware that there are limits to the number of directories and files you can have in a directory. The limit is 32000 for dirs, so you should always strive to keep number of dirs low.

Martin Wickman
  • 19,662
  • 12
  • 82
  • 106
  • 12
    What's the purpose in having both '345' and '45'? Seems like each of your first-level directories (like '345') will have exactly one subdirectory (in this case '45'). – Dustin Boswell Nov 05 '10 at 08:36
7

I know is impractical to have all of them sitting at the same directory in the server as it would slow access to a crawl.

This is an assumption.

I have designed systems where we had millions of files stored flat in one directory, and it worked great. It's also the easiest system to program. Most server filesystems support this without a problem (although you'd have to check which one you were using).

http://www.databasesandlife.com/flat-directories/

Adrian Smith
  • 17,236
  • 11
  • 71
  • 93
  • 2
    Thanks for sharing. The OP mentionned PHP and one practical issue is that FTP access to a directory with a large number of files can timeout. – James P. May 05 '11 at 20:14
  • 2
    I think it's important to say, as you do in your blog article, that *some* file systems support very large numbers of files in a single folder. And in my experience, some (other) file systems work outside their stated specification for large #s of file, but not all file operations will work. If you're going to store very large numbers of files in a single folder, test it out first! That said, why not just tree-balance the folder structure with a hash of some kind? – Steve Midgley Nov 10 '14 at 18:36
6

When saving files associated with an auto_increment ids, I use something like the following, which creates three directory levels, each comprised of 1000 dirs, and 100 files in each third-level directory. This supports ~ 100 billion files.

if $id = 99532455444 then the following returns /995/324/554/44

function getFileDirectory($id) {
    $level1 = ($id / 100000000) % 100000000;
    $level2 = (($id - $level1 * 100000000) / 100000) % 100000;
    $level3 = (($id - ($level1 * 100000000) - ($level2 * 100000)) / 100) % 1000;
    $file   = $id - (($level1 * 100000000) + ($level2 * 100000) + ($level3 * 100));

    return '/' . sprintf("%03d", $level1)
         . '/' . sprintf("%03d", $level2)
         . '/' . sprintf("%03d", $level3)
         . '/' . $file;
}
Isaac
  • 947
  • 7
  • 8
2

Look at XFS Filesystem. It supports an unlimited number of files, and Linux supports it. http://oss.sgi.com/projects/xfs/papers/xfs_usenix/index.html

EXTROMEDIA
  • 21
  • 1
1

You could alawys have a DateTime column in the table and then store them in folders named after the month,year or even month,day,year the images where added to the table.

Example

  1. 2009
  2. -01
  3. --01
  4. --02
  5. --03
  6. --31

this way you end up with no more then 3 folders deep.

Mike Geise
  • 825
  • 7
  • 15
1

I am currently facing this problem, and what Isaac wrote got me interested in the idea. Tho my function differs a little.

function _getFilePath($id) {
    $id = sprintf("%06d", $id);
    $level = array();
    for($lvl = 3; $lvl >= 1; $lvl--)
        $level[$lvl] = substr($id, (($lvl*2)-2), 2);
    return implode('/', array_reverse($level)).'.jpg';
}

My images are only in thousands so i only have this up to 999999 limit and so it would split that into 99/99/99.jpg or 43524 into 04/35/24.jpg

Tom van der Woerdt
  • 29,532
  • 7
  • 72
  • 105
Mikhail
  • 11
  • 1
0

Use the hierarchy of the file system. ID your images using something like 001/002/003/004.jpg would be very helpful. Partitioning is a different story, though. Could be random, content based, creation date based, etc. Really depends on what your application is.

PolyThinker
  • 5,152
  • 21
  • 22
0

You may check out the stratey used by Apple iPod for storing it's multimedia content. There are folders in one level of depth and files with titles of same width. I believe that Apple guys invested a lot of time in testing their solution so it may bring some instant benefit to you.

Boris Pavlović
  • 63,078
  • 28
  • 122
  • 148
0

If the pictures you're handling are digital photographs, you could use EXIF data to sort them, for example by capture date.

Keltia
  • 14,535
  • 3
  • 29
  • 30
0

You can store the images in the database as blobs (varbinary for mssql). That way you don't have to worry about the storage or directory structure. The only downside is that you can't easily browse the files, but that would be hard in a balanced directory tree anyway.

Mats Fredriksson
  • 19,783
  • 6
  • 37
  • 57
  • IMO it's a bad advice. 1.Soon your DB will become huge and it will bring other issues. 2. On the other hand it won't be possible to cache images by using caching proxy server like nginx or HAproxy, which extreanly fast for static content. 3. DB will become bottle neck with pretty low load. – Roman Podlinov Apr 30 '13 at 15:41