4

Possible Duplicate:
Storing a large number of images

Hello,

I want to be able to scale to millions of user profile pics on my LAMP Server using PHP.

I currently store all images in one folder, which is a big no-no, so I want to spread them out into many folders and sub-folders (e.g. aa/bb/ etc...).

What is the best and most efficient way of doing that, especially if I do not want to have to call the DB to get the filename/path for that user's profile pic?

I'm thinking of maybe doing a hash of the username and utilizing the first 4 letters of that hash to generate/locate the path for that user's profile pic, that way I wouldn't have to access anything additionally from the DB since I will always have the user's username. So, for example, if the first 4 characters of the user's username hash were "aabb", I would store that user's profile pic under aa/bb/username/profile.jpg , which should theoretically allow me to scale to millions of users without having to add anything to the DB, while spreading all the pics evenly throughout the aa/bb/ folder structure.

Any ideas/input?

Thanks!

Community
  • 1
  • 1
PleaseHelpMe
  • 709
  • 3
  • 8
  • 13
  • 1
    That may cause images with same names to be put in one folder, i.e new image may replace the older one. – Hossein Mar 17 '11 at 18:12

1 Answers1

2

Depends on how your users are organized.

  1. I guess they all have a unique ID. If you know it you can store files like 0xx/007.png, 8xx/824.png and 547xx/54723.png. This cuts the number of items in the main folder by factor 100 and every folder contains 100 items at max.

  2. If you have only selected chars in your usernames allowed you can directly use it but I would not generally recommend that. It could get dangerous if you don't know what you're doing. Filenames like ma/master_of_desaster.png, ki/king_cool.png and so/some_other_infantile_name.png

  3. Using hashes is a great idea. If it's not about security (seems it's not) you can reduce CPU overhead by using a short checksum algorithm instead of a complex but secure hash algorithm. Just think of CRC32. Filenames like [CRC32sum]/[USER_ID].png

Daniel Böhmer
  • 14,463
  • 5
  • 36
  • 46
  • Thanks for the quick reply, halo. The usernames are definitely unique and I also have a unique numeric userid assigned to each user which I can also use. Considering that, which alternative would you suggest between #1 and #3? Also, with respect to #3, which "short checksum algorithm" would you suggest in PHP? I was planning on simply using md5 but CPU overhead is definitely a concern, especially once I'm serving up millions of images and each one has to go through the same hashing procedure to be retrieved. Thanks again! – PleaseHelpMe Mar 17 '11 at 18:20
  • md5 the username then create or check (if exists) directory based on first couple of chars of md5 then save reference to the image in a db – Lawrence Cherone Mar 17 '11 at 18:27
  • Solution #1 is faster and can be easily used with 2 or 3 iterations to decrease the number of items per folder by factor 10,000 or 1,000,000. Additionally you can skip 3 or 4 digits per folder level. For solution #3 CRC32 is the best that comes to my mind. See http://php.net/manual/en/function.crc32.php – Daniel Böhmer Mar 17 '11 at 18:27
  • Also keep in mind that users may be able to change their name (if not yet possibly in the future). Thus it's a good idea to choose an algorithm based on the ID. – Daniel Böhmer Mar 17 '11 at 18:28
  • halo, I would appreciate some clarifications regarding Solution #1: How would I deal with early userid's that are '1' or '14'? What is the purpose of the 'xx' in your original example? How would I store userid '1' vs. userid '1000000' ? Thanks again. – PleaseHelpMe Mar 17 '11 at 18:43
  • I think you can make up this yourself. I added `xx` to indicate that 2 digits were skipped. You can name the folders and files any way you like. The main point of the idea is to group files to 100s or 1000s by all digits expect the last 2/3 ones. To answer your question: in the example given in the answer your files would be `0xx/001.png` and `10000xx/1000000.png`. Again, there is no problem in modifying the naming schema in details. – Daniel Böhmer Mar 17 '11 at 18:47
  • 1
    I'm trying to wrap my head around this new way of thinking about the problem. If I group them as per your example (10000xx), then I will have thousands of sub-folders within the images folder. So I'm trying to think of the best way to break it up into more than one layer of sub-folders. Something like 0xx/0xx/1.png and 10xx/0xx/1000000.png , but I'm having difficulty coming up with the best way of doing that without having too many files or sub-folders within a folder. Any suggestions? Thanks for your time, I appreciate it. – PleaseHelpMe Mar 17 '11 at 19:10