0

I'm in the process of redesigning my web app which is mostly derived of user photo's. Our photo volume is generally in the low millions or upwards to 35 per user.

Anyhow what I'm looking to do is redesign how we store the photo's on the file system and reference them in the database. Our current system works, but not without it's flaws.

Currently I store them as such

user database table

pk 1 
photo_count 12

Which ends up becoming directory

storage/000/000/000/000/001/1_640x480.png

thru

storage/000/000/000/000/001/12_640x480.png

directory is derived from user pk

first number in the file name is the sort order

the file name refers to the photo size.

This is a very efficient way to store the photo's in the database, however it doesn't come without it's flaws. Whenever the sort changes we must first write the changes to a temp directory then overwrite all the photos in the main directory which isn't all that efficient. We also export photo's to other sites and the problem with our current system is if a photo is modified the name never changes thus the third party site never knows to refresh photo's from the feed. Last major issue has to do with the photo count falling out of sync with the directory count. This results in us generating photo URLs from the database photo_count which may or may not exist which causes some third party sites to fail on the photo import job.

my purposed solution was to do the following, but I'd like an expert opinion.

user database table
pk "1"
photos "stored as a comma separated list of photo names generated from SHA-1" example: 

f56c0de1c61fdb926e79e8a0a65bd12930c9.jpg,ec1c55bfb660548a6770238668c4b117d92f.jpg

my thought was to store the photo sort order in the list, so if the order changes all I'd have to do is rearrange the list rather than rename photos.

I figured I could probably continue deriving my directory structure from the user pk although I'd prefer using some sort of hash, I'm just not sure how to reference it in the database if that was the preferred method. Does it just get stored in another column? example

00e4 becomes /00/e4/

The only other issue I seem to run into is the photo size assuming I still need to store thumbnails. Would it be recommended to suffix the file name with _thumb.jpg?

I think this would all address third party feeds since every photo gets a unique name which is always changed upon modification.

Does anybody have an expert opinion on this subject? I'm not sure this is the best solution, so I'd like to hear what others are doing. Thanks so much.

Code Junkie
  • 7,602
  • 26
  • 79
  • 141

1 Answers1

1

Not sure I understood the question completely, but I'll describe they way we solve the same task (with the only difference that we made a library of arbitrary documents).

Primary DB entry is docs table:

Id              -- file id, pk
UserId          -- owner id
Size            -- file size in bytes
FileName        -- name to display like 'img.jpg'
ContentType     -- image/jpg
FileLocation    -- location of the file on filesystem.
                -- in our case it was something like {storage-root}/{userid}/{guid}
ThumbLocation   -- location of a preview version of the same file.
Created         -- upload time

So upload algorithm:

  • Get a file from a user.
  • Create empty Document object conforming to the structure of the table above.
  • Set doc.FileName to the original filename ("img.jpg").
  • Caclulate destination folder at filesystem per template '/files/{userid}'.
  • Generate unique file name (Guid.New()) and save its value in doc.FileLocation.
  • Write file to doc.FileLocation.
  • Set doc.Size to the original file size.
  • If it's an image, generate preview file, generate it's name: doc.ThumbLocation = doc.FileLocation + "thumb".
  • Write thumb file.
  • save doc to DB.

When user requests file list (by url like /files/{userid}):

  • Read all user's documents from a database
  • (Note that sorting options are not affected by physical files and their locations. All required information is already in the table).
  • Show user a list of documents.
  • Each record shows doc.FileName, doc.Size, and a link pointing to /files/{userid}/{fileid}

When user requests a file by url like /files/{userid}/{fileid}):

  • Read doc from the database
  • Load file from doc.FileLocation
  • Send it to the user.

As a conclusion: we manage physical files and locations only on upload time. The rest of operations are based only on the data we have put to the database. Then getting files content we rely on paths written to the table as it and do not bother their exact meaning.

mikalai
  • 1,746
  • 13
  • 23
  • Thanks mikalai, it appears your creating an entirely separate table to manage your images. That was definitely one approach I've considered. It also looks as if your storing a lot more info than what I'd need. Here's what I ended up doing, I used a photo processor to repair images and convert them all to .jpg excluding extension from db. Encoded in the file name is the file location which I generate from a UUID. example file name 3145d5 becomes /31/45/db.jpg. This enables equal file distribution and around 65 million photos. lastly, I store in db like so, 1860ec,a81685,etc which manages sort. – Code Junkie Oct 02 '13 at 15:09
  • Temp photos and thumbs I encode into the file name _temp1860ec or 1860ec_thumb and then just use business logic to build out my urls. I like this approach do to the fact I can eliminate a join in the db which hopefully will speed things up a bit. – Code Junkie Oct 02 '13 at 15:19
  • @George - you eliminate DB joins by the price of physical files operations, which is much more expensive, I think. Anyway, you're the boss) – mikalai Oct 03 '13 at 08:00
  • it's an interesting point, I would love to do a bench test against the two for comparison. Approx how many photo's are you running on your site? I'd be curious of your performance. We seem to average around 3.5 million and I'm not apposed to changing things if your suggestion is in fact less expensive. – Code Junkie Oct 03 '13 at 12:57
  • No, not so many, it's an internal org library. But I'm sure such "index" approach is correct. – mikalai Oct 04 '13 at 07:50