3

Please consider the following:

I am storing around 1.2 Million TIF files ranging from 40 KB to 120 KB in size.

These documents are stored on a windows server with NTFS file-system.

The documents are stored using the following variables:

  • client
  • document type
  • image folder
  • actual image

See below:

C:\<client_id>\<doc_type_id>\image001\1.TIF

Example

C:\1\3\image001\1.TiF

It is a PHP hosted system.

The performance is acceptable at this stage. I want to know what the best strategy is going forward. Considering that the customers and document amounts are going to increase dramatically.

I am looking at replacing the complete storage with Jackrabbit CMS.

Would this be the way? Or

Is storing the documents in a format like:

  • Customer
  • Document type
  • Julian date day of the year document imported.
  • Current User
  • 6 digit unique code

Example

C:\1\1\167\2\453257\image001\image.TIF

going to be just as efficient?

Please take all other considerations of CMS vs File-system out of the picture. e.g versioning, data backup.

Thanks.

Koekiebox
  • 5,793
  • 14
  • 53
  • 88
  • Can you elaborate on what kinds of access patterns you expect? – Amber Sep 04 '09 at 22:32
  • The paths would be stored in the database. Users would run queries based on columns stored in the database. Based on what search result he selects, the path would be retrieved for the result selected and displayed to the user. – Koekiebox Sep 05 '09 at 12:04
  • If it is working, don't change it until you need to, just seperate the code that read the images out into it's own method(s) so you can change it IF you need to later on. – Ian Ringrose Sep 10 '09 at 09:24

3 Answers3

4

Honestly? I don't think it matters until you get to a certain size (and I can't, for the life of me, remember that size...). The thing is to find a method and then stick with it, hopefully it'll be in such a way that you never need to touch it again. My own advice, without anything as convincing as evidence to support it, is something akin to your own suggestion:

c:\<customer_id>\<document_year>\<document_month>\<document_day>\actual_file.tif

I'd also raise the suggestion that, depending on your server setup, it might be worth giving each customer (depending on the amount of data or account type) their own drive/partition.

Bear in mind that, without some sort of user-control or permissions system, that file-paths could be predictably guessed and browsed (as if you didn't know this already...I know, I'm sorry). The fact that you raised the bullet point of 'six digit unique code' suggests that you don't need a path of common-format, but I would suggest that a common-format (of whatever format you end up choosing) would be a better idea.

Back in my Windows days I sorted my own directories around the file's primary-relation, it'd be considered a 'tag' nowadays (c:\documents and settings\university\year1\module21\assignment1.doc for example), this made it easier to find things later. Your customers appear to have their directory structure enforced -by you- but finding things that they did last week is easier if they only have to traverse the date, remembering where they put something last week when they get to the six-digit unique number-named folders is going to be, well, difficult. At best.

David Thomas
  • 249,100
  • 51
  • 377
  • 410
2

Your question is very similar to this one. Is your load primarily reading your images or writing? If it's read scalability you need, the post describes memcached, which is probably all you need. jackrabbit has loads more features, but is more for hierarchical text storage. Not sure it will do any better performance wise on your images. Also, if you do choose jackrabbit, make sure your content hierarchy is deep enough for jackrabbit to stay efficient. Any parent with 10,000 or more children is going to have sub-par performance.

Community
  • 1
  • 1
DaveParillo
  • 2,233
  • 22
  • 20
  • memcache will only help if a small number if images are read a lot AND you have more then one server. Otherwise just use a 64 bit system and put lots of RAM in the file server. Let the OS do the cacheing for you. – Ian Ringrose Sep 10 '09 at 09:23
1

The strategy for storage you proposed would need to be addressed if you intend to move your content to different machines (SAN/NAS). To do this, you would need to strip all the customer data from the path, and just create a hash that you then save in the database to link to the file you are accessing. This way you are left with a folder structure something like this:

NAS1/00/01/86/63/54/89/image01/image.tiff
NAS2/00/02/46/62/22/11/image02/image.tiff
...

I would also recommend you take a gander at MogileFS. All you need to do to speed it up is to add some sort of a proxy in front of it and all should be well.

And like Dave mentioned, make sure you don't have too many children in one folder. Things tend to get quite sluggish around 10.000.

Miha Hribar
  • 5,776
  • 3
  • 26
  • 24