2

I have a site with more than 100k static files in a single directory (600k+ dirs and files in total). I guess I could get a VPS to host it without inode issues, but it won't be a high traffic site, so I'd rather use a cheap webhost.

I'm thinking to store the files in a MySQL table indexed by URL path and serve through PHP. Are there better approaches?

EDIT: Just to clarify, this is NOT the same as storing images on the DB. I'm talking about HTML pages.

Ivan
  • 97,549
  • 17
  • 50
  • 58

2 Answers2

1

I think your best approach would not be to store them in the database to start with. When it comes to storing and serving files, that is what a file system does best. There are no possible reasons that a database can do this more efficiently that a normal file system.

If you were to store them in a database then given the size restrictions you would want to use a BLOB field (e.g. TEXT) and for efficiency hash the URL and store that in a column rather than having some huge VARCHAR field indexed.

However, as you've said they are static there really isn’t any point in this – as they are static have your webserver add some long caching headers to the pages so they will be stored locally for future hits from the same client.

[Edit 1 - in response to comment]

I was answering the question with the information given and keeping it generic where information wasn't provided by OP.

It depends on how much of the VARCHAR you index – which is related to the length of the data stored (URL / path / page name) you’re indexing.

If you’re indexing less than about 45 characters for only 100k rows I guess it really wouldn't make much difference, a hash will use less memory but size and performance for a small set probably wouldn't really make that much difference.

I answered it as the OP asked about the database but still can’t see any reason why you would want to put them there in the first place – it will be slower than using the file system.0 Why connect to the database, deal with network performance (unless they are on the same box – unlikely in a web host) query an index, fetch a row, run that data through the database provider and stream the output to the response stream when the webserver can do the same outcome with much less CPU cycles and in comparison to a database a fraction of the memory usage?

Steve
  • 3,673
  • 1
  • 19
  • 24
  • Why is having a medium sized (let's be honest it's unlikely to be 200 bytes even) VARCHAR field indexed such a big problem? Yes, it uses up a lot of storage, but a lot less than using the database as a file-system. The files are static so there's little index maintenance that needs to be done and you don't have to calculate the URL hash on the fly every time you want to find a file (thereby delaying your index look-up before you've even started it). Lastly, and most minor, there's no chance of a collision. – Ben Mar 23 '13 at 21:10
  • @Ben my answer is too long I'll add an edit – Steve Mar 23 '13 at 21:24
  • Thanks for your answer. I'm no expert in this stuff, but my previous research showed that some filesystems have a 32k entries per directory limit. Plus, `df -i` shows I have about 500k inodes free, so that's another issue. Right now I'm not yet close to the 500k limit, so I'm tempted to go the filesystem route until it becomes a problem, although I hope it won't be a huge pain to migrate later on. – Ivan Mar 24 '13 at 03:29
0

Yes - a filesystem is a database. All the filesystems I've come across in the last 10 years can easily accommodate this number of files in a directory - and the directories are implemented as trees (there are some using B-Trees - but structures with bigger fanouts such as H-Trees work better for this kind of application).

(actually, given the coice I'd recommend structuring it into a hierarchy of directory - e.g. using dirs for the first 2 letters of the filename or md5 hash of the content - it'd make managing the content a lot easier without compromising performance).

Relational databases are all about storing small pieces of structured data - they are not an efficient way to manage large variable sized data.

I don't have any benchmarks to hand but just as I'd pick a station wagon to move several petabytes of data quickly over a sports motorcycle, I'd go with a suitable filesystem (such as BTRFS or Ext4 - ZFS would do the job too but it's not a good choice on anything other than Solaris - and it's questionable whether solaris makes any sense for a webserver).

Problem is that cheap hosting companies rarely provide this level of information up front.

Note that a wee tweak of the filesystem behaviour can yield big imperovements in performance - in your case, if running on Linux, I'd recommend reducing the vfs_cache_pressure significantly. But this requires root access.

An alternative approach would be to use a document database rather than a relational database (not a key/value store). These are a type of Schema free (NoSQL) database designed to provide fast replication and handling of large datastructures. Hence this would provide a more scalable solution (if that's a concern). e.g. RavenDB. You could use a key/value store but these are rarely optimized to handle large data payloads.

I'd only consider MySQL if you have a very strong reason other than what you've described here.

symcbean
  • 47,736
  • 6
  • 59
  • 94
  • Performance is not the most important issue, given how simple the DB queries will be. As you say, if I could choose the filesystem this would be moot, but alas, I want to host on a shared webhost. Oh, also, partitioning the files into letter directories is a good idea, but then the anchors in the HTMLs wouldn't work. I'd also use a document database but all I have access to is MySQL. – Ivan Mar 24 '13 at 03:27