4

Which type of file-system is beneficial for storing images in a social-networking website of around 50 thousand users?

I mean to say how to create the directory? What should be the hierarchy of folders for storing images (such as by album or by user).

I know Facebook use haystack now, but before that it uses simple NFS. What is the hierarchy of NFS?

thegrinner
  • 11,546
  • 5
  • 41
  • 64
Shivam Agrawal
  • 2,053
  • 4
  • 26
  • 42

1 Answers1

0

There is no "best" way to do this from a filesystems perspective -- NFS, for example, doesn't have any set "hierarchy" other than the directories that you create in the NFS share where you're writing the photos.

Each underlying filesystem type (not NFS, I mean the server-side filesystem that you would use NFS to serve files from) has its own distinct performance characteristics, but probably all of them will have a relatively fast (O(1) or at least O(log(n))) way to look up files in a directory. For that reason, you could basically do any directory structure you want and get "not terrible" performance. Therefore, you should make the decision based on what makes writing and maintaining your application the easiest, especially since you have a relatively small number of users right now.

That said, if I were trying to solve this problem and wanted to use a relatively simple solution, I would probably give each photo a long random number in hex (like b16eabce1f694f9bb754f3d84ba4b73e) or use a checksum of the photo (such as the output from running md5/md5sum on the photo file, like 5983392e6eaaf5fb7d7ec95357cf0480), and then split that into a "directory" prefix and a "filename" suffix, like 5983392e6/eaaf5fb7d7ec95357cf0480.jpg. Choosing how far into the number to create the split will determine how many files you'll end up with in each directory. Then I'd store the number/checksum as a column in the database table you're using to keep track of the photos that have been uploaded.

The tradeoffs between these two approaches are mostly performance-related: creating random numbers is much faster than doing checksums, but checksums allow you to notice that multiple of the same photo have been uploaded and save storage (if that's likely to be common on your website, which I have no idea about :-) ). Cryptographically secure checksums also create very well-distributed values, so you can be certain that you won't end up with an artificially high number of photos in one particular directory (even if a hacker knows what checksum algorithm you're using).

If you ever find that the exact splitting point you chose can no longer scale because it requires too many files per directory, you can simply add another level of directory nesting, for instance by switching from 5983392e6/eaaf5fb7d7ec95357cf0480.jpg to 5983392e6/eaaf5fb7/d7ec95357cf0480.jpg. Also, if your single NFS server can't handle the load by itself anymore, you could use the prefix to distribute the photos across multiple NFS servers instead of simply across multiple directories.

Dan
  • 7,155
  • 2
  • 29
  • 54