Fast distributed file system for small file

Question

Our company has five million users. We store user's code files. Users can edit and add their files, just like web IDE, the web IDE list users's file. We use PHP functions to implement these operations, such as readdir, file_get_contents and file_put_contents. We used the MooseFS but when we read the files in the program, in particular the slow loading speed.

So, we need to replace the file system , I hope someone can give me some advice , we have a huge number of small files, which distributed file system should be used.

I experience the same problem: very low performance of small files (not MooseFS though, I tried gluster and google disk storage). I'm curious with what you finally en up ? — dev_null, Dec 09 '20 at 13:51

duffymo · Answer 1 · 2016-08-06T14:28:35.937

Five million entries is small to a relational database. I'd wonder why you feel the need to store these in a file system.

Does every user require that all files be loaded on startup? If yes, I'd wonder about the design of the system. That operation is O(N) no matter how you design it.

If you put those five million small files into a relational or NoSQL database, and then let each user connect to it and query for the particular ones they want, then you eliminate the need to load them repeatedly on startup. Problem solved.

score 0 · Answer 2 · answered Sep 01 '16 at 11:38

In any distributed filesystem, one of the most crucial aspects when we consider operations on small files is network latency - it should be as small as possible (like 0.1 ms) between such distributed filesystem components. The best way to achieve it is to use reliable switch and connect all machines to the same switch.

Also, in distributed filesystems (especially in MooseFS) the best thing is scalability - it means, that the more nodes you have (and the more your calculations are distributed, i.e. done simultaneously on more than one mount), the faster the cluster is.

If you use MooseFS, please check out MooseFS 3.0, because operations on small files are improved since 3.0 version. This is an easy way for now, because you don't have to make a "revolution" (before upgrade remember to backup the /var/lib/mfs on Master Server - i.e. metadata). MooseFS can handle small files well, so maybe there's a problem in configuration?

In MooseFS additionally (still considering small files operations), one of the most important things is to have high CPU clock (like e.g. 3.7 GHz) with small amount of CPU cores and disabled energy saving options in BIOS for Master Server (because Master Server is a single-threaded process). For Chunkservers and Clients situation is different - they are multi-threaded, so you'll get better results while using multicore CPUs.

Additionally, as stated in MooseFS Best practices in paragraph 4. "Virtual Machines and MooseFS":

[...] we do not recommend running MooseFS components (especially Master Server(s)) on Virtual Machines.

So if you run MFS on VMs, you in fact may have poor results.

Fast distributed file system for small file

2 Answers2