I don't understand this from the Google File Systems Paper
A small file consists of a small number of chunks, perhaps just one. The chunkservers storing those chunks may become hot spots if many clients are accessing the same file.
What difference does a small file make? Aren't large files being accessed by many clients equally likely to cause problems?
I've thought / read the following:-
- I assume (correct me if I'm wrong) that the chunks of large files are stored on different chunkservers thereby distributing load. In such a scenario say 1000 clients access 1/100th of the file from each chunkserver. So each chunkserver inevitably ends up getting 1000 requests. (Isn't is the same as 1000 clients accessing a single small file. The server gets 1000 requests for small files or 1000 requests for parts of a larger file)
- I read a little about Sparse files. Small files according to the paper fill up a chunk or several chunks. So to my understanding small files aren't reconstructed and hence I've eliminated this as the probable cause for the hot spots.