1

I don't understand this from the Google File Systems Paper

A small file consists of a small number of chunks, perhaps just one. The chunkservers storing those chunks may become hot spots if many clients are accessing the same file.

What difference does a small file make? Aren't large files being accessed by many clients equally likely to cause problems?

I've thought / read the following:-

  • I assume (correct me if I'm wrong) that the chunks of large files are stored on different chunkservers thereby distributing load. In such a scenario say 1000 clients access 1/100th of the file from each chunkserver. So each chunkserver inevitably ends up getting 1000 requests. (Isn't is the same as 1000 clients accessing a single small file. The server gets 1000 requests for small files or 1000 requests for parts of a larger file)
  • I read a little about Sparse files. Small files according to the paper fill up a chunk or several chunks. So to my understanding small files aren't reconstructed and hence I've eliminated this as the probable cause for the hot spots.
Abhirath Mahipal
  • 938
  • 1
  • 10
  • 21
  • 1
    " In such a scenario say 1000 clients access 1/100th of the file from each chunkserver. So each chunkserver inevitably ends up getting 1000 requests." Can you expand on your thoughts here more? If a client accesses 1/100th of a file, only 1/100th of the chunkservers will be contacted per client. The idea the paper is getting at is that for large files the access pattern is effectively a random distribution across chunkservers, *but not all at once*. – GManNickG Oct 05 '17 at 04:25
  • @GManNickG A large file is stored in 100 chunkservers. 1000 clients need that particular file. All of them would eventually need data from the 100 chunkservers. So each chunkserver will invariably end up serving 1000 clients. Even if there was a random distribution wouldn't a single request made by each file equal to the load generated by a small file? More importantly are parts of large files stored in different chunkservers? – Abhirath Mahipal Oct 05 '17 at 04:29
  • 1
    Gotcha. In your scenario all the chunkservers will eventually serve their chunk 1000 times, yes, but the instantaneous load will be lower. 1000 clients asking a single server for data at once is a hot spot, 1000 clients over 100 chunk servers means your instantaneous load on any server is lower, assuming clients aren't just contacting every chunk server at the same time. However, I believe the intended interpretation of the paper's point is that in practical applications, not all clients end up reading the entire file at all, in which case a chunkserver just handles (e.g.) one request. – GManNickG Oct 05 '17 at 04:35
  • @GManNickG Thanks makes sense especially "not all clients end reading the entire file at all". Perhaps you could copy all this to an answer :) – Abhirath Mahipal Oct 05 '17 at 05:23

1 Answers1

5

Some of the subsequent text can help clarify:

However, hot spots did develop when GFS was first used by a batch-queue system: an executable was written to GFS as a single-chunkfile and then started on hundreds of machines at the same time. The few chunkservers storing this executable were overloaded by hundreds of simultaneous requests. We fixed this problem by storing such executables with a higher replication factor and by making the batchqueue system stagger application start times. A potential long-term solution is to allow clients to read data from other clients in such situations.

If 1000 clients want to read a small file at the same time, the N chunkservers holding its only chunk will receive 1000 / N simultaneous requests. This sudden load is what's meant by a hot spot.

Large files aren't going to be read all at once by a given client (after all, they are large). Instead, they're going to load some portion of the file, work on it, then move on to the next portion.

In a sharding (MapReduce, Hadoop) scenario, workers may not even read the same chunks at all; one client out of N will read 1/N chunks of the file, distinct from the others.

Even in a non-sharding scenario, in practice clients will not be completely synchronized. They may all end up reading the entire file, but with a random access pattern so that statistically there is no hotspotting. Or if they do read it sequentially, they will get out of sync because of difference in workload (unless you're purposefully synchronizing the clients....but don't do that).

So even with lots of clients, larger files get less hot spotting due to the nature of work that large files entail. It's not guaranteed, which is what I think you are saying in your question, but in practice distributed clients won't work in tandem on every chunk of a multi-chunk file.

GManNickG
  • 494,350
  • 52
  • 494
  • 543
  • Say a large number of clients access different files on the same server, will it become a hot spot? (I want to essentially know whether access to the same region of the hard disk causes the issue or is it due to increased load) – Abhirath Mahipal Oct 06 '17 at 00:27
  • It's never formally defined, but the term hot spot is usually used to refer to any single object that causes high load. So "this file/chunk/banana/shoe is a hot spot" just means "this thing is causing higher than usual load". So different files having chunks that happen to reside on the same chunkserver wouldn't only be considered a hotspot - that's just regular load on the system. – GManNickG Oct 06 '17 at 00:56
  • 1
    The issue with hotspotting isn't necessarily one thing; maybe the network interface to the machine gets overloaded, maybe the bandwidth on the machine cannot keep up with the requests, etc. Remember, this chunkserver is being shared among all clients that want any chunk it has, so hotspotting could simply be "this chunk is taking too many away from other chunk accesses". – GManNickG Oct 06 '17 at 00:56
  • Thanks a ton for your input :) – Abhirath Mahipal Oct 06 '17 at 01:32