2

I am preparing for the system design interview, and since I have little experience with this topic, I bought the "Grokking the system design interview" course from educative.io, which was recommended by several websites. However I read it, I think I did not manage to understand several things, so if someone could answer my questions, that would be helpful.

  1. Since I have no experience with nosql, I find it difficult to chose the proper db system. Several times the course just do not give any reasoning why it chose one db over another one. For example in chapter "Designing Youtube or Netflix" the editors chose mysql for db role with no explanation. In the same chapter we have the following non-functional requirements:

"The system should be highly available. Consistency can take a hit (in the interest of availability); if a user doesn’t see a video for a while, it should be fine."

Following the above hint and taking into account the size of the system and applying the material in the "CAP theorem" chapter for me it seems or Cassandra and CouchDB would be a better choise. What do I miss here?

Same question goes for "Designing Facebook’s Newsfeed"

  1. Is CAP theorem still applicable?

What I mean is: HBase is according to the chapter "CAP theorem" good at consistency and partition tolerance, but according to the HBase documentation, it also supports High Availibility since version 2.X. So it seems to me that it is a one fits all / universal solution for db storage which goes against CAP theorem, unless they sacrificed something for HA. What do I miss here?

  1. The numbers are kind of inconsistent around the course about how much RAM/storage/bandwidth can a computer handle, I guess they are outdated. What are the current numbers for a, regular computers, b, modern servers?

  2. Almost every chapter has a part called "Capacity Estimation and Constraints", but what is calculated here changes from chapter to chapter. Sometimes only storage is calculated, often bandwidth too, sometimes QPS is added, sometimes there are task specific metrics. How do I know what should I calculate for a specific task?

Thanks in advance!

suho
  • 23
  • 2
  • These may be good individual questions - you should really re-ask each of them individually, unless if you can explain how each of them are inextricably linked (doesnt seem that way to me). You could re-edit this question for just one of them. – StayOnTarget Jan 15 '20 at 21:44

1 Answers1

1

Each database is different and fulfills different requirements. I recommend you read dynamo-paper, and familiarize yourself with the rest of the terminology used in it (two-phase locking, leader/follower, multi-leader, async/sync replication, quorums), and know what guarantees the different databases provide. Now to the questions:

  1. MySQL can be configured to prioritize Availability at the cost of Consistency with its asynchronous replication model (the leader doesn't wait for acknowledgement from its followers before committing a write; if a leader crashes before the data gets propagated to the followers, the data is lost), so it can be one of the suitable solutions here.

  2. From the documentation of HBase, HBase guarantees strong consistency, even at the cost of availability. The promise of high availability is for reads, not for writes i.e. for reading stale data while the rest of the system recovers from failure and can accept additional writes.

    because of this single homing of the reads to a single location, if the server becomes unavailable, the regions of the table that were hosted in the region server become unavailable for some time.

    Since all writes still have to go through the primary region, the writes are not highly-available (meaning they might block for some time if the region becomes unavailable).

  3. The numbers used are estimates by the candidate i.e. you decide what are the specs of a single hypothetical server, and how many servers you would need in order to scale and accommodate the storage/throughput requirement.

  4. You don't know in advance (although you can make a guess based on the requirements e.g. if it's a data storage system, a streaming service etc., I still wouldn't recommend it). Instead, you should ask the interviewer what area they are interested in, and you make estimates for it. The interview, especially the system design part, is a discussion, don't follow a template to the letter. You recognize the different areas you can tackle about the system, and approach them based on the interviewer's interest.

rhytonix
  • 968
  • 6
  • 14