0

As per this article, every cluster has its own storage.

" A cluster hosts millions of users (how many depends on the age of the hardware) and is a self-contained set of servers including: Frontend servers – Servers that that check for viruses and host the code that talks to your browser or mail client, using protocols such as POP3 and DeltaSync. Backend servers – SQL and file storage servers, spam filters, storage of monitoring- and spam data, directory agents and servers handling inbound and outbound mail. Load balancers – Hardware and software used to distribute the load more evenly for faster performance. "

I am guessing which cluster the user gets assigned to is decided by your geography (ip address). In that case if I send myself an email from Germany and then check my email when I come to the US, I would be assigned to different clusters (hence different SQL databases). So for me to be able to see that email in the US, does it mean that all the databases in all the clusters are constantly synchronized?

developer747
  • 15,419
  • 26
  • 93
  • 147

1 Answers1

1

Geography is most likely how you are assigned a cluster (Think of it like a content delivery network). I think you're right on with that assumption.

Of course I cannot say for certain how this all works, but from my experience with other large scale providers, my thoughts are as follows:

The emails are redundantly stored within a cluster (so the loss of a machine/hard drive) means nothing, and those clusters are also (probably) redundantly stored in a geographically separate location, making large scale outages and disasters less impactful on the end user. This push/pull is constantly being done within the data cluster (think of a file system like HDFS) to ensure n-level redundancy. Because the chances of you logging into a system in a different cluster are minimal within any given hour, there isn't a huge need to sync the information in real time (in terms of availability), but probably on the order of minutes/hours, and as fast as their machines can run to ensure data durability.

The cluster setup is probably similar to Amazon: east/west coast clusters, a euro cluster, and depending on where a lot of other users are, an Asian cluster (or two or three). The push pull of data between these isn't on the scale of minutes, but maybe hours.

All of this redundancy and synchronization is important to keep in mind for other services, like the article you mentioned posts, such as skydrive and messenger all share this sayme infrastructure.

Mike
  • 611
  • 4
  • 12