2

Let's say I have a table (let's call it BigTable) which could experience 5,000,000 INSERTS per day (with possibly just as many SELECTs). Each row inserted is about 50kb.

These daily INSERTs are split across 5 clients equally (the table has a FK called ClientID). There is never a need to SELECT or JOIN data across multiple clients.

I am worried about the database performance as this table grows, so I have come up with two solutions.

SOLUTION 1:

  • Partition BigTable by ClientID
  • Store each partition on a separate hard disk on the server (using Azure blog storage).
  • Partition all data which is 1 month old (archive data, yet still need to be queryable) into another set of READONLY partitions.

Essentially this means the following partitions on their own storage devices:

  • Primary (all data excluding BigTable)
  • ClientA's BigTable (5,000,000 rows per day / 5 clients x 30 days = 30,000,000 rows)
  • ClientB's BigTable (30,000,000 rows)
  • ClientC's BigTable (30,000,000 rows)
  • ClientD's BigTable (30,000,000 rows)
  • ClientE's BigTable (30,000,000 rows)
  • ClientA's BigTable archive
  • ClientB's BigTable archive
  • ClientC's BigTable archive
  • ClientD's BigTable archive
  • ClientE's BigTable archive

The number of rows in the archive tables will be (5,000,000) x (age of DB in days) - (30,000,000). This is still a huge table, but will only be used to drawing up the odd report.

SQL Server will be hosted on a 14GB, 8core Azure VM.

SOLUTION 2:

The other option is to host separate databases for each client. This means each will have it's own dedicated SQL Server machine. Partitioning will still happen for archive data.

This option is not optimal because of the physical separation of the data. Having to manage updates to multiple databases could be very problematic. Having separate database connections for each client will also be a consideration for the developers.

Could anyone perhaps advise on these options?

Dave New
  • 38,496
  • 59
  • 215
  • 394
  • I've experienced the database per client solution before and have no complaints. As the number of clients increases, it'll be easier to move them onto separate servers if needed (due to performance or space issues). There are some really good answers here: http://stackoverflow.com/questions/8704588/one-database-per-client-or-all-clients-in-one-database-which-one-should-i-use-f – jdl Feb 18 '13 at 14:42
  • So you would have separate tables per Client ID, not partitioning by Client ID, and then partitioning by some datetime field. This would work fine. You could then create a paritioned view on top of your tables to give the illusion of a single Client table. – muhmud Feb 18 '13 at 15:59
  • You should seriously consider engaging Microsoft directly for this project. SQL Server works best with low latency storage. Windows Azure Blob storage may not have low enough latency to be usable in this context. – StrayCatDBA Feb 18 '13 at 16:48

3 Answers3

5

Since you have tagged this with [azure] and [sql-server], I assume that you are trying to do this in Windows Azure. If that is the case then a) partitioning by client is not necessarily a good idea, and b) SQL may not be the best (complete) fit for your problem.

When building scalable architectures, the partitioning strategy shouldn't be based on something specific like 'client', but rather something more arbitrary. The reason is simple — unless clients have a reason to be separate, such as not wanting their data mixed with others, or different SLAs per client, then the choice of 'client' as a partition may not render the optimal result. If 80% of your business is generated by a single client, you have not solved your problem, and still have to maintain n separate databases for marginal load.

5 mil database inserts per day is not a big number, but may be a big number for SQL Server hosted in Azure IaaS, or Azure SQL Database — due to the performance of the underlying commodity hardware. Before determining how to partition SQL, ask yourself two questions. First, what are the usages and performance characteristics that you want from the data? (Does it have to be immediately consistent? Can you process data asynchronously?) Secondly, have you mapped those characteristics against other data store technologies? Have you considered Table Storage (or non-MS solutions like Redis)?

You may find, after trying out a few options that:

  • SQL is a good store for some of the data, some of the time.
  • Much of the processing can be done asynchronously, so the peak performance of inserts is of little concern (and doing 5 mil inserts over a 24-hour period is not a problem).
  • SQL may not be suited to long-term storage.
  • Querying of older data can be done effectively using map-reduce, rather than SQL queries.

For example, I have an app that tracks vehicles at one-second intervals. It is targeted for 100,000 vehicles but architected in such a way as to be able to scale up to millions without changing any code or databases. But in the medium term, it has to cope with 72 mil inserts per day. All of this runs on a single Windows Azure SQL database that is less than 10GB, and a whole bunch of table storage. The reason why this works is because although I want to archive all data (72 mil rows), I don't need complex SQL query access to it, so it sits happily in table storage. What I store in SQL is a summary of the data. So in my example, I am only interested in a vehicle's journey (start and end position, distance travelled etc), which means that I only have two or three rows per vehicle per day that I need in SQL — greatly reducing the load on the database. Also, my bottleneck is in the collection of data, so I add the data immediately to a (Windows Azure) queue — and worry about the summarising of data in a separate workload.

This answer may be a bit long, but is intended for you to think about your data model more carefully, rather than just trying to think about how to solve the problem with SQL. For more detail, have a look at the data model in CALM.

Simon Munro
  • 5,399
  • 6
  • 33
  • 40
  • Great answer - thanks. Yes, we are using Azure. I have had a brief look at Azure Table Storage. My main concern is data processing. For example, we might need to get the last 100 rows for a client with some simple criteria (maybe where IsActive=1). We do this in Entity Framework. Should I perhaps wire up calls to this 'table' in EF so that it fetches the data from Azure Table Storage under the hood? This way the developer will be abstracted away from having to specifically deal with fetching data from Azure Table Storage. – Dave New Feb 19 '13 at 10:02
  • Note that the data processing on this table is minimal (very very few JOINs occur). – Dave New Feb 19 '13 at 10:03
  • Be careful architecting with an ORM (EF) as the starting point. Start with your data store, and then look at data access. If you have scalability concerns EF is all but guaranteed to paint you into a corner that you cannot get out of. Table storage takes some work to figure out the best partition/row key - depending on usage. Your point that there are few joins means that a key-value store is probably a good fit. – Simon Munro Feb 19 '13 at 11:04
  • @davenewza I also thought to use table storage to store long term metrics data but, how to manage metrics with different frequencies? Do you use multiple tables? I think that table storage is the azure storage service with best price/value ratio for this usage. – Andrea Cattaneo Sep 18 '18 at 21:53
3

A few ideas for you: 1) Use Azure Tables instead of SQL. Have PartitionKey = ClientID. Each table can be 200TB & support 20k IOPS. Each partition is 2k IOPS, since the clients would be logically separated you would get a natural load balance (Azure load balances by Partition). This would also save you from having to run/manage a XL VM 24x7 (i.e. much cheaper). The storage costs would be the same since the data drive for the VM is backed by Azure storage anyway. 5M inserts per day is only ~60/sec, so there would be a lot of headroom for growth. This is especially true b/c you are doing fairly simplistic Insert/Select and would not cross Client boundaries.

2) If you want to do a per client DB, I would go with SQL Azure. The provisioning is much faster and each DB is a separate scale unit (this would prevent one client from creating an issue for others). You can also dynamically change the DB based on the client changes.

3) If you want a single monolithic DB, I would go with SQL Server on a VM. Create multiple Data drives and mount them as a striped set. For an XL VM, you can have up to 16 drives mapped. This will limit the maximum size of the DB to 16TB, so you would need to have some mechanism for aging out/archiving/garbage collecting.

Pat Filoteo
  • 1,016
  • 6
  • 4
  • i wouldn't use the clientId as the partition id as that does not gurantee an even distribution, i prefer something like the last 1 or 2 digits of the client id for partition sharding. If you want extreme performance add another layer on top that splits those into groups, and each group uses a different storage account. – JTtheGeek Jul 29 '13 at 18:59
  • Actually, that would lead to a lower ultimate scale point than using the client ID (assuming >100 clients) unless you used the full 20k IOPS of an account. The reason is that each partition can be individually scaled from the table; so it is OK to have one hot partition and 'n' cold ones. What you absolutely want to avoid is an append only scenario (i.e. always writing to the last partition/last row). This effectively precludes the load balancing the table gives you. To avoid this, use a client ID that is non-incremental (e.g. GUID). – Pat Filoteo Jul 29 '13 at 21:32
  • Assuming the client id is a Guid, using the last 2 chars creates 256 partitions, need more... use the last 3 = 16^3 = 4096 paritions, MORE?! last 4 = 65k paritions. We build a dictionary on startup mapping N digits into X partitions for fast lookup and insert. It's fairly easy to add 1 more layer to map subsets of those partitions into different storage accounts. Queries use Tasks to execute across the partitions in parallel and Task.wait to reaggregate. We easily hit the IO threshold. – JTtheGeek Jul 31 '13 at 00:24
1

You do not only need to think about access performance but also about disaster recovery performance. At 6TB for each client in the active month alone, I strongly recommend to keep the clients in separate databases.

If you have a decent Continuos Integration and an automated deploy process, keeping the database schema in synch should not be that big of a problem.

Sebastian Meine
  • 11,260
  • 29
  • 41