0

At present we send emails to subscribers on our Mailing List. This features a tracking pixel that maps to a PHP file that then does Insert data into our SQL Server database.

The current average for Email Tracking Pixel Inserts is about 80,000 inserts a day. And about 800,000 inserts equates to about 1GB of hard disk space, so 10 days 1GB of data.

As well as this we have other Inserts and Tracking Data getting inserted into the SQL Server database which also happens to be the same database that the websites uses. So for space, performance, licensing costs and horizontal scaling reasons etc I want to move this analytic tracking data away from the SQL Server DB + the fact that this analytic tracking data is not needed by the website so I want to move these write heavy inserts away so that the website DB is just that.

Table Structure at the moment

TrackingPixelId | UserId | Code | Medium | Source | DateViewed | SessionId

9109616 | 1234 | 'BULLETIN120115' | 'email' | 'BULLETIN120115' | {datetime} | bf7e2f801...

Column Info

TrackingPixelId : PK Integer auto increment

UserId: Integer

Code, Medium and Source : Strings/varchars

DateViewed: DateTime eg 2015-01-13 06:18:24.920

SessionId : eg fa5cac87896e1c7b423051fffdb836a6

Code and Source are essentially the same thing, the Unique ID of the Email Mail Out that was opened.

So in terms of how the data will be reported we will be looking at volume of opened emails. So Daily, Weekly, Monthly and Yearly reports and we won't need the reports to be generated instantly probably loads of writes per day but just a hand full of reads but probably wanting the most recent data first.

So considering all of these factors what would be the best shard key?

mrjamesmyers
  • 454
  • 4
  • 13

1 Answers1

1

First of all, I would caution you not to jump into sharding right away. With your data volume, you should be able to have a single replica set handle your traffic. The real trigger point for sharding is when the working set becomes larger than is reasonable for the RAM on a single machine.

That said, given that you are mostly concerned with write performance and will tolerate slower reads for reporting, I think a hashed shard key will work best. You could hash some unique id or hash a MongoDB-generated ObjectId value. This will guarantee that writes are spread evenly through the cluster. Reads will be scatter-gather, but it sounds like that is tolerable to get very good write scaling.

Plus, since you're interested in generating daily, weekly, ... reports, I think you can employ a hierarchical aggregation strategy to minimize and amortize the read load, which will be much more costly for the cluster than the write load because of the shard key choice. This will let you generate your reports incrementally on new data and then query the finished report without having to generate it at the time of querying. I'd also suggest using aggregation pipelines instead of map/reduce to generate the reporting data whenever possible, even though the manual page uses map/reduce.

wdberkeley
  • 11,531
  • 1
  • 28
  • 23