I am looking for some input on how to scale out a Windows Service that is currently running at my company. We are using .NET 4.0 (can and will be upgraded to 4.5 at some point in the future) and running this on Windows Server 2012.
About the service
The service's job is to query for new rows in a logging table (We're working with an Oracle database), process the information, create and/or update a bunch of rows in 5 other tables (let's call them Tracking tables), update the logging table and repeat.
The logging table has large amounts of XML (can go up to 20 MB per row) which needs to be selected and saved in the other 5 Tracking tables. New rows are added all the time at the maximum rate of 500,000 rows an hour.
The Tracking tables' traffic is much higher, ranging from 90,000 new rows in the smallest one to potentially millions of rows in the largest table, each hour. Not to mention that there are Update operations on those tables as well.
About the data being processed
I feel this bit is important for finding a solution based on how these objects are grouped and processed. The data structure looks like this:
public class Report
{
public long Id { get; set; }
public DateTime CreateTime { get; set; }
public Guid MessageId { get; set; }
public string XmlData { get; set; }
}
public class Message
{
public Guid Id { get; set; }
}
- Report is the logging data I need to select and process
- For every Message there are on average 5 Reports. This can vary between 1 to hundreds in some cases.
- Message has a bunch of other collections and other relations, but they are irrelevant to the question.
Today the Windows Service we have barely manages the load on a 16-core server (I don't remember the full specs, but it's safe to say this machine is a beast). I have been tasked with finding a way to scale out and add more machines that will process all this data and not interfere with the other instances.
Currently each Message gets it's own Thread and handles the relevant reports. We handle reports in batches, grouped by their MessageId to reduce the number of DB queries to a minimum when processing the data.
Limitations
- At this stage I am allowed to re-write this service from scratch using any architecture I see fit.
- Should an instance crash, the other instances need to be able to pick up where the crashed one left. No data can be lost.
- This processing needs to be as close to real-time as possible from the reports being inserted into the database.
I'm looking for any input or advice on how to build such a project. I assume the services will need to be stateless, or is there a way to synchronize caches for all the instances somehow? How should I coordinate between all the instances and make sure they're not processing the same data? How can I distribute the load equally between them? And of course, how to handle an instance crashing and not completing it's work?
EDIT
Removed irrelevant information