Scaling out Windows Services

Question

I am looking for some input on how to scale out a Windows Service that is currently running at my company. We are using .NET 4.0 (can and will be upgraded to 4.5 at some point in the future) and running this on Windows Server 2012.

About the service
The service's job is to query for new rows in a logging table (We're working with an Oracle database), process the information, create and/or update a bunch of rows in 5 other tables (let's call them Tracking tables), update the logging table and repeat.

The logging table has large amounts of XML (can go up to 20 MB per row) which needs to be selected and saved in the other 5 Tracking tables. New rows are added all the time at the maximum rate of 500,000 rows an hour.
The Tracking tables' traffic is much higher, ranging from 90,000 new rows in the smallest one to potentially millions of rows in the largest table, each hour. Not to mention that there are Update operations on those tables as well.

About the data being processed
I feel this bit is important for finding a solution based on how these objects are grouped and processed. The data structure looks like this:

public class Report
{
    public long Id { get; set; }
    public DateTime CreateTime { get; set; }
    public Guid MessageId { get; set; }
    public string XmlData { get; set; }
}

public class Message
{
    public Guid Id { get; set; }
}

Report is the logging data I need to select and process
For every Message there are on average 5 Reports. This can vary between 1 to hundreds in some cases.
Message has a bunch of other collections and other relations, but they are irrelevant to the question.

Today the Windows Service we have barely manages the load on a 16-core server (I don't remember the full specs, but it's safe to say this machine is a beast). I have been tasked with finding a way to scale out and add more machines that will process all this data and not interfere with the other instances.

Currently each Message gets it's own Thread and handles the relevant reports. We handle reports in batches, grouped by their MessageId to reduce the number of DB queries to a minimum when processing the data.

Limitations

At this stage I am allowed to re-write this service from scratch using any architecture I see fit.
Should an instance crash, the other instances need to be able to pick up where the crashed one left. No data can be lost.
This processing needs to be as close to real-time as possible from the reports being inserted into the database.

I'm looking for any input or advice on how to build such a project. I assume the services will need to be stateless, or is there a way to synchronize caches for all the instances somehow? How should I coordinate between all the instances and make sure they're not processing the same data? How can I distribute the load equally between them? And of course, how to handle an instance crashing and not completing it's work?

EDIT
Removed irrelevant information

This *sounds* like an ETL process. Have you considered looking at something like SQL Server Integration Services (SSIS) and writing packages that can be scheduled to run to regularly perform this process? — Russ Cam, Feb 04 '13 at 22:50
We use Oracle and the higher ups don't want to hear a word about SQL Server, unfortunately. — Artless, Feb 05 '13 at 05:08
I was thinking only the SSIS part of it and not the database engine :) Alternatives would be something like Pentaho Data Integration (http://www.pentaho.com/explore/pentaho-data-integration/) or Talend etl analytics (http://www.talend.com/solutions/etl-analytics) — Russ Cam, Feb 05 '13 at 09:52

meklarian · Answer 1 · 2013-02-04T22:43:24.220

For your work items, Windows Workflow is probably your quickest means to refactor your service.

Windows Workflow Foundation @ MSDN

The most useful thing you'll get out of WF is workflow persistence, where a properly designed workflow may resume from a Persist point, should anything happen to the workflow from the last point at which it was saved.

Workflow Persistence @ MSDN

This includes the ability for a workflow to be recovered from another process should any other process crash while processing the workflow. The resuming process doesn't need to be on the same machine if you use the shared workflow store. Note that all recoverable workflows require the use of the workflow store.

For work distribution, you have a couple options.

A service to produce messages combined with host-based load balancing via workflow invocation using WCF endpoints via the WorkflowService class. Note that you'll probably want to use the design-mode editor here to construct entry methods rather than manually setup Receive and corresponding SendReply handlers (these map to WCF methods). You would likely call the service for every Message, and perhaps also call the service for every Report. Note that the CanCreateInstance property is important here. Every invocation tied to it will create a running instance that runs independently.
~
WorkflowService Class (System.ServiceModel.Activities) @ MSDN
Receive Class (System.ServiceModel.Activities) @ MSDN
Receive.CanCreateInstance Property (System.ServiceModel.Activities) @ MSDN
SendReply Class (System.ServiceModel.Activities) @ MSDN
Use a service bus that has Queue support. At the minimum, you want something that potentially accepts input from any number of clients, and whose outputs may be uniquely identified and handled exactly once. A few that come to mind are NServiceBus, MSMQ, RabbitMQ, and ZeroMQ. Out of the items mentioned here, NServiceBus is exclusively .NET ready out-of-the-box. In a cloud context, your options also include platform-specific offerings such as Azure Service Bus and Amazon SQS.
~
NServiceBus
MSMQ @ MSDN
RabbitMQ
ZeroMQ
Azure Service Bus @ MSDN
Amazon SQS @ Amazon AWS
~
Note that the service bus is just the glue between a producer that will initiate Messages and a consumer that can exist on any number of machines to read from the queue. Similarly, you can use this indirection for Report generation. Your consumer will create workflow instances that may then use workflow persistence.
Windows AppFabric may be used to host workflows, allowing you to use many techniques that apply to IIS load balancing to distribute your work. I don't personally have any experience with it, so there's not much I can say for it other than it has good monitoring support out-of-the-box.
~
How to: Host a Workflow Service with Windows App Fabric @ MSDN

Thanks! I'll have to do some reading and testing, and see what my company is willing to do. — Artless, Feb 05 '13 at 05:22
Given your comment to the reporting solution comment on your question, I should warn you that the persistence store that ships with WF relies on MS SQL Server, which may be a dealbreaker for your company. It might be worth seeing if you can get MSDE working as a persistence store to avoid having to setup a MSSQL instance. — meklarian, Feb 05 '13 at 05:27

score 2 · Accepted Answer · answered Mar 11 '13 at 18:16

I solved this by coding all this scalability and redundancy stuff on my own. I will explain what I did and how I did it, should anyone ever need this.

I created a few processes in each instance to keep track of the others and know which records the particular instance can process. On start up, the instance would register in the database (if it's not already) in a table called Instances. This table has the following columns:

Id                 Number
MachineName        Varchar2
LastActive         Timestamp
IsMaster           Number(1)

After registering and creating a row in this table if the instance's MachineName wasn't found, the instance starts pinging this table every second in a separate thread, updating its LastActive column. Then it selects all the rows from this table and makes sure that the Master Instance (more on that later) is still alive - meaning that it's LastActive time is in the last 10 seconds. If the master instance stopped responding, it will assume control and set itself as master. In the next iteration it will make sure that there is only one master (in case another instance decided to assume control as well simultaneously), and if not it will yield to the instance with the lowest Id.

What is the master instance?
The service's job is to scan a logging table and process that data so people can filter and read through it easily. I didn't state this in my question, but it might be relevant here. We have a bunch of ESB servers writing multiple records to the logging table per request, and my service's job is to keep track of them in near real-time. Since they're writing their logs asynchronously, I could potentially get a finished processing request A before started processing request A entry in the log. So, I have some code that sorts those records and makes sure my service processes the data in the correct order. Because I needed to scale out this service, only one instance can do this logic to avoid lots of unnecessary DB queries and possibly insane bugs.
This is where the Master Instance comes in. Only it executes this sorting logic and temporarily saves the log record Id's in another table called ReportAssignment. This table's job is to keep track of which records were processed and by whom. Once processing is complete, the record is deleted. The table looks like this:

RecordId        Number
InstanceId      Number    Nullable

The master instance sorts the log entries and inserts their Id's here. All my service instances check this table in 1 second intervals for new records that aren't being processed by anyone or that are being processed by an inactive instance, and that the [record's Id] % [number of isnstances] == [index of current instance in a sorted array of all the active instances] (that were acquired during the Pinging process). The query looks somewhat like this:

SELECT * FROM ReportAssignment 
WHERE (InstanceId IS NULL OR InstanceId NOT IN (1, 2, 3))   // 1,2,3 are the active instances
AND RecordId % 3 == 0    // 0 is the index of the current instance in the list of active instances

Why do I need to do this?

The other two instances would query for RecordId % 3 == 1 and RecordId % 3 == 2.
RecordId % [instanceCount] == [indexOfCurrentInstance] ensures that the records are distributed evenly between all instances.
InstanceId NOT IN (1,2,3) allows the instances to take over records that were being processed by an instance that crashed, and not process the records of already active instances when a new instance is added.

Once an instance queries for these records, it will execute an update command, setting the InstanceId to its own and query the logging table for records with those Id's. When processing is complete, it deletes the records from ReportAssignment.

Overall I am very pleased with this. It scales nicely, ensures that no data is lost should the instance go down, and there were nearly no alterations to the existing code we have.

Scaling out Windows Services

2 Answers2