I'm new to Rx .NET but I have a business scenario that I think warrants it. However, I'm still having my trouble wrapping my head around the initial design.
The Problem
- I have a large set of items, let's say 600k.
- I have a way of pulling these from a database in batches (let's say 1,000 at a shot)
- I would run a process on these items in parallel, at most x amount at a time (let's say 50 at a time)
- When we're done, I need to know that because I need to spit out some additional stats and ensure the long-running process returns.
This seems ideal for reactive extensions -- I have:
- Something that is feeding a list over time
- A set of things that I want to do to those items as they come in
- A need to handle errors
- A need to handle completions.
Where I'm getting started
This seems like I would have a list of items as an observable, and my "looping" process that gets those items from the database would "push" them into this observable, and then the subscription to that observable would take over.
Where I'm getting stuck
- I'm a little unsure of the syntax
- I'm a little unsure of how to handle the x degrees of parallelism with a limit
- I'm unsure how I'll actually know when I hit completion. Would my loop that's pulling from the database call "OnComplete()" instead of "OnNext" at that point?
I'm hoping someone can help me break down, conceptually, what I'm looking for so I can better wrap my head around it. Thanks!
Code v3 -- Much better but the method still exits too quickly.
This is starting to really feel much better, but I know it's not quite ther.e
public override async Task ProcessAsync(DataLoadRequest dataLoadRequest, Func<string, Task> createTrackingPayload)
{
_requestParameters = Deserialize<SchoolETLRequestParameters>(dataLoadRequest.DataExtractorParams);
WireUpDependencies();
//This is the new retriever which allows records to be "paged" (e.g. returns empty list for pageNum > 0 on the ones that don't have paging.)
_recordsToProcessRetriever = new SettingBasedRecordsRetriever(_propertyRepository, _requestParameters.RunType, _requestParameters.ResidentialProfileIDOverrides, _processorSettings.MaxBatchesToProcess, _etlLogger);
var query = Observable.Range(0, int.MaxValue)
.Select(pageNum => _recordsToProcessRetriever.GetResProfIDsToProcess(pageNum, _processorSettings.BatchSize))
.TakeWhile(resProfList => resProfList.Any())
.SelectMany(records => records)
.Select(resProf => Observable.Start(() => Task.Run(()=> _schoolDataProcessor.ProcessSchoolsAsync(resProf)).Result))
.Merge(maxConcurrent: _processorSettings.ParallelProperties);
var subscription = query.Subscribe(async trackingRequests =>
{
await CreateRequests(trackingRequests, createTrackingPayload);
var numberOfAttachments = SumOfRequestType(trackingRequests, TrackingRecordRequestType.AttachSchool);
var numberOfDetachments = SumOfRequestType(trackingRequests, TrackingRecordRequestType.DetachSchool);
var numberOfAssignmentTypeUpdates = SumOfRequestType(trackingRequests, TrackingRecordRequestType.UpdateAssignmentType);
_etlLogger.Info("Extractor generated {0} attachments, {1} detachments, and {2} assignment type changes.",
numberOfAttachments, numberOfDetachments, numberOfAssignmentTypeUpdates);
},
() =>
{
_etlLogger.Info("Finished! Woohoo!");
});
}
Problems with v3
- The ProcessAsync method still finishes before all of the items have processed in the background. Normally I'd be fine with that, but in our case, the framework I'm using needs to wait until all of the tracking requests have been created (e.g. until
CreateTrackingRequests
has been called for each batch of results).
Is it possible to await all operations being completed within this?
Update: Additional information about the problem
In this case, we don't know what is going to produce the observables until run-time. The app is passed in a command, which amounts to either:
- "New Records": hits a method that returns a the results of a specific sproc
- "Specific Record": for testing; hits a method that hits a separate sproc for specific given values
- "All Records": hits a method that goes into a continuous paging loop, looping through 600k records in pages of x (defined by a setting).
The first two scenarios sound like I could easily pass them right into an observable with no problem. However, the last one seems like I'd have to loop through a bunch of sets of observables in this case, which isn't the behavior I want (I want all 600k items to end up in a large queue and be processed 50 at a time).
My hope was that I could have one method that "throws things on the queue", and have the processing task continuously pull from that in batches of 50.
A note: all those methods that call the sprocs return the exact same thing -- a list of IThing
(obfuscated out of necessity).
I've wired all of those repository functions, etc. into my processor AS dependencies, so calling ProcessStuffForMyThing(List<IThing>)
takes care of that whole process, and works fine in parallel using the same object (no need to new it up each time).