ReactiveExtensions: Stop an observable from returning before the tasks it has spun off have finished?

Question

I'm new to Rx .NET but I have a business scenario that I think warrants it. However, I'm still having my trouble wrapping my head around the initial design.

The Problem

I have a large set of items, let's say 600k.
I have a way of pulling these from a database in batches (let's say 1,000 at a shot)
I would run a process on these items in parallel, at most x amount at a time (let's say 50 at a time)
When we're done, I need to know that because I need to spit out some additional stats and ensure the long-running process returns.

This seems ideal for reactive extensions -- I have:

Something that is feeding a list over time
A set of things that I want to do to those items as they come in
A need to handle errors
A need to handle completions.

Where I'm getting started

This seems like I would have a list of items as an observable, and my "looping" process that gets those items from the database would "push" them into this observable, and then the subscription to that observable would take over.

Where I'm getting stuck

I'm a little unsure of the syntax
I'm a little unsure of how to handle the x degrees of parallelism with a limit
I'm unsure how I'll actually know when I hit completion. Would my loop that's pulling from the database call "OnComplete()" instead of "OnNext" at that point?

I'm hoping someone can help me break down, conceptually, what I'm looking for so I can better wrap my head around it. Thanks!

Code v3 -- Much better but the method still exits too quickly.

This is starting to really feel much better, but I know it's not quite ther.e

public override async Task ProcessAsync(DataLoadRequest dataLoadRequest, Func<string, Task> createTrackingPayload)
{
    _requestParameters = Deserialize<SchoolETLRequestParameters>(dataLoadRequest.DataExtractorParams);
    WireUpDependencies();

    //This is the new retriever which allows records to be "paged" (e.g. returns empty list for pageNum > 0 on the ones that don't have paging.)
    _recordsToProcessRetriever = new SettingBasedRecordsRetriever(_propertyRepository, _requestParameters.RunType, _requestParameters.ResidentialProfileIDOverrides, _processorSettings.MaxBatchesToProcess, _etlLogger);

    var query = Observable.Range(0, int.MaxValue)
        .Select(pageNum => _recordsToProcessRetriever.GetResProfIDsToProcess(pageNum, _processorSettings.BatchSize))
        .TakeWhile(resProfList => resProfList.Any())
        .SelectMany(records => records)
        .Select(resProf => Observable.Start(() => Task.Run(()=> _schoolDataProcessor.ProcessSchoolsAsync(resProf)).Result))
        .Merge(maxConcurrent: _processorSettings.ParallelProperties);

    var subscription = query.Subscribe(async trackingRequests =>
        {
            await CreateRequests(trackingRequests, createTrackingPayload);
            var numberOfAttachments = SumOfRequestType(trackingRequests, TrackingRecordRequestType.AttachSchool);
            var numberOfDetachments = SumOfRequestType(trackingRequests, TrackingRecordRequestType.DetachSchool);
            var numberOfAssignmentTypeUpdates = SumOfRequestType(trackingRequests, TrackingRecordRequestType.UpdateAssignmentType);

            _etlLogger.Info("Extractor generated {0} attachments, {1} detachments, and {2} assignment type changes.",
                numberOfAttachments, numberOfDetachments, numberOfAssignmentTypeUpdates);
        },
        () =>
        {
            _etlLogger.Info("Finished! Woohoo!");
        });
}

Problems with v3

The ProcessAsync method still finishes before all of the items have processed in the background. Normally I'd be fine with that, but in our case, the framework I'm using needs to wait until all of the tracking requests have been created (e.g. until CreateTrackingRequests has been called for each batch of results).

Is it possible to await all operations being completed within this?

Update: Additional information about the problem

In this case, we don't know what is going to produce the observables until run-time. The app is passed in a command, which amounts to either:

"New Records": hits a method that returns a the results of a specific sproc
"Specific Record": for testing; hits a method that hits a separate sproc for specific given values
"All Records": hits a method that goes into a continuous paging loop, looping through 600k records in pages of x (defined by a setting).

The first two scenarios sound like I could easily pass them right into an observable with no problem. However, the last one seems like I'd have to loop through a bunch of sets of observables in this case, which isn't the behavior I want (I want all 600k items to end up in a large queue and be processed 50 at a time).

My hope was that I could have one method that "throws things on the queue", and have the processing task continuously pull from that in batches of 50.

A note: all those methods that call the sprocs return the exact same thing -- a list of IThing (obfuscated out of necessity).

I've wired all of those repository functions, etc. into my processor AS dependencies, so calling ProcessStuffForMyThing(List<IThing>) takes care of that whole process, and works fine in parallel using the same object (no need to new it up each time).

Enigmativity · Accepted Answer · 2015-07-10T00:25:02.513

You have a number of issues with your code that you should fix. The mistakes you're making I've seen many times - everyone seems to go down the same path. It really boils down to changing your thinking from procedural to functional.

To start with, Rx has a lot of operators designed to make your life easier. One of them is Observable.Using. It's job is to spin up a disposable resource, build an observable, and dispose of the resource when the observable completes. Just perfect for reading records from a database.

Your code seems to have an already open database connection and you're pumping records out via a subject. You should avoid having external state (the data processor) and you should avoid using subjects. There's almost always an observable operator you can use.

The other thing you're doing that you probably shouldn't is mixing your monads - or more specifically observables and tasks. There are operators in Rx for turning tasks into observables, but they are there for interfacing with existing code and shouldn't be used as a tool in your observables. The rul is to try to get into an observable and stay there until you are ready to subscribe to your data.

I felt that your code was a little fragmented to understand exactly what was being called where, so I wrote a general purpose bit of code that I think covers off on what you need. Here's the query:

var pageSize = 4;

Func<Record, Result> process = r =>
{
    Thread.Sleep(100); // Only here to demonstrate parallelism
    return new Result(r.ID);
};

var query =
    Observable
        .Using(
            () => new DataProcessor(),
            dc =>
                Observable
                    .Range(0, int.MaxValue)
                    .Select(n => dc.GetRecords(n, pageSize))
                    .TakeWhile(rs => rs.Any())
                    .SelectMany(rs => rs)
                    .Select(r => Observable.Start(() => process(r)))
                    .Merge(maxConcurrent: 4));

var subscription =
    query
        .Subscribe(
            r => Console.WriteLine(r.ID),
            () => Console.WriteLine("Done."));

I've clearly taken some shortcuts with your code, but in essence it is much the same (I hope).

This code is runnable if you add in the following classes:

public class DataProcessor : IDisposable
{
    public DataProcessor() { Console.WriteLine("Opened."); }
    public void Dispose() { Console.WriteLine("Closed."); }
    public IEnumerable<Record> GetRecords(int page, int count)
    {
        Console.WriteLine("Reading.");
        Thread.Sleep(100);
        var records = page <= 5
            ? Enumerable
                .Range(0, count < 5 ? count : count / 2)
                .Select(x => new Record())
                .ToArray()
            : new Record[] { };
        Console.WriteLine("Read.");
        return records;
    }
}

public class Record
{
    private static int __counter = 0;
    public Record() { this.ID = __counter++; }
    public int ID { get; private set; }
}

public class Result
{
    public Result(int id) { this.ID = id; }
    public int ID { get; private set; } 
}

When I run it I get this result:

Opened.
Reading.
Read.
Reading.
0
2
3
1
Read.
Reading.
7
Read.
5
6
4
Reading.
10
11
9
8
Read.
Reading.
15
12
Read.
14
Reading.
13
17
19
18
16
Read.
Reading.
21
Read.
20
22
23
Done.
Closed.

You can see that it is being processed in parallel. You can see that the observable is completing. Also you can see that the database is opening, and then closing once the observable is done.

Let me know if this helps.

Thank you for this thorough answer! I'm still digesting. I didn't think it was relevant at the time but based on what you've answered I think I should note that depending on a variable these results are actually coming from one of three different methods (that each hit a different stores proc.) the only one that pages is the "all records" page; the other two just return one list. I'm stuck with the procs and data access framework as they're mandatory :( still, these are all really good tips. I'm looking forward to my ah-ha moment with functional; I know there's a lot of power there. — SeanKilleen, Jul 10 '15 at 01:49
@SeanKilleen - If you can post the signatures of the procs and data access framework I could have a go a refactoring my solution to fit. — Enigmativity, Jul 10 '15 at 01:59
thanks for that generous offer! I've added a new section to the bottom of question with updated information that I Hope explains better what I'm trying to accomplish. — SeanKilleen, Jul 10 '15 at 12:21
I also like the idea of adding a paging signature (even if it's a dummy) to each signature for retrieving data. Then I could throw it behind an interface and call the same signature no matter which one it is. — SeanKilleen, Jul 10 '15 at 12:32
I updated the code to reflect "v3" with your changes (and less obfuscation). Works great! The last part I'm wrapping my head around is how to stop the `ProcessAsync` method from completing until all of the calls to `CreateTrackingRecords` have been made. — SeanKilleen, Jul 10 '15 at 13:35

score 1 · Answer 2 · answered Jul 09 '15 at 23:13

First, I would not recommend rolling your own enumeration conversion. If you have an IEnumerable<T> you can use the .ToObservable() extension that will handle the enumeration for you.

Second, you should handle the results of an Observable in the Subscribe method, right now your method will return immediately after the enumeration because you don't appear to actually await anything in your async method. If you have to use the current method signature then you can take advantage of an Observable also being awaitable.

So here is my suggested code structure (warning untested):

public override async Task ProcessAsync(Request theRequest,
                             Func<string,Task> createTrackingPayload) // not my design^TM
    {
        // ...do stuff with the request, wire up some dependencies, etc.
        //End goal is to call createTrackingPayload with some things.
            await items.ToObservable()
            .Select(thing => Observable.FromAsync(async () =>
            {
                var requests = await _dataProcessor.DoSomethingAsync(thing);
                if (requests != null && requests.Any())
                {
                    var numberOfType1 = SumOfRequestType(requests, TrackingRecordRequestType.Type1);
                    var numberOfType2 = SumOfRequestType(requests, TrackingRecordRequestType.DetachSchool);
                    var numberOfType3 = SumOfRequestType(requests, TrackingRecordRequestType.UpdateAssignmentType);


                    await CreateRequests(requests, createTrackingPayload); // something that will iterate over the list and call the function we need to call.
                    return requests.Count();
                }
                return 0;

            }
            }))
        .Merge(maxConcurrent: _processorSettings.DegreeofParallelism)
        .Do(x => _logger.Info("processed {0} items.", x))
        .Aggregate(0, (acc, x) => acc + x);

    }

Basically the idea here being that you await the completion of the Observable, which will actually give you the last value before the Observable completed. By adding the Do and the Aggregate you can move the logging logic out of your processing logic.

score 0 · Answer 3 · edited May 23 '17 at 11:44

I'm giving the credit here to Enigmativity because their answer is what led me to my (mostly) correct place.

The code that does what I need it to is below, with the exception of One minor issue with the sequence evaluating multiple times.

var query = Observable.Range(0, int.MaxValue)
    .Select(pageNum =>
        {
            _etlLogger.Info("Calling GetResProfIDsToProcess with pageNum of {0}", pageNum);
            return _recordsToProcessRetriever.GetResProfIDsToProcess(pageNum, _processorSettings.BatchSize);
        })
    .TakeWhile(resProfList => resProfList.Any())
    .SelectMany(records => records.Where(x=> _determiner.ShouldProcess(x)))
    .Select(resProf => Observable.Start(async () => await _schoolDataProcessor.ProcessSchoolsAsync(resProf)))
    .Merge(maxConcurrent: _processorSettings.ParallelProperties)
    .Do(async trackingRequests =>
    {
        await CreateRequests(trackingRequests.Result, createTrackingPayload);

        var numberOfAttachments = SumOfRequestType(trackingRequests.Result, TrackingRecordRequestType.AttachSchool);
        var numberOfDetachments = SumOfRequestType(trackingRequests.Result, TrackingRecordRequestType.DetachSchool);
        var numberOfAssignmentTypeUpdates = SumOfRequestType(trackingRequests.Result,
            TrackingRecordRequestType.UpdateAssignmentType);

        _etlLogger.Info("Extractor generated {0} attachments, {1} detachments, and {2} assignment type changes.",
            numberOfAttachments, numberOfDetachments, numberOfAssignmentTypeUpdates);
    });

var subscription = query.Subscribe(
trackingRequests =>
{
    //Nothing really needs to happen here. Technically we're just doing something when it's done.
}, 
() =>
{
    _etlLogger.Info("Finished! Woohoo!");
});
await query.Wait();