3

I have a stream of data records being loaded from database. I can't store and load all of them into memory because there are millions of them. The caller should process records one by one (of course I have no guarantee).

My first try was to return lazy sequence of IEnumerable<Records> which would be loaded on demand and returned by yield return statement.

But I couldn't use await/async (which was used to get data from a database) in this method because yield return requires a return type of IEnumerable<>. In a result I cannot use async and Task<IEnumerable<>>.

Reading this convinced me to try Reactive Extensions as I can await async methods and return IObservable<>.

But as far as I've figured out as soon as someone subscribes to my observable the method which pulls the data is being invoked and it will pull all the data at once.

This is how part of my method my method looks like:

IList<int> ids = (...);
return Observable.Create<NitemonkeyRegistration>(async obs => 
    {
        using (SqlDataReader reader = await command.ExecuteReaderAsync())
            {
                if (!reader.HasRows)
                    obs.OnCompleted();

                while (await reader.ReadAsync())
                    ids.Add(reader.GetInt32(reader.GetOrdinal("RegistrationId")));

                for (int i = 0; i < ids.Count; i += 1000)
                {
                    //heavy database operations
                    var registrations = await GetRegistrationsByIds(connection, ids.Skip(i).Take(1000));
                    foreach (var pulledReg in registrations)
                    {
                        obs.OnNext(pulledReg);
                    }

                }

            }
     });

Can I put the caller in control so when he calls .Next() on the observable then my code pulls the data on demand?

How can I implement something that is similar to yield return using reactive extensions?

UPDATE

This is my consumer code:

var cancellationTokenSource = new CancellationTokenSource();
await Observable.ForEachAsync<NitemonkeyRegistration>(niteMonkeySales, async (record, i) =>
            {
                try
                {
                     await SomethingAwaitableWhichCanTakeSeconds(record);
                }
                catch(Exception e)
                {
                    // add logging
                    // this cancels the loop but also the IObservable
                    cancellationTokenSource.Cancel();
                    // can't rethrow because line
                    // above will cause errored http response already created
                }
            }, cancellationTokenSource.Token);

Problem with this is that new records are being pushed not waiting for awaitable task to complete. I can do this with .Wait() and not async lambda but the thread will be wasted waiting for a lengthy network operation to complete.

Might be important: this is a ASP.NET WEB API service.

mbudnik
  • 2,087
  • 15
  • 34

3 Answers3

2

Rx allows describing "push sequences", where the producer pushes values to the observers. If your requirement is to "pull" values from the source, I think what you are looking for is the Interactive Extensions Async Library (check-out this Channel 9 Video). It defines the IAsyncEnumerable<T> type together with a full set of LINQ operators, which allows describing pull-based sequences with asynchronous behavior (the downside though is that yield return does not work with that type (yet at least), so you'll probably need to write your own IAsyncEnumerator<T> implementation).

Jonas Chapuis
  • 697
  • 5
  • 9
  • Thanks for this. I will look into this. But how about I can cope with pushing sequences but I would like the pusher wait for subscriber to handle the data before pushing more. I'm trying to use Observable.ForEachAsync for this but as an action which handles the data I'm passing an async lambda (because I have some awaitable tasks in the body). But when I pass empty action with only await Task.Delay(10000) the loop continues to process next record without awaiting for this task to complete. I will edit my question with some more code. – mbudnik Feb 24 '15 at 15:07
  • Ok I am not sure the `ForEachAsync` operator is what you need then. The `OnNext()` on the observer will naturally only return when the observer has finished processing the element (unless the observer is an `ObserveOn()` operator which can introduce concurrency via a scheduler, i.e. the `ThreadPoolScheduler`). In other words, if you just `Subscribe()` to your observable you should be fine, the observable source will grab records as the observer processes them. – Jonas Chapuis Feb 24 '15 at 15:49
  • I might have omitted important thing here. This is a WEB API service and while handling request I need something I could await for. I can't return from method without being sure all the data is processed. I can await for Observable.ForEachAsync (that's why I've used it) but can't wait for all the items to be processed using Subscribe because it is not blocking or awaitable. – mbudnik Feb 24 '15 at 16:00
  • Ok I see: you can actually await an IObservable sequence. The behavior is to return the last element (before the OnCompleted), or throw the OnError that was observed (if the sequence contains no elements, you'll get an InvalidOperationException). So you should be able to write the following: `await niteMonkeySales.ObserveOn(ThreadPoolScheduler.Instance).Do(SomethingWhichCanTakeSeconds)` – Jonas Chapuis Feb 24 '15 at 16:05
  • Just tried it: http://pastebin.com/Fsp0x9YV. It doesn't wait for the task inside lambda before going to the next record. – mbudnik Feb 24 '15 at 16:39
  • Ok yes the underlying assumption in the piece of code above was that `SomethingWhichCanTakeSeconds()` was not some asynchronous piece of code, but some synchronous piece of work (possibly involving some locking), that you would launch on the ThreadPoolScheduler. If `SomethingWhichCanTakeSeconds()` is an asynchronous operation, you could make it an observable and use the `SelectMany()` operator, but that won't provide any backpressure mechanism. For backpressure, either you compromise a bit the asynchronicity by blocking on the thread pool or you need some additional mechanism. – Jonas Chapuis Feb 25 '15 at 07:55
  • I have myself used a custom subject for that purpose, which I called ValveSubject (http://stackoverflow.com/questions/28602104/valvesubject-a-queuing-subject-for-rx-with-built-in-buffering-open-close-opera/). This subject allows opening and closing the stream. You could write a solution based on that, in which the consumer tracks the number of pending jobs and closes the valve when this reaches some threshold number, and reopens it when it goes below another threshold. – Jonas Chapuis Feb 25 '15 at 07:59
  • Thanks for help, I'll let you know what I've used. I think I will try with AsyncEnumerable. – mbudnik Feb 25 '15 at 10:50
0

Do you need to use reactive extensions?

Your first attempt might have been on the right track.

Have a look at this answer to another question.

The problem could be with the query rather than the client side code.

As noted in the linked question, you might have to rewrite the query to make sure it streams the data to the client properly.


Update:

You should try to break up GetRegistrationsById into two blocks.

  1. Get the SqlDataReader to run the query. You can await this part.
  2. Use the SqlDataReader returned and iterate through it using yield return.

Here is an example, which is loosely based on your code sample.

    IList<int> ids = new List<int>();

    private async void doWork()
    {
        var connection = new SqlConnection(...);
        connection.Open();

        SqlCommand command = new SqlCommand("SELECT registrationId FROM someTable", connection);

        using (SqlDataReader reader = await command.ExecuteReaderAsync())
        {
            while (await reader.ReadAsync())
            {
                ids.Add(reader.GetInt32(reader.GetOrdinal("RegistrationId")));
            }

            reader.Close();

            //heavy database operations
            // Part 1 of whatever GetRegistrationsByIds does would go into GetRegistrationReader().
            var registrationReader = await Task.Run(() => GetRegistrationReader(connection, ids));

            // Part 2 of whatever GetRegistrationsByIds does for each 
            // Registration would go into GetRegistrations().
            var registrationEnumerator = GetRegistrations(orderReader);

            foreach (var registration in registrationEnumerator)
            {
                // Do whatever you need to do for each registration
                listBox1.Items.Add(registration.Id);
            }

        }
    }

    private IEnumerable<Registration> GetRegistrations(SqlDataReader reader)
    {
        while (reader.Read())
        {
            // You would do whatever you need to do to each registration here.
            var registration = new Registration{ Id = reader.GetInt32(reader.GetOrdinal("RegistrationId")) };
            yield return registration;
        }

    }

    private SqlDataReader GetRegistrationReader(SqlConnection connection, IList<int> ints)
    {
        // Some query that returns a lot of rows.
        // Ideally it would written to stream directly from the 
        // database server, rather than buffer the data to the client
        // side.
        SqlCommand command = new SqlCommand("SELECT * from registrations", connection);

        return command.ExecuteReader();
    }

    internal class Registration
    {
        public int Id;
        // ... other fields, etc.
    }
Community
  • 1
  • 1
Chris O'Neill
  • 1,742
  • 12
  • 14
  • The answer to this question you mentioned is a method returning IEnumerable with use of yield return. I want to use the async API for SqlReader but can't return Task from method which uses yield. – mbudnik Feb 24 '15 at 15:10
  • Thanks for the effort. It might even work correctly but what about I want to use async all the way down? Do you know SqlReader has reader.ReadAsync method? Sorry for being so picky but I want to be sure I use the best solution possible. – mbudnik Feb 24 '15 at 16:13
  • Sure. *Best* solution might be difficult to define, because introducing Rx.Net adds additional complexity / dependencies. Even ReadAsync() will block eventually, it just does it in a different thread. So if you are concerned about reader.Read() blocking, just do the `foreach (var registration in registrationEnumerator)` in its own `Task` and await that. – Chris O'Neill Feb 24 '15 at 16:35
  • 2
    You may be suprised but there is no thread blocking if you use async APIs all the way down. I have already asked it here http://stackoverflow.com/questions/21879606/async-await-threading-internals and got an interesting answer =] – mbudnik Feb 24 '15 at 16:42
  • Interesting. I wasn't aware from the question that the goal was to exclusively use async APIs. – Chris O'Neill Feb 24 '15 at 17:15
0

Rx.NET doesn't have a lot of built-in backpressure operators currently.

Using something like TPL Dataflow is probabnly a better fit for your problem.

Anyway, I think you can use a BlockingCollection to limit the rate you pull from the database:

// maximum of 10 items in buffer
var buffer = new BlockingCollection<NitemonkeyRegistration>(10);
niteMonkeySales.Subscribe(t => buffer.Add(t), () => buffer.CompleteAdd());

foreach (var item in buffer.GetConsumingEnumerable())
{
    try
    {
         await SomethingAwaitableWhichCanTakeSeconds(record);
    }
    catch(Exception e)
    {
        // add logging
        // this cancels the loop but also the IObservable
        cancellationTokenSource.Cancel();
        // can't rethrow because line
        // above will cause errored http response already created
    }
}
Brandon
  • 38,310
  • 8
  • 82
  • 87