2

While trying to traverse directory tree efficiently, I tried a RX solution described here. While this solution works for small tree depth, it's not useable for big tree depth. The Default Scheduler creates too many threads, slowing down the tree traversal.

Here's the code I use :

public static void TestTreeTraversal()
    {
        Func<DirectoryInfo, IObservable<DirectoryInfo>> recurse = null;
        recurse = i => Observable.Return(i)
                        .Concat(i.GetDirInfos().ToObservable().SelectMany(d => recurse(d)))
                        .ObserveOn(Scheduler.Default);
        var obs = recurse(new DirectoryInfo(@"C:\"));
        var result = obs.ToEnumerable().ToList();
    }

public static IEnumerable<DirectoryInfo> GetDirInfos(this DirectoryInfo dir)
    {
        IEnumerable<DirectoryInfo> dirs = null;
        try
        {
            dirs = dir.EnumerateDirectories("*", SearchOption.TopDirectoryOnly);
        }
        catch (Exception)
        {
            yield break;
        }
        foreach (DirectoryInfo d in dirs)
            yield return d;
    }

If you remove ObserveOn(Scheduler.Default), the function works at the same speed than a mono-threaded recursive function. Using ObserveOn, it seems a thread is created each time SelectMany is called, slowing down the process dramatically.

Is there a way to control/limit the maximum number of threads the Scheduler can use at the same time?

Is there another way to write such a parallel tree traversal with Rx, without falling in this parallel-pitfall?

Community
  • 1
  • 1
rducom
  • 7,072
  • 1
  • 25
  • 39

1 Answers1

1

It could be done in Rx with this overload of the Merge operator, perhaps by passing Environment.ProcessorCount to the maxConcurrent parameter.

However, Rx is designed to work over IObservable<T> for natively asynchronous processing. Surely you can convert an IEnumerable<T> into an IObservable<T> and process it in parallel, as you've done here, but it's going against the grain in Rx.

A more natural solution to this problem is PLINQ, which begins with an IEnumerable<T> and is designed for partitioning a query into parallel processes, implicitly taking into account the number of physical processors available.

Rx is mostly about taming concurrency, while PLINQ is mostly about introducing it.

Untested:

Func<DirectoryInfo, ParallelQuery<DirectoryInfo>> recurse = null;

recurse = dir => new[] { dir }.AsParallel()
  .Concat(dir.GetDirInfos().AsParallel().SelectMany(recurse));

var result = recurse(new DirectoryInfo(@"C:\")).ToList();
Dave Sexton
  • 2,562
  • 1
  • 17
  • 26
  • I have translated GetDirInfos into an Observable version, and tried with Merge(8). The same problem happens : too many thread are created. Called like this : Concat(i.GetObservableDirInfos().Select(d => recurse(d)).Merge(8)).ObserveOn(Scheduler.Default); Your untested code don't work. I already have a working recursive parallel version with a Parallel.ForEach over a BlockingCollection using a GetConsumingPartitioner() extension. What I'd like is to achieve is the same result, but with Reactive Extensions. – rducom Jan 22 '15 at 16:19
  • 1
    How did you translate `GetDirInfos` into an observable? Are you taking advantage of the native file I/O asynchrony? If not, then my recommendation to use PLINQ still stands. If so, then your problem is that you're applying `Merge` recursively. Rx doesn't constrain concurrency across calls to `Merge`. You'd have to apply `Merge` to the end of your query only. – Dave Sexton Jan 22 '15 at 16:23
  • Keep in mind that even using `Merge` is going to require you to come up with your own partitioning strategy. Perhaps that's the confusion. You can't just insert `Merge` into your existing query. That query specifically introduces concurrency for *every* directory. That's why PLINQ is more appropriate. – Dave Sexton Jan 22 '15 at 16:36
  • I use Merge inside the Concat() method, since Concat() takes an IObservable as argument. I have understood that the recursion is the root of the problem, and that multiple thread are created inside the SelectMany() or the Merge() methods. I also know there are other ways to make a parallel tree traversal (and I repeat, I already have a working one). What I ask, is : "Is there another way to write such parallel tree traversal with Rx" – rducom Jan 22 '15 at 16:38
  • Neither `SelectMany` nor `Merge` introduces concurrency. The `ObserveOn` operator is the only operator that's introducing concurrency, and it's doing it for *every* directory. That's the query you chose to use. If your question is whether Rx can partition your sequence for you, then the answer is no. You have to choose what parts of the recursion you want to execute in parallel yourself. – Dave Sexton Jan 22 '15 at 16:41
  • I was believing that ObserveOn just enables or disables concurrency by specifying a free-threaded or multi-threaded scheduler. And, I was also believing the "single-threaded" nature of RX makes the SelectMany operator one of the few places where parallelism happens, and threads are created. I agree with you about the partitioning problem, that's exactly the problematic. I don't hope RX partition things for me. I hope some ideas to do it the right way :) What I search is a correct RX way of considering tree traversal, letting parallelism happens in a efficient way. – rducom Jan 22 '15 at 17:29
  • Rx is free-threaded. `IScheduler` is the only thing that can create threads. The operators don't care about threads at all. The only purpose of `ObserveOn` is to use an `IScheduler` to create an asynchronous boundary. Observables represent concurrency in Rx, so any operator that deals with multiple observables can handle concurrency, of which `SelectMany` is just a single example. `SelectMany` doesn't know anything about `IScheduler` itself; it doesn't even have any overloads that accept an `IScheduler`. – Dave Sexton Jan 22 '15 at 20:45
  • If you're simply converting an enumerable into an observable, which I suspect that you are, then PLINQ is the correct approach. – Dave Sexton Jan 22 '15 at 20:46
  • I would agree with Dave Sexton here. It seem you have a Hammer (Rx) and are looking for a nail. However, as the problem space is not really about querying a sequence or sequences of events, you are finding it to not behave as you wish. PLINQ does seem to be the correct tool here. Maybe you are able to explain why you need an Rx solution to this? – Lee Campbell Jan 25 '15 at 22:35
  • The problem posted was just an example for playing around with RX, trying to mix recursion and parallelism. I agree using PLINQ is far more intuitive and simple, and efficient. While exploring Rx source code, I notice DefaultConcurrencyAbstractionLayer class creates a new Thread on each call of StartThread(), and I found no code limiting the maximum number of threads. I think Rx is still young, and at the moment and is not intended for parallelism. However, it seems there's a lot of place for future integration of efficient parallelism primitives. I hope it will happens soon. – rducom Jan 28 '15 at 13:29
  • I doubt Rx will ever solve the partitioning problem for you. That's the point that I think you're continuing to ignore. [Partitioning](http://blogs.msdn.com/b/pfxteam/archive/2009/05/28/9648672.aspx) is explicitly handled by PLINQ. It can do this efficiently because it's pull-based. It pulls however much it needs. Rx is push-based, thus there's only one way to force parallelization: by controlling the number of simultaneous calls to `Subscribe`. That's exactly how `Merge(int maxConcurrent)` works. – Dave Sexton Jan 28 '15 at 18:22