How to zip 2 sequences based on property (zip, join)

Question

I would like to zip the items of 2 sequences based on a common property similar to joining them when using enumerables. How can I make the second test pass?

using NUnit.Framework;
using System;
using System.Collections.Generic;
using System.Linq;
using System.Reactive.Linq;
using System.Threading.Tasks;

public class SequenceTests
{
    private class Entry
    {
        public Entry(DateTime timestamp, string value)
        {
            Timestamp = timestamp;
            Value = value;
        }

        public DateTime Timestamp { get; }

        public string Value { get; }
    }

    private readonly IEnumerable<Entry> Tasks = new List<Entry>
    {
        new Entry(new DateTime(2021, 6, 6), "Do homework"),
        new Entry(new DateTime(2021, 6, 7), "Buy groceries"), // <-- This date is also in the People collection!
        new Entry(new DateTime(2021, 6, 8), "Walk the dog"),
    };

    private readonly IEnumerable<Entry> People = new List<Entry>
    {
        new Entry(new DateTime(2021, 6, 4), "Peter"),
        new Entry(new DateTime(2021, 6, 5), "Jane"),
        new Entry(new DateTime(2021, 6, 7), "Paul"), // <-- This date is also in the Tasks collection!
        new Entry(new DateTime(2021, 6, 9), "Mary"),
    };

    private class Assignment
    {
        public string Task { get; set; }

        public string Person { get; set; }
    }

    [Test]
    public void Join_two_collections_should_succeed()
    {
        var assignments = Tasks
            .Join(People, 
                task => task.Timestamp,
                person => person.Timestamp,
                (task, person) => new Assignment { Task = task.Value, Person = person.Value });

        Assert.AreEqual(1, assignments.Count());
        Assert.AreEqual("Buy groceries", assignments.First().Task);
        Assert.AreEqual("Paul", assignments.First().Person);
    }

    [Test]
    public async Task Zip_two_sequences_should_succeed()
    {
        var tasks = Observable.ToObservable(Tasks);
        var people = Observable.ToObservable(People);

        var sequence = tasks
            .Zip(people)
            .Select(pair => new Assignment { Task = pair.First.Value, Person = pair.Second.Value });

        var assignments = await sequence.ToList();

        Assert.AreEqual(1, assignments.Count);
        Assert.AreEqual("Buy groceries", assignments.First().Task);
        Assert.AreEqual("Paul", assignments.First().Person);
    }
}

Theodor Zoulias · Answer 1 · 2021-08-08T06:40:30.767

Here is a custom Join operator that could be used in order to solve this problem. It is based on the Merge, GroupByUntil and SelectMany operators:

/// <summary>
/// Correlates the elements of two sequences based on matching keys. Results are
/// produced for all combinations of correlated elements that have an overlapping
/// duration.
/// </summary>
public static IObservable<TResult> Join<TLeft, TRight, TKey, TResult>(
    this IObservable<TLeft> left,
    IObservable<TRight> right,
    Func<TLeft, TKey> leftKeySelector,
    Func<TRight, TKey> rightKeySelector,
    Func<TLeft, TRight, TResult> resultSelector,
    TimeSpan? keyDuration = null,
    IEqualityComparer<TKey> keyComparer = null)
{
    // Arguments validation omitted
    keyComparer ??= EqualityComparer<TKey>.Default;
    var groupDuration = keyDuration.HasValue ?
        Observable.Timer(keyDuration.Value) : Observable.Never<long>();
    return left
        .Select(x => (x, (TRight)default, Type: 1, Key: leftKeySelector(x)))
        .Merge(right.Select(x => ((TLeft)default, x, Type: 2, Key: rightKeySelector(x))))
        .GroupByUntil(e => e.Key, _ => groupDuration, keyComparer)
        .Select(g => (
            g.Where(e => e.Type == 1).Select(e => e.Item1),
            g.Where(e => e.Type == 2).Select(e => e.Item2).Replay().AutoConnect(0)
        ))
        .SelectMany(g => g.Item1.SelectMany(_ => g.Item2, resultSelector));
}

Usage example:

IObservable<Assignment> sequence = tasks
    .Join(people, t => t.Timestamp, p => p.Timestamp,
        (t, p) => new Assignment { Task = t.Value, Person = p.Value });

It should be noted that this problem cannot be solved with guaranteed 100% correctness without buffering all the elements that the two source sequences produce. Obviously this is not going to scale well in case the sequences contain infinite elements.

In case sacrificing the absolute correctness in favor of scalability is acceptable, the optional keyDuration argument can be used to configure the maximum duration that a stored key (and its associated elements) can be preserved in memory. An expired key can potentially be reborn, in case new elements having this key are produced by the left or right sequences.

The above implementation performs reasonably well with sequences containing large number of elements. Joining two same-sized sequences, each having 100,000 elements, takes ~8 seconds in my PC.

score 1 · Answer 2 · answered Jun 20 '21 at 15:57

I don't like either of the posted answers. Both of them are variations on the same theme: Keep all members of both sequences in memory indefinitely and iterate over the entire right sequence whenever a new left element comes in, and incrementally check the left key whenever a new right element comes in. Both answers you O(L + R) memory indefinitely and are O(R * L) time complexity (where L and R are the sizes of the left and right sequences).

If we were dealing with collections (or enumerables), that would be a sufficient answer. But we're not: We're dealing with observables, and the answers should acknowledge that. There could be large time gaps in between the actual use case. The question is posed as a test case stemming from an enumerable. If it were simply an enumerable, the right answer is to convert back to Enumerable and use Linq's Join. If there's a possibility of a long running process with time gaps, the answer should acknowledge that you may want to only join on elements that have happened within some period of time, releasing memory in the process.

This satisfies the test answer, while allowing for a time box:

var sequence = tasks.Join(people,
        _ => Observable.Timer(TimeSpan.FromSeconds(.5)),
        _ => Observable.Timer(TimeSpan.FromSeconds(.5)),
        (t, p) => (task: t, person: p)
    )
    .Where(t => t.person.Timestamp == t.task.Timestamp)
    .Select(t => new Assignment { Task = t.task.Value, Person = t.person.Value });

This creates a window for each element of .5 seconds, meaning a left element and right element will match if they pop out within .5 seconds of each other. After .5 seconds, each element is released from memory. If, for whatever reason, you didn't want to release from memory and hold all objects in memory indefinitely, this would suffice:

var sequence = tasks.Join(people,
        _ => Observable.Never<Unit>(),
        _ => Observable.Never<Unit>(),
        (t, p) => (task: t, person: p)
    )
    .Where(t => t.person.Timestamp == t.task.Timestamp)
    .Select(t => new Assignment { Task = t.task.Value, Person = t.person.Value });

Nice! I didn't know that the `Join` operator worked this way. This may not satisfy the OP's requirements though, because it sacrifices the correctness for the sake of scalability. Any selected time window could be more or less arbitrary, and could result to lost correlations. And passing `Observable.Never()` as `durationSelector` makes this solution only marginally more performant than Enigmativity's [solution](https://stackoverflow.com/a/67955053/11178549). It takes 15 sec to join two 5,000-sized sequences in my PC, demonstrating an O(n²) time complexity. — Theodor Zoulias, Jun 20 '21 at 19:46

score 0 · Accepted Answer · answered Jun 13 '21 at 05:07

0

The observable Zip operator works just the same as the enumerable version. You didn't use that in the first test so it's not like to be the operator you need here.

What you need is simply the SelectMany operator.

Try this query:

var sequence =
    from t in tasks
    from p in people
    where t.Timestamp == p.Timestamp
    select new Assignment { Task = t.Value, Person = p.Value };

That works with your test.

answered Jun 13 '21 at 05:07

Enigmativity

113,464
11
89
172

Perfect! Thank you very much. Also for clarifying that I was mixing zip and join the wrong way. – Martin Komischke Jun 13 '21 at 08:15
Isn't it important to `Publish` the `people` sequence? Or the LINQ query syntax does it automatically? – Theodor Zoulias Jun 13 '21 at 10:06
@TheodorZoulias - If you publish the people sequence you change the semantics. It would be the equivalent of calling `from t in tasks.Take(1) from p in people ...`. – Enigmativity Jun 14 '21 at 03:07
To be honest I am not familiar with the query syntax, and it confuses me quite a bit. If you could rewrite it in method syntax, I would be able to understand better what's going on. If I understand it correctly it's translated to `tasks.SelectMany(t => people.Where(p => t.Timestamp == p.Timestamp))`, which should cause a new subscription to the `people` sequence for each element in the `tasks`sequence. – Theodor Zoulias Jun 14 '21 at 03:18
I just tested it. This query causes indeed multiple subscriptions to the `people` sequence. So this solution is currently problematic. It can be fixed quite easily by `Publish`ing the `people` sequence though. – Theodor Zoulias Jun 18 '21 at 04:38
@TheodorZoulias - Can you post your test code? It doesn't match my results. – Enigmativity Jun 18 '21 at 04:42
Yeap, [here](https://dotnetfiddle.net/ymfO6t) is my code. The output is in a comment at the bottom. – Theodor Zoulias Jun 18 '21 at 04:45
You can't fix it by publishing `People` - that doesn't produce the same results. – Enigmativity Jun 18 '21 at 04:47
`People.Publish(ps => from t in Tasks from p in ps select new { t, p })` is equivalent to `from t in Tasks.Take(1) from p in People select new { t, p }`, but not `from t in Tasks from p in People select new { t, p }`. You're changing the semantics by publishing `People`. – Enigmativity Jun 18 '21 at 04:48
[Here](https://dotnetfiddle.net/lJjf8J) is how it can be fixed with `Publish` (using method syntax instead of query syntax). The output is again at the bottom. It is the same output, minus the multiple subscriptions to the people sequence. – Theodor Zoulias Jun 18 '21 at 04:54
@TheodorZoulias - It's a bogus test because both sequences are increasing. Change `people` to use `Observable.Range(1, 10).Select(x => 10 - x)` and try again. The output changes when you don't use `.Publish`. – Enigmativity Jun 18 '21 at 05:06
You are right. The plain vanilla `Publish` doesn't cut it. To get the same results a buffered publish (`Replay`) is needed instead. – Theodor Zoulias Jun 18 '21 at 05:18
Beyond the multiple-subscriptions problem, this solution should also be used with caution in case the `tasks` and `people` sequences contain a large number of elements. I just tested it with sequences containing 2,000 elements each, and this solution took 8 seconds in my PC, just for printing the count of the joined pairs (in Release mode, without the debugger attached). It took 37 seconds for 4,000-sized sequences. This solution has a O(n²) computational complexity, which is not optimal to say the least. – Theodor Zoulias Jun 18 '21 at 05:25
@TheodorZoulias - `SelectMany` is O(n²). It's hardly surprising. – Enigmativity Jun 18 '21 at 06:14
Enigmativity btw take a look at this survey on GitHub: [LINQ Usage Survey](https://github.com/dotnet/runtime/issues/76205). People express their opinion/preference regarding the query syntax vs method syntax dilemma. – Theodor Zoulias Sep 28 '22 at 12:22

How to zip 2 sequences based on property (zip, join)

3 Answers3