What is the fastest non-LINQ algorithm to 'pair up' matching items from multiple separate lists?

Question

IMPORTANT NOTE

To the people who flagged this as a duplicate, please understand we do NOT want a LINQ-based solution. Our real-world example has several original lists in the tens-of-thousands range and LINQ-based solutions are not performant enough for our needs since they have to walk the lists several times to perform their function, expanding with each new source list.

That is why we are specifically looking for a non-LINQ algorithm, such as the one suggested in this answer below where they walk all lists simultaneously, and only once, via enumerators. That seems to be the best so far, but I am wondering if there are others.

Now back to the question...

For the sake of explaining our issue, consider this hypothetical problem:

I have multiple lists, but to keep this example simple, let's limit it to two, ListA and ListB, both of which are of type List<int>. Their data is as follows:

List A    List B
  1         2
  2         3
  4         4
  5         6
  6         8
  8         9
  9        10

...however the real lists can have tens of thousands of rows.

We next have a class called ListPairing that's simply defined as follows:

public class ListPairing
{
    public int? ASide{ get; set; }
    public int? BSide{ get; set; }
}

where each 'side' parameter really represents one of the lists. (i.e. if there were four lists, it would also have a CSide and a DSide.)

We are trying to do is construct a List<ListPairing> with the data initialized as follows:

A Side    B Side
  1         -
  2         2
  -         3
  4         4
  5         -
  6         6
  8         8
  9         9
  -        10

Again, note there is no row with '7'

As you can see, the results look like a full outer join. However, please see the update below.

Now to get things started, we can simply do this...

var finalList = ListA.Select(valA => new ListPairing(){ ASide = valA} );

Which yields...

A Side    B Side
  1         -
  2         -
  4         -
  5         -
  6         -
  8         -
  9         -

and now we want to go back-fill the values from List B. This requires checking first if there is an already existing ListPairing with ASide that matches BSide and if so, setting the BSide.

If there is no existing ListPairing with a matching ASide, a new ListPairing is instantiated with only the BSide set (ASide is blank.)

However, I get the feeling that's not the most efficient way to do this considering all of the required 'FindFirst' calls it would take. (These lists can be tens of thousands of items long.)

However, taking a union of those lists once up front yields the following values...

1, 2, 3, 4, 5, 6, 8, 9, 10 (Note there is no #7)

My thinking was to somehow use that ordered union of the values, then 'walking' both lists simultaneously, building up ListPairings as needed. That eliminates repeated calls to FindFirst, but I'm wondering if that's the most efficient way to do this.

Thoughts?

Update

People have suggested this is a duplicate of getting a full outer join using LINQ because the results are the same...

I am not after a LINQ full outer join. I'm after a performant algorithm.

As such, I have updated the question.

The reason I bring this up is the LINQ needed to perform that functionality is much too slow for our needs. In our model, there are actually four lists, and each can be in the tens of thousands of rows. That's why I suggested the 'Union' approach of the IDs at the very end to get the list of unique 'keys' to walk through, but I think the posted answer on doing the same but with the enumerators is an even better approach as you don't need the list of IDs up front. This would yield a single pass through all items in the lists simultaneously which would easily out-perform the LINQ-based approach.

Are the lists already sorted? If so, I don't think you need a union or join: just walking both lists simultaneously should be enough shouldn't it? You can pick up what values you'll need as you go. — Rup, Apr 09 '13 at 23:29

score 3 · Accepted Answer · edited Apr 12 '13 at 22:29

3

This didn't turn out as neat as I'd hoped, but if both input lists are sorted then you can just walk through them together comparing the head elements of each one: if they're equal then you have a pair, else emit the smallest one on its own and advance that list.

public static IEnumerable<ListPairing> PairUpLists(IEnumerable<int> sortedAList,
                                                   IEnumerable<int> sortedBList)
{
    // Should wrap these two in using() per Servy's comment with braces around
    // the rest of the method.
    var aEnum = sortedAList.GetEnumerator();
    var bEnum = sortedBList.GetEnumerator();
    bool haveA = aEnum.MoveNext();
    bool haveB = bEnum.MoveNext();

    while (haveA && haveB)
    {
        // We still have values left on both lists.
        int comparison = aEnum.Current.CompareTo(bEnum.Current);
        if (comparison < 0)
        {
            // The heads of the two remaining sequences do not match and A's is
            // lower. Generate a partial pair with the head of A and advance the
            // enumerator.
            yield return new ListPairing() {ASide = aEnum.Current};
            haveA = aEnum.MoveNext();
        }
        else if (comparison == 0)
        {
            // The heads of the two sequences match. Generate a pair.
            yield return new ListPairing() {
                    ASide = aEnum.Current,
                    BSide = bEnum.Current
                };
            // Advance both enumerators
            haveA = aEnum.MoveNext();
            haveB = bEnum.MoveNext();
        }
        else
        {
            // No match and B is the lowest. Generate a partial pair with B.
            yield return new ListPairing() {BSide = bEnum.Current};
            // and advance the enumerator
            haveB = bEnum.MoveNext();
        }
    }
    if (haveA)
    {
        // We still have elements on list A but list B is exhausted.
        do
        {
            // Generate a partial pair for all remaining A elements.
            yield return new ListPairing() { ASide = aEnum.Current };
        } while (aEnum.MoveNext());
    }
    else if (haveB)
    {
        // List A is exhausted but we still have elements on list B.
        do
        {
            // Generate a partial pair for all remaining B elements.
            yield return new ListPairing() { BSide = bEnum.Current };
        } while (bEnum.MoveNext());
    }
}

edited Apr 12 '13 at 22:29

Mark A. Donohoe

28,442
25
137
286

answered Apr 09 '13 at 23:48

Rup

33,765
9
83
112

Remember that `IEnumerator` is `IDisposable`, you need to dispose of it, ideally through wrapping it in a `using` statement. – Servy Apr 10 '13 at 13:53
I didn't know that. Sure, the first two `var` lines should be `using (var...)` then, and braces around everything else. I've got my test project on another machine - I'll try that later and edit it in. – Rup Apr 10 '13 at 14:44
I think this is close, but the issue is it looks like you're putting all the paired versions up-front, and it would 'bomb out' on the first non-paired match and doesn't look like it would catch any more after that. Still, this gives me an idea. If you can rework your answer a little, I can give you the vote. – Mark A. Donohoe Apr 12 '13 at 10:45
@Rup, Also, we have access to the source lists directly inside the function (members of the same class) and we want to return the actual List not an enumerator. I think that means you'd have full control over the enumerators for disposal, perhaps wrapping them in simple 'using' statements. – Mark A. Donohoe Apr 12 '13 at 11:30
Right - in that case you can just maintain list indexes and not bother with enumerators. I wasn't sure whether I wanted to do that when I sat down to write it and ended up not doing, but it doesn't really matter. No, the first loop handles all values until one of the two lists runs out, and will generate both complete pairs and single values. The two loops at the bottom just finish off whichever list was left and they won't both run. – Rup Apr 12 '13 at 13:19
Yeah, I caught my error in assuming they were *matched* pairs, not just both having values. You're right. I did add an 'else' to the last 'if' check though. And it does look like enumerators will be fine since I'm not yielding like you're doing. I'm just inserting into the list. Whatever the case, this is enough to get the vote. Thanks! – Mark A. Donohoe Apr 12 '13 at 22:31
@Servy, are you sure about that? (your IDisposable statement?) It's not in the documentation. (http://msdn.microsoft.com/en-us/library/system.collections.ienumerator.aspx) nor in the metadata for IEnumerator. Where did you get your information? – Mark A. Donohoe Apr 14 '13 at 00:30
More info: http://stackoverflow.com/questions/232558/why-ienumerator-of-t-inherts-from-idisposable-but-non-generic-ienumerator-does – Mark A. Donohoe Apr 14 '13 at 00:31
@MarqueIV I was referring to the generic version of the interface, since in this context it is the generic version that is used. As your own link indicates; the generic version of the interface implements `IDisposable`. – Servy Apr 14 '13 at 19:01
I just re-visited this because of another similar issue we're having and I really have to say I like this approach! Thanks again, @Rup! – Mark A. Donohoe Mar 14 '14 at 14:45

Tim Jarvis · Answer 2 · 2013-04-10T00:03:13.507

0

var list1 = new List<int?>(){1,2,4,5,6,8,9};
var list2 = new List<int?>(){2,3,4,6,8,9,10};

var left = from i in list1
            join k in list2 on i equals k
            into temp
            from k in temp.DefaultIfEmpty()
            select new {a = i, b = (i == k) ? k : (int?)null};


var right = from k in list2
            join i in list1 on k equals i
            into temp
            from i in temp.DefaultIfEmpty()
            select new {a = (i == k) ? i : (int?)i , b = k};

var result = left.Union(right);

If you need the ordering to be same as your example, then you will need to provide an index and order by that (then remove duplicates)

var result = left.Select((o,i) => new {o.a, o.b, i}).Union(right.Select((o, i) => new {o.a, o.b, i})).OrderBy( o => o.i);

result.Select( o => new {o.a, o.b}).Distinct();

edited Apr 10 '13 at 00:03

answered Apr 09 '13 at 23:24

Tim Jarvis

18,465
9
55
92

You're focusing on the nits. Please see my ListPairing class. I am after a List result, not a simple list of results. Again, please see the question. – Mark A. Donohoe Apr 10 '13 at 01:25
1

@MarqueIV so all you need to do is project into your class... result.Select( o => new {o.a, o.b}).Distinct().Select(o => new ListPairing(){ASide = o.a, BSide = o.b}) – Tim Jarvis Apr 10 '13 at 04:56
Couldn't you just go straight into a ListPairing object to start with instead of the anonymous type? And for that matter, is there a way to just do the equivalent of a full outer join to begin with? If so, wouldn't that change your code from three Linq statements down to one? (This isn't rhetorical as I really don't know if there's a Linq equivalent of a full outer join.) – Mark A. Donohoe Apr 10 '13 at 13:49
just wondering if you gave though to my last question. Again, not a challenge or rhetorical. I'm really asking. – Mark A. Donohoe Apr 17 '13 at 03:46
Sure, in fact its almost exactly the same code, instead of projecting an anonymous type, you can new up a declared type and initialize it. So the select new {.... simply becomes select new ListPairing {.... – Tim Jarvis Apr 17 '13 at 04:38
Thinking of deleting this now, as after your edits to the question specifying non-linq, it looks like I ignored that...now this is downvote bait. – Tim Jarvis Apr 17 '13 at 04:40
I pushed it up again to counter. You may want to just edit your question and say you wrote that before I changed it to clarify I was after an algorithm and not a LINQ solution to stave off the down-voters and are leaving it up to help others. – Mark A. Donohoe Apr 17 '13 at 04:49

What is the fastest non-LINQ algorithm to 'pair up' matching items from multiple separate lists?

IMPORTANT NOTE

Update

2 Answers2