What is the best approach to matching/joining elements in two non-identical unsorted python lists of different lengths?

Question

I have for example the following python lists:

a = [1,2,1,3,2,1,1,1,2,1,1,1,1,1,3,1,2]  
b = [1,1,2,1,3,1,1,1,1,2,2,1,1,1,1,3,1,2]

and I'd like to obtain the tuples of indices of the elements that can be confidently matched, such as:

[(0,0), (1,2), (2,3), (3,4), (8,9), (14,15), (15,16), (16,17)]

The data represent the sizes of groups of people recorded arriving at, and leaving a queue, but the data isn't perfect either, so the sums of a and b don't match, and people don't always need to leave in the order they arrive.

I realise it depends on several variables (or threshold parameters), but am just looking for suggestions about how best to approach the problem. I'm happy to use Pandas/Numpy/Scipy to do the job.

I've realised it's quite hard to explain. By eye, it's easy for me to match the partial sequences, such as 1,2,1,3. Not finding it so easy to work out a good algorithmic approach though.

I'm not fully understanding the specifications. For example, why is (0, 1) not in your list? a[0] == b[1]. — timgeb, Jan 13 '16 at 16:46
Thanks @timegeb and KaustavDatta. The output is just an example. I guess there are many possible match sets depending on the fuzzy matching criteria. (8,9) is preferred over (4,9) in my case as this is a rough queue of people, which generally leave in the order they arrive, but don't have to. — Incompetent Perfectionist, Jan 13 '16 at 17:01
Well, to help you we need exact specifications, not fuzzy criteria. — timgeb, Jan 13 '16 at 18:24
@timgeb I appreciate I may not have been very precise, but I was looking for an approach rather than a solution. Thanks for taking the time to look at it anyway. — Incompetent Perfectionist, Jan 14 '16 at 19:08

Padraic Cunningham · Answer 1 · 2016-01-13T18:03:50.130

0

I don't fully understand your output but to get the matching element indexes in order:

a = [1, 2, 1, 3, 2, 1, 1, 1, 2, 1, 1, 1, 1, 1, 3, 1, 2]
b = [1, 1, 2, 1, 3, 1, 1, 1, 1, 2, 2, 1, 1, 1, 1, 3, 1, 2]
from collections import defaultdict, deque

d = defaultdict(deque)
for i, j in enumerate(b):
    d[j].append(i)

print([(i, d[j].popleft()) for i, j in enumerate(a)])

The only way I can match your output is if we consider elements that are not sequences:

from itertools import groupby 
from operator import itemgetter

def pairs(a, b):
    for (k, v) in (groupby(enumerate(a), key=itemgetter(1))):
        data = next(v)
        if not next(v, None):
            ind, val = data
            if b[ind] == val:
                yield (ind, ind)
            elif val == b[ind + 1]:
                yield (ind, ind + 1)

 print(list(pairs(a, b)))

Which would give you:

[(0, 0), (1, 2), (2, 3), (3, 4), (8, 9), (14, 15), (15, 16), (16, 17)]

edited Jan 13 '16 at 18:03

answered Jan 13 '16 at 17:12

Padraic Cunningham

176,452
29
245
321

Thanks. This gives me something to go on, and is generally useful, but I am finding that I get IndexError: pop from an empty deque if I change list a. Ideally it would make a best efforts match whatever the input. Not all elements from either side need to be matched. – Incompetent Perfectionist Jan 13 '16 at 17:24
That is because you need to account for the different length of the lists and different number of repeated elements, without some actual criteria to go on it is hard to suggest anything more specific – Padraic Cunningham Jan 13 '16 at 17:31
I see, thanks. I'll investigate those collections. It's my first question, so it won't mark your useful comment yet. – Incompetent Perfectionist Jan 13 '16 at 17:54
@IncompetentPerfectionist, why is there no match for `1, 1, 1,` and `1,1,1,1`, why do the indexes jump to `(8, 9)`? – Padraic Cunningham Jan 13 '16 at 17:55
Sorry about that... I realise I haven't defined this very well! In my dataset there are mostly 1s, so I was thinking I'd need to match them separately after first doing the easier match with the less frequent numbers and number sequences. There should indeed be the matches you mention though. – Incompetent Perfectionist Jan 13 '16 at 18:19
I think add a broader sample and the expected output with the logic of why would be the best approach – Padraic Cunningham Jan 13 '16 at 18:44

score 0 · Accepted Answer · answered Jan 13 '16 at 20:46

I finally realised Python has the difflib library for just this kind of thing!

a = [1, 2, 1, 3, 2, 1, 1, 1, 2, 1, 1, 1, 1, 1, 3, 1, 2]  
b = [1, 1, 2, 1, 3, 1, 1, 1, 1, 2, 2, 1, 1, 1, 1, 3, 1, 2]  

from difflib import SequenceMatcher  

s = SequenceMatcher(None, a, b, autojunk=False)  

matched_element_indices = []
for m in s.get_matching_blocks():
    matched_element_indices += [(m.a+i,m.b+i) for i in range(m.size)]

It produces this:

In : matched_element_indices
Out: [(0, 1), (1, 2), (2, 3), (3, 4), (5, 6), (6, 7), (7, 8), (8, 9), 
           (10, 11), (11, 12), (12, 13), (13, 14), (14, 15), (15, 16), (16, 17)]

What is the best approach to matching/joining elements in two non-identical unsorted python lists of different lengths?

2 Answers2