-3

I have for example the following python lists:

a = [1,2,1,3,2,1,1,1,2,1,1,1,1,1,3,1,2]  
b = [1,1,2,1,3,1,1,1,1,2,2,1,1,1,1,3,1,2]  

and I'd like to obtain the tuples of indices of the elements that can be confidently matched, such as:

[(0,0), (1,2), (2,3), (3,4), (8,9), (14,15), (15,16), (16,17)]  

The data represent the sizes of groups of people recorded arriving at, and leaving a queue, but the data isn't perfect either, so the sums of a and b don't match, and people don't always need to leave in the order they arrive.

I realise it depends on several variables (or threshold parameters), but am just looking for suggestions about how best to approach the problem. I'm happy to use Pandas/Numpy/Scipy to do the job.

I've realised it's quite hard to explain. By eye, it's easy for me to match the partial sequences, such as 1,2,1,3. Not finding it so easy to work out a good algorithmic approach though.

2 Answers2

0

I don't fully understand your output but to get the matching element indexes in order:

a = [1, 2, 1, 3, 2, 1, 1, 1, 2, 1, 1, 1, 1, 1, 3, 1, 2]
b = [1, 1, 2, 1, 3, 1, 1, 1, 1, 2, 2, 1, 1, 1, 1, 3, 1, 2]
from collections import defaultdict, deque

d = defaultdict(deque)
for i, j in enumerate(b):
    d[j].append(i)

print([(i, d[j].popleft()) for i, j in enumerate(a)])

The only way I can match your output is if we consider elements that are not sequences:

from itertools import groupby 
from operator import itemgetter

def pairs(a, b):
    for (k, v) in (groupby(enumerate(a), key=itemgetter(1))):
        data = next(v)
        if not next(v, None):
            ind, val = data
            if b[ind] == val:
                yield (ind, ind)
            elif val == b[ind + 1]:
                yield (ind, ind + 1)

 print(list(pairs(a, b)))

Which would give you:

[(0, 0), (1, 2), (2, 3), (3, 4), (8, 9), (14, 15), (15, 16), (16, 17)]
Padraic Cunningham
  • 176,452
  • 29
  • 245
  • 321
  • Thanks. This gives me something to go on, and is generally useful, but I am finding that I get IndexError: pop from an empty deque if I change list a. Ideally it would make a best efforts match whatever the input. Not all elements from either side need to be matched. – Incompetent Perfectionist Jan 13 '16 at 17:24
  • That is because you need to account for the different length of the lists and different number of repeated elements, without some actual criteria to go on it is hard to suggest anything more specific – Padraic Cunningham Jan 13 '16 at 17:31
  • I see, thanks. I'll investigate those collections. It's my first question, so it won't mark your useful comment yet. – Incompetent Perfectionist Jan 13 '16 at 17:54
  • @IncompetentPerfectionist, why is there no match for `1, 1, 1,` and `1,1,1,1`, why do the indexes jump to `(8, 9)`? – Padraic Cunningham Jan 13 '16 at 17:55
  • Sorry about that... I realise I haven't defined this very well! In my dataset there are mostly 1s, so I was thinking I'd need to match them separately after first doing the easier match with the less frequent numbers and number sequences. There should indeed be the matches you mention though. – Incompetent Perfectionist Jan 13 '16 at 18:19
  • I think add a broader sample and the expected output with the logic of why would be the best approach – Padraic Cunningham Jan 13 '16 at 18:44
0

I finally realised Python has the difflib library for just this kind of thing!

a = [1, 2, 1, 3, 2, 1, 1, 1, 2, 1, 1, 1, 1, 1, 3, 1, 2]  
b = [1, 1, 2, 1, 3, 1, 1, 1, 1, 2, 2, 1, 1, 1, 1, 3, 1, 2]  

from difflib import SequenceMatcher  

s = SequenceMatcher(None, a, b, autojunk=False)  

matched_element_indices = []
for m in s.get_matching_blocks():
    matched_element_indices += [(m.a+i,m.b+i) for i in range(m.size)]

It produces this:

In : matched_element_indices
Out: [(0, 1), (1, 2), (2, 3), (3, 4), (5, 6), (6, 7), (7, 8), (8, 9), 
           (10, 11), (11, 12), (12, 13), (13, 14), (14, 15), (15, 16), (16, 17)]