2

I have two lists

list1 = ['a', 'b', 'c', 'd']
list2 = ['e', 'f', 'g', 'h']

I know from before that some of these elements are related through another list

ref_list = [
   ['d', 'f'], ['a', 'e'], ['b', 'g'], ['c', 'f'], ['a', 'g'],
   ['a', 'f'], ['b', 'e'], ['b', 'f'], ['c', 'e'], ['c', 'g']
]

I would like to quickly identify the two groups from list1 and list2 which have all the possible pairs [list1 element, list2 element] in ref_list.
In this case the solution would be

[['a', 'b', 'c'], ['e', 'f', 'g']]

I can think of some ways to do this for such small lists but need help if list1, list2 and ref_list have thousands of elements each.

Lante Dellarovere
  • 1,838
  • 2
  • 7
  • 10

3 Answers3

0

Set inclusion seems pretty fast.

import random
import string

list1 = [random.choice(string.ascii_letters) + random.choice(string.ascii_letters) + random.choice(string.ascii_letters) for _ in xrange(9999)]
# len(list1) == 9999    
list2 = [random.choice(string.ascii_letters) + random.choice(string.ascii_letters) + random.choice(string.ascii_letters) for _ in xrange(9999)]
# len(list2) == 9999
ref_list = [[random.choice(string.ascii_letters) + random.choice(string.ascii_letters) + random.choice(string.ascii_letters), random.choice(string.ascii_letters) + random.choice(string.ascii_letters) + random.choice(string.ascii_letters)] for _ in xrange(9999)]
# len(ref_list) == 9999

refs1 = set([t[0] for t in ref_list])
# CPU times: user 2.45 ms, sys: 348 µs, total: 2.8 ms
# Wall time: 2.2 ms
# len(refs1) == 9656 for this run

refs2 = set([t[1] for t in ref_list])
# CPU times: user 658 µs, sys: 3.92 ms, total: 4.58 ms
# Wall time: 4.02 ms
# len(refs2) == 9676 for this run

list1_filtered = [v for v in list1 if v in refs1]
# CPU times: user 1.19 ms, sys: 4.34 ms, total: 5.53 ms
# Wall time: 3.76 ms
# len(list1_filtered) == 702 for this run

list2_filtered = [v for v in list2 if v in refs2]
# CPU times: user 3.05 ms, sys: 4.29 ms, total: 7.33 ms
# Wall time: 4.51 ms
# len(list2_filtered) == 697 for this run
mVChr
  • 49,587
  • 11
  • 107
  • 104
0

You can add the elements from each pair in ref_list to sets set1 and set2, then use list1 = list(set1) and list2 = list(set2). Sets contain no duplicates, and this should be fast for thousands of elements since e in s1 for sets takes O(1) time on average.

Jackson H
  • 171
  • 10
  • Thanks but that would not work. Unlike the example I gave, the solution for larger list1 and list2 would have multiple group pairs – pythonprotein Apr 29 '19 at 21:26
0

You can use collections.Counter to generate counts for items in ref_list and use them to filter out items in the two lists that do not occur more than once:

from collections import Counter
[[i for i in lst if counts.get(i, 0) > 1] for lst, ref in zip((list1, list2), zip(*ref_list)) for counts in (Counter(ref),)]

This returns:

[['a', 'b', 'c'], ['e', 'f', 'g']]
blhsing
  • 91,368
  • 6
  • 71
  • 106