3

The following question is on python 3.6. Suppose I have lists of sets, for example

L1 = [{2,7},{2,7,8},{2,3,6,7},{1,2,4,5,7}]      
L2 = [{3,6},{1,3,4,6,7},{2,3,5,6,8}]      
L3 = [{2,5,7,8},{1,2,3,5,7,8}, {2,4,5,6,7,8}] 

I need to find all the intersection sets between each element of L1, L2, and L3. E.g.:

    {2,7}.intersection({3,6}).intersection({2,5,7,8})= empty  
    {2,7}.intersection({3,6}).intersection({1,2,3,5,7,8})= empty  
    {2,7}.intersection({3,6}).intersection({2,4,5,6,7,8})= empty  
    {2,7}.intersection({1,3,4,6,7}).intersection({2,5,7,8})= {7}  
    {2,7}.intersection({1,3,4,6,7}).intersection({1,2,3,5,7,8})= {7}  
    {2,7}.intersection({1,3,4,6,7}).intersection({2,4,5,6,7,8})= {7}

...............................

If we keep doing like this, we end up with the following set:

{{empty},{2},{3},{6},{7},{2,3},{2,5},{2,6},{2,8},{3,7},{4,7},{6,7}}

Suppose:
- I have many lists L1, L2, L3,...Ln. And I do not know how many lists I have.
- Each list L1, L2, L3..Ln are big, so I can not load all of them into the memory.

My question is: Is there any way to calculate that set sequentially, e.g., calculate between L1 and L2, then using result to calculate with L3, and so on...

Håken Lid
  • 22,318
  • 9
  • 52
  • 67
cdt
  • 85
  • 10

3 Answers3

1

You can first calculate all possible intersections between L1 and L2, then calculate the intersections between that set and L3 and so on.

list_generator = iter([  # some generator that produces your lists 
    [{2,7}, {2,7,8}, {2,3,6,7}, {1,2,4,5,7}],      
    [{3,6}, {1,3,4,6,7}, {2,3,5,6,8}],      
    [{2,5,7,8}, {1,2,3,5,7,8}, {2,4,5,6,7,8}], 
])
# for example, you can read from a file:
# (adapt the format to your needs)
def list_generator_from_file(filename):
    with open(filename) as f:
        for line in f:
            yield list(map(lambda x: set(x.split(',')), line.strip().split('|')))
# list_generator would be then list_generator_from_file('myfile.dat')

intersections = next(list_generator)  # get first list
new_intersections = set()

for list_ in list_generator:
    for old in intersections:
        for new in list_:
            new_intersections.add(frozenset(old.intersection(new)))
    # at this point we don't need the current list any more
    intersections, new_intersections = new_intersections, set()

print(intersections)

Output looks like {frozenset({7}), frozenset({3, 7}), frozenset({3}), frozenset({6}), frozenset({2, 6}), frozenset({6, 7}), frozenset(), frozenset({8, 2}), frozenset({2, 3}), frozenset({1, 7}), frozenset({4, 7}), frozenset({2, 5}), frozenset({2})}, which matches what you have except for the {1,7} set you missed.

Norrius
  • 7,558
  • 5
  • 40
  • 49
  • Thank you very much. In fact, I need to read a file a line of which represents a list of sets. E.g., line 1: 2,7|2,7,8|2,3,6,7|1,2,4,5,7; line 2: 3,6|1,3,4,6,7|2,3,5,6,8; line 3: 2,5,7,8|1,2,3,5,7,8|2,4,5,6,7,8. Because I can only read the file line by line, so, I am going to read each line, split it and convert each line to a list of sets. Then, I will calculate the intersections sequentially. By this way ,the list_generator which contains all the lists of sets is unavailable. Based on your method, can you suggest a way to read the file line by line then find the intersections between them? – cdt Mar 12 '18 at 13:53
  • @cdt This is why I made `list_generator` an iterator. You can read lines from the file and yield them one by one, using the fact that the file object you get from `open()` already is an iterator over the file's lines. – Norrius Mar 12 '18 at 14:18
  • @cdt I added an example, but I didn't test it extensively, you might need to tweak the parser. – Norrius Mar 12 '18 at 14:31
  • Thank you very much. This is exactly what I am looking for. – cdt Mar 12 '18 at 21:54
  • Norrius, I have an extra question. If I split my input lists into multiple parts, then find intersections for each part and then aggregate results. e.g., L12=intersect(L1, L2), L345=intersect(L3,L4,L5), and L6n=intersect(L6,L7,...Ln). Then, does intersect(L12,L345,L6n) gives same result as your code above? – cdt May 29 '18 at 04:53
1

You can use functools.reduce(set.intersection, sets) to handle variable inputs. And itertools.product(nested_list_of_sets) to generate combinations with one element from each of several sequences.

By using generator functions (yield) and lazy iterators such as itertools.product, you can reduce memory usage by orders of magnitude.

import itertools
import functools

nested_list_of_sets = [
    [{2,7}, {2,7,8}, {2,3,6,7}, {1,2,4,5,7}], 
    [{3,6}, {1,3,4,6,7}, {2,3,5,6,8}],
    [{2,5,7,8}, {1,2,3,5,7,8}, {2,4,5,6,7,8}],
]

def find_intersections(sets):
    """Take a nested sequence of sets and generate intersections"""
    for combo in itertools.product(*sets):
        yield (combo, functools.reduce(set.intersection, combo))

for input_sets, output_set in find_intersections(nested_list_of_sets):
    print('{:<55}  ->   {}'.format(repr(input_sets), output_set))

Output is

({2, 7}, {3, 6}, {8, 2, 5, 7})                           ->   set()
({2, 7}, {3, 6}, {1, 2, 3, 5, 7, 8})                     ->   set()
({2, 7}, {3, 6}, {2, 4, 5, 6, 7, 8})                     ->   set()
({2, 7}, {1, 3, 4, 6, 7}, {8, 2, 5, 7})                  ->   {7}
({2, 7}, {1, 3, 4, 6, 7}, {1, 2, 3, 5, 7, 8})            ->   {7}
({2, 7}, {1, 3, 4, 6, 7}, {2, 4, 5, 6, 7, 8})            ->   {7}
({2, 7}, {2, 3, 5, 6, 8}, {8, 2, 5, 7})                  ->   {2}
({2, 7}, {2, 3, 5, 6, 8}, {1, 2, 3, 5, 7, 8})            ->   {2}
# ... etc

Online demo on repl.it

Håken Lid
  • 22,318
  • 9
  • 52
  • 67
  • 1
    Given the description that he cannot hold all the lists in memory, he'll also need an iterable lazy loading type. In particular it needs to be rewound for all the latter lists. That also means serializing the data, and in turn likely that there's no point transforming the collections to actual hashed sets for any but the first list. – Yann Vernier Mar 10 '18 at 11:39
  • The question is a bit hard to interpret. The sample code only makes sense with all L1, L2, L3 ... available to create a carthesian product. The entire product is not in memory, but the input has to be available to create the product. – Håken Lid Mar 10 '18 at 11:45
  • 1
    According to the top answer to [this question](https://stackoverflow.com/questions/45586863/does-itertools-product-evaluate-its-arguments-lazily) `itertools.product` must convert its arguments to tuples. So it cannot accept lazy sequences that are too large to fit in memory. I think it might be possible to implement a carthesian product generator that might work with streams of inputs. But I believe you would need some way to rewind sequences. – Håken Lid Mar 10 '18 at 12:10
  • Thank you for your help. In fact, I need to read a file, in which a line represents a list of sets. For example, line 1: 2,7|2,7,8|2,3,6,7|1,2,4,5,7 line 2: 3,6|1,3,4,6,7|2,3,5,6,8 line 3: 2,5,7,8|1,2,3,5,7,8|2,4,5,6,7,8 I can only read the file line by line. So, I am going to read each line, split it (by |) and convert each line into a list of sets. Because the file is large and each line is long, I can not read the whole file into a list. So, define nested_list_of_sets seems impossible to me. – cdt Mar 12 '18 at 13:37
  • I think it should be possible to modify my answer to get what you want. You can pass a number of lines into `find_intersections`, create a list of the output, that list can be passed back to `find_intersections` along some more lines from the feed. You might want to prune empty sets from each iteration, if you are not interested in those. – Håken Lid Mar 12 '18 at 15:58
  • Thank you for your help. I will try your method, too. – cdt Mar 12 '18 at 21:56
0

This may be what you are looking for:

res = {frozenset(frozenset(x) for x in (i, j, k)): i & j & k \
       for i in L1 for j in L2 for k in L3}

Explanation

  • frozenset is required because set is not hashable. Dictionary keys must be hashable.
  • Cycle through every length-3 combination of items in L1, L2, L3.
  • Calculate intersection via & operation, equivalent to set.intersection.
jpp
  • 159,742
  • 34
  • 281
  • 339