1

I have a list of bags.
If allowed N selections, how do I choose the N bags that will maximize my set?
e.g.

choices = [ [A,A,Z], [B,A,E], [Z,Z,B,W], [Q], ...]

If N = 2 I would like to know that...

choices[1, 2]

Maximizes my set...

set = [B, A, E, Z, W]


I'm trying to force fit this into a gradient descent format but I'm having trouble creating a cost function for this. Is this a correct/reasonable approach?

What is the best way to solve this?


Notes:
Assume the list of choices is large enough that computing every possible combination of choices is not possible.
Assume a local optimum solution is acceptable.



Narrowing down the question...
Language: Python
Problem-size: ~2 million choices; 1-100 bag size.

@sascha - The set-covering tip was very helpful!

I took a short-cut on writing my own script and modified one from: https://cs.stackexchange.com/questions/16142/how-to-implement-greedy-set-cover-in-a-way-that-it-runs-in-linear-time

from collections import defaultdict
import random
import string
random.seed(123)  # For reproducibility

size = 100
bag_size = 5
F = []
for i in range(size):
    bag = [string.ascii_uppercase[random.randint(0, 25)] for j in range(bag_size)]
    F.append(set(bag))

print('First 5 sets... of 100')
for item in F[0:5]:
    print(item)

set_freq = {}
for bag in F:
    for item in bag:
        set_freq[item] = set_freq.get(item, 0) + 1
print('Unique items:', len(set_freq))

f_copy = F.copy()  # Because F gets modified

# First prepare a list of all sets where each element appears
D = defaultdict(list)
for y,S in enumerate(F):
    for a in S:
        D[a].append(y)

L=defaultdict(set)        
# Now place sets into an array that tells us which sets have each size
for x,S in enumerate(F):
    L[len(S)].add(x)

E=[] # Keep track of selected sets
# Now loop over each set size
for sz in range(max(len(S) for S in F),0,-1):
    if sz in L:
        P = L[sz] # set of all sets with size = sz
        while len(P):
            x = P.pop()
            E.append(x)
            for a in F[x]:
                for y in D[a]:
                    if y!=x:
                        S2 = F[y]
                        L[len(S2)].remove(y)
                        S2.remove(a)
                        L[len(S2)].add(y)
print('Results...\n')
print('Indices:', E)
captured = {}
for index in E:
    for item in f_copy[index]:
        captured[item] = captured.get(item, 0) + 1
print('Unique items captured:', len(captured))


Prints...

First 5 sets... of 100
{'B', 'Y', 'N', 'I', 'C'}
{'D', 'M', 'B', 'R', 'I'}
{'K', 'F', 'B', 'R'}
{'K', 'E', 'W', 'R'}
{'H', 'A', 'F', 'N', 'Y'}
Unique items: 26
Results...

Indices: [0, 10, 48, 32, 69, 7, 2, 5]
Unique items captured: 26


The part I'm missing is... e.g. if I could only pick 3, how do I maximize set-coverage?

Community
  • 1
  • 1
Delicious
  • 972
  • 12
  • 20
  • 1
    It's easy to implement a mixed-integer programming approach (alternative: sat-solvers). In this case there are also nice ADMM-based heuristics (if non-global solutions are ok). But before tackling this problem, you should give more information about the problem-size, statistics, programming-language restrictions, external-library restrictions and so on... Gradient-descent does not feel natural, as a naive boolean/integer-based formulation is not necessarily differentiable; which could be a problem. Did you have a look at the similar problems of **covering** (or **set-covering**)?. – sascha Jul 28 '16 at 12:17
  • How many unique individual elements are there in the bags? aka how many 'unique items' are there? – Ryan Jul 29 '16 at 05:06

0 Answers0