I have a list of bags.
If allowed N selections, how do I choose the N bags that will maximize my set?
e.g.
choices = [ [A,A,Z], [B,A,E], [Z,Z,B,W], [Q], ...]
If N = 2 I would like to know that...
choices[1, 2]
Maximizes my set...
set = [B, A, E, Z, W]
I'm trying to force fit this into a gradient descent format but I'm having trouble creating a cost function for this. Is this a correct/reasonable approach?
What is the best way to solve this?
Notes:
Assume the list of choices is large enough that computing every possible combination of choices is not possible.
Assume a local optimum solution is acceptable.
Narrowing down the question...
Language: Python
Problem-size: ~2 million choices; 1-100 bag size.
@sascha - The set-covering tip was very helpful!
I took a short-cut on writing my own script and modified one from: https://cs.stackexchange.com/questions/16142/how-to-implement-greedy-set-cover-in-a-way-that-it-runs-in-linear-time
from collections import defaultdict
import random
import string
random.seed(123) # For reproducibility
size = 100
bag_size = 5
F = []
for i in range(size):
bag = [string.ascii_uppercase[random.randint(0, 25)] for j in range(bag_size)]
F.append(set(bag))
print('First 5 sets... of 100')
for item in F[0:5]:
print(item)
set_freq = {}
for bag in F:
for item in bag:
set_freq[item] = set_freq.get(item, 0) + 1
print('Unique items:', len(set_freq))
f_copy = F.copy() # Because F gets modified
# First prepare a list of all sets where each element appears
D = defaultdict(list)
for y,S in enumerate(F):
for a in S:
D[a].append(y)
L=defaultdict(set)
# Now place sets into an array that tells us which sets have each size
for x,S in enumerate(F):
L[len(S)].add(x)
E=[] # Keep track of selected sets
# Now loop over each set size
for sz in range(max(len(S) for S in F),0,-1):
if sz in L:
P = L[sz] # set of all sets with size = sz
while len(P):
x = P.pop()
E.append(x)
for a in F[x]:
for y in D[a]:
if y!=x:
S2 = F[y]
L[len(S2)].remove(y)
S2.remove(a)
L[len(S2)].add(y)
print('Results...\n')
print('Indices:', E)
captured = {}
for index in E:
for item in f_copy[index]:
captured[item] = captured.get(item, 0) + 1
print('Unique items captured:', len(captured))
Prints...
First 5 sets... of 100
{'B', 'Y', 'N', 'I', 'C'}
{'D', 'M', 'B', 'R', 'I'}
{'K', 'F', 'B', 'R'}
{'K', 'E', 'W', 'R'}
{'H', 'A', 'F', 'N', 'Y'}
Unique items: 26
Results...
Indices: [0, 10, 48, 32, 69, 7, 2, 5]
Unique items captured: 26
The part I'm missing is... e.g. if I could only pick 3, how do I maximize set-coverage?