5

I have a certain amount of sets, each containing a variable amount of unique numbers - unique in the set they belong to and that can't be found in others.

I'd like to make an algorithm implemented preferably in Python - but it can be any other language - that find one combination of number from each of these sets that sums to a specified number, knowing, if this helps, that there can be the same set multiple times, and an element from a set can be reused.

Practical example: let's say I have the following sets:

A = {1, 3, 6, 7, 15}
B = {2, 8, 10}
C = {4, 5, 9, 11, 12}

I want to obtain a number combination with a method find_subset_combination(expected_sum, subset_list)

>>> find_subset_combination(41, [A, B, B, C, B])
[1, 8, 10, 12, 10]

A solution to this problem has already been proposed here, however it is rather a brute-force approach; as the number of sets and their size will be much larger in my case, I'd like an algorithm functioning with the least number of iteration possible.

What approach would you suggest me ?

CodeTalker
  • 79
  • 6
  • Can the elements per set be re-used? – wim Jun 10 '20 at 19:23
  • @wim Yes it can. I will update the question to clarify – CodeTalker Jun 10 '20 at 19:27
  • The example suggest you only need to find one combination, not the total count of solutions nor an enumeration of the solves. Is that correct? – wim Jun 10 '20 at 19:30
  • 1
    This is NP-hard. You're not going to find an efficient, general solution. – user2357112 Jun 10 '20 at 19:33
  • @wim This is correct. As finding one combination can already be quite difficult, I am not asking for multiple solutions. I already have an idea on how to find other possible combinations from one. – CodeTalker Jun 10 '20 at 19:34
  • This sounds like a variations of "counting change" or "knapsack" problems. The difference being that the set of possible choices is different at each iteration. – Code-Apprentice Jun 10 '20 at 19:55

1 Answers1

1

Firstly lets solve this for just two sets. This is known as the 'two sum' problem. You have two sets a and b that add to l. Since a + b = l we know that l - a = b. This is important as we can determine if l - a is in b in O(1) time. Rather than looping through b to find it in O(b) time. This means we can solve the 2 sum problem in O(a) time.

Note: For brevity the provided code only produces one solution. However changing two_sum to a generator function can return them all.

def two_sum(l, a, b):
    for i in a:
        if l - i in b:
            return i, l - i
    raise ValueError('No solution found')

Next we can solve the 'four sum' problem. This time we have four sets c, d, e and f. By combining c and d into a, and e and f into b we can use two_sum to solve the problem in O(cd + ef) space and time. To combine the sets we can just use a cartesian product, adding the results together.

Note: To get all results perform a cartesian product on all resulting a[i] and b[j].

import itertools


def combine(*sets):
    result = {}
    for keys in itertools.product(*sets):
        results.setdefault(sum(keys), []).append(keys)
    return results


def four_sum(l, c, d, e, f):
    a = combine(c, d)
    b = combine(e, f)
    i, j = two_sum(l, a, b)
    return (*a[i][0], *b[j][0])

It should be apparent that the 'three sum' problem is just a simplified version of the 'four sum' problem. The difference is that we're given a at the start rather than being asked to calculate it. This runs in O(a + ef) time and O(ef) space.

def three_sum(l, a, e, f):
    b = combine(e, f)
    i, j = two_sum(l, a, b)
    return (i, *b[j][0])

Now we have enough information to solve the 'six sum' problem. The question comes down to how do we divide all these sets?

  • If we decide to pair them together then we can use the 'three sum' solution to get what we want. But this may not run in the best time, as it runs in O(ab + bcde), or O(n^4) time if they're all the same size.
  • If we decide to put them in trios then we can use the 'two sum' to get what we want. This runs in O(abc + def), or O(n^3) if they're all the same size.

At this point we should have all the information to make a generic version that runs in O(n^⌈s/2⌉) time and space. Where s is the amount of sets entered into the function.

def n_sum(l, *sets):
    midpoint = len(sets) // 2
    a = combine(*sets[:midpoint])
    b = combine(*sets[midpoint:])
    i, j = two_sum(l, a, b)
    return (*a[i][0], *b[j][0])

You can further optimize the code. The size of both sides of the two sum matter quite a lot.

  • To exemplify this you can imagine 4 sets of 1 number on one side and 4 sets of 1000 numbers on the other. This will run in O(1^4 + 1000^4) time. Which is obviously really bad. Instead you can balance both sides of the two sum to make it much smaller. By having 2 sets of 1 number and 2 sets of 1000 numbers on both sides of the equation the performance increases; O(1^2×1000^2 + 1^2×1000^2) or simply O(1000^2). Which is far smaller than O(1000^4).

  • Expanding on the previous point if you have 3 sets of 1000 numbers and 3 sets of 10 numbers then the best solution is to put two 1000s on one side and everything else on the other side:

    • 1000^2 + 10^3×1000 = 2_000_000
    • Interlaced sorted and same size either side (10, 1000, 10), (1000, 10, 1000)
      10^2×1000 + 10×1000^2 = 10_100_000

Additionally if there is an even amount of each set provided then you can cut the time it takes to run in half by only calling combine once. For example if the input is n_sum(l, a, b, c, a, b, c) (without the above optimizations) it should be apparent that the second call to combine is only a waste of time and space.

Peilonrayz
  • 3,129
  • 1
  • 25
  • 37
  • Very interesting idea, even if I'll need more time to fully understand and assimilate all these explanations. Just a sad note, the algorithmic solution revealed upon a test with a quite large amount of sets (20 with the capacity of each ranging from 5 to 80) to be extremily memory intensive. – CodeTalker Jun 12 '20 at 16:13