How to find N numbers whose sum is closest to K , but over multiple columns?

Question

I'm trying to solve an optimization problem which consists of finding, an optimal solution to the subset sum problem but, we need to find a solution in which sum of each column is closest to a unique number for each column. Another constraint is that it should be a sum of only 45 rows in the table.

I've already tried using Bruteforce, but it simply exhausts system resources. From what I understood reading about the problem, this is a subset of the knapsack problem, called subset sum problem, but I want to do this over multiple columns.

To better illustrate the problem

Label | Weight | Parameter 1 | Parameter 2 | Parameter 3
Item1 |   12   |     13      |    91       |      24
Item2 |   76   |     12      |    10       |      14
Item3 |   43   |     11      |    34       |      35
Item4 |   23   |     16      |    11       |      10
Item5 |   23   |     40      |    14       |      12
Item6 |   83   |     70      |    11       |      40
Item7 |   22   |     11      |    41       |      20

I want to find only 3 rows whose, sum of Parameter 1 is closest to 30
sum of Parameter 2 is closest to 60 sum of Parameter 3 is closest to 70

Please note this is an example table with example values

This is a kind of homework question, and I've already spent lots of hours trying to solve it. I know it's an optimization problem, mostly an edge case of the knapsack problem and I should use dynamic programming to solve it, but I cannot figure out how to do that for multiple constraints instead of one. I already looked into multidimensional knapsack, but couldn't figure out how to do it.

A Jupyter notebook explaining how to do it would be a great help

I would consider switching one of the tags on the question to "algorithm" to generate additional interest. — גלעד ברקן, Jul 28 '19 at 23:38
To expand on the previous comment: you need to define what is best when three rows have the sum of Param1 closest to 30, but the sum of Param2 not quite closest to 60, and another three rows where it is the opposite. Which will be considered best? You'd need an *expression* that needs to be minimised. For example `Min(abs(sum(Param1)-30) + abs(sum(Param2)-60) + abs(sum(Param3)-70))`, or maybe `Min(sqr(sum(Param1)-30) + sqr(sum(Param2)-60) + sqr(sum(Param3)-70))`, or still something else. — trincot, Jul 30 '19 at 14:10

score 1 · Accepted Answer · answered Jul 31 '19 at 20:36

You are talking of the Knapsack problem, but there are a few peculiarities:

you don't wan't to find an exact sum but the closest result to a value;
the problem is multidimensional;
the number aren't guaranteed to be positive;
you didn't provide a distance.

I think your best bet is to enumerate the subsets of size K and to choose the closest sum. That is brute force, but dynamic programming may help to output the subsets and compute the sums.

As pointed out in comment, you first have to define what closest mean. That is, define a distance. For instance, euclidean distance is pretty commmon:

def d(p1, p2, p3):
    return p1*p1 + p2*p2 + p3*p3

Let's extract the data from the file, more precisely, the three last values (parameters 1, 2, 3) and the index of the row:

DATA = """Label | Weight | Parameter 1 | Parameter 2 | Parameter 3
Item1 |   12   |     13      |    91       |      24
Item2 |   76   |     12      |    10       |      14
Item3 |   43   |     11      |    34       |      35
Item4 |   23   |     16      |    11       |      10
Item5 |   23   |     40      |    14       |      12
Item6 |   83   |     70      |    11       |      40
Item7 |   22   |     11      |    41       |      20"""

import io
import csv

f = io.StringIO(DATA)
reader = csv.reader(f, delimiter='|')
next(reader) # skip the header

L = [tuple([int(v) for v in row[-3:]] + [i]) for i, row in enumerate(reader)]
# [(13, 91, 24, 0), (12, 10, 14, 1), (11, 34, 35, 2), (16, 11, 10, 3), (40, 14, 12, 4), (70, 11, 40, 5), (11, 41, 20, 6)]

Now, set the number of rows K and the target T (a triplet)

N = len(L)
K = 3
T = (30, 60, 70)

It's dynamic programming, hence we need to store intermediate results. list_by_triplet_by_k is a list of nested dicts:

the index of the dict is the number of rows used (we are interested in K but need to compute other values).
the key of the outer dict is the sum of "Parameter 1";
the key of the first nested dict is the sum of "Parameter 2";
the key of the second nested dict is the sum of "Parameter 3";
the value is the list of used rows.

(I didn't use a 4 dimensional array, because it would have been very sparse.)

A little trick: I initialize the list_by_triplet_by_k with the target. If we have 0 rows, we are at -T.

list_by_triplet_by_k = [{} for _ in range(N)]
list_by_triplet_by_k[0] = {-T[0]: {-T[1]: {-T[2]: [(-T[0], -T[1], -T[2], "target")]}}}

Let's build the subsets. Basically, we build a forest of K+1 trees with dynamic programming:

best = None
ret = []
for a, b, c, i in L:
    for k in range(0, K):
        list_by_triplet = list_by_triplet_by_k[k]
        for u in list_by_triplet.keys():
            for v in list_by_triplet[u].keys():
                for w in list_by_triplet[u][v]:
                    if (a, b, c, i) not in list_by_triplet[u][v][w]: # 0/1
                        list_by_triplet_by_k[k+1].setdefault(a+u, {}).setdefault(b+v, {})[c+w] = list_by_triplet[u][v][w] + [(a, b, c, i)]

    # compute the best match on the fly at the end (not a very useful optimization, but why not?):
    list_by_triplet = list_by_triplet_by_k[K-1]
    for u in list_by_triplet.keys():
        for v in list_by_triplet[u].keys():
            for w in list_by_triplet[u][v]:
                if (a, b, c, i) not in list_by_triplet[u][v][w]: # 0/1
                    cur = d(u+a, v+b, w+c)
                    if best is None or cur < best:
                        best = cur
                        ret = list_by_triplet[u][v][w] + [(a, b, c, i)]

There is maybe a trick to avoid duplicates by design, I don't know: I just tested if the element wasn't already in the list.

Result:

print (best, ret)
# 227 [(-30, -60, -70, 'target'), (12, 10, 14, 1), (11, 34, 35, 2), (16, 11, 10, 3)]

Remarks:

See https://cs.stackexchange.com/a/43662 for info, but I don't think it will work with any hypothetical distance.
It could be possible to prune the trees of possibilities with some extra assumptions.

How to find N numbers whose sum is closest to K , but over multiple columns?

1 Answers1