You are talking of the Knapsack problem, but there are a few peculiarities:
- you don't wan't to find an exact sum but the closest result to a value;
- the problem is multidimensional;
- the number aren't guaranteed to be positive;
- you didn't provide a distance.
I think your best bet is to enumerate the subsets of size K
and to choose the closest sum. That is brute force, but dynamic programming may help to output the subsets and compute the sums.
As pointed out in comment, you first have to define what closest
mean. That is, define a distance. For instance, euclidean distance is pretty commmon:
def d(p1, p2, p3):
return p1*p1 + p2*p2 + p3*p3
Let's extract the data from the file, more precisely, the three last values (parameters 1, 2, 3) and the index of the row:
DATA = """Label | Weight | Parameter 1 | Parameter 2 | Parameter 3
Item1 | 12 | 13 | 91 | 24
Item2 | 76 | 12 | 10 | 14
Item3 | 43 | 11 | 34 | 35
Item4 | 23 | 16 | 11 | 10
Item5 | 23 | 40 | 14 | 12
Item6 | 83 | 70 | 11 | 40
Item7 | 22 | 11 | 41 | 20"""
import io
import csv
f = io.StringIO(DATA)
reader = csv.reader(f, delimiter='|')
next(reader) # skip the header
L = [tuple([int(v) for v in row[-3:]] + [i]) for i, row in enumerate(reader)]
# [(13, 91, 24, 0), (12, 10, 14, 1), (11, 34, 35, 2), (16, 11, 10, 3), (40, 14, 12, 4), (70, 11, 40, 5), (11, 41, 20, 6)]
Now, set the number of rows K
and the target T
(a triplet)
N = len(L)
K = 3
T = (30, 60, 70)
It's dynamic programming, hence we need to store intermediate results. list_by_triplet_by_k
is a list of nested dicts:
- the index of the
dict
is the number of rows used (we are interested in K
but need to compute other values).
- the key of the outer dict is the sum of "Parameter 1";
- the key of the first nested dict is the sum of "Parameter 2";
- the key of the second nested dict is the sum of "Parameter 3";
- the value is the list of used rows.
(I didn't use a 4 dimensional array, because it would have been very sparse.)
A little trick: I initialize the list_by_triplet_by_k
with the target. If we have 0 rows, we are at -T.
list_by_triplet_by_k = [{} for _ in range(N)]
list_by_triplet_by_k[0] = {-T[0]: {-T[1]: {-T[2]: [(-T[0], -T[1], -T[2], "target")]}}}
Let's build the subsets. Basically, we build a forest of K+1
trees with dynamic programming:
best = None
ret = []
for a, b, c, i in L:
for k in range(0, K):
list_by_triplet = list_by_triplet_by_k[k]
for u in list_by_triplet.keys():
for v in list_by_triplet[u].keys():
for w in list_by_triplet[u][v]:
if (a, b, c, i) not in list_by_triplet[u][v][w]: # 0/1
list_by_triplet_by_k[k+1].setdefault(a+u, {}).setdefault(b+v, {})[c+w] = list_by_triplet[u][v][w] + [(a, b, c, i)]
# compute the best match on the fly at the end (not a very useful optimization, but why not?):
list_by_triplet = list_by_triplet_by_k[K-1]
for u in list_by_triplet.keys():
for v in list_by_triplet[u].keys():
for w in list_by_triplet[u][v]:
if (a, b, c, i) not in list_by_triplet[u][v][w]: # 0/1
cur = d(u+a, v+b, w+c)
if best is None or cur < best:
best = cur
ret = list_by_triplet[u][v][w] + [(a, b, c, i)]
There is maybe a trick to avoid duplicates by design, I don't know: I just tested if the element wasn't already in the list.
Result:
print (best, ret)
# 227 [(-30, -60, -70, 'target'), (12, 10, 14, 1), (11, 34, 35, 2), (16, 11, 10, 3)]
Remarks:
- See https://cs.stackexchange.com/a/43662 for info, but I don't think it will work with any hypothetical distance.
- It could be possible to prune the trees of possibilities with some extra assumptions.