Split array basing on chunk weight

Question

I have an array with 2 <= n <= 100 doubles:

A = [a1, a2, ... , an], ai > 0

and an integer 2 <= k <= min(n, 20). I need to split A into k subarrays:

B1 = [a1,     a2, ... , ap]
B2 = [ap+1, ap+2, ... , aq]

             ...

Bk = [aw+1, aw+2, ... , an]

such that the sum in each B is almost equal (it's hard to give a strict definition what this means - I'm interested in an approximate solution).

Example:

Input: A = [1, 2, 1, 2, 1], k=2
Output: [[1, 2, 1], [2, 1]] or [[1, 2], [1, 2, 1]]

I tried a probabilistic approach:

sample from [1, 2, .., n] using A as probability weights
cut the sample into quantiles to find a good partition,

but this was not stable enough for production.

tl;dr This question asks about 2-chunk divisions. I need k-chunk division.

MBo · Accepted Answer · 2018-11-20T02:23:20.933

3

Calculate overall sum of array S. Every chunk sum should be near S / K.

Then walk through array, calculating running sum R. When R+A[i+1] - S/K becomes larger than S/K - R, close current chunk and make R=0. Continue with the next chunk.

You also can compensate accumulating error (if it occurs), comparing overall sum of M chunks with M * S / K

Quick-made code for the last approach (not thoroughly checked)

def chunks(lst, k):
    s = sum(lst)
    sk = s / k
    #sk = max(s / k, max(lst))
    #variant from user2052436 in comments  
    idx = 0
    chunkstart = 0
    r = 0
    res = []
    for m in range(1, k):
        for idx in range(chunkstart, len(lst)):
            km = k -m
            irest = len(lst)-idx
            if((km>=irest) or (2*r+lst[idx]>2*m*sk)) and (idx>chunkstart):
                res.append(lst[chunkstart:idx])
                chunkstart = idx
                break
            r += lst[idx]
    res.append(lst[idx:len(lst)])
    return res

print(chunks([3,1,5,2,8,3,2], 3))
print(chunks([1,1,1,100], 3))

>>>[[3, 1, 5], [2, 8], [3, 2]]
   [[1, 1], [1], [100]]

edited Nov 20 '18 at 02:23

answered Aug 30 '18 at 11:50

MBo

77,366
5
53
86

Thank you for a nice snippet! The problem with this approach is that sometimes it fails - for example `chunks([1, 1, 1, 100], 3)` will return 2 subarrays, not 3. – Pawel Aug 30 '18 at 12:16
Yes, we have to add limit for chunk length (account for `k-m` and `len-idx` comparison) – MBo Aug 30 '18 at 12:22
1

Made this correction. Perhaps there are another hard cases. – MBo Aug 30 '18 at 13:05
This seems to work very well, great job! (I haven't created a test making this fail) – Pawel Aug 30 '18 at 14:38
Unfortunately such a hard case exists: `chunks([16, 8, 6, 4], 4)` returns `[[16], [], [8], [6, 4]]` – Pawel Aug 30 '18 at 15:39
Logic mistake in algorithm - empty chunk added when accumulated sum is too high. Will check. – MBo Aug 30 '18 at 15:50
1

Added check for empty chunk. – MBo Aug 30 '18 at 16:04
Passes all the tests I invented! – Pawel Aug 31 '18 at 16:10
Is there any proof for this approach? – Pham Trung Oct 24 '18 at 07:51
@Pham Trung No, just arbitrary chosen and adapted heuristics. – MBo Oct 24 '18 at 08:02
Does not work well for `print(chunks([100,1,1,103,90], 3))`. Result: `[[100], [1, 1, 103], [90]]`. – user2052436 Nov 19 '18 at 22:56
`sk = max(s / k, max(lst))` fixes above example: result `[[100, 1, 1], [103], [90]]` – user2052436 Nov 19 '18 at 23:22
@ser2052436 But probably might give worse variant in some cases (did not tested). See also Pham Trung solution in linked topic – MBo Nov 20 '18 at 02:25
It should not give worse variants: usually (especially when `n >> k`), `s / k is > max(lst)`, so your average stays. If there is a huge outlier, then chunk containing the outlier will have a sum at least equal to this outlier, so it makes sense to use the outlier instead of the average. – user2052436 Nov 20 '18 at 15:49

Split array basing on chunk weight

1 Answers1

Linked