3

There is a set S containing N integers each with value 1<=X<=10^6. The problem is to partition the set S into k partitions. The value of a partition is the sum of the elements present in it. Partition is to be done in such a way the total value of the set S is fairly distributed amongst the k partitions. The mathematical meaning of fair also needs to be defined (e.g. the objective could be to minimize the standard deviation of the values of the partitions from the average value of the set S (which is, sum(S)/k))

e.g. S = {10, 15, 12, 13, 30, 5}, k=3

A good partitioning would be {30}, {10, 15}, {12, 13, 5}

A bad partitioning would be {30, 5}, {10, 15}, {12, 13}

First question is to mathematically express condition for one partition to be better than the other. Second question is to how to solve the problem. The problem is NP-Hard. Are there any heuristics?

In the problem that I am trying to solve N <= (k*logX)^2 and K varies from 2 to 7.

==================================================================================

Based on other related SO questions, there are two reasonable functions to evaluate a distribution:

a) Minimize the value of the partition with the maximum value.

On a second thought, this is not a good metric. Consider, a set {100, 40, 40} to be partitioned into three subsets. This metric does not distinguish between the following two distributions even though one is clearly better than the other.

Distribution 1: {100}, {40}, {40} and Distribution 2: {100}, {40, 40}, {}

b) Minimize the maximum of the difference of any two values in a given partition i.e minimize max|A-B| for any A, B

Chris Gerken
  • 16,221
  • 6
  • 44
  • 59
Akhil
  • 2,269
  • 6
  • 32
  • 39
  • for the record, `N <= (7*ln(10^6))^2 ~= 9300`; does that sound about right? – ninjagecko Jun 23 '11 at 14:38
  • 1
    Exact same question with a really good answer here. http://stackoverflow.com/questions/6454598/stuck-with-an-interview-question-partitioning-of-an-array – Corey Alexander Jun 24 '11 at 06:49
  • [My Question](http://stackoverflow.com/questions/5458422/partition-a-set-into-k-disjoint-subset) is pretty much the same. I found the answer, i've posted it. – st0le Jun 24 '11 at 09:39
  • 1
    @Corey - Their is a gross difference between the two questions. In my problem, rearrangement is allowed, while in the referred question the number should not be rearranged – Akhil Jun 24 '11 at 17:56
  • @stOle - Your solution needs more explanation as in how the two problems are the same. Are you able to reduce one problem to the other? – Akhil Jun 24 '11 at 18:15
  • Akhil's (this) problem is slightly more general because it doesn't assume a penalty function, but rather asks for one. st0le's problem accidentally does not specify a problem definition for k>2, even though a commenter asked for clarification; even if so, it would be slightly less general because it asks for a penalty function in terms of the differences. Corey's problem is drastically different. – ninjagecko Jun 25 '11 at 06:06

3 Answers3

6

I think a good metric will be:

let the result set be s1,s2,...,sk
let MAX be max{sum(si) for each i}
f({s1,...,sk}) = Sigma(MAX-sum(si)) for each i)

the upside: a perfect distribution will yield always 0!
the downside: if there is none perferct solution, the best result will not yield 0.

a greedy heuristic for this problem will be:

sorted<-sort(S) (let's say sorted[0] is the highest)
s1=s2=...=sk= {}
for each x in sorted:
   s <- find_min() (*)
   s.add(x)

where find_min() yields s such that sum(s) <= sum(si) for each si.

this solution's will yield f (metrics defined above) such that f(sol) <= (k-1)*max{S} (from here it is a proof for this bound):


claim: for each subset s, MAX- sum(s) <= max{S}
proof - by induction: at each step, the claim is true for the temporary solution.
in each step, let MAX be max{sum(si)} at the start of the iteration (before the add)!

base: the set of subsets at start is {},{},.. MAX=sum(si)=0 for each si. 
step: assume the assumption is true for iteration i, we'll show it is also true for iteration i+1:
let s be the set that x was added to, then MAX-sum(s) <= max{S} (induction assumption).
if sum(s) + x <= MAX: we are done, MAX was not changed.
else: we sorted the elements at start, so x <= max{S}, and thus if s was chosen
   (sum(si) >= sum(s) for each si) and sum(s) + x > MAX then: for each si, sum(si) + x >=
   sum(s) + x, so sum(s)+x - sum(si) <= x <= max{S}. since sum(s)+x will be the MAX next 
   iteration, we are done.

because for each set MAX-sum(si) <= max{S} (and obviously, for the max set, MAX-sum(si)=0), at overall Sigma(MAX-sum(si)) <= (k-1)*max{S}, as promised.

EDIT :
I had some spare time, so I programmed both heuristics suggested by me and by @Akhil, and both metrics, first of all, both results are conclusive (according to Wilcoxon's pair-t test), but which is better is defined by which metric you choose, surprisingly, the algorithm which tried to minimize f() (@Akhil`s) scored lower for this same f, but higher for the second metric. @Akhil's metrics graph

@Amit's metrics graph

amit
  • 175,853
  • 27
  • 231
  • 333
  • @amit: neat solution and a very good effort. I am putting another greedy approach as an answer below. I gave it a few tries but could not prove any bound. Can you give it a try? – Akhil Jun 27 '11 at 02:19
  • @amit: can you please define the 'x' and the 'y' axis in your answer, for completeness/clarity? – Akhil Jun 27 '11 at 20:13
  • @Akhil: It is labled: for both histograms x axis is metric value (for each graph a different metric) while y axis is frequency – amit Jun 28 '11 at 08:07
  • @amit: graph?? The metric is supposed to be continuous (being the objective value). Also, what does frequency represent(frequency of what?) – Akhil Jun 28 '11 at 16:29
  • @Akhil it is a histogram, I produced 150 random samples, and ran both algorithms on each sample. then I produced the histogram with 20 bins for each metric - to try and show which is better. it is pretty standard for comparing two heuristics... – amit Jun 28 '11 at 17:28
1

One heuristic would be to spread the larger weights among the bags as evenly as possible, leaving enough smaller weights that you're now left with a subproblem with a large number of degrees of freedom. Repeat into sub-subproblems if necessary. This heuristic assumes your distribution is not too geometric, e.g. {1000} and {100, 10, 1}, and slightly assumes that your penalty function will penalize nil-assignments or very large outliers.

For example:

distributeFairly(numbers, bins):
    distributeFairlySubproblem(numbers, bins):
        n = len(numbers)
        numElementsToDefer = min(-n//3,20*k)  # modify as appropriate, e.g. to avoid len(toPlace)<len(toDefer)

        toDefer = numbers[-numElementsToDefer:]
        toPlace = numbers[:-numElementsToDefer]

        newBins = shoveThemIn(toPlace, copy(bins))
        return distributeFairlySubproblem(toDefer, newBins)

    initialGuess = distributeFairlySubproblem(sorted(numbers,reverse=True), [[]]*k)
    return anneal(initialGuess)
ninjagecko
  • 88,546
  • 24
  • 137
  • 145
1

Let the metric be to minimize max(sum(si) - sum(sj)) where si and sj are any two subsets in the resulting partition of the set S.

Lets say we have a distribution D and we need to include another element x to the distribution D. Add it to the subset s such that the metric above is minimized.

Could not prove any bounds, but intuition says that it will give a good approximation to the optimal? Anyone good at proving bounds?

Akhil
  • 2,269
  • 6
  • 32
  • 39
  • note that this metric will be O(n^2), so the algorithm will be O(n^3). I think with this metric, the same claim from my answer should apply, with a similar proof, so the bound will be f(sol) <= max{S}. I'm pretty sure that if you sort S before starting to add, you'll get better results in the average case, but these are all heuristics, just run some experiments with random samples and see which is better... (it will take around 1-2 hours to program both of these algorithms in Python). – amit Jun 27 '11 at 09:30
  • @amit: k (number of partitions) will be a small number (from 2 to 10). So, the complexity becomes (n *k^2). For the problem I am trying to solve, maximum value of n is (klogN)^2 where N<= 10^6. Therefore, this approach will not be very costly – Akhil Jun 27 '11 at 13:18
  • I had some free time, so I wrote the Python code for both algorithms and both heuristics, will edit my answer shortly with results. – amit Jun 27 '11 at 13:35
  • have a look, I editted my answer: each algorithm scored higher according to the other's metrics, which is surprising - because your suggested algorithm scored lower in the same metric it tried to minimize. – amit Jun 27 '11 at 14:05