Algorithm to find the best combination of items under certain constraints

Question

I'll try to explain the problem in the math language.
Assume I have a set of items X = {x_1, x_2, ..., x_n}. Each item of X belongs to one of the sets S_1, S_2, ..., S_5. I consider all the subsets of X consisting of 5 items: {x_i1, x_i2, ..., xi5} so x_i1 belongs to S_1, ..., x_i5 belogns to S_5.
Some subsets are considered to be correct and some are considered to be not correct. Subset is considered to be correct if it does not contain conflicting items. I have a function f1 to determing if a pair of items conflict or not.
I also have a function f2 which can compare such correct subsets and say which subset is better (they might be equal as well).
I need to find the best not-conflicting subset(s).

Algo I used:
I built all the subsets, discarded not-correct subsets. Then I sorted correct subsets using f2 as a sorting function and took first best subset(s) (I used quick-sort algorithm). As far as there were a huge number of subsets this procedure took insufficient amount of time.

Is there a better approach in terms of time-consumption?

UPDATED
Let's think of x_i as if it's interval with integer endpoints. f1 returns true if 2 intervals do not intersect and false otherwise. f2 compares sum lengths of intervals in subsets.

Without more detail about the function you use to rank subsets I don't think it's possible to do better; think about a function that gives weight 0 to everything except one subset, which has weight 1. The only possible way to find that subset would be to list them all unless you already know something about the function. — templatetypedef, Jan 10 '12 at 08:16
First of all, starting the optimization, you need to figure, which part takes the most time. Make benchmarks and update the question. — Ranty, Jan 10 '12 at 09:45
What I need here is not a micro-optimization but another approach that will solve the same problem. So I think there's no need for profiling. — StuffHappens, Jan 10 '12 at 10:09
With the new definition given, I suspect a matroid in there somewhere. Have a look for this and your solution will be dead simple. However I do not have the time to prove the matroid condition on your problem ATM. — LiKao, Jan 10 '12 at 11:42

score 1 · Answer 1 · answered Jan 10 '12 at 09:47

1

Without further qualifying the domains and the evaluation function, this problem can be easily shown to be NP-Complete by reducing SAT onto it (i.e. let S_1,...,S_5 be {true,false} and f2 = 1 if the formula is fullfiled and 0 if not). Hence in that case, even without taking f1 into account you are out of luck.

If you know more about the actual structure of f1 and f1, you might have more luck. Have a look at Constrait Satisfaction Problems, to find out what to look for in the structure of f1 and f2.

answered Jan 10 '12 at 09:47

LiKao

10,408
6
53
91

Thanks for your answer. I updated the post so the meaning of f1 and f2 became more clear. – StuffHappens Jan 10 '12 at 10:10
Could you please elaborate? I don't see how a boolean expression with n variables can be reduced to 5 variables S_1,...,S_5. Also, you must take f1 into account, the algorithm for this problem might use it's structure to find a solution quickly. How would you construct f1? – Ishtar Jan 10 '12 at 10:48
@Ishtar: I was first tackling the general case (not limited to 5 variables, not taking into account the structure of f_2). If you need to enumerate everything for n variables, you need to enumerate everything for 5 variables. However the general problem is already NP-Complete for 5 Variables, as can be easily shown by using S_i = {true,false}^(m_i) with sum(m_i) = n. Now you can encode any formula with n variables in this problem. The definition of f_1 is also very simple: f_1 = no_conflict for any two elements. For the more specific part with the structure of f_2 given, I have no idea. – LiKao Jan 10 '12 at 11:36

score 1 · Answer 2 · answered Jan 10 '12 at 11:13

Let's think of x_i as if it's interval with integer endpoints. f1 returns true if 2 intervals do not intersect and false otherwise. f2 compares sum lengths of intervals in subsets.

If I understand correctly, this means we can assign a value(its length) to each x_i from X. There is then no need to evaluate f2 on each possible solution / subset.

It's very unlikely that the smallest 5 x_i form the best subset. Depending on the actual data, the best subset might be the 5 biggest intervals. So I'd suggest sorting X by value. General idea is to start with the highest x and try adding more x's (highest first) till you got 5 nonoverlapping. Most likely you will find the best subset, before even generating a fraction of all the possible subsets (depends on the specific problem of course). But in worst case this is not faster than your solution.

I think it might work. Could you please provide more details for the general idea? — StuffHappens, Jan 10 '12 at 11:25

score 1 · Answer 3 · answered Jan 11 '12 at 12:34

If we put aside the condition to take one x from each S_i, this problem is equivalent to Maximum Weight Independent Set in an interval graph (that is, finding a maximum-weight set of pairwise not connected vertices in a graph where vertices represent intervals, and vertices are connected if the corresponding intervals overlap). This problem can be solved in polynomial time. The version here also has a color for each vertex, and the chosen vertices need to have all different colors. I am not sure how to solve this in polynomial time, but you can exploit the fact that there are not too many colors: make a dynamic programming table T[C, x], where C is a set of colors and x is the position of an endpoint of an interval. T[C, x] should contain the maximum weight you can get from |C| intervals with the colors in C that are to the left of x. You can then fill in the table from left to right. This should be feasible since there are only 2^5=32 color sets.

Kshitij Banerjee · Answer 4 · 2012-01-13T15:48:52.903

I have a solution that should be good if my understanding of your question is right: So i begin with what i understand

each Integer is actually an interval from I1 to I2 and a Set is a 
combination of such intervals. A Set is correct if none of the intervals 
are intersecting and Set1>Set2 if the sum of Intervals in S1> sum of Intervals in S2.

So what I would've done in this situation would be somthing on these lines.

While comparing the intervals to determine if they intersect, do this.

a) Sort the intervals in order of start points

b) compare the end point of first and start point of consecutive intervals to determine an overlap. Keep an integer named gap, and if start and end of 2 intervals do not overlap increment gap with their difference.

This will automatically get you the sum of intervals in the set by doing Endpoint(lastI)-Startpoint(firstI) - Gap.

=> If you need just the best, you can take one variable max and keep comparing sets as they come.

=> If you need top5 or something then follow below, otherwise skip.

As soon as you get the sum and the set is correct, add the sum to a "MinHeap" of 5 elements. The first 5 elements will go as it is. Basically you are keeping track of the top 5 elements. When a new set is less that the min of the heap "Do Nothing and ignore this set as it is less that the top 5 sets" , when the set is larger than the min(meaning it is in the top 5) replace the min and sift the element down, keeping the min of top 5 at top. This will always keep the top 5 elements in the heap.
Now that you have the top 5 elements, you can easily determine the best with 5 pops. :)

Note: If intervals are in random order it will get you into a O(n^2) solution , and each comparison would then again have 4 if statements to check for overlap positions. you can sort the intervals in O(nlogn) and then go through the list once to determine overlap,(nlogn +n = nlogn) while simultaneously getting the top 5 sets. This should improve your performance, and time.

.

score 1 · Accepted Answer · answered Jan 19 '12 at 01:48

This problem is a variation of maximum weighted interval scheduling algorithm. The DP algorithm has polynomial complexity of O(N*log(N)) with O(N) space for the naive problem, and O(2^G * N * logn(N)) complexity with O(2^G * N) space for this variation problem, where G, N represent the total no of groups/subsets(5 here) & intervals respectively.

If x_i doesn't represent intervals, then the problem is in NP, which other solutions have proved.

First let me explain the dynamic programming solution for maximum weighted interval scheduling, and then solve the variation problem.

We are given starting & ending points of the intervals. Let start(i), end(i), weight(i) be starting, ending point, interval length of the interval i respectively.
Sort the intervals based on increasing order of start point.
Let the sorted order of intervals be 1, 2, ... N.
Let next(i) represent the next interval that doesn't overlap with interval i.
Lets define a subproblem S(i) to be the maximum weighted interval only considering jobs i, i+1, ... N.
S(1) is the solution, that considers all jobs from 1,2,... N and returns the maximum weighted interval.
Now lets define S(i) recursively.

.

S(i)  = weight(i)                             if(i==N) // last job
      = max(weight(i)+S(next(i)), S(i+1)

Complexity of this solution is O(N*log(N) + N). N*log(N) for finding next(i) for all jobs, and N for solving the subproblems. Space is O(N) for saving subproblem solutions.

Now, lets solve variation of this problem.

Lets collectively look at all the intervals in X. Each interval belongs to one of the sets S_1,... S_5.
Let start(i), end(i), weight(i), subset(i) be starting, ending point, interval length, subset of the interval i respectively.
Sort the intervals based on increasing order of start point.
Let the sorted order of intervals be 1, 2, ... N.
Let next(i) represent the next interval that doesn't overlap with interval i.
Lets define a subproblem S(i, pending) to be the maximum weighted interval only considering jobs i, i+1, ... N and pending is a list of subsets from which we have to choose one interval each.
S(1, {S_1,...S_5}) is the solution, that considers all jobs 1,...N , chooses one interval for each of S_1,...S_5 and returns the maximum weighted interval.
Now lets define S(i) recursively as follows.

.

S(i, pending)  = 0                          if(pending==empty_set) // possible combination
               = -inf                       if(i==N && pending!={group(i)}) // incorrect combination
               = S(i+1, pending)            if(group(i) not element of pending)
               = max(weight(i)+S(next(i), pending-group(i)),
                     S(i+1, pending)

Note that I may have missed some base cases.

Complexity of this algo is O(2^G * N * logn(N)) with O(2^G * N) space. 2^G * N represents the subproblem size.

As an estimate, for small values of G<=10 and high values of N>=100000, this algo runs pretty quickly. For medium values of G>=20, N<=10000 should be low as well for this algo to converge. And for high values of G>=40, the algo doesn't converge.

Thanks for the answer very much. It seems like it's going to work for me. I'll try it out before accepting your answer. — StuffHappens, Jan 19 '12 at 07:11

score 0 · Answer 6 · answered Jan 17 '12 at 18:30

I don't got the answer because you asked very abstract question but I will give you an idea.

Try think multiThreading. For instance you can create a thread pool with a limited number of threads. Then find a recursion solution and start new task for each loop when you are diving inside.

I am saying as you be able to split this problem to many small task as better your algorithm will be.

Think problematically not mathematically!

score 0 · Answer 7 · answered Jan 19 '12 at 04:55

Consider using a lookup table to optimize the time of f1. Consider inserting subsets you discover into merge sorted list instead of quicksorting at the end. If the domain is small and finite you can implement some very fast merge sorts by populating sparse arrays.

Algorithm to find the best combination of items under certain constraints

7 Answers7

Linked