How to optimize this set-picking algorithm?

Question

I am trying to solve the following problem:

I have several blocked sets (which may contain duplicate elements).
I must pick a (varying) number of elements from each blocked set to unblock it.
I am only allowed to pick elements that also occur in the picking set.
Whenever I remove an element from a blocked set I must also remove it from the picking set.
The picking set may contain more or less elements than strictly required.

Minimal example:

//Syntax for blocking sets:
//Name (number of elements required to pick): {elements in set}

//Syntax for picking set:
//Name: {elements in set}

BS1 (1): {0, 1}
BS2 (1): {1, 2}
Picking Set: {1, 2}

Possible solution:

BS1 (1): {0, 1}     <- take 1
BS2 (1): {1, 2}     <- take 2
Picking Set: {1, 2} <- remove 1, 2

If picking 1 from BS2, the problem becomes unsolvable. The picking set will reduce to {2} and BS1 only contains {0, 1}, making further picks impossible.

A more difficult scenario:

BS1 (1): {1, 2, 4} 
BS2 (2): {2, 3, 4} 
BS3 (3): {1, 3, 4, 4} 
Picking Set: {1, 2, 3, 4, 4, 4}

Possible Solution:

BS1 (1): {1, 2, 4}              <- take 1
BS2 (2): {2, 3, 4}              <- take 2, 4 
BS3 (3): {1, 3, 4, 4}           <- take 3, 4, 4
Picking Set: {1, 2, 3, 4, 4, 4} <- remove all

This scenario is solvable in multiple ways, but certain picks will lead to a dead end:

BS1 (1): {1, 2, 4}              <- take 1
BS2 (2): {2, 3, 4}              <- take 2, 3 
BS3 (3): {1, 3, 4, 4}           <- take 4, 4, and then dead end
Picking Set: {1, 2, 3, 4, 4, 4} <- remove all but one 4

My solution:

I wrote a recursive brute force algorithm that tests all picking combinations, then all following combinations of the next set based on that, etc. It works, but is slow. Number of combinations explodes but roughly half of the branches are successful, even for large problem instances. This makes me hopeful a heuristic or other method exists which can construct a valid solution directly.

My questions:

Is there a name for this problem?
What is the fastest way to construct a valid solution?
Can we do better than brute force (generate guaranteed valid solutions without trial and error)?

This is a variation on the [set cover problem](https://en.wikipedia.org/wiki/Set_cover_problem). You're looking for an exact cover of the picking set. But at the same time you need partial covers of the blocked sets. That's the variation. Whether that makes the problem harder or easier, I don't know. — user3386109, Jul 01 '23 at 16:45
This is certainly closely related to the problems you stated. Although an exact cover of the picking set is not always necessary (more elements may exist in the picking set than strictly required). Is it maybe possible to at least detect fast whether a solution exists at all? Rather than to construct one? — CircularRefraction, Jul 01 '23 at 17:32
"(more elements may exist in the picking set than strictly required)" <-- That definitely needs to be added to the question, because that feature is not illustrated in either of the examples. — user3386109, Jul 01 '23 at 17:43
That feature makes the problem harder and easier at the same time. Harder because there are more combinations to check, but easier because more combinations will work. My approach to the exact cover version of the question is to attack the branching factor. I'll have to rethink that approach if the picking set has excess elements. — user3386109, Jul 01 '23 at 17:50
I have added this detail to the question as advised. The examples concern minimal feasible picking sets, because I believe this presents the highest risk for encountering invalid branches. If the picking set is too small, it is trivial to show that no solution can exist. If it is larger, then our odds to encounter a dead end branch should tend to be smaller. — CircularRefraction, Jul 01 '23 at 18:34

maxplus · Answer 1 · 2023-07-02T08:09:25.637

I would place this problem in a class of maximum bipartite matching with additional constraints problems. It's likely impossible to give "the fastest way to construct a valid solution" without knowing limits on set sizes, blocking set amount, number of distinct elements and their multiplicity. But I'll give an efficient polynomial-time solution: reduction to the maximum flow problem, similar to other problems in that class. This should allow solving your problem even for total size of multisets on the order of 100'000.

Definitions

Let's define a graph describing our problem.

The picking multiset will be represented by one vertex, a_i, per each distinct element v_i having a multiplicity pickcnt_i. The j-th blocked multiset will be represented by one vertex, b_jk, per each distinct element u_jk having a multiplicity of blockcnt_jk, and an auxiliary "bottleneck" vertex c_j. price_j will designate the necessary amount of elements to unblock the j-th blocked multiset. Additionally vertices S, source, and T, sink, are defined.

v_i is usable pickcnt_i times, so S is connected to a_i by an edge with capacity pickcnt_i. Similarly b_jk is connected to c_j with capacity blockcnt_jk. c_j is connected to T with capacity price_j to limit the progress of "partial unblocking". a_i and b_jk are connected by an edge iff v_i == u_jk, with unlimited capacity.

Interpreting a flow

This graph represents a flow network. Let's look at an arbitrary feasible flow. Each unit of flow consumes: a unit capacity on S->a_i, modeling removal of a single v_i from the picking multiset; a unit capacity on b_jk->c_j, modeling removal of a single u_jk from the j-th blocked multiset; a unit capacity on c_j->T, modeling a single partial unblocking. Hence it is trivial to convert between a feasible flow and a matching of picking and blocking sets elements.

Let's look at a maximum flow. It doesn't violate any constraints from our original problem, and its value corresponds to the number of matched elements. So its value can't be higher than Σprice_j, can reach Σprice_j only by unblocking all sets, and must reach it if all sets can be unblocked. Therefore maximum flow gives a solution to the original problem if it satiates all c_j->T, and otherwise there is no solution.

Complexity

There are many algorithms for finding a maximum flow with complexities favoring dense or sparse graphs with small or unlimited capacities. Many perform in practice better than their complexity would suggest, particularly on special graphs like those produced by a bipartite matching problem. For some such graphs there are additional theorems proving a better complexity. Not knowing the limits I can't suggest a specific algorithm, only describe the size of the reduced problem.

The number of vertices is dominated by the sum of unique element counts for each set. The number of edges — by the number of valid "initial moves": what element can be used to partially unblock what multiset. The maximum flow is the maximum number of "moves" than can be performed.

To give an example of prewritten, ready to be used maxflow implementations, you can take a look at Dinic and PushRelabel here.

Interesting suggestion. It didn't come to my mind to transform the problem in such a way. One question: If I understand correctly, then the method for maximum flow does indeed yield a single solution (assuming one exists), although the graph could be used to find other solutions as well (which are not necessarily maximum flows). Is that correct? — CircularRefraction, Jul 02 '23 at 08:50
@CircularRefraction, maximum flow yields a single solution that might change upon edge reordering/vertex relabeling, yes. But every solution to the original problem has a corresponding maximum flow: if something is a solution to the original problem, it corresponds to a flow of value `Σprice_j`, which is maximum. — maxplus, Jul 02 '23 at 11:51

user3386109 · Answer 2 · 2023-07-02T05:06:25.833

The approach I would take to solve this problem is a technique I call attacking the branching factor.

I first became aware of this technique while writing a solver for Sudoku. So I'll explain how it works using a Sudoku puzzle as an example. Here's a partially solved Sudoku that was posted on the puzzling stack exchange.

The small grey numbers are the available choices for each empty square. The quantity of the small grey numbers is the branching factor for the square. For example, the empty squares in row 0 (the top row) have a branching factor of 3. Blindly trying every available choice for each square would result in 81 combinations to try.

Now take a look at row 5 (with the highlighted yellow square). Every square on that row has a branching factor of 2, which is only 16 combinations in total. So obviously, it's much better to start with row 5 than to start with row 0. And that's the principle that's at the heart of the technique. Don't blindly start at the upper-left square and work left-to-right top-to-bottom. Instead, identify the square with the smallest branching factor, and work on that square first.

For the example puzzle, the smallest branching factor is 2, and the yellow square happens to be one of the squares with that branching factor. The first choice to try is the 1. Choosing the 1 makes all sorts of wonderful things happen (just follow the blue arrows in the image below):

the branching factor of the square at {5,1} is reduced to 1, forcing the 4
then the square at {3,2) is reduced to 1, forcing the 3
then the square at {4,1} to reduced to 1, forcing the 1
then the square at {6,1} is reduced to 1, forcing the 3
then the square at {6,0} is reduced to 1, forcing the 1
and as a bonus, 10 others squares have their branching factors reduced (indicated by the red x's)

So by identifying the yellow square as the square with the lowest branching factor, and then choosing the 1 in that square, six squares are filled in with no additional branching. After filling in the six squares, another square with branching factor 2 needs to be chosen, and the process continues.

Applying this technique to the sample puzzle yields an answer in 160 attempts. That's pretty darn fast considering that there are 50 empty squares in the puzzle. Blindly solving the puzzle left-to-right top-to-bottom takes 12108 attempts. Solving the puzzle in a deliberately bad order takes 640,916,214 attempts.

To summarize the algorithm:

at each level of recursion:
   identify the choice in the problem that has the lowest branching factor
   for each of the allowed choices:
       make the choice
       update the branching factors for any other related choices
       move to the next level of recursion

Ok, now let's apply the technique to the problem posed in the question:

BS1 (1): {1, 2, 4} 
BS2 (2): {2, 3, 4} 
BS3 (3): {1, 3, 4, 4} 
Picking Set: {1, 2, 3, 4, 4, 4}

There are two types of branching in this problem, the branching within a blocked set (BS1, BS2, BS3), and the branching for the numbers (N1, N2, N3, N4).

Let's examine the sets first:

BS1 has a branching factor of 3, since one out of the three numbers must be chosen.
BS2 has a branching factor of 3, since two out of the three numbers must be chosen.
BS3 has a branching factor of 3, the choices are {1,3,4}, {1,4,4}, or {3,4,4}.

Now let's look at the numbers:

N1 appears in two sets, but only appears once in the picking set. So N1 has a branching factor of 2.
N2 also has a branching factor of 2.
N3 also has a branching factor of 2.
N4 has a branching factor of 3, the choices are {BS1, BS2, BS3}, {BS1, BS3, BS3}, or {BS2, BS3, BS3}

The best branching factor is 2. N1 has that, and the choices are N1 in BS1, or N1 in BS3. Try N1 in BS1, and update the branching factors:

BS1 is eliminated
BS2's branching factor is not affected
BS3's branching factor is reduced to 1
N1 is eliminated
N2's branching factor is reduced to 1
N3's branching factor is not affected
N4's branching factor is reduced to 1

The lowest branching factor is 1, BS3 has that, so BS3(3,4,4) is forced. After updating the branching factors, we find that BS2(2,4) is forced, and we're done.

Thanks a lot for your answer! I already tested with various "heuristics" within my brute force algorithm to prefer testing specific combinations before other combinations, and I was indeed able to vary the necessary branching drastically by doing so. However, in my mind it was not written out so clearly what exactly I am actually doing and the solutions were not as clear cut as your given method. I will make sure to test this in the program! — CircularRefraction, Jul 02 '23 at 08:40
I have now implemented this algorithm. Even in large problem instances, very few infeasible branches need to be explored, which is great (Often less than 10 dead ends for over 10000 successes). However, the additional cost of calculating the branching factors can become a liability if the problem happens to be less constricted. Considering that even a simple algorithm often succeeds in half it's branches, it can pay off to rely on quantity rather than quality. Nonetheless, this answer helped me greatly in understanding the problem, thanks! — CircularRefraction, Jul 02 '23 at 13:13
@CircularRefraction You're welcome. I think you've understood the costs and benefits of this technique quite well. Finding data structures that allow incremental updates of the branching factors is an important part of the implementation. For example, in the difficult scenario, if the picking set is `{1,1,2,2,3,3,4,4,4,4}`, then brute force will work the first time (no dead ends). So using that picking set is a good way to measure the quality of your implementation. Measure the time for brute force, and the time needed for this technique. The ratio of those times indicates the quality. — user3386109, Jul 02 '23 at 18:38
The nature of this problem suggests that trial and error is inevitable, though the odds are not too bad. Thus, one must weigh the benefits of costly, but efficient branching vs fast, but suboptimal branching. Indeed, I found at this point streamlining of the implementation is paramount to obtaining good performance. That includes minimization of redundancy and copying operations in particular. In naive implementations, every recursive call creates a copy of the partial solution state and recalculates all branching factors from scratch. Better options are likely available. — CircularRefraction, Jul 02 '23 at 19:08

How to optimize this set-picking algorithm?

2 Answers2

Definitions

Interpreting a flow

Complexity