8

I'm looking for an algorithm to solve the following problem. I have a number of subsets (1-n) of a given set (a-h). I want to find the smallest collection of subsets that will allow me to construct, by combination, all of the given subsets. This collection can contain subsets that do not exist in 1-n yet.

  a b c d e f g h
1 1
2 1   1
3   1     1   1
4 1       1
5   1         1
6 1     1   1   1
7 1       1 1   1
8 1   1       1
9 1         1   1

Below are two possible collections, the smallest of which contains seven subsets. I have denoted new subsets with an x.

1 1
x   1
x     1
x       1
x         1
x           1
x             1
x               1

1 1
x   1         
x     1
x       1        
x         1    
x           1   1
x             1

I believe this must be a known problem, but I'm not very familiar with algorithms. Any help is very much appreciated, as is a suggestion for a better topic title.

Thanks!

Update

Graph coloring gets me a long way, thanks. However, in my case subsets are allowed to overlap. For example:

  a b c d
1 1 1 1  
2 1 1 1 
3 1 1 1
4     1 1
5 1 1 1 1

Graph coloring gives me this solution:

x 1 1
x     1
x       1     

But this one is valid as well, and is smaller:

1 1 1 1  
4     1 1
user3170702
  • 1,971
  • 8
  • 25
  • 33
  • 1
    what do you mean by "by combination"? – hugomg Jan 07 '14 at 20:47
  • @ScottHunter because it could be non optimal, i.e. `a` `b` could be correlated so they always appear together which reduces the total number of subsets needed by 1 (compared to `a` `b` `c` `d` ... ) - although your approach would be most likely the fastest... – drahnr Jan 07 '14 at 20:49
  • @missingno I need to be able to obtain all of the given subsets by performing an OR operation on a number of subsets from the solution. So "combining" 1100 and 0110 yields 1110. I hope that's clearer. – user3170702 Jan 07 '14 at 21:43
  • 2
    @user3170702: Thats clearer. The normal name for that is set union, btw. – hugomg Jan 07 '14 at 22:23
  • I think, your six subset solutions are not correct. For both of them I see no way to get set 8. – Henry Jan 08 '14 at 09:12
  • @Henry Thanks, I have updated the question. It's a not so good example, and even more so now. – user3170702 Jan 08 '14 at 09:39
  • Regarding the update about graph coloring.. ..if you simply apply graph coloring on the negative of your original sets you get the negative of your solution that allows overlapping. This is because if you are saying that solution rows can have overlapping elements and look for minimal number of rows it is the same as require that their negative are not overlapping and use coloring on the negative or given data rows. To say in a different way: you are coloring the empties and that empties are not allowed to overlap! – Diego Mazzaro Jan 10 '14 at 22:39

3 Answers3

6

This problem is known as Set Basis, and it is NP-complete (Larry J. Stockmeyer: The set basis problem is NP-complete. Technical Report RC-5431, IBM, 1975). Its formulation as a graph problem is Bipartite Dimension. Since it is very hard to solve in general, it might be useful to look if there are any helpful properties of your data (e.g., are the sets small? Is the solution small? Can all sets occur?)

I cannot think of an easy ILP formulation. Instead, you could try to reduce the problem to Clique Cover, which is better studied, using either the reduction from Kou&Wong or the one from Nor et al.. I have coauthered a paper discussing algorithms for Clique Cover, and written a Clique cover solver with both an exact solver and two heuristics.

Falk Hüffner
  • 4,942
  • 19
  • 25
  • Good, that's clearly the right problem. How would you go about solving it? – harold Jan 09 '14 at 17:34
  • LP as in Ram's answer would work. Or CP or Metaheuristics. The good choice depends on the scale of your problem, your expertise and if you have additional constraints. Basically, showing that it's "NP complete" is computer science speak for something like: many algo's could work but the best algo is unknown. – Geoffrey De Smet Jan 09 '14 at 19:23
1

This problem was shown in one the video's of Coursera's Discrete Optimization lectures. IIRC, it's called the set cover problem.

IIRC, it's NP-complete or NP-hard, so look into the typical algorithms (exact algo's for small datasets, metaheuristics for medium/big datasets) and typical frameworks (OptaPlanner, ...)

Geoffrey De Smet
  • 26,223
  • 11
  • 73
  • 120
  • 1
    Are you sure? This is one of the first things I looked at, and I concluded that it can't be formulated like that. Problem is, the goal is not so much to cover all elements, but to make sure that every given subset can be constructed by taking the union of some things. – harold Jan 08 '14 at 10:15
  • Good point, it's not a canonical set cover problem, but a variant as each of your subsets is in itself a set cover problem. I wonder though: if you solve the set cover problem for the entire set, don't you have as solution for every subset too? (it might not be the most efficient subset though) – Geoffrey De Smet Jan 08 '14 at 12:13
  • @GeoffreyDeSmet Can you please elaborate? What do you mean by solving for the entire set? – user3170702 Jan 08 '14 at 14:51
1

For this variant of the Set Cover problem, here is an Integer Programming formulation approach, with row generation.

Let's denote the components a,b,c,d... by their Column number. a=1, b=2 etc.

The rows are 'subsets.' Let's say that the EXISTING subsets are S1,...Sm. (These are the ones that HAVE to be covered.)

Notation for NEW subsets

This is the step where we introduce NEW subsets. Let's call the 'atomic' subsets as a_x. All a subsets have only one component.

   a1 is the subset {1,0,0,0}
   a2 is the subset {0,1,0,0}
   a3 is the subset {1,0,1,0}
   ...

Let bxy be subsets with two components.

So `b13` is the subset with component 1 and 3 being present.
b13 = {1, 0, 1, 0}
b34 = {0, 0, 1, 1} etc.

cxyz are subsets with three components.
For example, c124 = { 1, 1, 0, 1} etc.

d subsets will have 4 components
e subsets will have 5 components 
and so on.

Row Generation Step

Given an EXISTING Set, we generate only the needed NEW a, b, c ... subsets as we need.

For example, let's take the subset S1 = {1, 0, 1, 1}
Meaningful sets needed that can help create S1 are
a1, a3, a4. (Note that a2 is not needed since component b is not a component in S1}
b11, b13, b34.
c134

PREPROCESSING STEP: For each Sj in EXISTING SETS, generate new sub sets, using the procedure mentioned above. We create only as many ax, bxy, cxyz dxyzw... as needed. This step is needed before the formulation step.

In the worst case, there are (2^num_components-1) subsets needed per Sj. But they are easy to generate.

Example Problem

Now the formulation for the following problem:

  a b c d
1 1 1 1  
2 1 1 1 
3 1 1 1
4     1 1
5 1 1 1 1

We will have one constraint for each ROW. Each set has to be "covered"

For the problem above, here's the formulation

Formulation

Objective Minimize sum of all Subsets.
 Min sum (a_x) + sum (b_xy) + sum (c_xyz) + sum (d_xyzw)

Subject to:

   a1 + a2 + a3 + b11 + b12 + b13 + c123  >= 1 \\ Set 1 has to be formed
   a1 + a2 + a3 + b11 + b12 + b13 + c123  >= 1 \\ Set 2 has to be formed
   a1 + a2 + a3 + b11 + b12 + b13 + c123  >= 1 \\ Set 3 has to be formed
   a4 + a5            + b34               >= 1 \\ Set 4 has to be formed
   a1 + a2 + a3 + a4 + b11 + b12 + ..+  b34 + c123 + ...+ d1234  >= 1 \\ Set 5 has to be formed

 a's, b's, c's, d's Binary

Upper bound: By exploiting the fact that you need at most j subsets (Number of existing Subsets) you can even add a cut. Objective function has to be j or lower.

Hope that helps.

Community
  • 1
  • 1
Ram Narasimhan
  • 22,341
  • 5
  • 49
  • 55
  • But the goal is not to cover the component space, the goal is to be able to construct all the given subsets. And the output is allowed to contain arbitrary sets, not just the given ones. – harold Jan 09 '14 at 15:59
  • Ah, thanks Harold. I misunderstood. I will either reformulate (or will retract the answer if I am unable to.) Thanks. – Ram Narasimhan Jan 09 '14 at 16:05
  • Please solve it, I'm really curious about the answer :) – harold Jan 09 '14 at 16:06
  • Okay, I am missing something then. Why isn't the set of all components =1 {1,1,1,1} the solution? In the updated problem by OP, why is 5 not the solution? – Ram Narasimhan Jan 09 '14 at 16:11
  • Because with only 5, you can not construct 1 or 4 by taking the union of something, you can only get 5 – harold Jan 09 '14 at 16:12