5

Say I have 5 collections that contain a bunch of strings (hundreds of lines).

Now I want to extract the minimum nr of lines from each of these collections to uniquely identify that 1 collection.

So if I have

Collection 1:

A B C

Collection 2:

B B C

Collection 3:

C C C

Then collection 1 would be identified by A.

Collection 2 would be identified by BC or BB.

Collection 3 would be identified by CC.

Is there any algorithm already out there that does this kind of thing? Name?

Thanks, Wesley

  • 1
    My first idea is, that if you have 5 collections of size 500 each, you need at least read more than 100 elements from each collection, to say that at least 2 don't match (AAA... BBB... ... EEE...), and for each pair 251 elements (AAA... BBB...). – user unknown Oct 28 '11 at 21:43
  • Yes, sorry CC - not CCC. Order is not important. Thx for the idea. –  Oct 28 '11 at 21:49
  • 1
    I suspect this will be related to figuring out if a set of vectors is linearly independent. (e.g. http://en.wikipedia.org/wiki/Vector_space_model) http://stackoverflow.com/questions/532029/how-to-check-if-m-n-sized-vectors-are-linearly-independent. And have a solution similar to Gaussian Elimination http://en.wikipedia.org/wiki/Gaussian_elimination – Mark Bolusmjak Oct 28 '11 at 22:27

2 Answers2

1

If the order is not important, I would sort all Lists (Collections).

Then you could look whether all 5 start with the same element. You would group them by the first element:

Start - Character instead of Strings/Lines.:

T A L U D
N I O S A D 
R A B E 
T A U C
D A N E B

Sorted internally:

A D U L T
A D O N I S
A B E R 
A C U T
A B E N D

Sorted:

A B E N D
A B E R 
A C U T
A D U L T
A D O N I S

Grouped (2):

(A B) E N D
(A B) E R 
(A C) U T # identified by 2 elements
(A D) U L T
(A D) O N I S

Rest grouped by 3 elements:

(A C) U T     # identified by 2 elements
(A B E) N D
(A B E) R 
(A D U) L T   # only ADU...
(A D O) N I S # only ADO...

Rest grouped by 4 elements:

(A C) U T     # AC..
(A D U) L T   # ADU...
(A D O) N I S # ADO...
(A B E N) D
(A B E R)
user unknown
  • 35,537
  • 11
  • 75
  • 121
  • This sounds good, but I'd like to suggest an optimisation. When you are sorting, do it so that the least frequent items are at the front. In this example the symbol A is completely useless because it is contained in all sets, i.e. does not help with the identification. N and L however, appear in one set each (ADULT and ADONIS). As a result, you can identify ADONIS with one symbol (I) instead of three (ADO) – mbatchkarov Oct 28 '11 at 23:25
  • That would mean, that you have to count the elements in the collections. Are the strings equally distributed in the data, or normally distributed? Without knowing such details, I wouldn't start optimizing. The example is not an good indication for anything, because it is willfully made up for the pure purpose to show a possible approach. – user unknown Oct 29 '11 at 00:07
  • 1
    ... since we only have this example: We would use the overall-count of elements, to define what to search for? The counts are: `(S,1), (R,1), (O,1), (L,1), (I,1), (C,1), (U,2), (T,2), (N,2), (E,2), (B,2), (D,3), (A,5)`, so S->Adonis, R->Aber, L->Adult, C->Acut, NE->Abend? It is easy for the cases, where there is only one match, but how do you search for 'ABEND'? – user unknown Oct 29 '11 at 03:35
1

This is an easy problem to solve. You have one multiset (collection 1) (it is a "multiset" because the same element can occur multiple times), and then a number of other multisets (collections 2..N), and you want to find a minimum-size subset of collection 1 that does not occur in any of the other collections (2..N).

It is an easy problem to solve because it can be solved by simple set theory. I'll explain this first without multisets, i.e. assuming that every line can occur only once in any given set, and then explain how it works with multiset.

Let's call your collection 1 set S and the other collections sets X1 .. XN. Now keeping in mind that for now the sets do not have multiple instances of any item, it is obvious that any singleton set { a } such that a ∉ Xi distinguishes S from Xi, and so it is enough to calculate the set differences A - X1, ..., A - XN and then pick up a minimum-size set R such that R shares an element with all these difference sets. This is then the SET COVER combinatorial optimization problem that is NP-complete but for your small problem (5 collections) can be handled easily by brute force.

Now then when the sets are actually multisets this only changes so that the distinguishing "singleton" sets are actually multisets containing 1 or more copies of the same element and thus they have different costs. You can still calculate the set differences as above (you subtract element counts), but now your SET COVER combinatorial optimization part has take into account the fact that the distinguishing elements can be multisets and not singletons. Here's an illustration how it works for your problem when we solve for collection 3:

S = {{ c, c, c }}

X1 = {{ a, b, c }}

X2 = {{ b, b, c }}

S - X1 distinguishers: {{ c, c }}

S - X2 distinguishers: {{ c, c }}

Minimum multiset covering a distinguisher for every set: {{ c, c }}

And here how it works for calculating for collection 1:

S = {{ a, b, c }}

X1 = {{ b, b, c }}

X2 = {{ c, c, c }}

S - X1 distinguishers: {{ a }}

S - X2 distinguishers: {{ a }}, {{ b }}

Minimum multiset covering a distinguisher for every set: {{ a }}

Antti Huima
  • 25,136
  • 3
  • 52
  • 71