Optimal merging of triplets

Question

I'm trying to come up with an algorithm for the following problem :

I've got a collection of triplets of integers - let's call these integers A, B, C. The value stored inside can be big, so generally it's impossible to create an array of size A, B, or C. The goal is to minimize the size of the collection. To do this, we're provided a simple rule that allows us to merge the triplets :

For two triplets (A, B, C) and (A', B', C'), remove the original triplets and place the triplet (A | A', B, C) if B == B' and C = C', where | is bitwise OR. Similar rules hold for B and C also.

In other words, if two values of two triplets are equal, remove these two triplets, bitwise OR the third values and place the result to the collection.

The greedy approach is usually misleading in similar cases and so it is for this problem, but I can't find a simple counterexample that'd lead to a correct solution. For a list with 250 items where the correct solution is 14, the average size computed by greedy merging is about 30 (varies from 20 to 70). The sub-optimal overhead gets bigger as the list size increases.

I've also tried playing around with set bit counts, but I've found no meaningful results. Just the obvious fact that if the records are unique (which is safe to assume), the set bit count always increases.

Here's the stupid greedy implementation (it's just a conceptual thing, please don't regard the code style) :

public class Record {
    long A;
    long B;
    long C;

    public static void main(String[] args) {
        List<Record> data = new ArrayList<>();
        // Fill it with some data

        boolean found;

        do {
            found = false;
            outer:
            for (int i = 0; i < data.size(); ++i) {
                for (int j = i+1; j < data.size(); ++j) {
                    try {
                        Record r = merge(data.get(i), data.get(j));
                        found = true;
                        data.remove(j);
                        data.remove(i);
                        data.add(r);
                        break outer;
                    } catch (IllegalArgumentException ignored) {
                    }
                }
            }
        } while (found);
    }

    public static Record merge(Record r1, Record r2) {
        if (r1.A == r2.A && r1.B == r2.B) {
            Record r = new Record();
            r.A = r1.A;
            r.B = r1.B;
            r.C = r1.C | r2.C;
            return r;
        }
        if (r1.A == r2.A && r1.C == r2.C) {
            Record r = new Record();
            r.A = r1.A;
            r.B = r1.B | r2.B;
            r.C = r1.C;
            return r;
        }
        if (r1.B == r2.B && r1.C == r2.C) {
            Record r = new Record();
            r.A = r1.A | r2.A;
            r.B = r1.B;
            r.C = r1.C;
            return r;
        }
        throw new IllegalArgumentException("Unable to merge these two records!");
    }

Do you have any idea how to solve this problem?

Find some two records that match the condition, merge them, put them back and repeat while there are some. — Danstahr, Mar 06 '14 at 11:26
Something worries me: assume you have (A, B, C), (A', B, C) and (A, B', C). There are two possible merges (A|A', B, C) or (A, B|B', C). If you apply one of the merges, the other becomes impossible. Is this the way it is ? — , Mar 06 '14 at 11:59
Yes, exactly. If there was only one way all the time, it'd be trivial. — Danstahr, Mar 06 '14 at 12:02
It's not clear what you want. Do you need an in-memory data structure? Or do you need a way to store this data in a file? — pentadecagon, Mar 06 '14 at 12:07
I need the algorithm to find the optimal way of merging so that the final size of the collection is minimal, i.e. it contains as little records as possible. — Danstahr, Mar 06 '14 at 12:16
Have you tried experiments (like greedy merges in random order) to get some insight on how much you could gain with an optimal solution compared to a greedy one ? — , Mar 06 '14 at 13:18
Yes, I've done some, I've updated the question with more info. — Danstahr, Mar 06 '14 at 13:38
Interesting problem, I really have no clue yet apart from brute-force depth first search. I tried greedy with the rule: if the merge of X and Y can immediately be merged with something, merge X and Y. If there are none of those, then if the merge of X and Y has 1 position in common with any other triple, merge X and Y. If *that* fails, just merge anything. It was worse than I expected. — harold, Mar 06 '14 at 14:28
What is the effectiveness of running a randomized greedy merge a number of times (say 1000) and keeping the best ? — , Mar 06 '14 at 14:55
@BasBrekelmans: The integers are 64-bit and 64-bit CPU is available. — Danstahr, Mar 06 '14 at 15:00
This could potentially be a very large problem, as the decision tree can be quite large. If there are N possible merges in the original set, choosing one of them will leave you with *at least* N-1 possible merges for the second step. but maybe more. There might even be pathological cases where you can never get to 0 merge possibilities - but I'm not sure about that... Finding an exact solution in the general case feels hard, and you may just have to be satisfied with a heuristic approximation... I might be wrong though - wouldn't be the first time... — twalberg, Mar 06 '14 at 15:00
I can be wrong but I have the feeling that this problem has a non-local behavior, meaning that a small change in the order of the merges can have big impact elsewhere by allowing cascaded merges. If true, this could make progressive refinement of the solution uneffective (simulated annealing, genetic optimization...). — , Mar 06 '14 at 15:00
@YvesDaoust: Not ideal, but not as bad as I thought. I used the same data s before, copypasted it four times and slightly modified its As. This pattern will be very common in the data I need this whole thing for. As the ideal solution still remains 14 (for 1000 items), the best solution found during 1000 tries is 38. For 500 items, it was 37. Scales better than I thought. With some indexing, this could be useful. The size of the input data is expected to be 10^4-10^5 — Danstahr, Mar 06 '14 at 15:09
@twalberg: The algorithm will always terminate, because each merge reduces the length of the collection by 1. So in ideal case, we end with just one record in the collection. — Danstahr, Mar 06 '14 at 15:10
More insight: your data compresses astonishingly well. Indeed, among triples of random 64 bits data, virtually none have two equal values, and there should be no compression at all. So there is strong correlation among these numbers. Maybe you should recast your problem in terms closer to the generation process of the triples. — , Mar 06 '14 at 18:14
More insight: maybe the data itself can be compressed. I mean when looking at the values in binary representation, some pattern may appear (like some bits being always 0, or 1's always in pairs). You can have a look at the number of distinct values occurring. For this to make sense, you should also take into account all combinations formed by pairwise or-ings. — , Mar 06 '14 at 18:17
More insight: you could look at the "compression forest". I mean if you consider all mergings of two triples into a new triple, you can form a graph which is a forest of binary trees. Maybe you can learn from the shape of this forest or statistics like tree depth, average branching factor... for the optimal solution. — , Mar 06 '14 at 18:20
More insight: I suggest the following heuristics. 1) Use the greedy merging strategy, but merging first the triples with the smallest Hamming distance. 2) same but with the greatest Hamming distance. [The rationale is to try and benefit from bitwise correlations.] — , Mar 06 '14 at 18:20
By the way, how do you know the optimal solution for your 500 triples problem ? — , Mar 06 '14 at 18:23
Could you tell us a bit of the story behind this? Some context may lead to some new insights. — harold, Mar 07 '14 at 06:52
@harold I'm given a bunch of events in time that's usually got some pattern. The goal is to find the pattern. Each of the bits in the integer represents "the event occurred in this particular year/month/day". For example, the representation of March 7, 2014 would be `[1 << (2014-1970), 1 << 3, 1 << 7]`. The pattern described above allows us to compress these events so that we can say 'the event occurred every 1st in years 2000-2010'. — Danstahr, Mar 07 '14 at 10:56
OK, then may I suggest an other idea? Represent the pattern as an NFA that accepts precisely that set of dates (easy to construct), convert to DFA, and minimize it. This is also an algorithms with worst-case exponential behaviour but it should be "usually fast", instead of "always terrible" like brute force. — harold, Mar 07 '14 at 11:16
@harold The problem is that I don't know that pattern. I need to guess it somehow - it'd be the result of the algorithm. — Danstahr, Mar 07 '14 at 11:24
@Danstahr yes, that's what I'm suggesting. The resulting DFA would describe the pattern. The NFA that you'd build is trivial, just the disjunction of a bunch of trivial NFA's that each accept one specific date. — harold, Mar 07 '14 at 11:56
@harold I tried that with http://www.brics.dk/automaton/ and it gives worse results than the best-of-10 naive implementation runs. — Danstahr, Mar 07 '14 at 14:25
Can you add some working minimal example code which is showing the algorithm you are currently using? — Flovdis, Mar 07 '14 at 14:26
@Danstahr: Based on your problem description, you should check out this SO answers (if you didn't do it already): http://stackoverflow.com/a/4202095/44522 and http://stackoverflow.com/a/3251229/44522 — MicSim, Mar 07 '14 at 15:31
I think there's a polytime reduction from the NP-hard problem [3D matching](http://en.wikipedia.org/wiki/3-dimensional_matching). — David Eisenstat, Mar 08 '14 at 06:19
@Danstahr Can you give us the input with 250 values where the result is 14? I've written some code and would like to test it. — Irfy, Mar 10 '14 at 09:42

score 2 · Accepted Answer · answered Mar 11 '14 at 00:55

This is going to be a very long answer, sadly without an optimal solution (sorry). It is however a serious attempt at applying greedy problem solving to your problem, so it may be useful in principle. I didn't implement the last approach discussed, perhaps that approach can yield the optimal solution -- I can't guarantee that though.

Level 0: Not really greedy

By definition, a greedy algorithm has a heuristic for choosing the next step in a way that is locally optimal, i.e. optimal right now, hoping to reach the global optimum which may or may not be possible always.

Your algorithm chooses any mergable pair and merges them and then moves on. It does no evaluation of what this merge implies and whether there is a better local solution. Because of this I wouldn't call your approach greedy at all. It is just a solution, an approach. I will call it the blind algorithm just so that I can succinctly refer to it in my answer. I will also use a slightly modified version of your algorithm, which, instead of removing two triplets and appending the merged triplet, removes only the second triplet and replaces the first one with the merged one. The order of the resulting triplets is different and thus the final result possibly too. Let me run this modified algorithm over a representative data set, marking to-be-merged triplets with a *:

0: 3 2 3   3 2 3   3 2 3
1: 0 1 0*  0 1 2   0 1 2
2: 1 2 0   1 2 0*  1 2 1
3: 0 1 2*
4: 1 2 1   1 2 1*
5: 0 2 0   0 2 0   0 2 0

Result: 4

Level 1: Greedy

To have a greedy algorithm, you need to formulate the merging decision in a way that allows for comparison of options, when multiple are available. For me, the intuitive formulation of the merging decision was:

If I merge these two triplets, will the resulting set have the maximum possible number of mergable triplets, when compared to the result of merging any other two triplets from the current set?

I repeat, this is intuitive for me. I have no proof that this leads to the globally optimal solution, not even that it will lead to a better-or-equal solution than the blind algorithm -- but it fits the definition of greedy (and is very easy to implement). Let's try it on the above data set, showing between each step, the possible merges (by indicating the indices of triplet pairs) and resulting number of mergables for each possible merge:

          mergables
0: 3 2 3  (1,3)->2
1: 0 1 0  (1,5)->1
2: 1 2 0  (2,4)->2
3: 0 1 2  (2,5)->2
4: 1 2 1
5: 0 2 0

Any choice except merging triplets 1 and 5 is fine, if we take the first pair, we get the same interim set as with the blind algorithm (I will this time collapse indices to remove gaps):

          mergables
0: 3 2 3  (2,3)->0
1: 0 1 2  (2,4)->1
2: 1 2 0
3: 1 2 1
4: 0 2 0

This is where this algorithm gets it differently: it chooses the triplets 2 and 4 because there is still one merge possible after merging them in contrast to the choice made by the blind algorithm:

          mergables
0: 3 2 3  (2,3)->0   3 2 3
1: 0 1 2             0 1 2
2: 1 2 0             1 2 1
3: 1 2 1

Result: 3

Level 2: Very greedy

Now, a second step from this intuitive heuristic is to look ahead one merge further and to ask the heuristic question then. Generalized, you would look ahead k merges further and apply the above heuristic, backtrack and decide the best option. This gets very verbose by now, so to exemplify, I will only perform one step of this new heuristic with lookahead 1:

          mergables
0: 3 2 3  (1,3)->(2,3)->0
1: 0 1 0         (2,4)->1*
2: 1 2 0  (1,5)->(2,4)->0
3: 0 1 2  (2,4)->(1,3)->0
4: 1 2 1         (1,4)->0
5: 0 2 0  (2,5)->(1,3)->1*
                 (2,4)->1*

Merge sequences marked with an asterisk are the best options when this new heuristic is applied.

In case a verbal explanation is necessary:

Instead of checking how many merges are possible after each possible merge for the starting set; this time we check how many merges are possible after each possible merge for each resulting set after each possible merge for the starting set. And this is for lookahead 1. For lookahead n, you'd be seeing a very long sentence repeating the part after each possible merge for each resulting set n times.

Level 3: Let's cut the greed

If you look closely, the previous approach has a disastrous perfomance for even moderate inputs and lookaheads(*). For inputs beyond 20 triplets anything beyond 4-merge-lookahead takes unreasonably long. The idea here is to cut out merge paths that seem to be worse than an existing solution. If we want to perform lookahead 10, and a specific merge path yields less mergables after three merges, than another path after 5 merges, we may just as well cut the current merge path and try another one. This should save a lot of time and allow large lookaheads which would get us closer to the globally optimal solution, hopefully. I haven't implemented this one for testing though.

_{(*): Assuming a large reduction of input sets is possible, the number of merges is
proportional to input size, and
lookahead approximately indicates how much you permute those merges.
So you have choose lookahead from |input|, which is
the binomial coefficient that for lookahead ≪ |input| can be approximated as
O(|input|^lookahead) -- which is also (rightfully) written as you are thoroughly screwed.}

Putting it all together

I was intrigued enough by this problem that I sat and coded this down in Python. Sadly, I was able to prove that different lookaheads yield possibly different results, and that even the blind algorithm occasionally gets it better than lookahead 1 or 2. This is a direct proof that the solution is not optimal (at least for lookahead ≪ |input|). See the source code and helper scripts, as well as proof-triplets on github. Be warned that, apart from memoization of merge results, I made no attempt at optimizing the code CPU-cycle-wise.

Thank you for your answer. I'm accepting it as it helped me the most. In the end, I manually found a few repetitive patterns in the data, I merged them first and then I used the "best of x random naive passes" approach. Worked well. — Danstahr, Apr 03 '14 at 12:11
Just to keep this post updated, we proved that the original problem is NP hard. The bipartite dimension problem is a trivial case of the originally posted problem and is NP complete itself. — Danstahr, Oct 20 '14 at 19:37
Duh... I could never match given problems to well known np-complete problems well. :) — Irfy, Oct 21 '14 at 23:30

score 0 · Answer 2 · answered Mar 08 '14 at 00:21

I don't have the solution, but I have some ideas.

Representation

A helpful visual representation of the problem is to consider the triplets as points of the 3D space. You have integers, so the records will be nodes of a grid. And two records are mergeable if and only if the nodes representing them sit on the same axis.

Counter-example

I found an (minimal) example where a greedy algorithm may fail. Consider the following records:

(1, 1, 1)   \ 
(2, 1, 1)   |     (3, 1, 1)  \
(1, 2, 1)   |==>  (3, 2, 1)  |==> (3, 3, 1)
(2, 2, 1)   |     (2, 2, 2)  /    (2, 2, 2)
(2, 2, 2)   /

But by choosing the wrong way, it might get stuck at three records:

(1, 1, 1)   \ 
(2, 1, 1)   |     (3, 1, 1)
(1, 2, 1)   |==>  (1, 2, 1)
(2, 2, 1)   |     (2, 2, 3)
(2, 2, 2)   /

Intuition

I feel that this problem is somehow similar to finding the maximal matching in a graph. Most of those algorithms finds the optimal solution by begining with an arbitrary, suboptimal solution, and making it 'more optimal' in each iteration by searching augmenting paths, which have the following properties:

they are easy to find (polynomial time in the number of nodes),
an augmenting path and the current solution can be crafted to a new solution, which is strictly better than the current one,
if no augmenting path is found, the current solution is optimal.

I think that the optimal solution in your problem can be found in the similar spirit.

score 0 · Answer 3 · answered Mar 13 '14 at 12:41

Based on your problem description:

I'm given a bunch of events in time that's usually got some pattern. The goal is to find the pattern. Each of the bits in the integer represents "the event occurred in this particular year/month/day". For example, the representation of March 7, 2014 would be [1 << (2014-1970), 1 << 3, 1 << 7]. The pattern described above allows us to compress these events so that we can say 'the event occurred every 1st in years 2000-2010'. – Danstahr Mar 7 at 10:56

I'd like to encourage you with the answers that MicSim has pointed at, specifically

Based on your problem description, you should check out this SO answers (if you didn't do it already): stackoverflow.com/a/4202095/44522 and stackoverflow.com/a/3251229/44522 – MicSim Mar 7 at 15:31

The description of your goal is much more clear than the approach you are using. I'm scared that you won't get anywhere with the idea of merging. Sounds scary. The answer you get depends upon the order that you manipulate your data. You don't want that.

It seems you need to keep data and summarize. So, you might try counting those bits instead of merging them. Try clustering algorithms, sure, but more specifically try regression analysis. I should think you would get great results using a correlation analysis if you create some auxiliary data. For example, if you create data for "Monday", "Tuesday", "first Monday of the month", "first Tuesday of the month", ... "second Monday of the month", ... "even years", "every four years", "leap years", "years without leap days", ... "years ending in 3", ...

What you have right now is "1st day of the month", "2nd day of the month", ... "1st month of the year", "2nd month of the year", ... These don't sound like sophisticated enough descriptions to find the pattern.

If you feel it is necessary to continue the approach you have started, then you might treat it more as a search than a merge. What I mean is that you're going to need a criteria/measure for success. You can do the merge on the original data while requiring strictly that A==A'. Then repeat the merge on the original data while requiring B==B'. Likewise C==C'. Finally compare the results (using the criteria/measure). Do you see where this is going? Your idea of bit counting could be used as a measure.

Another point, you could do better at performance. Instead of double-looping through all your data and matching up pairs, I'd encourage you to do single passes through the data and sort it into bins. The HashMap is your friend. Make sure to implement both hashCode() and equals(). Using a Map you can sort data by a key (say where month and day both match) and then accumulate the years in the value. Oh, man, this could be a lot of coding.

Finally, if the execution time isn't an issue and you don't need performance, then here's something to try. Your algorithm is dependent on the ordering of the data. You get different answers based on different sorting. Your criteria for success is the answer with the smallest size after merging. So, repeatedly loop though this algorithm: shuffle the original data, do your merge, save the result. Now, every time through the loop keep the result which is the smallest so far. Whenever you get a result smaller than the previous minimum, print out the number of iterations, and the size. This is a very simplistic algorithm, but given enough time it will find small solutions. Based on your data size, it might take too long ...

Kind Regards,

-JohnStosh