I have two sets of intervals: one containing “positive” intervals between, say, 0 and 100; and the other containing “negative” intervals between -100 and 0. The intervals in each set are not necessarily unique (maybe “collection” is a better word in this case than “set”), and can overlap. For example, the positive set is
{ [0, 10], [5,15], [5,15], [10,15], [10,20], [25, 40] }
and the negative set is
{ [-15, 0], [-15,-5], [-20,-15], [-30,-25] }
The adjacent non-overlapping intervals (i.e. those intervals where the right end-point of one is equal to the left end-point of the other) within each set can be combined to form longer intervals, e.g. [0,10] + [10,20] = [0,20] and [-15,0] + [-20,-15] = [-20,0], but [0,10] and [5,15] cannot be combined into [0,15].
The positive and negative intervals may be cancelled with each other if they span exactly the same range in absolute numbers, e.g. [5,15] + [-15,-5] = 0 and [0,10] + [10,20] + [-15,0] + [-20,-15] = [0,20] + [-20,0] = 0.
I am looking for an efficient algorithm for joining and cancelling the intervals in a way that minimizes the total combined length of the remaining intervals. In the example, the remaining total length = len([5,15]) + len([10,15]) + len([25,40]) + len([-30,-25]) = 10 + 5 + 15 + 5 = 35.
Maybe this type of problem has been addressed already somewhere in the literature or here (I couldn’t find anything, but maybe it’s just because I don’t know how to formulate it in a formal way), so I would be grateful for references and links; or a solution posted here would of course also suffice.
Below are my first naive thoughts on the (very) high-level steps that could be taken. The idea is that a positive interval whose left end-point matches with a left end-point of some negative interval is "potentially cancelable" either if its right end-point matches a right end-point of some negative interval, or if one of its adjacent intervals is "potentially cancelable".
Let's use positive numbers for both sets to denote intervals' left (l
) and right (r
) end-points, calling them l+
/ l-
and r+
/ r-
for positive / negative set. Set S = 0
.
Find all left end-points such that
l+ = l- = l
and all right end-points such thatr+ = r- = r
. For each suchl
and eachr
, findn_l = min{number of positive intervals with l+ = l; number of negative intervals with l- = l}
andn_r = min{number of positive intervals with r+ = r; number of negative intervals with r- = r}
.Find the smallest
l_min
from the set of matched left end-points{l}
fromStep 1
and find the largestr_max
from the set of matched right end-points{r}
fromStep 1
. Keep all the intervals that fall entirely betweenl_min
andr_max
for further processing in the next steps. CalculateS = S + (the total length of the intervals which do not fall entirely between the two bounds l_min and r_max
).Order the intervals in each set by the left end-points in an ascending order.
At each left end-point, arrange intervals by their length in a descending order.
Loop over all positive intervals starting at the left-most point
l_min
.Compare the right end-point of the interval with the set of matched right end-points
r
fromStep 2
.If no match in
Step 6
if found, then look for a next interval whose left end-point is equal to the right end-point of the current interval.If an interval in
Step 7
is found, then use it go back toStep 6
.If no interval in
Step 7
is found, then add the length of the current interval to the sum of lengthsS
. Decreasen_l
corresponding to the left end-point of the current interval by 1:n_l := n_l - 1
. If the resulting new value ofn_l = 0
then go toStep 2
. Ifn_l > 0
then go toStep 5
and take the next interval with the same left end-point as the current interval. Remove the current interval from further steps.If a match in
Step 6
if found, then use negative intervals to go toStep 5
.
work in progress...
[...]
For each set (positive S+ and negative S-) construct the longest possible combinations of intervals treating non-unique intervals as identical. Say there are N_C+ and N_C- different combinations possible each containing N_k+ and N_k- intervals after joining with k+ = 1..N_C+ and k- = 1..N_C-.
Compare these combinations between two sets (starting with those combinations which contain the longest intervals) eliminating / canceling sections which coincide.
Calculate the total remaining length.
Obviously, there are many details that have to be filled in for the above, but at this point I am not even sure if this approach guarantees finding the minimum solution.