2

I two sets of intervals that correspond to the same 1-dimensional (linear) space. Here is a rough visual--in reality, there are many more intervals and they are much more spread out, but this gives the basic idea.

interval graphic

Each of these intervals contains information, and I am writing a program to compare the information in one set of intervals (the red) to the information contained in the other set (the blue).

Here is my problem. I would like to partition the space into n chunks such that there is roughly an equal amount of comparison work to be done in each chunk (the amount of work depends on the number of intervals in that portion of the space). Also, the partition should not split any red or blue interval across two chunks.

So the input is two sets of intervals, and the desired output is a partition of the space such that

  • the intervals are (roughly) equally distributed across each element of the partition
  • no interval overlaps with multiple partition elements

Can anyone suggest an approach or an algorithm for doing this?

Josh Lee
  • 171,072
  • 38
  • 269
  • 275
Daniel Standage
  • 8,136
  • 19
  • 69
  • 116
  • @Daniel: Is it expensive for you to scan the entire space once before beginning comparison, in order to build a comparison list? Also, are you guaranteed to have an equal number of red and blue intervals? Is there any way to identify the sequence of an interval by examining it (e.g. sequence number encoded in a header or something)? – Eric J. Apr 01 '11 at 03:19
  • It is likely that there will not be an equal number of intervals. A quick initial scan of the linear space will probably be required and will not be expensive. I can store pointers to "interval" objects (objects with start and end coordinates) in multiple data structures if necessary, although it would be too memory intensive to store much more than that. An interval tree and a dynamic array came to mind first, but I'm trying to figure out how I can use them... – Daniel Standage Apr 01 '11 at 03:34
  • @Daniel - Am I missing something, or can't you just create a list of pointer pairs with a quick scan, then partition that list for processing? – Eric J. Apr 01 '11 at 03:44
  • 2
    There is not necessarily *any* solution, since you're only allowed to "snip" the space into segments at points that are not covered by either a red or a blue interval. E.g. if the intervals are arranged like the bricks in 2 rows of brickwork, there will be no such points. (Also you don't say whether or not the intervals are guaranteed not to overlapping within their respective sets, but my counterexample holds even if they don't.) – j_random_hacker Apr 01 '11 at 05:13
  • It is not clear what depends the workload of a chunk - you say the number of comparison, which depends on the intervalls. But which intervalls, the ones from set 1 or from set 2, or from both? If it would depend just on one of them the distribution of the intervalls in would be interresting. – flolo Apr 01 '11 at 07:50
  • @j_random_hacker Yes, it's certainly possible to construct pathological data sets like the brickwork one you mentioned, and I guess the algorithm should maintain that as a single chunk no matter what the value provided for *n*. However, as I mentioned there will be many of these intervals and they will be *much more spread out* than shown in the example graphic. That is the operating assumption. – Daniel Standage Apr 01 '11 at 11:11
  • @frolo It depends on the number of intervals from both sets. For example, it the red set contains 4000 intervals and the blue set contains 4400 and n = 4, then ideally each chunk would contain 1000 red intervals and 1100 blue intervals. Of course, the intervals are not distributed perfectly evenly, so the number of intervals in each chunk may be a little more or a little less. – Daniel Standage Apr 01 '11 at 11:15
  • @Eric J. Maybe I am missing something simple, but it still is not clear to me. Pointer pairs? Pointers to pairs of intervals? Pairs of overlapping intervals from the two sets? What about the intervals that don't overlap with anything from the other set? In principle, this sounds like an easy problem, but I'm having trouble sorting out the details. – Daniel Standage Apr 01 '11 at 11:17
  • Could you be more specific about how the work depends on both numbers? – Null Set Apr 03 '11 at 20:43

1 Answers1

2

Define a "word" to be a maximal interval in which every point belongs either to a red interval or a blue interval. No chunk can end in the middle of a word, and every union of consecutive words is a potential chunk. Now apply a minimum raggedness word-wrap algorithm to the words, where the length of a word is defined to be the number of intervals it contains (line = chunk).

qrqwe
  • 299
  • 1
  • 4
  • The tricky part to using this is guessing a line width that results in the n lines needed. – Null Set Apr 03 '11 at 20:24
  • The rest of the question makes that sound like a soft constraint. More pressing is the question of whether non-contiguous partitions are allowed. – qrqwe Apr 03 '11 at 20:31
  • Yes, a sub-optimal but near-optimal solution is acceptable, but "words" must be maintained in sorted order. – Daniel Standage Apr 04 '11 at 11:12
  • I think this is the best thing to do, using a binary search to find a good line width. I tried changing the DP recursion so that we have a fixed, known number of lines and the unknown linewidth is dependent on them, but this causes a state space blowup (= inefficiency) because now the best cost of a solution starting at the jth word (assuming an exponent of 2 in the cost function) depends not just on the sum of the lengths of lines we have already (something we can immediately get from j) but also on the number of those lines, and the sum of squares of their lengths. – j_random_hacker Apr 05 '11 at 11:54
  • I had worked with the idea of a "word" previously (I called it something else), but for some reason I didn't think of it for this problem. If there are k words, it turns out that splitting the space so that there are k/n words in each chunk is an acceptable heuristic. Not exactly what you suggested, but reminding me of the "word" concept is just what I needed! – Daniel Standage Apr 06 '11 at 00:51