Splitting apart overlapping segments

Question

Here is a vector of segments

class Segment
{
public:
   size_t left;
   size_t right;
   char ID;
   Segment(size_t a, size_t b, char c):left(a), right(b), ID(c){assert(left<right);}
};


std::vector<Segment> A = {{3, 10, 'A'}, {7, 22, 'B'}, {14, 17, 'C'} , {16, 19, 'D'}, {25, 31, 'E'}, {28, 32, 'F'}, {34, 37, 'G'}, {34, 37, 'H'}, {46, 49, 'I'}, {52, 59, 'J'}}

The vector A is sorted based on the attribute left. One could draw the content of this vector as

       -------  'A'
           ---------------  'B'
                  ---  'C'
                    ---   'D' 
                             ------  'E'
                                ----   'F'
                                      ---  'G'
                                      ---  'H'
                                                  ---  'I'
                                                        -------  'J'

I would like to create object B that contains small non-overlapping segments from those big segments in A. To get to B, we must shrink the segments that overlap with one another and create a new segment with ID X for all the places that overlap. Vector B also needs to be sorted based on the left attribute.

For the above example, the expected output is

std::vector<Segment> B = {{3, 7, 'A'}, {7, 10, `X`}, {10, 14, 'B'}, {14, 19, 'X'}, {19, 22, 'B'} , {25, 28, 'E'}, {28, 31, 'X'}, {31, 32, 'F'}, {34, 37, 'X'}, {46, 49, 'I'}, {52, 59, 'J'}}


       ----  'A'
           ---  'X'  (overlap between 'A' and 'B')
              ----  'B'
                  --------  'X' (overlap between 'B', 'C' and 'D')
                          ---  'B'  -> Note that 'B' is now split in two
                                ---  'E'
                                   ---   'X'  (overlap between 'E' and 'F')
                                      -  'F'
                                         ---  'X' (overlap between 'G' and 'H')
                                                     ---  'I'
                                                           -------  'J'

Can anyone give me a hand?

Note that, unlike in the above example, two segments in A can actually have the same ID (but then, it would be impossible for them to overlap). Note also that A is not const and can be modified during the operation. For performance consideration, note that the vector A is typically relatively short; between 1 and about a hundred (or a few hundreds) segments long. The values left and right are typically quite large (in the range 0 to about 1e9) and only few of those segments will intersect. Generally, when there are few segments, those segments will be quite wide (when size is 1, the single segment will often be about 1e9 in width). Finally, you can print the above diagram with

void print(std::vector<Segment>& v)
{
    for (auto& elem : v)
    {
        std::cout << "{" << elem.left << ", " << elem.right << ", " << elem.ID << "} ";
    }
    std::cout << "\n";

    for (auto& elem : v)
    {
        for (size_t i = 0 ; i < elem.left ; ++i)
            std::cout << " ";
        for (size_t i = 0 ; i < elem.right - elem.left ; ++i)
            std::cout << "-";
        std::cout << "  " << elem.ID << "\n";           
    }
    std::cout << "\n\n\n";  
}

An algorithm that does not need the input to be sorted would be even better.

Attempt

Just to show effort, here is an attempt that 1) is buggy and 2) would be a relatively slow implementation. Let's call "breakpoint", any right of left in the following B vector. The idea is to jumps from one breakpoint to the next by systematically searching among previous and following segments for a potential next breakpoint. In doing so, it should keep track what ID (if the distance between breakpoint matches at least one segment in A) should be given in the new segment.

std::vector<Segment> foo(std::vector<Segment>& A)
{
    if (A.size() <= 1) return A;

    std::vector<Segment> B;
    B.reserve(A.size());
    
    size_t A_index = 0;
    size_t currentPos = A[A_index].left;
    while ( A_index < A.size())
    {
        auto nextPos = A[A_index].right;

        //std::cout << "currentPos = " << currentPos << "\n";
        //std::cout << "nextPos before search = " << nextPos << "\n";

        bool isIntersection = false;
        // Search in preceding Segments
        for (size_t i = A_index - 1 ; i < A.size() ; --i)
        {
            if (A[i].right > currentPos && A[i].right < nextPos )
            {
                nextPos = A[i].right;
                isIntersection = true;
                //std::cout << "Found " << nextPos << " in preceding segment\n";
            }
        }

        // Search in following Segments
        for (size_t i = A_index+1 ; i < A.size() ; ++i)
        {
            if ( A[i].left > currentPos && A[i].left < nextPos)
            {
                nextPos = A[i].left;
                //std::cout << "Found left of " << nextPos << " in following segment\n";
                break;
            }

            if ( A[i].right > currentPos &&  A[i].right < nextPos )
            {
                nextPos = A[i].right;
                isIntersection = true;
                //std::cout << "Found right of " << nextPos << " in following segment\n";
                break;
            }
        }

        // create new Segment
        if (!isIntersection)
        {
            B.push_back({currentPos, nextPos, A[A_index].ID});
        } else
        {
            B.push_back({currentPos, nextPos, 'X'});
        }
        if (nextPos == A[A_index].right)
        {
            ++A_index;
            nextPos = A[A_index].left;
        }
        currentPos = nextPos;
    }

    return B;
}


int main()
{
    std::vector<Segment> A = {{3, 10, 'A'}, {7, 22, 'B'}, {14, 17, 'C'} , {16, 19, 'D'}, {25, 31, 'E'}, {28, 32, 'F'}, {34, 37, 'G'}, {34, 37, 'H'}, {46, 49, 'I'}, {52, 59, 'J'}};
    print(A);
    auto B = foo(A);
    print(B);
}

463035818_is_not_an_ai · Answer 1 · 2020-06-28T12:26:18.800

The following is not the most efficient, but it produces the expected output. Strategy is the following:

(It actually does not require the input vector to be sorted)
Dissect all Segments into 1 wide pieces. Store them in a map that maps the position to the IDs present at that position
Assign new IDs according to overlap
Glue the pieces again together

For convenience I am using a

struct Overlap {
    std::vector<char> IDs;
    Overlap() {}
    void add(char id) {IDs.push_back(id);}
};

And my implementation of the last step requires to drop the requirement of left<right in the constructor.

int main() {
    std::vector<Segment> A = {{3, 10, 'A'}, {7, 22, 'B'}, {14, 17, 'C'} , {16, 19, 'D'}, {25, 31, 'E'}, {28, 32, 'F'}, {34, 37, 'G'}, {34, 37, 'H'}, {46, 49, 'I'}, {52, 59, 'J'}};   
    // dissect
    std::map<size_t,Overlap> over;
    for (const auto& s : A) {
        for (size_t i = s.left; i < s.right; ++i) {
            over[i].add(s.ID);
        }
    }
    // assign new segments 
    std::map<size_t,char> pieces;
    for (const auto& o : over) {
        if (o.second.IDs.size() == 1) {
            pieces[o.first] = o.second.IDs.front();
        } else {
            pieces[o.first] = 'X';
        }
    }
    // glue them
    std::vector<Segment> result;
    auto it = pieces.begin();
    Segment current(it->first,it->first,it->second);  // here left==right !
    ++it;
    for ( ; it != pieces.end(); ++it) {
        if (it->second == current.ID) continue;
        current.right = it->first -1;
        result.push_back(current);
        current = Segment{it->first,it->first,it->second};
    }

    print(result);
}

Output:

{3, 6, A} {7, 9, X} {10, 13, B} {14, 18, X} {19, 24, B} {25, 27, E} {28, 30, X} {31, 33, F} {34, 45, X} {46, 51, I} 
   ---  A
       --  X
          ---  B
              ----  X
                   -----  B
                         --  E
                            --  X
                               --  F
                                  -----------  X
                                              -----  I

Live Example

Actually the only expensive part is 1 (O(number of segments x their width). I believe it can be made more efficient by using two containers, one sorted according to left and one according to right, to enable easier detection of overlaps. However, the above naive implementation is what I would start with. Also if the width is considerably small compared to number of segments then it might be that O(number of segments x width) is better than O( log number of segments x number of segments) for sorting.

Thank you! That's a smart and simple way to go around it. Just while you were posting your answer, I made an edit to specify that type of values that `left` and `right` take. A segment can be up to 1e9 long, so fabricating pieces of width 1 would be very eager in RAM. — Remi.b, Jun 28 '20 at 12:15
@Remi.b oh :). I don't know, the pieces don't necessarily have to be 1 wide, when there is no overlap in a huge region, or same segments spanning over a wide region you only need to store begin and end of the region. In any case I am not claiming that this is the perfect solution. I found the problem interesting and my advice is to start with something correct and simple — 463035818_is_not_an_ai, Jun 28 '20 at 12:18
@Remi.b I'd probably go for two sorted containers. Then you can pass through the "global left" to "global right" and one container helps you to know when a segment pops out, the other tells you when a segment pops in — 463035818_is_not_an_ai, Jun 28 '20 at 12:20
Something correct and simple is already much more than I managed to do! I'll try to think about these two containers and how to build them. Thanks. +1 — Remi.b, Jun 28 '20 at 12:23

cigien · Accepted Answer · 2020-07-04T15:35:09.113

Here's a solution that computes all the transition points created by segments, and then reconstructs the new segments using these points.

The algorithm is:

Every segment generates 2 transition points, one for the opening, and one for the closing of the segment.
The transition points are sorted.
Construct new segments from every adjacent pair of transition points. Each pair of points represents either:

a) an empty segment (no new segment is added)

b) a single segment (a segment with .ID is added)

c) multiple segments (a segment with 'X' is added)
Newly constructed segments might contain adjacent X segments, so they need to be merged.

First, a simple struct to store the transition points:

struct Point
{
    size_t location; 
    bool overlap;    // does this point start/close a new segment
    char ID;
};

The implementation is:

std::vector<Segment> foo(std::vector<Segment> const & segments)
{
    // generate all transition points
    std::vector<Point> points;
    for (auto const & seg : segments)
    {
        points.push_back({seg.left, true, seg.ID});
        points.push_back({seg.right, false, seg.ID});
    }

    // sort transition points
    std::sort(points.begin(), points.end(), 
      [](auto a, auto b) { return a.location < b.location; });

    std::vector<Segment> res;

    // initialize overlaps
    std::multiset<char> overs{points[0].ID};

    // for every adjacent transition point
    for(auto i = 1u; i < points.size(); ++i) 
    {
        auto &a = points[i - 1];
        auto &b = points[i];

        // if there is a jump in between transition points
        if (a.location < b.location)
           switch (overs.size())
           {
               // no segment
               case 0 : break;
               // ony one segment
               case 1 : res.push_back({a.location, b.location, *overs.begin()}); break;
               // overlapping segment
               default : res.push_back({a.location, b.location, 'X'}); break;
           }

        // update overlaps
        if (b.overlap)
           overs.insert(b.ID);
        else
           overs.erase(overs.find(b.ID));  
    }
    
    // merge adjacent 'X' overlaps 
    for(auto i = 0u; i < res.size(); ++i) 
    {
         if (res[i].ID == 'X')
         {
           auto f = std::find_if(res.begin() + i + 1, res.end(),
            [](auto r) { return r.ID != 'X'; });
           res[i].right = (f - 1)->right;
           res.erase(res.begin() + i + 1, f); 
         }
     }
        
    return res;
}

This is an O(n log(n)) algorithm.

Here's a demo.

Thank you! At first, I thought your algorithm does not need input `segments` to be sorted, which is very convenient as it saves time. However, there is one particular scenario where your algorithm fails when the input is not sorted. It is when two segments with the same ID touches each other without overlapping each other. For example, the input `A={{25, 100, 'A'}, {0, 25, 'A'}}` will output `{{0, 25, 'A'}}` instead of `{{0,100,'A'}}` (or instead of `{{25, 100, 'A'}, {0, 25, 'A'}}` which would also work in my case). Is there a way to fix that? — Remi.b, Jul 04 '20 at 15:16
@Remi.b Yes, the input segments don't need to be sorted. Also, I edited the answer to fix the case you just described by making `overs` a `multiset` instead of `set`. Also, if you want the output to be `{{0,100,'A'}}`, then in the last loop, remove the condition that removes only adjacent `X`s. It will be a couple less line of code as well :) [demo](https://godbolt.org/z/vGfj4U) — cigien, Jul 04 '20 at 15:37

Splitting apart overlapping segments

2 Answers2