0

A long time ago, I made a game in which a sort of connected-component labeling is required to implement AI part. I used the two-pass algorithm unknowingly at that time.

Recently, I got to know that I can make them faster using bit-scan based method instead. It uses 1-bit-per-pixel data as input, instead of typical bytes-per-pixel input. Then it finds every linear chunks in each scan-lines using BSF instruction. Please see the code below. Cut is a struct which saves information of a linear chunk of bit 1s in a scan-line.

Cut* get_cuts_in_row(const u32* bits, const u32* bit_final, Cut* cuts) {
    u32 working_bits = *bits;
    u32 basepos = 0, bitpos = 0;
    for (;; cuts++) {
        //find starting position
        while (!_BitScanForward(&bitpos, working_bits)) {
            bits++, basepos += 32;
            if (bits == bit_final) {
                cuts->start_pos = (short)0xFFFF;
                cuts->end_pos = (short)0xFFFF;
                return cuts + 1;
            }
            working_bits = *bits;
        }
        cuts->start_pos = short(basepos + bitpos);

        //find ending position
        working_bits = (~working_bits) & (0xFFFFFFFF << bitpos);
        while (!_BitScanForward(&bitpos, working_bits)) {
            bits++, basepos += 32;
            working_bits = ~(*bits);
        }
        working_bits = (~working_bits) & (0xFFFFFFFF << bitpos);
        cuts->end_pos = short(basepos + bitpos);
    }
}

First, it uses the assembly BSF instruction to find the first position bit 1 appears. After it is found, it finds the the first position bit 0 appears after that position using bit inversion and bit masking, then repeats this process.

After getting the starting position and the ending position of all linear chunks of 1s (I prefer refer them as 'cuts') in every scan-line, it gives labels to them in CCL manner. For the first row, every cuts get different labels.

For each cut in rest rows, it checks if there are upper cuts which are connected to it first. If no upper cuts are connected to it, it gets new label. If only one upper cuts are connected to it, it gets the copy of the label. If many upper cuts are connected to it, those labels are merged and it gets the merged one. This can be done easily using two progressing pointers of upper chunks and lower chunks. Here is the full code doing that part.

Label* get_labels_8c(Cut* cuts, Cut* cuts_end, Label* label_next) {
    Cut* cuts_up = cuts;
    
    //generate labels for the first row
    for (; cuts->start_pos != 0xFFFF; cuts++) cuts->label = [GET NEW LABEL FROM THE POOL];
    cuts++;

    //generate labels for the rests
    for (; cuts != cuts_end; cuts++) {
        Cut* cuts_save = cuts;
        for (;; cuts++) {
            u32 start_pos = cuts->start_pos;
            if (start_pos == 0xFFFF) break;

            //Skip upper slices ends before this slice starts 
            for (; cuts_up->end_pos < start_pos; cuts_up++);

            //No upper slice meets this
            u32 end_pos = cuts->end_pos;
            if (cuts_up->start_pos > end_pos) {
                cuts->label = [GET NEW LABEL FROM THE POOL];
                continue;
            };

            Label* label = label_equiv_recursion(cuts_up->label);

            //Next upper slice can not meet this
            if (end_pos <= cuts_up->end_pos) {
                cuts->label = label;
                continue;
            }

            //Find next upper slices meet this
            for (; cuts_up->start_pos <= end_pos; cuts_up++) {
                Label* label_other = label_equiv_recursion(cuts_up->label);
                if (label != label_other) [MERGE TWO LABELS]
                if (end_pos <= cuts_up->end_pos) break;
            }
            cuts->label = label;
        }
        cuts_up = cuts_save;
    }
    return label_next;
}

After this, one can use these information for each scan-line to make the array of labels or any output he want directly.

I checked the execution time of this method and then I found that it's much faster the two-scan method I previously used. Surprisingly, it turned out to be much faster than the two-scan one even when the input data is random. Apparently the bit-scanning algorithm is best for data with relatively simple structures where each chunks in scan-lines are big. It wasn't designed to be used on random images.

What baffled me was that literally nobody says about this method. Frankly speaking, it doesn't seem to be an idea that hard to come up with. It's hard to believe that I'm the first one who tried it.

Perhaps while my method is better than the primitive two-scan method, it's worse than more developed ones based on the two-scan idea so that anyway doesn't worth to be mentioned.

However, if the two-scan method can be improved, the bit-scan method also can. I myself found a nice improvement for 8-connectivity. It analyses neighboring two scan-lines at ones my merging them using the bit OR instruction. You can find the full codes and detailed explanation on how they work here.

I got to know that there is a benchmark for CCL algorithms named YACCLAB. I'll test my algorithms in this with best CCL algorithms to see how really good they are. Before that, I want to ask several things here.

My question is,

  1. Are these algorithms I found really new? It's still hard to believe that nobody has ever thought CCL algorithm using bit-scanning. If it's already a thing, why I can't found anyone says about it? Were bit-scan based algorithms proven to be bad and forgotten?

  2. If I really found a new algorithm, what should I do next? Of course I'll test it in more reliable system like YACCLAB. I'm questioning about what I should do next. What I should to to make these algorithms mine and spread them?

Antel
  • 26
  • 1
  • 9
  • This just looks like using bit fiddling to do a standard CCL algorithm to me. But maybe I am missing something? – Yakk - Adam Nevraumont Jan 21 '21 at 05:37
  • What is the standard CCL algorithm you're refering? I don't know any CCL algorithm which may be called standard except two, one-component-at-a-time and two-pass algorithm. Both are quite different from what I tried. Even if I do the same thing without using bit fiddling, it would be still quite different algorithm from the two. – Antel Jan 21 '21 at 06:54
  • So to reiterate in simpler terms: you scan once horizontally and for each row you make a set of labels. Then I'm a somewhat confused about the next step since your description is a bit handwavy. How do you effectively transform your line-labels into labels for whole image please? (E.g. how do you quickly achieve merge). – Shamis Jan 21 '21 at 07:38
  • I'm sorry for the lack of explanation. I just wanted to know if there have been any CCL algorithm which uses bit-scanning like mine. I didn't thought that I need to explain every part of the algorithm in detail for that. I feel like asking this itself was a mistake for now, but the mistake was already made so... please wait a little. I'll add some explanation on that part too. – Antel Jan 21 '21 at 07:53

1 Answers1

1

So far, I'm a bit sceptical

My reasoning was getting too long for a comment, so here we are. There is a lot to unpack. I like the question quite a bit even though it might be better suited for a computer science site.

The thing is, there are two layers to this question:

  1. Was a new algorithm discovered?
  2. What about the bit scanning part?

You are combining these two so first I will explain why I would like to think about them separately:
The algorithm is a set of steps(more formal definition) that is language-agnostic. As such it should work even without the need for the bitscanning.
The bit scanning on the other hand I would consider to be an optimization technique - we are using a structure that the computer is comfortable with which can bring us performance gains.

Unless we separate these two, the question gets a bit fuzzy since there are several possible scenarios that can be happening:

  1. The algorithm is new and improved and bit scanning makes it even faster. That would be awesome.
  2. The algorithm is just a new way of saying "two pass" or something similar. That would still be good if it beats the benchmarks. In this case it might be worth adding to a library for the CCL.
  3. The algorithm is a good fit for some cases but somehow fails in others(speed-wise, not correction wise). The bit scanning here makes the comparison difficult.
  4. The algorithm is a good fit for some cases but completely fails in others(produces incorrect result). You just didn't find a counterexample yet.

Let us assume that 4 isn't the case and we want to decide between 1 to 3. In each case, the bit scanning is making things fuzzy since it most likely speeds up things even more - so in some cases even a slower algorithm could outperform a better one.

So first I would try and remove the bit scanning and re-evaluate the performance. After a quick look it seems that the algorithms for the CCL have a linear complexity, depending on the image size - you need to check every pixel at least once. Rest is the fight for lowering the constant as much as possible. (Number of passes, number of neighbors to check etc.) I think it is safe to assume, that you can't do better then linear - so the first question is: does your algorithm improve on the complexity by a multiplicative constant? Since the algorithm is linear, the factor directly translates to performance which is nice.

Second question would then be: Does bit scanning further improve the performance of the algorithm?

Also, since I already started thinking about it, what about a chess-board pattern and 4-connectivity? Or alternatively, a chessboard of 3x3 crosses for the 8-connectivity.

Shamis
  • 2,544
  • 10
  • 16
  • To answer the first question, my algorithm is definitely linear. I doubt if there have been any CCL algorithms which are faster than linear. On the second question, I never tried to check if the algorithm is faster when I don't use bit-scanning. However, I'll answer most straightforwardly on this. It will be definitely slower than the two-pass algorithm without bit-scanning. Most of its benefit comes from that it *can* be optimized using bit operations. The two-scan algorithms requires to check a pixel and its upper pixel together. It's impossible to optimize it using bit operations. – Antel Jan 21 '21 at 08:51
  • The same thing applies to the other CCL algorithm using bit-scanning I found. It *uses* the fact that hardware can calculate bit OR operation very fast and efficiently. Is it possible to deal with performance problem without taking hardware things into account? Like instructions, caches and a lot. When we talk about algorithms purely in mathematical point of view, they are indeed irrelevant. However, I think, it should be taken into account when we're dealing with performance. – Antel Jan 21 '21 at 08:58
  • Chess-board patterns are where the bit-scanning based algorithms show the worst performance. It's much slower than two-pass in that case. (Despite my algorithms seem to be faster in most cases, it does not mean they are faster always.) My alternative algorithm which I mentioned (you can find it here as the name BMS, https://github.com/attltb/connected_component_labeling), however, deals with them greatly. – Antel Jan 21 '21 at 09:03
  • I haven't checked all CCL algorithms, but it's different from any algorithm I know even if I don't use the bit-operations. People might came up with the same algorithm, but they would've simply discard the idea since it's much slower than the two-scan one. I don't think that the algorithm itself can be competitive CCL algorithm without taking into the fact that it can be optimized better. – Antel Jan 21 '21 at 09:22
  • That sounds reasonable. What about the case of lines in format 11101110111011101, 10111011101110111(every line shifted by two) which would cause a lot of merges? (for the 4-connectivity) – Shamis Jan 21 '21 at 09:36
  • It actually don't cause that many merges. Merges happen only when processing the second line. (This is because my label implementations. In all of my algorithms, when two neighborhood have different labels, it does not just choose smaller one among the two. it copies the new root of the unified tree of connected labels. I found that this makes the algorithms a little faster.) However, it checks if things have to be merges a lot. My algorithm gets slower again because it have to check a lot. – Antel Jan 21 '21 at 10:11
  • Also the more I look at the basis of your algorithm, it seems like a **smart** reordering of the two pass, that is likely more cache and cpu friendly. Actually it seems that you are doing three passes: rows, merging rows and then re-labeling the unions. However the passes are likely faster then in the usual ordering - especially the rowwise one. (For the 4-connectivity. Not sure about the 8-connectivity part. That one I'm still a bit baffled with ^^). – Shamis Jan 21 '21 at 10:14