Algorithm: use union find to count number of islands

Question

Suppose you need to count the number of islands on a matrix

                    {1, 1, 0, 0, 0},
                    {0, 1, 0, 0, 1},
                    {1, 0, 0, 1, 1},
                    {0, 0, 0, 0, 0},
                    {1, 0, 1, 0, 1}

We could simply use DFS or BFS when the input matrix size can be fitting into the memory.

However, what do we do if the input matrix is really large which could not be fitting into the memory?

I could chunk/split the input matrix into different small files and read them respectively.

But how to merge them?

I got stuck at how to merge them. I have the idea that when merging them we have to read some overlapped portion. But what is a concrete way to do so?

Trying to understand Matt's solution.

When I drew the below sample on the whiteboard and process it row by row. Merge left then merge top and it seems won't work.

From Matt's solution.

not sure what are topidx, botidx meaning

            int topidx = col * 2;
            int botidx = topidx + 1;

How is an "island" defined? – גלעד ברקן Mar 21 '19 at 19:13 — גלעד ברקן, Mar 21 '19 at 19:13
connected 1's considered one island. – newBike Mar 21 '19 at 20:07 — newBike, Mar 21 '19 at 20:07

Matt Timmermans · Accepted Answer · 2019-03-21T22:26:00.600

Using union-find, the basic algorithm (without worrying about memory) is:

Create a set for every 1
Merge the sets for every pair of adjacent 1s. It doesn't matter what order you find them in, so reading order is usually fine.
Count the number of root sets -- there will be one for every island.

Easy, and with a little care, you can do this using sequential access to the matrix and only 2 rows worth of memory:

Initialize the island count to 0
Read the first row, create a set for each 1, and merge sets in adjacent columns.
For each additional row:
1. Read the row, create a set for each 1, and merge sets in adjacent columns;
2. Merge sets in the new row with adjacent sets in the previous row. ALWAYS POINT THE LINKS DOWNWARD, so that you never end up with a set in the new row linked to a parent in the old row.
3. Count the remaining root sets in the previous row, and add the number to your island count. These will never be able to merge with anything else.
4. Discard all the sets in the previous row -- you're never going to need them again, because you already counted them and nothing links to them.
Finally, count the root sets in the last row and add them to your island count.

The key to this, of course, is always pointing the links downward whenever you link sets in different rows. This will not hurt the complexity of the algorithm, and if you're using your own union-find, then it is easy to accomplish. If you're using a library data structure then you can use it just for each row, and keep track of the links between root sets in different rows yourself.

Since this is actually one of my favorite algorithms, here is an implementation in Java. This is not the most readable implementation since it involves some low-level tricks, but is super-efficient and short -- the kind of thing I'd write where performance is very important:

import java.util.Arrays;

public class Islands
{
    private static final String[] matrix=new String[] {
        "  #############  ###   ",
        "  #      #####   ##    ",
        "  #  ##  ##   #   #    ",
        "    ###      ##   #  # ",
        "  #   #########  ## ## ",
        "          ##       ##  ",
        "          ##########   ",
    };

    // find with path compression.
    // If sets[s] < 0 then it is a link to ~sets[s].  Otherwise it is size of set
    static int find(int[] sets, int s)
    {
        int parent = ~sets[s];
        if (parent>=0)
        {
            int root = find(sets, parent);
            if (root != parent)
            {
                sets[s] = ~root;
            }
            return root;
        }
        return s;
    }

    // union-by-size
    // If sets[s] < 0 then it is a link to ~sets[s].  Otherwise it is size of set
    static boolean union(int[] sets, int x, int y)
    {
        x = find(sets,x);
        y = find(sets,y);
        if (x!=y)
        {
            if ((sets[x] < sets[y]))
            {
                sets[y] += sets[x];
                sets[x] = ~y;
            }
            else
            {
                sets[x] += sets[y];
                sets[y] = ~x;
            }
            return true;
        }
        return false;
    }

    // Count islands in matrix

    public static void main(String[] args)
    {
        // two rows of union-find sets.
        // top row is at even indexes, bottom row is at odd indexes.  This arrangemnt is chosen just
        // to make resizing this array easier.
        // For each value x:
        // x==0 => no set. x>0 => root set of size x. x<0 => link to ~x
        int cols=4;
        int[] setrows= new int[cols*2];

        int islandCount = 0;

        for (String s : matrix)
        {
            System.out.println(s);
            //Make sure our rows are big enough
            if (s.length() > cols)
            {
                cols=s.length();
                if (setrows.length < cols*2)
                {
                    int newlen = Math.max(cols,setrows.length)*2;
                    setrows = Arrays.copyOf(setrows, newlen);
                }
            }
            //Create sets for land in bottom row, merging left
            for (int col=0; col<s.length(); ++col)
            {
                if (!Character.isWhitespace(s.charAt(col)))
                {
                    int idx = col*2+1;
                    setrows[idx]=1; //set of size 1
                    if (idx>=2 && setrows[idx-2]!=0)
                    {
                        union(setrows, idx, idx-2);
                    }
                }
            }
            //merge up
            for (int col=0; col<cols; ++col)
            {
                int topidx = col*2;
                int botidx = topidx+1;
                if (setrows[topidx]!=0 && setrows[botidx]!=0)
                {
                    int toproot=find(setrows,topidx);
                    if ((toproot&1)!=0)
                    {
                        //top set is already linked down
                        union(setrows, toproot, botidx);
                    }
                    else
                    {
                        //link top root down.  It does not matter that we aren't counting its size, since
                        //we will shortly throw it aaway
                        setrows[toproot] = ~botidx;
                    }
                }
            }
            //count root sets, discard top row, and move bottom row up while fixing links
            for (int col=0; col<cols; ++col)
            {
                int topidx = col * 2;
                int botidx = topidx + 1;
                if (setrows[topidx]>0)
                {
                    ++islandCount;
                }
                int v = setrows[botidx];
                setrows[topidx] = (v>=0 ? v : v|1); //fix up link if necessary
                setrows[botidx] = 0;
            }
        }

        //count remaining root sets in top row
        for (int col=0; col<cols; ++col)
        {
            if (setrows[col*2]>0)
            {
                ++islandCount;
            }
        }

        System.out.println("\nThere are "+islandCount+" islands there");
    }

}

Hey @matt thanks for your reply, what do you mean by links downward? I believe your idea is reading two rows at a time. for a grid `G[i][j]` is has to check its surrounding i-1, i, i+1, j-1,j,j+1. Not clear about how will your idea turn into the code implementation. — newBike, Mar 21 '19 at 20:10
First, you only have to check two directions per grid point -- for example up and left -- because the down and right directions will be covered by checking up and left from other cells. Re. links downward: To merge two sets in a union-find structure, you call find() on each one to get their root sets, and then link one root set to the other. Pointing the links downward means that if you need to link together two sets in different rows, always point the upper set to the lower one, instead of the other way around — Matt Timmermans, Mar 21 '19 at 20:29
@MattTimmermans I know this question is old, but may I ask why pointing the upper row downwards does not affect the time complexity? If I understood correctly your code assings setrows[toproot] = ~botidx; for that special case, and only uses union-by-size for the other cases. Doesn't it affect the time complexity obtained by using union-by-size and path compression for every case? I can't get my head around it. — Manu Mackwar, Jun 22 '22 at 18:26
@ManuMackwar The reason is that, since the upper rows are simply discarded, no set that you will *actually use* has more than one downward link in its path. To analyze the time complexity, it's easier to think of each row as a disjoint set structure of its own, and then just consider the different cases of unions and finds that could happen due to downward links when you do the vertical merges. — Matt Timmermans, Jun 22 '22 at 19:02

Algorithm: use union find to count number of islands

Trying to understand Matt's solution.

not sure what are topidx, botidx meaning

1 Answers1

Linked