6

I am solving a problem on LeetCode:

Given an unsorted array of integers nums, return the length of the longest consecutive elements sequence. You must write an algorithm that runs in O(n) time. So for nums = [100,4,200,1,3,2], the output is 4.

The Union Find solution to solve this is as below:

class Solution {
public:
    vector<int> parent, sz;
    
    int find(int i) {
        if(parent[i]==i) return i;
        return parent[i]=find(parent[i]);
    }
    
    void merge(int i, int j) {
        int p1=find(i);
        int p2=find(j);
        
        if(p1==p2) return;
        if(sz[p1]>sz[p2]) {
            sz[p1]+=sz[p2];
            parent[p2]=p1;
        } else {
            sz[p2]+=sz[p1];
            parent[p1]=p2;
        }
    }

    int longestConsecutive(vector<int>& nums) {
        sz.resize(nums.size(),1);
        parent.resize(nums.size(),0);

        iota(begin(parent),end(parent),0);

        unordered_map<int, int> m;

        for(int i=0; i<nums.size(); i++) {
            int n=nums[i];
            if(m.count(n)) continue;
            if(m.count(n-1)) merge(i,m[n-1]);
            if(m.count(n+1)) merge(i,m[n+1]);
            m[n]=i;
        }

        int res=0;
        for(int i=0; i<parent.size(); i++) {
            if(parent[i]==i && sz[i]>res) {
                res=sz[i];
            }
        }

        return res;
    }
};

This gets accepted by the OJ (Runtime: 80 ms, faster than 76.03% of C++ online submissions for Longest Consecutive Sequence), but is this really O(n), as claimed by many answers, such as this one? My understanding is that Union Find is an O(NlogN) algorithm.

Are they right? Or, am I missing something?

Someone
  • 611
  • 4
  • 13
  • 1
    https://en.wikipedia.org/wiki/Disjoint-set_data_structure – Matt Timmermans Mar 13 '22 at 01:32
  • Where do you see anything with logarithmic complexity being done? In other words, what are your results? BTW: It's customary to provide a [mcve], not just a solution that depends on some online judge boilerplate as environment. – Ulrich Eckhardt Mar 14 '22 at 07:33

1 Answers1

7

They are right. A properly implemented Union Find with path compression and union by rank has linear run time complexity as a whole, while any individual operation has an amortized constant run time complexity. The exact complexity of m operations of any type is O(m * alpha(n)) where alpha is the inverse Ackerman function. For any possible n in the physical world, the inverse Ackerman function doesn't exceed 4. Thus, we can state that individual operations are constant and algorithm as a whole linear.

The key part for path compression in your code is here:

return parent[i]=find(parent[i])

vs the following that doesn't employ path compression:

return find(parent[i])

What this part of the code does is that it flattens the structure of the nodes in the hierarchy and links each node directly to the final root. Only in the first run of find will you traverse the whole structure. The next time you'll get a direct hit since you set the node's parent to its ultimate root. Notice that the second code snippet works perfectly fine, but it just does redundant work when you are not interested in the path itself and only in the final root.

Union by rank is evident here:

if(sz[p1]>sz[p2]) {...

It makes sure that the node with more children becomes the root of the node with less children. Therefore, less nodes need to be reassigned a new parent, hence less work.

Note: The above was updated and corrected based on feedback from @Matt-Timmermans and @kcsquared.

DevSolar
  • 67,862
  • 21
  • 134
  • 209
user1984
  • 5,990
  • 2
  • 13
  • 32
  • 1
    Thanks for your answer. So if I understand it, if we use path compression (and union by size as well, in my case), then Union Find is an `O(N)` algorithm. Is that correct? – Someone Mar 12 '22 at 22:29
  • Yes, Union Find with path compression is an amortized `O(N)`, if you want to be exact. – user1984 Mar 12 '22 at 22:31
  • 1
    Perfect, thanks! – Someone Mar 12 '22 at 22:33
  • 2
    This answer is incorrect – Matt Timmermans Mar 13 '22 at 01:44
  • 6
    Normally this detail isn't worth correcting, but, since this question is solely about time complexity: Union find with path compression (and union by size) still isn't linear, but is [barely superlinear](https://en.wikipedia.org/wiki/Disjoint-set_data_structure#Time_complexity) by a multiplicative factor that grows (albeit extremely, extremely slowly) without bound. Linear-time union find algorithms are much more complex than standard ones, and only apply in restricted circumstances. – kcsquared Mar 13 '22 at 02:40
  • @kcsquared is correct, but I note that the code won't work for arbitrary n, since `int` is limited in size. The fudging necessary to apply the theoretical complexity to the real code is much more significant than fudging the difference between inverse ackermann and 1. – Paul Hankin Mar 13 '22 at 12:50
  • 1
    Thanks for the correction @kcsquared . If I understand the linked article correctly, `m` operations on a union find structure with path compression and union by rank is `O(m * alpha(n))` where `alpha(n)` is the inverse Ackerman function and doesn't exceed 4 for any `n` in the physical world, hence any operation is considered an amortized constant one. I see that path compression and union by rank are implemented but don't know about the "splitting, or halving" part. Is my understanding correct? Can you clarify, please? I'd like to update the answer. Thanks :D – user1984 Mar 13 '22 at 15:32
  • 1
    That’s exactly right. To get the amortized time per operation down to inverse Ackermann(n), it is sufficient to have both of 1. Path compression or path halving in the `find` function and 2. Union by rank or union by size. – kcsquared Mar 13 '22 at 15:50
  • 1
    Thanks a lot. Will update the answer later :D – user1984 Mar 13 '22 at 15:58
  • Good on you if you're going to fix the answer. At the moment, though, it's incorrect in multiple ways. The amortized time per operation is O(alpha(n)) -- nearly constant. The whole algorithm is nearly linear -- not amortized. Amortization applies only on a per-operation basis. Both path compression and union by size/rank are required to achieve this bound. – Matt Timmermans Mar 13 '22 at 18:08
  • @MattTimmermans so, to be exact, I need to point out that single operations are of amortized constant complexity and the whole algorithm is nearly linear? Isn't this the same as saying it is an amortized linear algorithm? Of course, provided path compression and union by rank are both implemented. – user1984 Mar 13 '22 at 19:33
  • It’s not technically incorrect to talk about the amortized complexity of the whole algorithm, just strange/confusing. It’s exactly like saying ‘the town has 100 houses and each house has 2 people on *average*, so the *average population* of the town is 200. You’d just say ‘population’, since it’s not an average/amortization over anything, you know the exact amount. – kcsquared Mar 13 '22 at 21:29
  • Ohh, that makes sense. Thanks. I was here to answer a question but ended up probably learning more than OP :D – user1984 Mar 13 '22 at 21:39
  • 1
    @user1984, well, I learned too. :) So all in all, the TC is `O(N*alpha(n))` where `alpha(n)` is the inverse Ackerman function. Is this correct? – Someone Mar 13 '22 at 22:08
  • :D yes, and the inverse Ackerman function won't be greater than 4 for any possible n in the physical universe, so we can consider it just a constant. – user1984 Mar 13 '22 at 22:24
  • 1
    Hello, all. I've updated the answer based on your feedback. Please have a look when you have time and let me know if I could further improve it. Thanks. – user1984 Mar 14 '22 at 07:32
  • 1
    @user1984, thanks. Nit: you call it union by rank, while I think it is union by size. – Someone Mar 15 '22 at 14:52