Find duplicate in unsorted array with best time Complexity

Question

I know there were similar questions, but not of such specificity

Input: n-elements array with unsorted emelents with values from 1 to (n-1). one of the values is duplicate (eg. n=5, tab[n] = {3,4,2,4,1}.

Task: find duplicate with best Complexity.

I wrote alghoritm:

int tab[] = { 1,6,7,8,9,4,2,2,3,5 };
int arrSize = sizeof(tab)/sizeof(tab[0]);

for (int i = 0; i < arrSize; i++) {
    tab[tab[i] % arrSize] = tab[tab[i] % arrSize] + arrSize;
}

for (int i = 0; i < arrSize; i++) {
    if (tab[i] >= arrSize * 2) {
        std::cout << i;
        break;
    }

but i dont think it is with best possible Complexity. Do You know better method/alghoritm? I can use any c++ library, but i don't have any idea.

Is it possible to get better complexity than O(n) ?

what makes you think it can be done in less than `O(N)` ? Even if you pass the array once and stop once you found the duplicate its average `O(N/2)` which is the same as `O(N)` — 463035818_is_not_an_ai, Aug 08 '22 at 19:53
Why do you think it's not "best cocmplexity" (also, best in terms of what? memory? time? big O notation?) — amit, Aug 08 '22 at 19:54
@463035818_is_not_a_number I don't know if it can be done faster than O (n), but I want to find the fastes alghoritm. In this case the same complexity is if we have more than one duplicate, for that I thought it could be done better — hero, Aug 08 '22 at 20:00
Side note: As of C++17, `int arrSize = sizeof(tab)/sizeof(tab[0]);` -> `int arrSize = std::size(tab);`. Also worth noting that before C++17 `std::size` [is trivial to implement](https://en.cppreference.com/w/cpp/iterator/size#Possible_implementation) — user4581301, Aug 08 '22 at 20:01
Don't get too hang up by big O notation. It is about asymptotic complexity. "fastest algorithm" is not necessarily "algorithm with best asymptotic complexity". Though, its not possible to do something that requires you to inspect N elements in less than `O(N)` — 463035818_is_not_an_ai, Aug 08 '22 at 20:02
And very often *fast enough* is much easier to code and test than *fastest*. Economically, you're usually being paid for solutions that hit performance targets, not theoretical maximums, so the faster you can provide solutions, the more you get paid. — user4581301, Aug 08 '22 at 20:05
@PaulSanders Whi do You think so? The task was: find best possible alghoritm (smallest time Complexity), and in code was was what I tried — hero, Aug 08 '22 at 20:12
@PaulSanders I think it does. I had to read it twice to understand what it does and why it works. It uses `+size` as sentinel, that changes not yet checked elements, but when they are checked this is done module `size` — 463035818_is_not_an_ai, Aug 08 '22 at 20:17
I'm really surprised no one suggested the classic xor algorithm (xor all elements in array and 1 to n - 1, this will get you the element in O(n) time and O(1) space complexity). — maraca, Aug 08 '22 at 20:24
@KrzysztofMarchewka added another answer, although in a real world application you would probably just put everything in a set and could then throw an exception if there is more than one duplicate or a number is >= n. — maraca, Aug 08 '22 at 20:29
@maraca The duplicate element will cancel itself with xor and you are left with the xor of all others. Take `1, 2, 2, 3, 4` =xor=> `6`. But we know the sum of `1,2,3,..,n-1` == n * (n - 1) / 2`. So `std::accumulate(arr) - n * (n - 1) / 2` == duplicate. Although you might have a problem with integer overflow to consider. But I see in your answer you fix the "xor" by also adding each index. That works then. — Goswin von Brederlow, Aug 11 '22 at 18:18

amit · Answer 1 · 2022-08-08T20:08:06.523

6

In terms of big-O notation, you cannot beat O(n) (same as your solution here). But you can have better constants and simpler algorithm, by using the property that the sum of elements 1,...,n-1 is well known.

int sum = 0;
for (int x : tab) {
  sum += x;
}

duplicate = sum - ((n*(n-1)/2))

The constants here will be significntly better - as each array index is accessed exactly once, which is much more cache friendly and efficient to modern architectures.

(Note, this solution does ignore integer overflow, but it's easy to account for it by using 2x more bits in sum than there are in the array's elements).

edited Aug 08 '22 at 20:08

answered Aug 08 '22 at 19:59

amit

175,853
27
231
333

2

@Vishrant There are none. All elements appear once, except one that appears twice. (There are `n` elements, range is `n-1`, and only one appear twice or more). – amit Aug 08 '22 at 20:02

score 5 · Answer 2 · answered Aug 08 '22 at 20:31

5

Adding the classic answer because it was requested. It is based on the idea that if you xor a number with itself you get 0. So if you xor all numbers from 1 to n - 1 and all numbers in the array you will end up with the duplicate.

int duplicate = arr[0];
for (int i = 1; i < arr.length; i++) {
    duplicate = duplicate ^ arr[i] ^ i;
}

answered Aug 08 '22 at 20:31

maraca

8,468
3
23
45

1

Note for those who come later: This doesn't detect duplicates. It only correctly identifies the one-and-only-one duplicate. It cannot give the location of the duplicate. – user4581301 Aug 08 '22 at 20:46
1

Yes if the input is "wrong" the answer will be wrong too. Although you can scan the array for the duplicate afterwards if you want the position and it is still O(n). In my opinion the cleanest solution is just putting the numbers in a hash set with some other checks for validity of the input. – maraca Aug 08 '22 at 20:49
Btw similar problem is pretty crazy: numbers in [1, n] with array of size n + 1, find one (of the possibly multiple) duplicate in O(1) space and O(n) time complexity without modifying tht input: https://stackoverflow.com/questions/48841439/find-a-duplicate-in-array-of-integers/48841544#48841544 – maraca Aug 08 '22 at 20:57
1

You can save half the work here since the xor of the numbers from 1 to k is either k, 1, k+1, or 0, depending on whether k mod 4 is 0, 1, 2, or 3. – Dave Aug 08 '22 at 20:59
@Dave thanks for the info, I was actually wondering if we can save some xor operations. The only drawback is that the code gets slightly more complicated. – maraca Aug 08 '22 at 21:02

score 2 · Answer 3 · edited Aug 08 '22 at 20:53

Don't focus too much on asymptotic complexity. In practice the fastest algorithm is not necessarily the one with lowest asymtotic complexity. That is because constants are not taken into account: O( huge_constant * N) == O(N) == O( tiny_constant * N).

You cannot inspect N values in less than O(N). Though you do not need a full pass through the array. You can stop once you found the duplicate:

#include <iostream>
#include <vector>

int main() {
    std::vector<int> vals{1,2,4,6,5,3,2};
    std::vector<bool> present(vals.size());
    for (const auto& e : vals) {
        if (present[e]) {
            std::cout << "duplicate is " << e << "\n";
            break;
        }
        present[e] = true;
    }
}

In the "lucky case" the duplicate is at index 2. In the worst case the whole vector has to be scanned. On average it is again O(N) time complexity. Further it uses O(N) additional memory while yours is using no additional memory. Again: Complexity alone cannot tell you which algorithm is faster (especially not for a fixed input size).

No matter how hard you try, you won't beat O(N), because no matter in what order you traverse the elements (and remember already found elements), the best and worst case are always the same: Either the duplicate is in the first two elements you inspect or it's the last, and on average it will be O(N).

Standard caveat about `std::vector`: It's optimized for space and can get a bit weird. [Weird enough that it gets its own documentation page](https://en.cppreference.com/w/cpp/container/vector_bool). The odder aspects shouldn't hurt here, but the time spent on the bit packing needs to be considered. — user4581301, Aug 08 '22 at 20:28

Find duplicate in unsorted array with best time Complexity

3 Answers3