-4

I am trying to solve the programming problem firstDuplicate on codesignal. The problem is "Given an array a that contains only numbers in the range 1 to a.length, find the first duplicate number for which the second occurrence has minimal index".

Example: For a = [2, 1, 3, 5, 3, 2] the output should be firstDuplicate(a) = 3

There are 2 duplicates: numbers 2 and 3. The second occurrence of 3 has a smaller index than the second occurrence of 2 does, so the answer is 3.

With this code I pass 21/23 tests, but then it tells me that the program exceeded the execution time limit on test 22. How would I go about making it faster so that it passes the remaining two tests?

#include <algorithm>

int firstDuplicate(vector<int> a) {
    vector<int> seen;
    
    for (size_t i = 0; i < a.size(); ++i){
        if (std::find(seen.begin(), seen.end(), a[i]) != seen.end()){
            return a[i];
        }else{
            seen.push_back(a[i]);
        }
    }
    
    if (seen == a){
        return -1;
    }
    
}

thedro0-0
  • 1
  • 3
  • 1
    Use a hashmap instead of a vector – Mick Dec 29 '20 at 06:02
  • Read a good [C++ programming book](https://www.stroustrup.com/programming.html) and the documentation of your compiler (e.g. [GCC](http://gcc.gnu.org/)...) and debugger (e.g. [GDB](https://www.gnu.org/software/gdb/)). See [this C++ reference](https://en.cppreference.com/w/cpp). Use [the Clang static analyzer](http://clang-analyzer.llvm.org/) and [gprof(1)](https://man7.org/linux/man-pages/man1/gprof.1.html). See [time(7)](https://man7.org/linux/man-pages/man7/time.7.html) – Basile Starynkevitch Dec 29 '20 at 06:04
  • Take inspiration from *existing* open source C++ software, e.g. [fish](https://fishshell.com/), [RefPerSys](http://refpersys.org/), [Qt](https://qt.io/), [FLTK](https://fltk.org/). Consider using some [C++ container](https://en.cppreference.com/w/cpp/container) – Basile Starynkevitch Dec 29 '20 at 06:07
  • The answer is the same regardless of language. In Java, I had to use `HashMap` and `HashSet` instead of iterating over the array multiple times. – ADTC Jul 29 '21 at 07:57

3 Answers3

1

Anytime you get asked a question about "find the duplicate", "find the missing element", or "find the thing that should be there", your first instinct should be use a hash table. In C++, there are the unordered_map and unordered_set classes that are for such types of coding exercises. The unordered_set is effectively a map of keys to bools.

Also, pass you vector by reference, not value. Passing by value incurs the overhead of copying the entire vector.

Also, that comparison seems costly and unnecessary at the end.

This is probably closer to what you want:

#include <unordered_set>


int firstDuplicate(const vector<int>& a) {
    std::unordered_set<int> seen;
    
    for (int i : a) {
       auto result_pair = seen.insert(i);
       bool duplicate = (result_pair.second == false);
       if (duplicate) {
          return (i);
       }
    }

    return -1;
   
}
selbie
  • 100,020
  • 15
  • 103
  • 173
0

std::find is linear time complexity in terms of distance between first and last element (or until the number is found) in the container, thus having a worst-case complexity of O(N), so your algorithm would be O(N^2).

Instead of storing your numbers in a vector and searching for it every time, Yyu should do something like hashing with std::map to store the numbers encountered and return a number if while iterating, it is already present in the map.

std::map<int, int> hash;
for(const auto &i: a) {
    if(hash[i])
        return i;
    else
        hash[i] = 1;
}

Edit: std::unordered_map is even more efficient if the order of keys doesn't matter, since insertion time complexity is constant in average case as compared to logarithmic insertion complexity for std::map.

Jarvis
  • 8,494
  • 3
  • 27
  • 58
  • `std::unordered_map` is more efficient than `std::map` (hash table vs tree insertion). `std::unordered_set` accomplishes the same thing as a a map of boolean values. – selbie Dec 29 '20 at 06:26
  • Yes, you are right, insertion time is constant average case as compared to log(n) for `std::map`. – Jarvis Dec 29 '20 at 06:28
0

It's probably an unnecessary optimization, but I think I'd try to take slightly better advantage of the specification. A hash table is intended primarily for cases where you have a fairly sparse conversion from possible keys to actual keys--that is, only a small percentage of possible keys are ever used. For example, if your keys are strings of length up to 20 characters, the theoretical maximum number of keys is 25620. With that many possible keys, it's clear no practical program is going to store any more than a minuscule percentage, so a hash table makes sense.

In this case, however, we're told that the input is: "an array a that contains only numbers in the range 1 to a.length". So, even if half the numbers are duplicates, we're using 50% of the possible keys.

Under the circumstances, instead of a hash table, even though it's often maligned, I'd use an std::vector<bool>, and expect to get considerably better performance in the vast majority of cases.

int firstDuplicate(std::vector<int> const &input) { 
    std::vector<bool> seen(input.size()+1);

    for (auto i : input) { 
        if (seen[i])
            return i;
        seen[i] = true;
    }
    return -1;
}

The advantage here is fairly simple: at least in a typical case, std::vector<bool> uses a specialization to store bools in only one bit apiece. This way we're storing only one bit for each number of input, which increases storage density, so we can expect excellent use of the cache. In particular, as long as the number of bytes in the cache is at least a little more than 1/8th the number of elements in the input array, we can expect all of seen to be in the cache most of the time.

Now make no mistake: if you look around, you'll find quite a few articles pointing out that vector<bool> has problems--and for some cases, that's entirely true. There are places and times that vector<bool> should be avoided. But none of its limitations applies to the way we're using it here--and it really does give an advantage in storage density that can be quite useful, especially for cases like this one.

We could also write some custom code to implement a bitmap that would give still faster code than vector<bool>. But using vector<bool> is easy, and writing our own replacement that's more efficient is quite a bit of extra work...

Jerry Coffin
  • 476,176
  • 80
  • 629
  • 1,111