1

I have a std::vector<int> with duplicate values. I can find the unique values using std::unique() and std::vector::erase(), but how can I efficiently find the vector of indices and construct the original vector given the vector of unique values, through an inverse mapping vector. Allow me to illustrate this using an example:

std::vector<int> vec  = {3, 2, 3, 3, 6, 5, 5, 6, 2, 6};
std::vector<int> uvec = {3, 2, 6, 5}; // vector of unique values
std::vector<int> idx_vec = {0, 1, 4, 5}; // vector of indices
std::vector<int> inv_vec = {0, 1, 0, 0, 2, 3, 3, 2, 1, 2}; // inverse mapping

The inverse mapping vector is such that with its indices one can construct the original vector using the unique vector i.e.

std::vector<int> orig_vec(ivec.size()); // construct the original vector
std::for_each(ivec.begin(), ivec.end(), 
    [&uvec,&inv_vec,&orig_vec](int idx) {orig_vec[idx] = uvec[inv_vec[idx]];});

And the indices vector is simply a vector indices of first occurrence of unique values in the original vector.

My rudimentary solution is far from efficient. It does not use STL algorithms and is O(n^2) at worst.

template <typename T> 
inline std::tuple<std::vector<T>,std::vector<int>,vector<int>>
unique_idx_inv(const std::vector<T> &a) {
    auto size_a = size(a);
    std::vector<T> uniques;
    std::vector<int> idx; // vector of indices
    vector<int> inv(size_a); // vector of inverse mapping

    for (auto i=0; i<size_a; ++i) {
        auto counter = 0;
        for (auto j=0; j<uniques.size(); ++j) {
            if (uniques[j]==a[i]) {
                counter +=1;
                break;
            }
        }
        if (counter==0) {
            uniques.push_back(a[i]);
            idx.push_back(i);
        }
    }

    for (auto i=0; i<size_a; ++i) {
        for (auto j=0; j<uniques.size(); ++j) {
            if (uniques[j]==a[i]) {
                inv[i] = j;
                break;
            }
        }
    }

    return std::make_tuple(uniques,idx,inv);
}

Comparing this with the typical std::sort+std::erase+std::unique approach (which by the way only computes unique values and not indices or inverse), I get the following timing on my laptop with g++ -O3 [for a vector of size=10000 with only one duplicate value]

Find uniques+indices+inverse:                       145ms
Find only uniques using STL's sort+erase+unique     0.48ms

Of course the two approaches are not exactly identical, as the latter one sorts the indices, but still I believe the solution I have posted above can be optimised considerably. Any thoughts how on I can achieve this?

Community
  • 1
  • 1
Tasam Farkie
  • 319
  • 3
  • 14

2 Answers2

6

If I'm not wrong, the following solution should be O(n log(n))

(I've changed the indexes in std::size_t values)

template <typename T> 
inline std::tuple<std::vector<T>,
                  std::vector<std::size_t>,
                  std::vector<std::size_t>>
unique_idx_inv(const std::vector<T> &a)
 {
   std::size_t               ind;
   std::map<T, std::size_t>  m;
   std::vector<T>            uniques;
   std::vector<std::size_t>  idx;
   std::vector<std::size_t>  inv;

   inv.reserve(a.size());

   ind = 0U;

   for ( std::size_t i = 0U ; i < a.size() ; ++i )
    {
      auto e = m.insert(std::make_pair(a[i], ind));

      if ( e.second )
       {
         uniques.push_back(a[i]);
         idx.push_back(i);
         ++ind;
       }

      inv.push_back(e.first->second);
    }

    return std::make_tuple(uniques,idx,inv);
}
max66
  • 65,235
  • 10
  • 71
  • 111
1

The O(n^2) arises from your approach to identify duplicates with nested loops over vectors. However, to find out if an element has already been read, a sorted vector or - imho better - an unordered map is more appropriate. So, without writing the code here, I'd suggest to use an unordered map of the form

unordered_map<int,int>, which can hold both the unique values and the indices. I'm not sure if you still need the vectors for this information, but you can easily derive these vectors from the map.

The complexity should reduce to O(n log(n)).

Stephan Lechner
  • 34,891
  • 4
  • 35
  • 58