1

Im having trouble to convert a string vector with size of ~ 1.0000.0000 elements to an associative vector with integers.

Input:

std::vector<std::string> s {"a","b","a","a","c","d","a"};

Desired output:

std::vector<int> i {0,1,0,0,2,3,0};

I was thinking of an std::unordered_multiset as mentioned in Associative Array with Vector in C++ but i can't get it running.

The goal is to reduce the time it takes to convert c++ strings to R strings, which is so much faster if I just use numbers.

Thank you for your help!

Edit:

Thats how I tried to populate the set:

for (size_t i = 0; i < s.size(); i++)
{
        set.insert(s[i]);
}
schlumpel
  • 174
  • 1
  • 8
  • 3
    What did you try doing with `unordered_multiset`? What specific problem did you have with it? – Useless Oct 25 '21 at 15:15
  • If i populate the set like done in the edit, i do just receive the first element if iterate through the set with ```ii->data()``` thus, i "don't need" the values, but the keys i think – schlumpel Oct 25 '21 at 15:21
  • 1
    We need more code :) what's `set`? and `ii`? where's the map? Please add this to the question instead of comments! – Ivan Oct 25 '21 at 15:26
  • Does the input need to be ordered? Why not use std::transform? Why do you need a std::unordered_multiset for? A simple code like this could do the trick std::vector in{"a", "b", "a", "a", "c", "d", "a"}; std::vector out(in.size()); std::transform(in.cbegin(), in.cend(), out.begin(), [] (const std::string &s) { return s[0] - 'a'; }); – Guilherme Ferreira Oct 25 '21 at 15:38

2 Answers2

3

If you need just the keys, why don't you use just a vector?

#include <iostream>
#include <string>
#include <vector>
#include <algorithm>


int main()
{
    std::vector<std::string> s {"a","b","a","a","c","d","a"};
    std::vector<int> out(s.size());
    std::transform(s.begin(), s.end(), out.begin(),[](auto& x)
    {
        return x[0] - 'a';
    });
    for(auto& i : out) std::cout << i << " ";
    std::cout << std::endl;
    return 0;
}

Live example here

linuxfever
  • 3,763
  • 2
  • 19
  • 43
2

This code will output your desired output for your given input. And it will process 1.000.000 strings of length 3 in 0.4s. So I think unordered_map is a viable choice.

#include <string>
#include <iostream>
#include <unordered_map>
#include <chrono>
#include <random>

// generator function for creating a large number of strings.
std::vector<std::string> generate_strings(const std::size_t size, const std::size_t string_length)
{
    static std::random_device rd{};
    static std::default_random_engine generator{ rd() };
    static std::uniform_int_distribution<int> distribution{ 'a', 'z' };

    std::vector<std::string> strings;
    std::string s(string_length, ' ');

    for (std::size_t n = 0; n < size; n++)
    {
        for (std::size_t m = 0; m < string_length; ++m)
        {
            s[m] = static_cast<char>(distribution(generator));
        }

        strings.emplace_back(s);
    }

    return strings;
}

int main() 
{
    std::vector<std::string> strings = generate_strings(1000000, 3);
    //std::vector<std::string> strings{ "a","b","a","a","c","d","a" };

    std::unordered_map<std::string, int> map;
    std::vector<int> output;

    // speed optimization, allocate enough room for answer
    // so output doesn't have to reallocate when growing.
    output.reserve(strings.size());

    auto start = std::chrono::high_resolution_clock::now();

    int id = 0;
    for (const auto& string : strings)
    {
        if (map.find(string) == map.end())
        {
            output.push_back(id);
            map.insert({ string, id });
            id++;
        }
        else
        {
            output.push_back(map.at(string));
        }
    }

    auto duration = std::chrono::high_resolution_clock::now() - start;
    auto nanoseconds = std::chrono::duration_cast<std::chrono::nanoseconds>(duration).count();

    auto seconds = static_cast<double>(nanoseconds) / 1.0e9;

    /*
    for (const auto& value : output)
    {
        std::cout << value << " ";
    }
    */

}
Pepijn Kramer
  • 9,356
  • 2
  • 8
  • 19
  • Thank you for this awesome solution! On my machine it's even faster (0.2s), and its strings of length ~25 there! Must say that same strings are repeated very often, though! – schlumpel Oct 25 '21 at 16:58
  • You may thank the builders of the standard library :) But happy it works for you – Pepijn Kramer Oct 25 '21 at 17:20