0

I am implementing minhash and LSH for similarity search for some string elements in C++11. The minhash sketch for my implementation is a vector of 200 64-bit integers i.e. vector<uint64_t> MinHashSketch. I have more than 2 million entries and the sketch generation portion does not take much time. But, the bucketing stage takes a long time. I am wondering if I can get some suggestions to make it a bit faster. Following is my bucketing stage using LSH.

I am taking consecutive elements in the sketch to create a hash which becomes bucket id. If bsize = 5, then 1-5, 6-10, 11-15, ... 196-200 elements in MinHashSketch[i] (for ith element) forms the bucket ids. Following the piece of code that does that.

for (int p = 0; p < 200; p += bsize) {  //bsize = 5
  string s = ""; 
  for(int x = p; x < (p+bsize); x++){
    s = s + to_string(MinHashSketch[i].at(x)); // ith element 
  }       
  uint64_t hash1 = 0;  // bucket id
  Hash_function ((uint8_t*)s.c_str(), s.length(), (uint8_t *)&hash1, 0);
  ........
  ........
}
SBDK8219
  • 661
  • 4
  • 11
  • What type is `MinHashSketch`? If it's `vector` the expression `MinHashSketch[i].at(x)` is illformed. This looks like a vector of vectors? – Timo Dec 18 '19 at 22:24
  • 1
    Do you really need to convert your numbers to strings? Can't you just hash raw bytes? As in `Hash_function(&MinHashSketch[i][p], sizeof(uint64_t)*bsize, ...)` and drop the inner loop. – Igor Tandetnik Dec 18 '19 at 22:25
  • @Timo Yes, it is a vector of vectors. For every element, it is a vector of 200 64-bit integers. – SBDK8219 Dec 18 '19 at 22:43
  • @IgorTandetnik Thanks!! I will try that. I was doing it differently, bt had some syntax error. – SBDK8219 Dec 18 '19 at 22:47
  • The second parameter of hash function is `long unsigned int`. This is the error `error: no matching function for call to ‘MetroHash64::Hash(__gnu_cxx::__alloc_traits >::value_type*, long unsigned int, uint8_t*, int)’` – SBDK8219 Dec 18 '19 at 22:53
  • 1
    Cast the first parameter to `uint8_t*`, same as you are doing now. – Igor Tandetnik Dec 18 '19 at 23:02
  • Thanks @IgorTandetnik !! It worked fine now. It did improve a little bit. It seems I need to work on the other parts of the code. – SBDK8219 Dec 18 '19 at 23:18

0 Answers0