0

I am trying to calculate the Hamming distance between a binary vector and a matrix of binary vectors. The fastest approach I could find is using unsigned 8 bit integers with Numpy:

import numpy as np
np.count_nonzero(data[0] !=  data, axis=1)

However, The problem with this approach is that it first finds all the bits that are different and then sums the number of differences. Wouldn't this be a bit of a waste? I tried implementing a basic version in C++ where I also keep a count of the number of bits that are different such that a sum isn't required at the end but this is way slower. Probably due to the fact that Numpy uses SIMD instructions.

So my question is. Is there a way to directly calculate the Hamming distance using SIMD instructions in Numpy / Python / Cython?

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
Wouter
  • 78
  • 5

1 Answers1

2

Ideally, what you actually want the CPU to be doing is sum += popcount( a[i] ^ b[i]) with chunks as large as possible. e.g. on x86, using AVX2 to XOR 32 bytes at a time with one instruction, and then a few more instructions (including vpshufb and vpaddq) to accumulate counts into SIMD vectors of per-element counts (which you horizontally sum at the end).

This is easy with C++ intrinsics for a specific ISA, like x86-64.

You can come close with portable code using std::bitset<64> to XOR 64-bit chunks together and .count() as a portable API to an efficient popcount. Clang can auto-vectorize scalar popcount into AVX2, but GCC can't.

To safely construct it without violating strict-aliasing, you might need to memcpy from arbitrary data of another type into an unsigned long long.


I don't know if Numpy has a loop for this compiled in, otherwise you might need to XOR in one pass and then popcount in another, which would suck for computational intensity so you'd definitely want to cache-block it into small chunks that stay hot in L1d cache before you get back to re-read them.

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
  • What an outstanding answer, thank you very much for the clear response. – Wouter Jan 14 '21 at 14:07
  • [Clang has a bit of a missed optimization](https://godbolt.org/z/Gbnrrf) it appears to horizontal sum each iteration. Also I would guess that if you are using ```AVX2``` (no ```vpopcnt``` instruction) you would want to the smallest bitwidth possible s.t that no accumulator in a sum vector would overflow so the manual popcnt would have as few steps as possible? Clang seems to be doing something entirely seperate and havent tested it though. – Noah Jan 15 '21 at 17:28
  • @Noah: It has no problem with `__builtin_popcountll()` directly over a uint64_t array. https://godbolt.org/z/K95hh9 Either the bitset stuff is a problem, or the 32-bit integer width. I'd suggest reporting this missed-optimization bug. – Peter Cordes Jan 15 '21 at 17:38
  • Its kind of weird. So if sum type has less width the width of popcnt then it will reduce to sum width each iteration as opposed to at the end. I.e ```uint32_t``` sum with ```_builtin_popcnt``` is good. ```uint64_t``` sum with ```__builtin_popcntll``` is good. but ```uint32_t``` sum with ```__builtin_popcntll``` will reduce. I guess ```__builtin_popcntll``` returns a ```long long``` despite never being able to reach above 64. (if you try and accumulate through a larger temporary like ```uint64_t tmp = __builtin_popcntll()``` then it u32 sum [vectorization fails](https://godbolt.org/z/TMhrjW) – Noah Jan 15 '21 at 17:56
  • I really would have thought narrowest width possible would be best because its just less steps for a manual popcnt. But I think it might not be an issue with clang, but actually an issue with count returning the wrong type. If ```__builtin_popcnt``` + ```uint32_t``` clang does fine. ```bitset<32>.count()``` + ```uint32_t``` [adds an extra step](https://godbolt.org/z/oW1evb). If I had to guess [line 198 of source code](https://gcc.gnu.org/onlinedocs/libstdc++/libstdc++-html-USERS-3.4/bitset-source.html) using ```__builtin_popcntl``` is what breaks this. – Noah Jan 15 '21 at 17:59