Use spark to find nearest neighbor in binary hamming distance

Asked Mar 01 '23 at 18:51

Active Mar 01 '23 at 18:51

Viewed 73 times

I have a hive table X of 10m binary vectors of dimension 256, and another hive table Y of 1b binary vectors also of dimension 256. How do I write a spark/hive job to find the nearest all rows in Y that's within a Hamming distance of say 16 from for each row in X?

Naively, I could do something like

select X.*, Y.* from X join Y on hamming(X.vector, Y.vector) < 16

where hamming is some User Defined Function (UDF).

But this will require a full cross join between the two tables. Are there more efficient ways if we assume the distribution of X.vector and Y.vector are independent and uniformly distributed?

asked Mar 01 '23 at 18:51

John Jiang

Use spark to find nearest neighbor in binary hamming distance

0 Answers0