0

I have a hive table X of 10m binary vectors of dimension 256, and another hive table Y of 1b binary vectors also of dimension 256. How do I write a spark/hive job to find the nearest all rows in Y that's within a Hamming distance of say 16 from for each row in X?

Naively, I could do something like

select X.*, Y.* from X join Y on hamming(X.vector, Y.vector) < 16

where hamming is some User Defined Function (UDF).

But this will require a full cross join between the two tables. Are there more efficient ways if we assume the distribution of X.vector and Y.vector are independent and uniformly distributed?

John Jiang
  • 827
  • 1
  • 9
  • 19

0 Answers0