I have a hive table X of 10m binary vectors of dimension 256, and another hive table Y of 1b binary vectors also of dimension 256. How do I write a spark/hive job to find the nearest all rows in Y that's within a Hamming distance of say 16 from for each row in X?
Naively, I could do something like
select X.*, Y.* from X join Y on hamming(X.vector, Y.vector) < 16
where hamming
is some User Defined Function (UDF).
But this will require a full cross join between the two tables. Are there more efficient ways if we assume the distribution of X.vector and Y.vector are independent and uniformly distributed?