Fuzzy join on multiple columns with Spark

Question

I have two Spark RDDs without common key that I need to join.

The first RDD is from cassandra table a contains reference set of items (id, item_name, item_type, item_size) for example: (1, 'item 1', 'type_a', 20). The second RDD is imported each night from another system and it contains roughly the same data without id and is in raw form (raw_item_name, raw_type, raw_item_size) for example ('item 1.', 'type a', 20).

Now I need to join those two RDDs based on similarity of the data. Right know the size of the RDDs is about 10000 but in the future it will grow up.

My actual solutions is: cartesian join of both RDDs, then calculating the distance between ref and raw attributes for each row, then grouping by id and selecting best match.

At this size of RDDs this solution is working but i'm afraid that in the future the cartesian join might be just to big.

What would be better solution? I tried to look at Spark MLlib but didn't know where to start, which algorith to use etc. Any advice will be greatly appreciated.

cant you preprocess the raw columns into the "nonraw" type? i.e. "item 1." -> "item 1"? If that's not feasible (for instance raw type is completely random) then maybe some kind of locality sensitive hashing would work? the idea is to hash your values so that similar values get put in same buckets (you actually WANT collisions) — Mateusz Dymczyk, Mar 20 '16 at 13:56
I do preprocess each raw field, for many cases it's not enough and exact match is not found. — Mike P., Mar 20 '16 at 14:42
As much as it is an interesting problem it is really not a good fit for SO. There is no practical programming problem here, and especially not a one that can be solved using a few paragraphs and a couple of line of code. My advice is to do some research first. — zero323, Mar 20 '16 at 17:05

Fuzzy join on multiple columns with Spark

0 Answers0