I have two Spark RDDs without common key that I need to join.
The first RDD is from cassandra table a contains reference set of items (id, item_name, item_type, item_size) for example: (1, 'item 1', 'type_a', 20). The second RDD is imported each night from another system and it contains roughly the same data without id and is in raw form (raw_item_name, raw_type, raw_item_size) for example ('item 1.', 'type a', 20).
Now I need to join those two RDDs based on similarity of the data. Right know the size of the RDDs is about 10000 but in the future it will grow up.
My actual solutions is: cartesian join of both RDDs, then calculating the distance between ref and raw attributes for each row, then grouping by id and selecting best match.
At this size of RDDs this solution is working but i'm afraid that in the future the cartesian join might be just to big.
What would be better solution? I tried to look at Spark MLlib but didn't know where to start, which algorith to use etc. Any advice will be greatly appreciated.