1

Im running apache spark on a hadoop cluster, using yarn. I have a big data set, something like 160 million records. I have to perform a self join. The join is done on exact match of 1 column (c1), a date overlap match and a match on at least 1 of 2 more columns (let's say c3 or c4).

I have read the data from HBase in an RDD and transformed that RDD to DATASET and then i did the join. So my questions are:

1) Would it help if i partition the rdd on c1(this must always match) before doing the join, such that spark will only join in the partitions instead of shuffling everything around?

2) I also did this by using keys, for example: c1+c3 and c1+c4 and then do the join by key, but then i have to filter all the results by a date overlap, i thought that adding the date overlap in the join would result in less records being generated.

3) Is there an efficient way to do self join where i match on exact column value, but also i do some comparisons between other columns?

Sorin
  • 61
  • 6

0 Answers0