I want to join large(1TB) data RDD with medium(10GB) size data RDD. There was an earlier processing on large data with was completing in 8 hours. I then joined the medium sized data to get an info that need to be add to the processing(its a simple join, which takes the value of second column and add it to the final output along with the large data processed output. But this job is running longer for more than 1 day. How do I optimize it? I tried to refer some solutions like Refer. But the solution are for spark dataframe. How do I optimize it for RDD?
Large dataset
1,large_blob_of_info
2,large_blob_of_info
3,large_blob_of_info
4,large_blob_of_info
5,large_blob_of_info
6,large_blob_of_info
Medium size data
3,23
2,45
1,67
4,89
Code that I have have.
rdd1.join(rdd2).map(a => a.x)
val result = input
.map(x => {
val row = x.split(",")
(row(0), row(2))
}).filter(x=> x._1 != null !x._1.isEmpty)