0

I want to join large(1TB) data RDD with medium(10GB) size data RDD. There was an earlier processing on large data with was completing in 8 hours. I then joined the medium sized data to get an info that need to be add to the processing(its a simple join, which takes the value of second column and add it to the final output along with the large data processed output. But this job is running longer for more than 1 day. How do I optimize it? I tried to refer some solutions like Refer. But the solution are for spark dataframe. How do I optimize it for RDD?

Large dataset

1,large_blob_of_info
2,large_blob_of_info
3,large_blob_of_info
4,large_blob_of_info
5,large_blob_of_info
6,large_blob_of_info

Medium size data

3,23
2,45
1,67
4,89

Code that I have have.

rdd1.join(rdd2).map(a => a.x)

val result = input
          .map(x => {
            val row = x.split(",")
            (row(0), row(2))
          }).filter(x=> x._1 != null !x._1.isEmpty)
user0712
  • 43
  • 6
  • I suppose that "input" is the result of the join operation. Update your question and include the cluster resources and if you have access to the application console you can analyze how much time take the different tasks, GC, etc, ... – Emiliano Martinez Sep 04 '22 at 08:21
  • Rdd is manual optimization – thebluephantom Sep 04 '22 at 11:19
  • If you have enough executor memory, I would apply broadcast join https://stackoverflow.com/questions/51507991/java-spark-broadcast-and-join-two-rdds – emesday Sep 04 '22 at 12:50
  • Can we use broadcast join on 10GB of data. I have read that broadcast takes max of 8GB – user0712 Sep 04 '22 at 15:46

0 Answers0