1

Have a query regarding groupByKey on my RDD. Below is the query I'm trying:

rdd3.map{ case(HandleMaxTuple(col1, col2, col3, col4, col5, col6, col7, col8, col9, col10, col11, col12, col13, col14, col15, col16, col17, col18, col19, col20, col21, col22, col23, col24, col25)) => (HandleMaxTuple(col1,col2,col3, col4, col5),(col6, col7, col8, col9, col10, col11, col12, col13, col14, col15, col16, col17, col18, col19, col20, col21, col22, col23, col24, col25))}.reduceByKey(_+_)
  .map{ case(HandleMaxTuple(col1, col2, col3, col4, col5),(col6, col7, col8, col9, col10, col11, col12, col13, col14, col15, col16, col17, col18, col19, col20, col21, col22, col23, col24, col25))}.groupByKey

The HandlemaxTuple case class I've defined to handle a Scala bug of handling more than 22 tuples in a row. Previous question explained here: number of tuples limit in RDD; reading RDD throws arrayIndexOutOfBoundsException

I wanted to do a groupBy on first 5 columns which I'm trying to get reduced into list of keys and then trying a groupByKey. Can someone help me out what's wrong with my above approach of groupByKey?

My goal is to group by the first 5 columns and then aggregate to get the sum of the 6th, 7th, and 8th columns.

Shaido
  • 27,497
  • 23
  • 70
  • 73
knowone
  • 840
  • 2
  • 16
  • 37
  • It's not very clear what you are trying to achieve here. Do you want to do a `reduceByKey` or `groupByKey`? As I understand your question, you want to do them after each other? (since the keys are the same the `groupByKey` won't do anything in that case, the data has already been reduced). – Shaido Mar 28 '18 at 02:17
  • @Shaido: Well, prior to using the `HandleMaxTuple` method, I was trying to groupBy the rdd with first 5 columns and sum of 6th, 7th & 8th which is my current problem as well. But, after introducing `HandleMaxTuple`, it has become a bit difficult for me to use aggregation. In the above question, I tried to convert the first 5 into keys and then using groupBy but that doesn't work out. – knowone Mar 28 '18 at 02:46
  • I see, having the first 5 columns as keys and reducing to get the sum of the 6th, 7th and 8th columns should be doable, I will add an answer in a minute. – Shaido Mar 28 '18 at 03:00

1 Answers1

0

When doing the aggregation, if you only want the result of some columns it's best to only select those in the map. If these are less than the Scala limitation on tuple length (22) you can simply use tuples, otherwise you would need to create a new case class with different length of the one you currently have. In other words, a case class for all columns (or those to keep) except for the first 5 columns that are used as key.

Using the first 5 columns as key and aggregating to the the sums of the 6th, 7th and 8th columns can be done as follows; first map to select the columns of interest, then do the aggregation.

rdd3.map{ case HandleMaxTuple(col1, col2, col3, col4, col5, col6, col7, col8, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _) => 
    ((col1,col2,col3, col4, col5),(col6, col7, col8))
}.reduceByKey((x,y) => (x._1 + y._1, x._2 + y._2, x._3 + y._3))

This will give a separate sum for the 6th, 7th and 8th columns separately.

Small example with an RDD with following rows as input:

HandleMaxTuple(1,2,3,4,5,6,7,8,9,10,11,12)
HandleMaxTuple(13,2,3,4,5,6,7,8,9,10,11,12)
HandleMaxTuple(1,2,3,4,5,65,7,8,9,10,11,12)

Gives:

((13,2,3,4,5),(6,7,8))
((1,2,3,4,5),(71,14,16))
Shaido
  • 27,497
  • 23
  • 70
  • 73
  • It worked for tuples `<=22`; irony again. Even though I've specified `HandleMaxTuple` for handling this limit of `22`, in the reduceByKey, I've to specify the remaining columns as well not to exclude them in the resulting rdd. And that again gives me `:1: error: too many elements for tuple: 25, allowed: 22` But your answer explained what I was missing in my approach. – knowone Mar 28 '18 at 03:26
  • @knowone Yes, I guess you could create another case class with 5 fewer columns and use that instead of the `(col6, col7, col8)` tuple. Still, I would probably recommend using a dataframe which wouldn't have this constraint (I know you said using RDD was a requirement before so I didn't include that in the answer). – Shaido Mar 28 '18 at 03:30
  • Thanks Shaido. With DF i already have done it. I'm just trying the same functionality with RDD, hoping to get some optimization done in terms of processing speed. – knowone Mar 28 '18 at 03:51