Have a query regarding groupByKey
on my RDD. Below is the query I'm trying:
rdd3.map{ case(HandleMaxTuple(col1, col2, col3, col4, col5, col6, col7, col8, col9, col10, col11, col12, col13, col14, col15, col16, col17, col18, col19, col20, col21, col22, col23, col24, col25)) => (HandleMaxTuple(col1,col2,col3, col4, col5),(col6, col7, col8, col9, col10, col11, col12, col13, col14, col15, col16, col17, col18, col19, col20, col21, col22, col23, col24, col25))}.reduceByKey(_+_)
.map{ case(HandleMaxTuple(col1, col2, col3, col4, col5),(col6, col7, col8, col9, col10, col11, col12, col13, col14, col15, col16, col17, col18, col19, col20, col21, col22, col23, col24, col25))}.groupByKey
The HandlemaxTuple
case class I've defined to handle a Scala bug of handling more than 22 tuples in a row. Previous question explained here: number of tuples limit in RDD; reading RDD throws arrayIndexOutOfBoundsException
I wanted to do a groupBy
on first 5 columns which I'm trying to get reduced into list of keys and then trying a groupByKey
. Can someone help me out what's wrong with my above approach of groupByKey
?
My goal is to group by the first 5 columns and then aggregate to get the sum of the 6th, 7th, and 8th columns.