I have a process that given a new input retrieves related information form our Kudu database and then does some computation.
The problem lies in the data retrieval, we have 1.201.524.092 rows and for any computation, it takes forever to start processing the needed ones because the reader needs to give it all to spark.
To read form kudu we do:
def read(tableName: String): Try[DataFrame] = {
val kuduOptions: Map[String, String] = Map(
"kudu.table" -> tableName,
"kudu.master" -> kuduContext.kuduMaster)
SQLContext.read.options(kuduOptions).format("kudu").load
}
And then:
val newInputs = ??? // Dataframe with the new inputs
val currentInputs = read("inputsTable") // This takes too much time!!!!
val relatedCurrent = currentInputs.join(newInputs.select("commonId", Seq("commonId"), "inner")
doThings(newInputs, relatedCurrent)
For example, we only want to introduce a single new input. Well, it has to scan the full table to find the currentInputs which makes a Shuffle Write of 81.6 GB / 1201524092 rows.
How can I improve this?
Thanks,