How to improve Kudu reads with Spark?

Question

I have a process that given a new input retrieves related information form our Kudu database and then does some computation.

The problem lies in the data retrieval, we have 1.201.524.092 rows and for any computation, it takes forever to start processing the needed ones because the reader needs to give it all to spark.

To read form kudu we do:

def read(tableName: String): Try[DataFrame] = {
        val kuduOptions: Map[String, String] = Map(
                "kudu.table" -> tableName,
                "kudu.master" -> kuduContext.kuduMaster)
        
        SQLContext.read.options(kuduOptions).format("kudu").load
        
}

And then:

val newInputs = ??? // Dataframe with the new inputs
val currentInputs = read("inputsTable") // This takes too much time!!!!

val relatedCurrent = currentInputs.join(newInputs.select("commonId", Seq("commonId"), "inner")

doThings(newInputs, relatedCurrent)

For example, we only want to introduce a single new input. Well, it has to scan the full table to find the currentInputs which makes a Shuffle Write of 81.6 GB / 1201524092 rows.

How can I improve this?

Thanks,

score 0 · Answer 1 · answered Jul 21 '20 at 13:04

You can collect the new input and after that you can use it in a where clause. Using this way you can easily hit an OOM, but it can make your query very fast because it's going to benefit of predicate pushdown

val collectedIds = newInputs.select("commonId").collect
val filtredCurrentInputs = currentInputs.where($"commonId".isin(collectedIds))

How to improve Kudu reads with Spark?

1 Answers1