I have a UDF in spark (running on EMR), written in scala that parses device from user agent using uaparser library for scala (uap-scala). When working on small sets it works fine (5000 rows) but when running on larger sets (2M) it works very slow. I tried collecting the Dataframe to list and looping over it on the driver, and that was also very slow, what makes me believe that the UDF runs on the driver and not the workers
- How can I establish this? does anyone have another theory?
- if that is the case, why can this happen?
This is the udf code:
def calcDevice(userAgent: String): String = {
val userAgentVal = Option(userAgent).getOrElse("")
Parser.get.parse(userAgentVal).device.family
}
val calcDeviceValUDF: UserDefinedFunction = udf(calcDevice _)
usage:
.withColumn("agentDevice", udfDefinitions.calcDeviceValUDF($"userAgent"))
Thanks Nir