1

I have a UDF in spark (running on EMR), written in scala that parses device from user agent using uaparser library for scala (uap-scala). When working on small sets it works fine (5000 rows) but when running on larger sets (2M) it works very slow. I tried collecting the Dataframe to list and looping over it on the driver, and that was also very slow, what makes me believe that the UDF runs on the driver and not the workers

  1. How can I establish this? does anyone have another theory?
  2. if that is the case, why can this happen?

This is the udf code:

def calcDevice(userAgent: String): String = {

val userAgentVal = Option(userAgent).getOrElse("")
Parser.get.parse(userAgentVal).device.family
}

val calcDeviceValUDF: UserDefinedFunction = udf(calcDevice _)

usage:

.withColumn("agentDevice", udfDefinitions.calcDeviceValUDF($"userAgent"))

Thanks Nir

T. Gawęda
  • 15,706
  • 4
  • 46
  • 61
Nir Ben Yaacov
  • 1,182
  • 2
  • 17
  • 33

3 Answers3

4

Problem was with instantiating the builder within the UDF itelf. The solution is to create the object outside the udf and use it at row level:

val userAgentAnalyzerUAParser = Parser.get

def calcDevice(userAgent: String): String = {

val userAgentVal = Option(userAgent).getOrElse("")
userAgentAnalyzerUAParser.parse(userAgentVal).device.family
}

val calcDeviceValUDF: UserDefinedFunction = udf(calcDevice _)
Nir Ben Yaacov
  • 1,182
  • 2
  • 17
  • 33
1

We ran into the same issue where Spark jobs were hanging. One additional thing we did was to use a broadcast variable. This UDF is actually very slow after all the changes so your mileage may vary. One other caveat is that of acquiring the SparkSession; we run in Databricks and if the SparkSession isn't available then it will crash; if you need the job to continue then you have to deal with that failure case.

object UDFs extends Serializable {
  val uaParser = SparkSession.getActiveSession.map(_.sparkContext.broadcast(CachingParser.default(100000)))

  val parseUserAgent = udf { (userAgent: String) =>
    // We will simply return an empty map if uaParser is None because that would mean
    // there is no active spark session to broadcast the parser.
    //
    // Also if you wrap the potentially null value in an Option and use flatMap and map to
    // add type safety it becomes slower.
    if (userAgent == null || uaParser.isEmpty) {
      Map[String, Map[String, String]]()
    } else {
      val parsed = uaParser.get.value.parse(userAgent)
      Map(
        "browser" -> Map(
          "family"      -> parsed.userAgent.family,
          "major"       -> parsed.userAgent.major.getOrElse(""),
          "minor"       -> parsed.userAgent.minor.getOrElse(""),
          "patch"       -> parsed.userAgent.patch.getOrElse("")
        ),
        "os" -> Map(
          "family"      -> parsed.os.family,
          "major"       -> parsed.os.major.getOrElse(""),
          "minor"       -> parsed.os.minor.getOrElse(""),
          "patch"       -> parsed.os.patch.getOrElse(""),
          "patch-minor" -> parsed.os.patchMinor.getOrElse("")
        ),
        "device" -> Map(
          "family"      -> parsed.device.family,
          "brand"       -> parsed.device.brand.getOrElse(""),
          "model"       -> parsed.device.model.getOrElse("")
        )
      )
    }
  }    
}

You might also want to play with the size of the CachingParser.

Hackmad
  • 51
  • 2
0

Given Parser.get.parse is missing from the question, it is possible to judge only udf part.

For performance you can remove Option:

def calcDevice(userAgent: String): String = {
  val userAgentVal = if(userAgent == null) "" else userAgent
  Parser.get.parse(userAgentVal).device.family
}
Alper t. Turker
  • 34,230
  • 9
  • 83
  • 115