4

In Scala/Spark, I am trying to do the following:

val portCalls_Ports = 
  portCalls.join(ports, portCalls("port_id") === ports("id"), "inner")

However I am getting the following error:

Exception in thread "main" org.apache.spark.sql.AnalysisException: 
     binary type expression port_id cannot be used in join conditions;

It's true that this is a binary type:

root
 |-- id: binary (nullable = false)
 |-- port_id: binary (nullable = false)
     .
     .
     .

+--------------------+--------------------+
|                  id|             port_id|
+--------------------+--------------------+
|[FB 89 A0 FF AA 0...|[B2 B2 84 B9 52 2...|

as is ports("id").

I am using the following libraries:

scalaVersion := "2.11.11"
libraryDependencies ++= Seq(
  // Spark dependencies
  "org.apache.spark" %% "spark-hive" % "1.6.2",
  "org.apache.spark" %% "spark-mllib" % "1.6.2",
  // Third-party libraries
  "postgresql" % "postgresql" % "9.1-901-1.jdbc4",
  "net.sf.jopt-simple" % "jopt-simple" % "5.0.3"
)

Note that I am using JDBC to read database tables.

What is the best way to fix this problem?

Mark Rotteveel
  • 100,966
  • 191
  • 140
  • 197
Paul Reiners
  • 8,576
  • 33
  • 117
  • 202

1 Answers1

4

Pre Spark 2.1.0, the best workaround I know of is using the base64 function to convert the binary columns into Strings, and compare these:

import org.apache.spark.sql.functions._

val portCalls_Ports =
  portCalls.join(ports, base64(portCalls("port_id")) === base64(ports("id")), "inner")
Tzach Zohar
  • 37,442
  • 3
  • 79
  • 85
  • Sorry - edited the post to include the import; I recommend making a habit of adding this import to every DataFrame-related piece of code ;) – Tzach Zohar Jun 09 '17 at 16:05