1

I am facing a strange problem with Apache Spark (using the Scala API). There are two DataFrame objects, let's call them beans and relation.

  1. The beans dataframe consists of two columns, named id and data. Consider that all ids are unique and data holds a text representation of some action or a target of an action.
  2. The relation DataFrame defines relationship between the actions and their targets. It consists of two columns: actionId and targetId

(look at the code snippet below to view a table represantation of the DataFrame objects)

Basically, I am trying to alias the beans as two new DataFrame objects: actions and targets and then join them via the relation DataFrame

Here's some code to illustrate what is going on:

//define sql context, using 
val sqlContext = new SQLContext(sparkContext)

// ...

// Produce the following DataFrame objects:
// beans:                   relation:
// +--------+--------+      +----------+----------+
// |   id   |  data  |      | actionId | targetId |
// +--------+--------+      +----------+----------+
// |   a    |  save  |      |    a     |     1    |
// +--------+--------+      +----------+----------+
// |   b    | delete |      |    b     |     2    |
// +--------+--------+      +----------+----------+
// |   c    |  read  |      |    c     |     3    |
// +--------+--------+      +----------+----------+
// |   1    |  file  |
// +--------+--------+
// |   2    |   os   |
// +--------+--------+
// |   3    |  book  |
// +--------+--------+
case class Bean(id: String, data: String)
case class Relation(actionId: String, targetId: String)
val beans = sqlContext.createDataFrame(
    Bean("a", "save") :: Bean("b", "delete") :: Bean("c", "read") ::
    Bean("1", "file") :: Bean("2", "os") :: Bean("3", "book") :: Nil
  )
val relation = sqlContext.createDataFrame(
    Relation("a", "1") :: Relation("b", "2") :: Relation("c", "3") :: Nil
  )


// alias beans as "actions" and "targets" to avoid ambiguity
val actions = beans as "actions"
val targets = beans as "targets"
// join actions and targets via relation
actions.join(relation, actions("id") === relation("actionId"))
        .join(targets, targets("id") === relation("targetId"))
        .select(actions("id") as "actionId", targets("id") as "targetId",
                 actions("data") as "action", targets("data") as "target")
        .show()

The desired output of this snippet is

// desired output
// +----------+----------+--------+--------+
// | actionId | targetId | action | target |
// +----------+----------+--------+--------+
// |    a     |    1     |  save  |  file  |
// +----------+----------+--------+--------+
// |    b     |    2     | delete |   os   |
// +----------+----------+--------+--------+
// |    c     |    3     |  read  |  book  |
// +----------+----------+--------+--------+

However, the real (and strange) output is an empty DataFrame

+--------+--------+------+------+
|actionId|targetId|action|target|
+--------+--------+------+------+
+--------+--------+------+------+

I have suspected that there is an issue with joining a DataFrame to itself, but the example in Usage of spark DataFrame “as” method proves this suspicion wrong.

I am working with Spark 1.4.1 and Scala 2.10.4 but got the same result on Spark 1.5.1 and Scala 2.11.7

Changing the schema of the DataFrame objects is not an option. Any suggestions?

Solution

Refer to zero323's response. If you are getting an error message like this

error: value $ is not a member of StringContext
              actions.join(relation, $"actions.id" === $"actionId")
                                     ^

be sure to add the following statement

import sqlContext.implicits._
Community
  • 1
  • 1
shadjiiski
  • 23
  • 6

2 Answers2

1

There is a subtle difference between what you do here and an example you've linked. In the linked answer I use Column objects directly, here you use apply method on a DataFrame. To see the difference just type both in a REPL:

scala> actions("actions.id")
res59: org.apache.spark.sql.Column = id

scala> col("actions.id")
res60: org.apache.spark.sql.Column = actions.id

For an alias to be properly recognized you have to use Column objects directly otherwise alias is simply stripped. It means you need query like this:

actions.join(relation, $"actions.id" === $"actionId")
  .join(targets, $"targets.id" === $"targetId")

or

import org.apache.spark.sql.functions.col 

actions.join(relation, col("actions.id") === col("actionId"))
  .join(targets, col("targets.id") === col("targetId"))

to make it work. Of course using col on the RHS is strictly optional here. You could have use apply as before.

If you prefer to use apply you can rename join columns:

val targets = beans.withColumnRenamed("id", "_targetId")
val actions = beans.withColumnRenamed("id", "_actionId")

actions.join(relation, actions("_actionId") === relation("actionId"))
  .join(targets, targets("_targetId") === relation("targetId"))
zero323
  • 322,348
  • 103
  • 959
  • 935
0

Solution

I would split it in two phases, so:

val beans = sqlContext.createDataFrame(
  Bean("a", "save") :: 
  Bean("b", "delete") :: 
  Bean("c", "read") ::
  Bean("1", "file") :: 
  Bean("2", "os") :: 
  Bean("3", "book") :: 
  Nil
)
val relation = sqlContext.createDataFrame(
  Relation("a", "1") :: 
  Relation("b", "2") :: 
  Relation("c", "3") :: 
  Nil
)

// "add" action
val step1 = beans.join(relation, beans("id") === relation("actionId"))
  .select(
    relation("actionId"), 
    relation("targetId"), 
    beans("data").as("action")
  )

// "add" target column
val result = step1.join( beans, beans("id") === relation("targetId"))
  .select( 
    step1("actionId"), 
    step1("targetId"), 
    step1("action"), 
    beans("data").as("target")
)

result.show

Remark

Still, it seems unusual and smelly to keep the different beans ("a", "b", "c") in the same table as ("1", "2", "3")

Martin Senne
  • 5,939
  • 6
  • 30
  • 47