I am facing a strange problem with Apache Spark (using the Scala API). There are two DataFrame objects, let's call them beans and relation.
- The beans dataframe consists of two columns, named id and data. Consider that all ids are unique and data holds a text representation of some action or a target of an action.
- The relation DataFrame defines relationship between the actions and their targets. It consists of two columns: actionId and targetId
(look at the code snippet below to view a table represantation of the DataFrame objects)
Basically, I am trying to alias the beans as two new DataFrame objects: actions and targets and then join them via the relation DataFrame
Here's some code to illustrate what is going on:
//define sql context, using
val sqlContext = new SQLContext(sparkContext)
// ...
// Produce the following DataFrame objects:
// beans: relation:
// +--------+--------+ +----------+----------+
// | id | data | | actionId | targetId |
// +--------+--------+ +----------+----------+
// | a | save | | a | 1 |
// +--------+--------+ +----------+----------+
// | b | delete | | b | 2 |
// +--------+--------+ +----------+----------+
// | c | read | | c | 3 |
// +--------+--------+ +----------+----------+
// | 1 | file |
// +--------+--------+
// | 2 | os |
// +--------+--------+
// | 3 | book |
// +--------+--------+
case class Bean(id: String, data: String)
case class Relation(actionId: String, targetId: String)
val beans = sqlContext.createDataFrame(
Bean("a", "save") :: Bean("b", "delete") :: Bean("c", "read") ::
Bean("1", "file") :: Bean("2", "os") :: Bean("3", "book") :: Nil
)
val relation = sqlContext.createDataFrame(
Relation("a", "1") :: Relation("b", "2") :: Relation("c", "3") :: Nil
)
// alias beans as "actions" and "targets" to avoid ambiguity
val actions = beans as "actions"
val targets = beans as "targets"
// join actions and targets via relation
actions.join(relation, actions("id") === relation("actionId"))
.join(targets, targets("id") === relation("targetId"))
.select(actions("id") as "actionId", targets("id") as "targetId",
actions("data") as "action", targets("data") as "target")
.show()
The desired output of this snippet is
// desired output
// +----------+----------+--------+--------+
// | actionId | targetId | action | target |
// +----------+----------+--------+--------+
// | a | 1 | save | file |
// +----------+----------+--------+--------+
// | b | 2 | delete | os |
// +----------+----------+--------+--------+
// | c | 3 | read | book |
// +----------+----------+--------+--------+
However, the real (and strange) output is an empty DataFrame
+--------+--------+------+------+
|actionId|targetId|action|target|
+--------+--------+------+------+
+--------+--------+------+------+
I have suspected that there is an issue with joining a DataFrame to itself, but the example in Usage of spark DataFrame “as” method proves this suspicion wrong.
I am working with Spark 1.4.1 and Scala 2.10.4 but got the same result on Spark 1.5.1 and Scala 2.11.7
Changing the schema of the DataFrame objects is not an option. Any suggestions?
Solution
Refer to zero323's response. If you are getting an error message like this
error: value $ is not a member of StringContext
actions.join(relation, $"actions.id" === $"actionId")
^
be sure to add the following statement
import sqlContext.implicits._