R's cbind functionality in Spark Scala

Question

I have a requirement to cbind [as it happens in R] two dataframes in spark using scala, which do not have a ID column. Any pointers on any readily available function for it, or some other workaround for it?

Example:

DF1:

    Name Age
    ABC  10
    BCD  11

DF2:

    Marks
    75
    85

Result needed:

    DF3:
    Name Age Marks
    ABC  10  75
    BCD  11  85

To bind columns, you don't need any prior knowledge of column names. The only thing that needs to match is number of rows. You may be looking for `merge` or SQL equivalents "joins". — Roman Luštrik, Jul 07 '17 at 07:18
Yes, but i dont have any key columns to Join. I am actually looking for Spark's equivalent of R's cbind(). — Sourajyoti Datta, Jul 07 '17 at 07:26
Have you seen this? https://forums.databricks.com/questions/8180/how-to-merge-two-data-frames-column-wise-in-apache.html — Roman Luštrik, Jul 07 '17 at 07:29
This is a workaround. I was hoping something as basic as this might already be implemented with a core function in spark! — Sourajyoti Datta, Jul 07 '17 at 07:32
Comments aren't really appropriate for pasting much code. Either edit your question or if you found a solution, post it as an answer. — Roman Luštrik, Jul 07 '17 at 07:34

score 1 · Accepted Answer · answered Jul 07 '17 at 08:11

1

This works perfectly as a workaround:

    df1 = df1.withColumn("id", monotonically_increasing_id())
    df2 = df2.withColumn("id", monotonically_increasing_id())
    df3 = df2.join(df1, "id", "outer").drop("id")

For spark 1.6.*, the last line needs to be:

    df3 = df2.join(df1, df1("id") === df2("id"), "outer").drop("id")

answered Jul 07 '17 at 08:11

Sourajyoti Datta

61
1
8

2

since monotonically_increasing_id() doesn't guarantee consecutive sequence, doubling your `cbind` could make a trouble. – Curycu Feb 01 '18 at 05:25
The correct form of the third line (for Spark 1.4) is: df3 = df2.join(df1, Seq("id"), "outer").drop("id") – Carlos AG Jul 14 '20 at 10:00
@Curycu Yes, it is not consecutive, when multiple partitions are involved. In such a case, the upper 31 bits of the 64-bit generated ids contain the partition id, and the lower 33 bits contain the record numbers. That is why it works, even when the data is distributed across clusters. – Sourajyoti Datta Aug 10 '22 at 08:49

score 0 · Answer 2 · answered Feb 01 '18 at 05:28

try this function

val cbind: (DataFrame, DataFrame) => DataFrame =
  (df, df2) => {
    val x =
      spark
        .createDataFrame(
          df.rdd.zipWithUniqueId.map(tu => Row(tu._1.toSeq.:+(tu._2): _*)),
          df.schema.add(StructField("primaryKeyForCbind", LongType, false)))
        .withColumn("orderKeyForCbind", $"primaryKeyForCbind")
        .as("df")
    val y =
      spark
        .createDataFrame(
          df2.rdd.zipWithUniqueId.map(tu => Row(tu._1.toSeq.:+(tu._2): _*)),
          df2.schema.add(StructField("primaryKeyForCbind", LongType, false)))
        .as("df2")
    x.join(y, col("df.primaryKeyForCbind") === col("df2.primaryKeyForCbind"))
      .sort("orderKeyForCbind")
      .drop("primaryKeyForCbind", "orderKeyForCbind")
  }

val df =
  Seq(
    ("curycu", "2018-01-01"),
    ("curycu", "2018-01-07"),
    ("curycu", "2018-01-12"),
    ("curycu", "2018-01-15"),
    ("tester", "2018-01-01"),
    ("tester", "2018-01-11"),
    ("tester", "2018-01-18"))
    .toDF("id", "date")

val df2 = Seq(0, 6, 5, 3, 0, 10, 7).toDF("daygap")

val res = cbind(cbind(cbind(df, df2), df2), df)
res.printSchema
res.show

root
 |-- id: string (nullable = true)
 |-- date: string (nullable = true)
 |-- daygap: integer (nullable = false)
 |-- daygap: integer (nullable = false)
 |-- id: string (nullable = true)
 |-- date: string (nullable = true)

+------+----------+------+------+------+----------+
|    id|      date|daygap|daygap|    id|      date|
+------+----------+------+------+------+----------+
|curycu|2018-01-01|     0|     0|curycu|2018-01-01|
|curycu|2018-01-07|     6|     6|curycu|2018-01-07|
|curycu|2018-01-12|     5|     5|curycu|2018-01-12|
|curycu|2018-01-15|     3|     3|curycu|2018-01-15|
|tester|2018-01-01|     0|     0|tester|2018-01-01|
|tester|2018-01-11|    10|    10|tester|2018-01-11|
|tester|2018-01-18|     7|     7|tester|2018-01-18|
+------+----------+------+------+------+----------+

R's cbind functionality in Spark Scala

2 Answers2