2

I have a requirement to cbind [as it happens in R] two dataframes in spark using scala, which do not have a ID column. Any pointers on any readily available function for it, or some other workaround for it?

Example:

DF1:

    Name Age
    ABC  10
    BCD  11

DF2:

    Marks
    75
    85

Result needed:

    DF3:
    Name Age Marks
    ABC  10  75
    BCD  11  85
  • To bind columns, you don't need any prior knowledge of column names. The only thing that needs to match is number of rows. You may be looking for `merge` or SQL equivalents "joins". – Roman Luštrik Jul 07 '17 at 07:18
  • Yes, but i dont have any key columns to Join. I am actually looking for Spark's equivalent of R's cbind(). – Sourajyoti Datta Jul 07 '17 at 07:26
  • Have you seen this? https://forums.databricks.com/questions/8180/how-to-merge-two-data-frames-column-wise-in-apache.html – Roman Luštrik Jul 07 '17 at 07:29
  • This is a workaround. I was hoping something as basic as this might already be implemented with a core function in spark! – Sourajyoti Datta Jul 07 '17 at 07:32
  • Comments aren't really appropriate for pasting much code. Either edit your question or if you found a solution, post it as an answer. – Roman Luštrik Jul 07 '17 at 07:34

2 Answers2

1

This works perfectly as a workaround:

    df1 = df1.withColumn("id", monotonically_increasing_id())
    df2 = df2.withColumn("id", monotonically_increasing_id())
    df3 = df2.join(df1, "id", "outer").drop("id")

For spark 1.6.*, the last line needs to be:

    df3 = df2.join(df1, df1("id") === df2("id"), "outer").drop("id")
  • 2
    since monotonically_increasing_id() doesn't guarantee consecutive sequence, doubling your `cbind` could make a trouble. – Curycu Feb 01 '18 at 05:25
  • The correct form of the third line (for Spark 1.4) is: df3 = df2.join(df1, Seq("id"), "outer").drop("id") – Carlos AG Jul 14 '20 at 10:00
  • @Curycu Yes, it is not consecutive, when multiple partitions are involved. In such a case, the upper 31 bits of the 64-bit generated ids contain the partition id, and the lower 33 bits contain the record numbers. That is why it works, even when the data is distributed across clusters. – Sourajyoti Datta Aug 10 '22 at 08:49
0

try this function

val cbind: (DataFrame, DataFrame) => DataFrame =
  (df, df2) => {
    val x =
      spark
        .createDataFrame(
          df.rdd.zipWithUniqueId.map(tu => Row(tu._1.toSeq.:+(tu._2): _*)),
          df.schema.add(StructField("primaryKeyForCbind", LongType, false)))
        .withColumn("orderKeyForCbind", $"primaryKeyForCbind")
        .as("df")
    val y =
      spark
        .createDataFrame(
          df2.rdd.zipWithUniqueId.map(tu => Row(tu._1.toSeq.:+(tu._2): _*)),
          df2.schema.add(StructField("primaryKeyForCbind", LongType, false)))
        .as("df2")
    x.join(y, col("df.primaryKeyForCbind") === col("df2.primaryKeyForCbind"))
      .sort("orderKeyForCbind")
      .drop("primaryKeyForCbind", "orderKeyForCbind")
  }

val df =
  Seq(
    ("curycu", "2018-01-01"),
    ("curycu", "2018-01-07"),
    ("curycu", "2018-01-12"),
    ("curycu", "2018-01-15"),
    ("tester", "2018-01-01"),
    ("tester", "2018-01-11"),
    ("tester", "2018-01-18"))
    .toDF("id", "date")

val df2 = Seq(0, 6, 5, 3, 0, 10, 7).toDF("daygap")

val res = cbind(cbind(cbind(df, df2), df2), df)
res.printSchema
res.show

root
 |-- id: string (nullable = true)
 |-- date: string (nullable = true)
 |-- daygap: integer (nullable = false)
 |-- daygap: integer (nullable = false)
 |-- id: string (nullable = true)
 |-- date: string (nullable = true)

+------+----------+------+------+------+----------+
|    id|      date|daygap|daygap|    id|      date|
+------+----------+------+------+------+----------+
|curycu|2018-01-01|     0|     0|curycu|2018-01-01|
|curycu|2018-01-07|     6|     6|curycu|2018-01-07|
|curycu|2018-01-12|     5|     5|curycu|2018-01-12|
|curycu|2018-01-15|     3|     3|curycu|2018-01-15|
|tester|2018-01-01|     0|     0|tester|2018-01-01|
|tester|2018-01-11|    10|    10|tester|2018-01-11|
|tester|2018-01-18|     7|     7|tester|2018-01-18|
+------+----------+------+------+------+----------+
Curycu
  • 1,545
  • 17
  • 21