How to merge two columns of a `Dataframe` in Spark into one 2-Tuple?

Question

I have a Spark DataFrame df with five columns. I want to add another column with its values being the tuple of the first and second columns. When using with withColumn() method, I get the mismatch error, because the input is not Column type, but instead (Column,Column). I wonder if there is a solution beside running for loop over the rows in this case?

var dfCol=(col1:Column,col2:Column)=>(col1,col2)
val vv = df.withColumn( "NewColumn", dfCol( df(df.schema.fieldNames(1)) , df(df.schema.fieldNames(2)) ) )

Tautvydas · Answer 1 · 2021-02-10T14:05:46.077

30

You can use struct function which creates a tuple of provided columns:

import org.apache.spark.sql.functions.struct

val df = Seq((1,2), (3,4), (5,3)).toDF("a", "b")
df.withColumn("NewColumn", struct(df("a"), df("b")).show(false)

+---+---+---------+
|a  |b  |NewColumn|
+---+---+---------+
|1  |2  |[1,2]    |
|3  |4  |[3,4]    |
|5  |3  |[5,3]    |
+---+---+---------+

edited Feb 10 '21 at 14:05

answered Aug 04 '17 at 12:01

Tautvydas

2,027
3
25
38

What if I want to delete the existing `a` and `b` columns? How to do that please? – Soumendra Apr 09 '19 at 12:58
1

@Soumendra `df.drop('a')`. – Nima Mousavi Apr 26 '19 at 13:35
This is the better way of doing it. Verified. – Innovation Feb 12 '21 at 07:56

Martin Senne · Accepted Answer · 2015-09-27T09:03:14.597

13

You can use a User-defined function udf to achieve what you want.

UDF definition

object TupleUDFs {
  import org.apache.spark.sql.functions.udf      
  // type tag is required, as we have a generic udf
  import scala.reflect.runtime.universe.{TypeTag, typeTag}

  def toTuple2[S: TypeTag, T: TypeTag] = 
    udf[(S, T), S, T]((x: S, y: T) => (x, y))
}

Usage

df.withColumn(
  "tuple_col", TupleUDFs.toTuple2[Int, Int].apply(df("a"), df("b"))
)

assuming "a" and "b" are the columns of type Int you want to put in a tuple.

edited Sep 27 '15 at 09:03

answered Sep 26 '15 at 17:46

Martin Senne

5,939
6
30
47

@TNM: Your edit got rejected unfortunately, as edit comments did not state clearly about wrong import, I used. It is corrected now. – Martin Senne Sep 27 '15 at 09:05

score 6 · Answer 3 · edited Feb 07 '18 at 13:25

6

You can merge multiple dataframe columns into one using array.

// $"*" will capture all existing columns
df.select($"*", array($"col1", $"col2").as("newCol"))

edited Feb 07 '18 at 13:25

stefanobaghino

11,253
4
35
63

answered Dec 12 '17 at 22:17

Abu Shoeb

4,747
2
40
45

score 2 · Answer 4 · answered Apr 14 '19 at 15:31

2

If you want to merge two dataframe columns into one column. Just:

import org.apache.spark.sql.functions.array
df.withColumn("NewColumn", array("columnA", "columnB"))

answered Apr 14 '19 at 15:31

superDuck

189
2
8

How to merge two columns of a `Dataframe` in Spark into one 2-Tuple?

4 Answers4

UDF definition

Usage

Linked