15

I have a Spark DataFrame df with five columns. I want to add another column with its values being the tuple of the first and second columns. When using with withColumn() method, I get the mismatch error, because the input is not Column type, but instead (Column,Column). I wonder if there is a solution beside running for loop over the rows in this case?

var dfCol=(col1:Column,col2:Column)=>(col1,col2)
val vv = df.withColumn( "NewColumn", dfCol( df(df.schema.fieldNames(1)) , df(df.schema.fieldNames(2)) ) )
ZygD
  • 22,092
  • 39
  • 79
  • 102
TNM
  • 1,361
  • 3
  • 15
  • 25

4 Answers4

30

You can use struct function which creates a tuple of provided columns:

import org.apache.spark.sql.functions.struct

val df = Seq((1,2), (3,4), (5,3)).toDF("a", "b")
df.withColumn("NewColumn", struct(df("a"), df("b")).show(false)

+---+---+---------+
|a  |b  |NewColumn|
+---+---+---------+
|1  |2  |[1,2]    |
|3  |4  |[3,4]    |
|5  |3  |[5,3]    |
+---+---+---------+
Tautvydas
  • 2,027
  • 3
  • 25
  • 38
13

You can use a User-defined function udf to achieve what you want.

UDF definition

object TupleUDFs {
  import org.apache.spark.sql.functions.udf      
  // type tag is required, as we have a generic udf
  import scala.reflect.runtime.universe.{TypeTag, typeTag}

  def toTuple2[S: TypeTag, T: TypeTag] = 
    udf[(S, T), S, T]((x: S, y: T) => (x, y))
}

Usage

df.withColumn(
  "tuple_col", TupleUDFs.toTuple2[Int, Int].apply(df("a"), df("b"))
)

assuming "a" and "b" are the columns of type Int you want to put in a tuple.

Martin Senne
  • 5,939
  • 6
  • 30
  • 47
  • @TNM: Your edit got rejected unfortunately, as edit comments did not state clearly about wrong import, I used. It is corrected now. – Martin Senne Sep 27 '15 at 09:05
6

You can merge multiple dataframe columns into one using array.

// $"*" will capture all existing columns
df.select($"*", array($"col1", $"col2").as("newCol")) 
stefanobaghino
  • 11,253
  • 4
  • 35
  • 63
Abu Shoeb
  • 4,747
  • 2
  • 40
  • 45
2

If you want to merge two dataframe columns into one column. Just:

import org.apache.spark.sql.functions.array
df.withColumn("NewColumn", array("columnA", "columnB"))
superDuck
  • 189
  • 2
  • 8