How to append an element to an array column of a Spark Dataframe?

Question

Suppose I have the following DataFrame:

scala> val df1 = Seq("a", "b").toDF("id").withColumn("nums", array(lit(1)))
df1: org.apache.spark.sql.DataFrame = [id: string, nums: array<int>]

scala> df1.show()
+---+----+
| id|nums|
+---+----+
|  a| [1]|
|  b| [1]|
+---+----+

And I want to add elements to the array in the nums column, so that I get something like the following:

+---+-------+
| id|nums   |
+---+-------+
|  a| [1,5] |
|  b| [1,5] |
+---+-------+

Is there a way to do this using the .withColumn() method of the DataFrame? E.g.

val df2 = df1.withColumn("nums", append(col("nums"), lit(5)))

I've looked through the API documentation for Spark, but can't find anything that would allow me to do this.

score 13 · Accepted Answer · answered Sep 26 '19 at 21:03

import org.apache.spark.sql.functions.{lit, array, array_union}

val df1 = Seq("a", "b").toDF("id").withColumn("nums", array(lit(1)))
val df2 = df1.withColumn("nums", array_union($"nums", lit(Array(5))))
df2.show

+---+------+
| id|  nums|
+---+------+
|  a|[1, 5]|
|  b|[1, 5]|
+---+------+

The array_union() was added since spark 2.4.0 release on 11/2/2018, 7 months after you asked the question, :) see https://spark.apache.org/news/index.html

Ramesh Maharjan · Answer 2 · 2018-10-03T16:35:53.430

5

You can do it using a udf function as

def addValue = udf((array: Seq[Int])=> array ++ Array(5))

df1.withColumn("nums", addValue(col("nums")))
  .show(false)

and you should get

+---+------+
|id |nums  |
+---+------+
|a  |[1, 5]|
|b  |[1, 5]|
+---+------+

Updated Alternative way is to go with dataset way and use map as

df1.map(row => add(row.getAs[String]("id"), row.getAs[Seq[Int]]("nums")++Seq(5)))
  .show(false)

where add is a case class

case class add(id: String, nums: Seq[Int])

I hope the answer is helpful

edited Oct 03 '18 at 16:35

answered Apr 06 '18 at 04:23

Ramesh Maharjan

41,071
6
69
97

is there a more efficient way for big arrays? Since by default, `Seq` is `List` and appending `List` is slow – Minh Thai Oct 03 '18 at 08:50
Majarjan thanks but in terms of performance, it is the same. So it is not related to what I'm saying – Minh Thai Oct 04 '18 at 03:38
Thanks for the upvote and letting me know that the performance are the same. you can try rdd way to check the performance. @MinhThai – Ramesh Maharjan Oct 04 '18 at 03:47

score 3 · Answer 3 · answered Apr 13 '21 at 14:43

If you are, like me, searching how to do this in a Spark SQL statement; here's how:

%sql

select array_union(array("value 1"), array("value 2"))

You can use array_union to join up two arrays. To be able to use this, you have to turn your value-to-append into an array. Do this by using the array() function.

You can enter a value like array("a string") or array(yourColumn).

score 0 · Answer 4 · answered Sep 08 '22 at 23:17

0

Be careful with using spark array_join. It is removing duplicates. So you will not get expected results if you have duplicated entries in your array. And it is at least costing O(N). So when I use it with a array aggregate, it became an O(N^2) operation and took forever for some large arrays.

answered Sep 08 '22 at 23:17

Chi Zhang

1

Your answer could be improved with additional supporting information. Please [edit] to add further details, such as citations or documentation, so that others can confirm that your answer is correct. You can find more information on how to write good answers [in the help center](/help/how-to-answer). – Community Sep 16 '22 at 06:14
Agreed. How did you do it then? Did you use an UDF? – mani_nz Sep 27 '22 at 18:14
I ended up writing a UDF that is specific to my case without using array append. – Chi Zhang Oct 21 '22 at 23:28

score 0 · Answer 5 · answered May 12 '23 at 13:30

Spark 3.4+

array_append($"nums", 5)

Full example:

val df1 = Seq("a", "b").toDF("id").withColumn("nums", array(lit(1)))
df1.show()
// +---+----+
// | id|nums|
// +---+----+
// |  a| [1]|
// |  b| [1]|
// +---+----+

val df2 = df1.withColumn("nums", array_append($"nums", 5))
df2.show()
// +---+------+
// | id|  nums|
// +---+------+
// |  a|[1, 5]|
// |  b|[1, 5]|
// +---+------+

How to append an element to an array column of a Spark Dataframe?

5 Answers5

Linked