32

Let's say I have a numpy array a that contains the numbers 1-10:
[1 2 3 4 5 6 7 8 9 10]

I also have a Spark dataframe to which I want to add my numpy array a. I figure that a column of literals will do the job. This doesn't work:

df = df.withColumn("NewColumn", F.lit(a))

Unsupported literal type class java.util.ArrayList

But this works:

df = df.withColumn("NewColumn", F.lit(a[0]))

How to do it?

Example DF before:

col1
a b c d e f g h i j

Expected result:

col1 NewColumn
a b c d e f g h i j 1 2 3 4 5 6 7 8 9 10
ZygD
  • 22,092
  • 39
  • 79
  • 102
A. R.
  • 433
  • 1
  • 4
  • 8

2 Answers2

53

List comprehension inside Spark's array

a = [1,2,3,4,5,6,7,8,9,10]
df = spark.createDataFrame([['a b c d e f g h i j '],], ['col1'])
df = df.withColumn("NewColumn", F.array([F.lit(x) for x in a]))

df.show(truncate=False)
df.printSchema()
#  +--------------------+-------------------------------+
#  |col1                |NewColumn                      |
#  +--------------------+-------------------------------+
#  |a b c d e f g h i j |[1, 2, 3, 4, 5, 6, 7, 8, 9, 10]|
#  +--------------------+-------------------------------+
#  root
#   |-- col1: string (nullable = true)
#   |-- NewColumn: array (nullable = false)
#   |    |-- element: integer (containsNull = false)

@pault commented (Python 2.7):

You can hide the loop using map:
df.withColumn("NewColumn", F.array(map(F.lit, a)))

@ abegehr added Python 3 version:

df.withColumn("NewColumn", F.array(*map(F.lit, a)))

Spark's udf

# Defining UDF
def arrayUdf():
    return a
callArrayUdf = F.udf(arrayUdf, T.ArrayType(T.IntegerType()))

# Calling UDF
df = df.withColumn("NewColumn", callArrayUdf())

Output is the same.

ZygD
  • 22,092
  • 39
  • 79
  • 102
Ramesh Maharjan
  • 41,071
  • 6
  • 69
  • 97
  • I tried this and it works. Thank you for the answer and I will keep it this way for now. However, in reality, my "a" array has tens of thousands of entries, and because of the for loop, it is not quite efficient. Is there a way to do it without loops? – A. R. Apr 06 '18 at 04:01
  • @A.R. I have updated my answer using udf function which doesn't require for loop. If the answer is helpful you can accept it and upvote – Ramesh Maharjan Apr 06 '18 at 04:10
  • 2
    You can hide the loop using `map`: `df.withColumn("NewColumn", F.array(map(F.lit, a)))` – pault Apr 06 '18 at 16:22
  • @pault Isn't map an rdd function? Also output of map is neither a string or column so `withColumn` would throw an error. – Ani Menon Oct 17 '20 at 10:56
  • No, that `map` is referring to the built in python function. – pault Oct 17 '20 at 12:24
  • 2
    @pault, I think this should be `F.array(*map(F.lit, a))` with the (star) spread operator, since F.array cannot handle a map object. – abegehr Feb 02 '21 at 21:52
  • Makes sense. I probably tested using python 2.7. – pault Feb 02 '21 at 22:59
0

In scala API, we can use "typedLit" function to add the Array or map values in the column.

// Ref : https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.functions$

Here is the sample code to add an Array or Map as a column value.

import org.apache.spark.sql.functions.typedLit

val df1 = Seq((1, 0), (2, 3)).toDF("a", "b")

df1.withColumn("seq", typedLit(Seq(1,2,3)))
    .withColumn("map", typedLit(Map(1 -> 2)))
    .show(truncate=false)

// Output

+---+---+---------+--------+
|a  |b  |seq      |map     |
+---+---+---------+--------+
|1  |0  |[1, 2, 3]|[1 -> 2]|
|2  |3  |[1, 2, 3]|[1 -> 2]|
+---+---+---------+--------+

I hope this helps.

Neeraj Bhadani
  • 2,930
  • 16
  • 26