Passing Array to Spark Lit function

Question

Let's say I have a numpy array a that contains the numbers 1-10:
[1 2 3 4 5 6 7 8 9 10]

I also have a Spark dataframe to which I want to add my numpy array a. I figure that a column of literals will do the job. This doesn't work:

df = df.withColumn("NewColumn", F.lit(a))

Unsupported literal type class java.util.ArrayList

But this works:

df = df.withColumn("NewColumn", F.lit(a[0]))

How to do it?

Example DF before:

col1
a b c d e f g h i j

Expected result:

col1	NewColumn
a b c d e f g h i j	1 2 3 4 5 6 7 8 9 10

score 53 · Accepted Answer · edited Aug 06 '21 at 07:40

53

List comprehension inside Spark's `array`

a = [1,2,3,4,5,6,7,8,9,10]
df = spark.createDataFrame([['a b c d e f g h i j '],], ['col1'])
df = df.withColumn("NewColumn", F.array([F.lit(x) for x in a]))

df.show(truncate=False)
df.printSchema()
#  +--------------------+-------------------------------+
#  |col1                |NewColumn                      |
#  +--------------------+-------------------------------+
#  |a b c d e f g h i j |[1, 2, 3, 4, 5, 6, 7, 8, 9, 10]|
#  +--------------------+-------------------------------+
#  root
#   |-- col1: string (nullable = true)
#   |-- NewColumn: array (nullable = false)
#   |    |-- element: integer (containsNull = false)

@pault commented (Python 2.7):

You can hide the loop using map:
df.withColumn("NewColumn", F.array(map(F.lit, a)))

@ abegehr added Python 3 version:

df.withColumn("NewColumn", F.array(*map(F.lit, a)))

Spark's `udf`

# Defining UDF
def arrayUdf():
    return a
callArrayUdf = F.udf(arrayUdf, T.ArrayType(T.IntegerType()))

# Calling UDF
df = df.withColumn("NewColumn", callArrayUdf())

Output is the same.

edited Aug 06 '21 at 07:40

ZygD

22,092
39
79
102

answered Apr 06 '18 at 03:30

Ramesh Maharjan

41,071
6
69
97

I tried this and it works. Thank you for the answer and I will keep it this way for now. However, in reality, my "a" array has tens of thousands of entries, and because of the for loop, it is not quite efficient. Is there a way to do it without loops? – A. R. Apr 06 '18 at 04:01
@A.R. I have updated my answer using udf function which doesn't require for loop. If the answer is helpful you can accept it and upvote – Ramesh Maharjan Apr 06 '18 at 04:10
2

You can hide the loop using `map`: `df.withColumn("NewColumn", F.array(map(F.lit, a)))` – pault Apr 06 '18 at 16:22
@pault Isn't map an rdd function? Also output of map is neither a string or column so `withColumn` would throw an error. – Ani Menon Oct 17 '20 at 10:56
No, that `map` is referring to the built in python function. – pault Oct 17 '20 at 12:24
2

@pault, I think this should be `F.array(*map(F.lit, a))` with the (star) spread operator, since F.array cannot handle a map object. – abegehr Feb 02 '21 at 21:52
Makes sense. I probably tested using python 2.7. – pault Feb 02 '21 at 22:59

Neeraj Bhadani · Answer 2 · 2020-05-31T10:57:51.130

In scala API, we can use "typedLit" function to add the Array or map values in the column.

// Ref : https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.functions$

Here is the sample code to add an Array or Map as a column value.

import org.apache.spark.sql.functions.typedLit

val df1 = Seq((1, 0), (2, 3)).toDF("a", "b")

df1.withColumn("seq", typedLit(Seq(1,2,3)))
    .withColumn("map", typedLit(Map(1 -> 2)))
    .show(truncate=false)

// Output

+---+---+---------+--------+
|a  |b  |seq      |map     |
+---+---+---------+--------+
|1  |0  |[1, 2, 3]|[1 -> 2]|
|2  |3  |[1, 2, 3]|[1 -> 2]|
+---+---+---------+--------+

I hope this helps.

This doesn't answer the question, the OP has asked for a pyspark solution. — Ani Menon, Oct 17 '20 at 11:13

Passing Array to Spark Lit function

2 Answers2

List comprehension inside Spark's `array`

Spark's `udf`

Linked

Passing Array to Spark Lit function

2 Answers2

List comprehension inside Spark's array

Spark's udf

Linked

List comprehension inside Spark's `array`

Spark's `udf`