Iterate over an array column in PySpark with map

Question

In PySpark I have a dataframe composed by two columns:

+-----------+----------------------+
| str1      | array_of_str         |
+-----------+----------------------+
| John      | [mango, apple, ...   |
| Tom       | [mango, orange, ...  |
| Matteo    | [apple, banana, ...  |

I want to add a column concat_result that contains the concatenation of each element inside array_of_str with the string inside str1 column.

+-----------+----------------------+----------------------------------+
| str1      | array_of_str         | concat_result                    |
+-----------+----------------------+----------------------------------+
| John      | [mango, apple, ...   | [mangoJohn, appleJohn, ...       |
| Tom       | [mango, orange, ...  | [mangoTom, orangeTom, ...        |
| Matteo    | [apple, banana, ...  | [appleMatteo, bananaMatteo, ...  |

I'm trying to use map to iterate over the array:

from pyspark.sql import functions as F
from pyspark.sql.types import StringType, ArrayType

# START EXTRACT OF CODE
ret = (df
  .select(['str1', 'array_of_str'])
  .withColumn('concat_result', F.udf(
     map(lambda x: x + F.col('str1'), F.col('array_of_str')), ArrayType(StringType))
  )
)

return ret
# END EXTRACT OF CODE

but I obtain as error:

TypeError: argument 2 to map() must support iteration

Possible duplicate of [TypeError: Column is not iterable - How to iterate over ArrayType()?](https://stackoverflow.com/questions/48993439/typeerror-column-is-not-iterable-how-to-iterate-over-arraytype) — pault, Jun 20 '19 at 15:21
I tried that solution, it does not work. If you can write one that works will be appreciated. — Matteo Guarnerio, Jun 20 '19 at 15:42
You need to define a `udf` with 2 arguments - (perhaps unless you're in spark 2.4+) — pault, Jun 20 '19 at 15:44
Possible duplicate of [Convert PySpark dataframe column from list to string](https://stackoverflow.com/a/45108533) — user10938362, Jun 23 '19 at 13:57

score 6 · Accepted Answer · edited Jun 20 '19 at 15:46

6

You only need small tweaks to make this work:

from pyspark.sql.types import StringType, ArrayType
from pyspark.sql.functions import udf, col

concat_udf = udf(lambda con_str, arr: [x + con_str for x in arr],
                   ArrayType(StringType()))
ret = df \
  .select(['str1', 'array_of_str']) \
  .withColumn('concat_result', concat_udf(col("str1"), col("array_of_str")))

ret.show()

You don't need to use map, standard list comprehension is sufficient.

edited Jun 20 '19 at 15:46

pault

41,343
15
107
149

answered Jun 20 '19 at 15:44

Richard Nemeth

1,784
1
6
16

2

Only caveat is that this will break if any of the `str1` or `array_of_str` values are `null`. You'd have to add explicit error checking in your `udf`. – pault Jun 20 '19 at 15:51

Iterate over an array column in PySpark with map

1 Answers1