In PySpark I have a dataframe composed by two columns:
+-----------+----------------------+
| str1 | array_of_str |
+-----------+----------------------+
| John | [mango, apple, ... |
| Tom | [mango, orange, ... |
| Matteo | [apple, banana, ... |
I want to add a column concat_result
that contains the concatenation of each element inside array_of_str
with the string inside str1
column.
+-----------+----------------------+----------------------------------+
| str1 | array_of_str | concat_result |
+-----------+----------------------+----------------------------------+
| John | [mango, apple, ... | [mangoJohn, appleJohn, ... |
| Tom | [mango, orange, ... | [mangoTom, orangeTom, ... |
| Matteo | [apple, banana, ... | [appleMatteo, bananaMatteo, ... |
I'm trying to use map
to iterate over the array:
from pyspark.sql import functions as F
from pyspark.sql.types import StringType, ArrayType
# START EXTRACT OF CODE
ret = (df
.select(['str1', 'array_of_str'])
.withColumn('concat_result', F.udf(
map(lambda x: x + F.col('str1'), F.col('array_of_str')), ArrayType(StringType))
)
)
return ret
# END EXTRACT OF CODE
but I obtain as error:
TypeError: argument 2 to map() must support iteration