6

In PySpark I have a dataframe composed by two columns:

+-----------+----------------------+
| str1      | array_of_str         |
+-----------+----------------------+
| John      | [mango, apple, ...   |
| Tom       | [mango, orange, ...  |
| Matteo    | [apple, banana, ...  | 

I want to add a column concat_result that contains the concatenation of each element inside array_of_str with the string inside str1 column.

+-----------+----------------------+----------------------------------+
| str1      | array_of_str         | concat_result                    |
+-----------+----------------------+----------------------------------+
| John      | [mango, apple, ...   | [mangoJohn, appleJohn, ...       |
| Tom       | [mango, orange, ...  | [mangoTom, orangeTom, ...        |
| Matteo    | [apple, banana, ...  | [appleMatteo, bananaMatteo, ...  |

I'm trying to use map to iterate over the array:

from pyspark.sql import functions as F
from pyspark.sql.types import StringType, ArrayType

# START EXTRACT OF CODE
ret = (df
  .select(['str1', 'array_of_str'])
  .withColumn('concat_result', F.udf(
     map(lambda x: x + F.col('str1'), F.col('array_of_str')), ArrayType(StringType))
  )
)

return ret
# END EXTRACT OF CODE

but I obtain as error:

TypeError: argument 2 to map() must support iteration
Matteo Guarnerio
  • 720
  • 2
  • 9
  • 26
  • 1
    Possible duplicate of [TypeError: Column is not iterable - How to iterate over ArrayType()?](https://stackoverflow.com/questions/48993439/typeerror-column-is-not-iterable-how-to-iterate-over-arraytype) – pault Jun 20 '19 at 15:21
  • I tried that solution, it does not work. If you can write one that works will be appreciated. – Matteo Guarnerio Jun 20 '19 at 15:42
  • You need to define a `udf` with 2 arguments - (perhaps unless you're in spark 2.4+) – pault Jun 20 '19 at 15:44
  • Possible duplicate of [Convert PySpark dataframe column from list to string](https://stackoverflow.com/a/45108533) – user10938362 Jun 23 '19 at 13:57

1 Answers1

6

You only need small tweaks to make this work:

from pyspark.sql.types import StringType, ArrayType
from pyspark.sql.functions import udf, col

concat_udf = udf(lambda con_str, arr: [x + con_str for x in arr],
                   ArrayType(StringType()))
ret = df \
  .select(['str1', 'array_of_str']) \
  .withColumn('concat_result', concat_udf(col("str1"), col("array_of_str")))

ret.show()

You don't need to use map, standard list comprehension is sufficient.

pault
  • 41,343
  • 15
  • 107
  • 149
Richard Nemeth
  • 1,784
  • 1
  • 6
  • 16
  • 2
    Only caveat is that this will break if any of the `str1` or `array_of_str` values are `null`. You'd have to add explicit error checking in your `udf`. – pault Jun 20 '19 at 15:51