Convert comma separated string to array in pyspark dataframe

Question

I have a dataframe as below where ev is of type string.

>>> df2.show()
+---+--------------+
| id|            ev|
+---+--------------+
|  1| 200, 201, 202|
|  1|23, 24, 34, 45|
|  1|          null|
|  2|            32|
|  2|          null|
+---+--------------+

Is there a way to cast ev to type ArrayType without using UDF or UDF is the only option to do that?

zero323 · Accepted Answer · 2016-07-04T17:22:39.400

37

You can use built-in split function:

from pyspark.sql.functions import col, split

df = sc.parallelize([
    (1, "200, 201, 202"), (1, "23, 24, 34, 45"), (1, None),
    (2, "32"), (2, None)]).toDF(["id", "ev"])

df.select(col("id"), split(col("ev"), ",\s*").alias("ev"))

If you want to convert data to numeric types you can cast as follows:

df.withColumn(
    "ev",
    split(col("ev"), ",\s*").cast("array<int>").alias("ev")
)

or

from pyspark.sql.types import ArrayType, IntegerType

df.withColumn(
    "ev",
    split(col("ev"), ",\s*").cast(ArrayType(IntegerType())).alias("ev")
)

edited Jul 04 '16 at 17:22

answered Jul 04 '16 at 16:47

zero323

322,348
103
959
935

1

Thanks for the information, after a bit of research i just found out the split function and was about to post the answer :). – Swadeep Jul 04 '16 at 16:49
One more help, I searched for this one and not able to find. The resultant array is a array of string, can we have it as array of integer? – Swadeep Jul 04 '16 at 17:18
Yes, you can cast types afterwards. – zero323 Jul 04 '16 at 17:22

Convert comma separated string to array in pyspark dataframe

1 Answers1

Linked