Replace spaces with underscores inside array elements in PySpark

Question

I have a Spark dataframe:

id	objects
1	[sun, solar system, mars, milky way]
2	[moon, cosmic rays, orion nebula]

I need to replace space with underscore in array elements.

Expected result:

id	objects	concat_obj
1	[sun, solar system, mars, milky way]	[sun, solar_system, mars, milky_way]
2	[moon, cosmic rays, orion nebula]	[moon, cosmic_rays, orion_nebula]

I tried using regexp_replace:

df = df.withColumn('concat_obj', regexp_replace('objects', ' ', '_'))

but that changed all spaces to underscores while I need to replace spaces only inside array elements.
So, how can this be done in PySpark?

See if this helps: [transform] (https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.sql.functions.transform.html) and on the similar lines https://stackoverflow.com/questions/51706383/pyspark-removing-special-numeric-strings-from-array-of-string — teedak8s, Jun 05 '22 at 17:14

score 1 · Accepted Answer · answered Jun 05 '22 at 21:21

Use highe order functions to replace white space through regexp_replace

schema

root
 |-- id: long (nullable = true)
 |-- objects: array (nullable = true)
 |    |-- element: string (containsNull = true)

solution

df.withColumn('concat_obj', expr("transform(objects, x-> regexp_replace(x,' ','_'))")).show(truncate=False)

+---+------------------------------------+------------------------------------+
|id |objects                             |concat_obj                          |
+---+------------------------------------+------------------------------------+
|1  |[sun, solar system, mars, milky way]|[sun, solar_system, mars, milky_way]|
|2  |[moon, cosmic rays, orion nebula]   |[moon, cosmic_rays, orion_nebula]   |
+---+------------------------------------+------------------------------------+

score 0 · Answer 2 · answered Jun 05 '22 at 17:18

0

You could use the following regex:

`(?<=[A-Za-z]) `

The only difference with respect to your code is that this pattern checks whether before the space there is an alphabetical character.

Try it here.

answered Jun 05 '22 at 17:18

lemon

14,875
6
18
38

I got the following error: `ERROR CodeGenerator: failed to compile: org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 246, Column 1: Assignment conversion not possible from type "org.apache.spark.sql.catalyst.util.ArrayData" to type "org.apache.spark.unsafe.types.UTF8String"` – red_quark Jun 05 '22 at 18:14
If you can provide a debugging environment with your code, I may help you further. I have no possibility of playing with pyspark at the moment. @red_quark – lemon Jun 05 '22 at 18:25
At the moment, I solved the problem in a different way by converting the array to a string and applying `regexp_replace`. But for the future, I'm still interested how to get the desired result without pre-converting the array to a string.. – red_quark Jun 05 '22 at 19:04

Replace spaces with underscores inside array elements in PySpark

2 Answers2