2

I have a Spark dataframe:

id objects
1 [sun, solar system, mars, milky way]
2 [moon, cosmic rays, orion nebula]

I need to replace space with underscore in array elements.

Expected result:

id objects concat_obj
1 [sun, solar system, mars, milky way] [sun, solar_system, mars, milky_way]
2 [moon, cosmic rays, orion nebula] [moon, cosmic_rays, orion_nebula]

I tried using regexp_replace:

df = df.withColumn('concat_obj', regexp_replace('objects', ' ', '_'))

but that changed all spaces to underscores while I need to replace spaces only inside array elements.
So, how can this be done in PySpark?

lemon
  • 14,875
  • 6
  • 18
  • 38
red_quark
  • 971
  • 5
  • 20
  • See if this helps: [transform] (https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.sql.functions.transform.html) and on the similar lines https://stackoverflow.com/questions/51706383/pyspark-removing-special-numeric-strings-from-array-of-string – teedak8s Jun 05 '22 at 17:14

2 Answers2

1

Use highe order functions to replace white space through regexp_replace

schema

root
 |-- id: long (nullable = true)
 |-- objects: array (nullable = true)
 |    |-- element: string (containsNull = true)

solution

df.withColumn('concat_obj', expr("transform(objects, x-> regexp_replace(x,' ','_'))")).show(truncate=False)

+---+------------------------------------+------------------------------------+
|id |objects                             |concat_obj                          |
+---+------------------------------------+------------------------------------+
|1  |[sun, solar system, mars, milky way]|[sun, solar_system, mars, milky_way]|
|2  |[moon, cosmic rays, orion nebula]   |[moon, cosmic_rays, orion_nebula]   |
+---+------------------------------------+------------------------------------+
wwnde
  • 26,119
  • 6
  • 18
  • 32
0

You could use the following regex:

`(?<=[A-Za-z]) `

The only difference with respect to your code is that this pattern checks whether before the space there is an alphabetical character.

Try it here.

lemon
  • 14,875
  • 6
  • 18
  • 38
  • I got the following error: `ERROR CodeGenerator: failed to compile: org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 246, Column 1: Assignment conversion not possible from type "org.apache.spark.sql.catalyst.util.ArrayData" to type "org.apache.spark.unsafe.types.UTF8String"` – red_quark Jun 05 '22 at 18:14
  • If you can provide a debugging environment with your code, I may help you further. I have no possibility of playing with pyspark at the moment. @red_quark – lemon Jun 05 '22 at 18:25
  • At the moment, I solved the problem in a different way by converting the array to a string and applying `regexp_replace`. But for the future, I'm still interested how to get the desired result without pre-converting the array to a string.. – red_quark Jun 05 '22 at 19:04