Pyspark : removing special/numeric strings from array of string

Question

To keep it simple I have a df with the following schema:

root

 |-- Event_Time: string (nullable = true)

 |-- tokens: array (nullable = true)

 |    |-- element: string (containsNull = true)

some of the elements of "tokens" have number and special characters for example:

 "431883", "r2b2", "@refe98"

Any way I can remove all those and keep only actuals words ? I want to do an LDA later and want to clean my data before. I tried regexp_replace, explode, str.replace with no success maybe I didn't use them correctly. Thanks

edit2:

df_2 = (df_1.select(explode(df_1.tokens).alias('elements'))
          .select(regexp_replace('elements','\\w*\\d\\w**',""))
      )

This works only if the column in a string type, and with explode method I can explode an array into strings but there is not in the same row anymore... Anyone can improve on this?

After the explode, you need to `groupBy` and aggregate with `collect_list` to get the values back into a single row. Assuming `Event_Time` is a unique key: `df2 = df_1.select("Event_Time", regexp_replace(explode("tokens"), "").alias("elements")).groupBy("Event_Time").agg(collect_list("elements").alias("tokens"))` — pault, Aug 06 '18 at 13:38
Thanks actually that's what I m doing right now, will see how it goes I m not sure Event_time is unique key but i have other columns. It looks like what you wrote. Will edit if the solution works, I m still wondering if there is no elegant solution to this. — ichafai, Aug 06 '18 at 13:46
Unfortunately, there is currently no way to [iterate over an array](https://stackoverflow.com/questions/48993439/typeerror-column-is-not-iterable-how-to-iterate-over-arraytype) in pyspark without using an `udf` or `rdd`. — pault, Aug 06 '18 at 13:47
Ah very good to know, because I would probably have taken some time to still think about it. Thank you very much — ichafai, Aug 06 '18 at 13:49

Arun Gunalan · Answer 1 · 2018-08-06T12:17:35.920

1

from pyspark.sql.functions import *
df = spark.createDataFrame([(["@a", "b", "c"],), ([],)], ['data'])
df_1 = df.withColumn('data_1', concat_ws(',', 'data'))
df_1 = df_1.withColumn("data_2", regexp_replace('data_1', "['{@]",""))
#df_1.printSchema()
df_1.show()

+----------+------+------+
|      data|data_1|data_2|
+----------+------+------+
|[@a, b, c]|@a,b,c| a,b,c|
|        []|      |      |
+----------+------+------+

edited Aug 06 '18 at 12:17

answered Aug 06 '18 at 11:29

Arun Gunalan

814
7
26

Yeah this is what I tried at first, but there is mismatch type error. str1 is not string but array (array of string), and this is why I thought about using explode but with no success either. Thanks – ichafai Aug 06 '18 at 11:34
nope sorry still doesn't work, i m being late because even though i downsample the data, it still a lot of time to run commands – ichafai Aug 06 '18 at 11:44
I have posted with a sample, if your sample doesn't look like this, then post the sample using .show() for me to get better understanding – Arun Gunalan Aug 06 '18 at 12:06
it sure is helpful but still not a solution for my problem :( (I voted up anyway ;) not sure if it was taken into account since I m new here) – ichafai Aug 06 '18 at 12:19

score 1 · Accepted Answer · answered Aug 06 '18 at 14:19

The solution I found is (as also stated by pault in comment section):

After explode on tokens, I groupBy and agg with collect list to get back the tokens in the format I want them.

here is the comment of pault: After the explode, you need to groupBy and aggregate with collect_list to get the values back into a single row. Assuming Event_Time is a unique key:

df2 = df_1
    .select("Event_Time", regexp_replace(explode("tokens"), "<your regex here>")        
    .alias("elements")).groupBy("Event_Time")
    .agg(collect_list("elements").alias("tokens"))

Also, stated by paul which I didnt know, there is currently no way to iterate over an array in pyspark without using udf or rdd.

score 0 · Answer 3 · answered Jul 08 '21 at 18:44

The transform() function was added in PySpark 3.1.0, which helped me accomplish this task a little more easily. The example in the question would now look like this:

from pyspark.sql import functions as F

df_2 = df_1.withColumn("tokens", 
                F.expr(""" transform(tokens, x -> regexp_replace(x, '\\w*\\d\\w**')) """))

Pyspark : removing special/numeric strings from array of string

3 Answers3

Linked