2

To keep it simple I have a df with the following schema:

root

 |-- Event_Time: string (nullable = true)

 |-- tokens: array (nullable = true)

 |    |-- element: string (containsNull = true)

some of the elements of "tokens" have number and special characters for example:

 "431883", "r2b2", "@refe98"

Any way I can remove all those and keep only actuals words ? I want to do an LDA later and want to clean my data before. I tried regexp_replace, explode, str.replace with no success maybe I didn't use them correctly. Thanks

edit2:

df_2 = (df_1.select(explode(df_1.tokens).alias('elements'))
          .select(regexp_replace('elements','\\w*\\d\\w**',""))
      )

This works only if the column in a string type, and with explode method I can explode an array into strings but there is not in the same row anymore... Anyone can improve on this?

ichafai
  • 331
  • 2
  • 9
  • After the explode, you need to `groupBy` and aggregate with `collect_list` to get the values back into a single row. Assuming `Event_Time` is a unique key: `df2 = df_1.select("Event_Time", regexp_replace(explode("tokens"), "").alias("elements")).groupBy("Event_Time").agg(collect_list("elements").alias("tokens"))` – pault Aug 06 '18 at 13:38
  • Thanks actually that's what I m doing right now, will see how it goes I m not sure Event_time is unique key but i have other columns. It looks like what you wrote. Will edit if the solution works, I m still wondering if there is no elegant solution to this. – ichafai Aug 06 '18 at 13:46
  • Unfortunately, there is currently no way to [iterate over an array](https://stackoverflow.com/questions/48993439/typeerror-column-is-not-iterable-how-to-iterate-over-arraytype) in pyspark without using an `udf` or `rdd`. – pault Aug 06 '18 at 13:47
  • Ah very good to know, because I would probably have taken some time to still think about it. Thank you very much – ichafai Aug 06 '18 at 13:49

3 Answers3

1
from pyspark.sql.functions import *
df = spark.createDataFrame([(["@a", "b", "c"],), ([],)], ['data'])
df_1 = df.withColumn('data_1', concat_ws(',', 'data'))
df_1 = df_1.withColumn("data_2", regexp_replace('data_1', "['{@]",""))
#df_1.printSchema()
df_1.show()

+----------+------+------+
|      data|data_1|data_2|
+----------+------+------+
|[@a, b, c]|@a,b,c| a,b,c|
|        []|      |      |
+----------+------+------+
Arun Gunalan
  • 814
  • 7
  • 26
  • Yeah this is what I tried at first, but there is mismatch type error. str1 is not string but array (array of string), and this is why I thought about using explode but with no success either. Thanks – ichafai Aug 06 '18 at 11:34
  • nope sorry still doesn't work, i m being late because even though i downsample the data, it still a lot of time to run commands – ichafai Aug 06 '18 at 11:44
  • I have posted with a sample, if your sample doesn't look like this, then post the sample using .show() for me to get better understanding – Arun Gunalan Aug 06 '18 at 12:06
  • it sure is helpful but still not a solution for my problem :( (I voted up anyway ;) not sure if it was taken into account since I m new here) – ichafai Aug 06 '18 at 12:19
1

The solution I found is (as also stated by pault in comment section):

After explode on tokens, I groupBy and agg with collect list to get back the tokens in the format I want them.

here is the comment of pault: After the explode, you need to groupBy and aggregate with collect_list to get the values back into a single row. Assuming Event_Time is a unique key:

df2 = df_1
    .select("Event_Time", regexp_replace(explode("tokens"), "<your regex here>")        
    .alias("elements")).groupBy("Event_Time")
    .agg(collect_list("elements").alias("tokens")) 

Also, stated by paul which I didnt know, there is currently no way to iterate over an array in pyspark without using udf or rdd.

ichafai
  • 331
  • 2
  • 9
0

The transform() function was added in PySpark 3.1.0, which helped me accomplish this task a little more easily. The example in the question would now look like this:

from pyspark.sql import functions as F

df_2 = df_1.withColumn("tokens", 
                F.expr(""" transform(tokens, x -> regexp_replace(x, '\\w*\\d\\w**')) """))
Adil B
  • 14,635
  • 11
  • 60
  • 78