To keep it simple I have a df with the following schema:
root
|-- Event_Time: string (nullable = true)
|-- tokens: array (nullable = true)
| |-- element: string (containsNull = true)
some of the elements of "tokens" have number and special characters for example:
"431883", "r2b2", "@refe98"
Any way I can remove all those and keep only actuals words ? I want to do an LDA later and want to clean my data before.
I tried regexp_replace
, explode
, str.replace
with no success maybe I didn't use them correctly.
Thanks
edit2:
df_2 = (df_1.select(explode(df_1.tokens).alias('elements'))
.select(regexp_replace('elements','\\w*\\d\\w**',""))
)
This works only if the column in a string type, and with explode method I can explode an array into strings but there is not in the same row anymore... Anyone can improve on this?