Since you're using Spark version 2.3.2, transform
is not available to you. Thus as explained in the post you linked, the best approach in general is to use udf
.
However, in this specific case you may be able to avoid a udf
using some hacky regular expression replacements.
from pyspark.sql.functions import col, concat_ws, regexp_replace, split, trim
df.withColumn(
"newcolname",
regexp_replace(concat_ws(",", col("colname")), "((?<=_)[^_,]+(?=_))", " $1 ")
).withColumn(
"newcolname",
regexp_replace(col("newcolname"), "(_[^_ ]+_)", "")
).withColumn(
"newcolname",
regexp_replace(col("newcolname"), "([^_ ]+_)", "")
).withColumn(
"newcolname",
regexp_replace(col("newcolname"), "_([^_ ]+)", "")
).withColumn(
"newcolname",
split(trim(col("newcolname")), "\s+")
).show(truncate=False)
#+------------------------------------+------------+
#|colname |newcolname |
#+------------------------------------+------------+
#|[foo_XX_foo, bar_YY_bar] |[XX, YY] |
#|[qwe_ZZ_rty, asd_AA_fgh, zxc_BB_vbn]|[ZZ, AA, BB]|
#+------------------------------------+------------+
Explanation
First we take the ArrayType(StringType())
column and concatenate the elements together to form one string. I used the comma as the separator, which only works if the comma does not appear in your data.
Next we perform a series of regexp_replace
calls.
The first pattern ((?<=_)[^_,]+(?=_))
identifies the content that you actually want to extract: the text bracketed by the underscore. Then the matching groups are replaced the match group surrounded by spaces " $1 "
. As before with the comma separator, this assumes that spaces do not appear in your data.
For example:
df.select(
regexp_replace(
concat_ws(",", col("colname")),
"((?<=_)[^_,]+(?=_))",
" $1 "
).alias("pattern1")
).show(truncate=False)
#+--------------------------------------+
#|pattern1 |
#+--------------------------------------+
#|foo_ XX _foo,bar_ YY _bar |
#|qwe_ ZZ _rty,asd_ AA _fgh,zxc_ BB _vbn|
#+--------------------------------------+
The next 3 calls to regexp_replace
selectively remove the parts of this string that are not desired.
Finally at the end, only the desired content remains. The string is trimmed to removed trailing/leading whitespace and split on whitespace to get the final result.