2

I have a DataFrame with a column of ArrayType(StringType):

+------------------------------------+
|colname                             |
+------------------------------------+
|[foo_XX_foo, bar_YY_bar]            |
|[qwe_ZZ_rty, asd_AA_fgh, zxc_BB_vbn]|
+------------------------------------+

I would now like to extract the string between the first and the second _, i.e. expected output is:

+------------+
|newcolname  |
+------------+
|[XX, YY]    |
|[ZZ, AA, BB]|
+------------+

Following this answer, I tried to use expr() with transform, but I didn't manage to get it working. Even the example of altering all strings to uppercase, as in the above quoted answer, does not work for me, I get the following error:

pyspark.sql.utils.ParseException: u"\nextraneous input '>' expecting {'(', 'SELECT', ...

How can I modify all elements in an ArrayType? I would like to avoid using an udf.

ZygD
  • 22,092
  • 39
  • 79
  • 102
pfnuesel
  • 14,093
  • 14
  • 58
  • 71

2 Answers2

6

A little bit unsafe but try something like this:

df = spark.sparkContext.parallelize([
  [["foo_XX_foo", "bar_YY_bar"]],
  [["qwe_ZZ_rty", "asd_AA_fgh", "zxc_BB_vbn"]]
]).toDF(['colname'])

df.selectExpr('transform(colname, x -> split(x, "_")[1]) as newcolname').show()

which results in:

+------------+
|  newcolname|
+------------+
|    [XX, YY]|
|[ZZ, AA, BB]|
+------------+
4

Since you're using Spark version 2.3.2, transform is not available to you. Thus as explained in the post you linked, the best approach in general is to use udf.

However, in this specific case you may be able to avoid a udf using some hacky regular expression replacements.

from pyspark.sql.functions import col, concat_ws, regexp_replace, split, trim

df.withColumn(
    "newcolname",
    regexp_replace(concat_ws(",", col("colname")), "((?<=_)[^_,]+(?=_))", " $1 ")
).withColumn(
    "newcolname",
    regexp_replace(col("newcolname"), "(_[^_ ]+_)", "")
).withColumn(
    "newcolname",
    regexp_replace(col("newcolname"), "([^_ ]+_)", "")
).withColumn(
    "newcolname",
    regexp_replace(col("newcolname"), "_([^_ ]+)", "")
).withColumn(
    "newcolname",
    split(trim(col("newcolname")), "\s+")
).show(truncate=False)
#+------------------------------------+------------+
#|colname                             |newcolname  |
#+------------------------------------+------------+
#|[foo_XX_foo, bar_YY_bar]            |[XX, YY]    |
#|[qwe_ZZ_rty, asd_AA_fgh, zxc_BB_vbn]|[ZZ, AA, BB]|
#+------------------------------------+------------+

Explanation

First we take the ArrayType(StringType()) column and concatenate the elements together to form one string. I used the comma as the separator, which only works if the comma does not appear in your data.

Next we perform a series of regexp_replace calls.

The first pattern ((?<=_)[^_,]+(?=_)) identifies the content that you actually want to extract: the text bracketed by the underscore. Then the matching groups are replaced the match group surrounded by spaces " $1 ". As before with the comma separator, this assumes that spaces do not appear in your data.

For example:

df.select(
    regexp_replace(
        concat_ws(",", col("colname")), 
        "((?<=_)[^_,]+(?=_))", 
        " $1 "
    ).alias("pattern1")
).show(truncate=False)
#+--------------------------------------+
#|pattern1                              |
#+--------------------------------------+
#|foo_ XX _foo,bar_ YY _bar             |
#|qwe_ ZZ _rty,asd_ AA _fgh,zxc_ BB _vbn|
#+--------------------------------------+

The next 3 calls to regexp_replace selectively remove the parts of this string that are not desired.

Finally at the end, only the desired content remains. The string is trimmed to removed trailing/leading whitespace and split on whitespace to get the final result.

pault
  • 41,343
  • 15
  • 107
  • 149