Pyspark; UDF that checks if a column contains one of elements of a list

Question

I have a dataframe and I want to check if on of its columns contains at least one keywords:

from pyspark.sql import types as T
import pyspark.sql.functions as fn
key_labels = ["COMMISSION", "COM", "PRET", "LOAN"]

def containsAny(string, array):
    if len(string) == 0:
        return False
    else:
        return (any(word in string for word in array))

contains_udf = fn.udf(containsAny, T.BooleanType())

df = spark.createDataFrame([("COMMISSION", "1"), ("CAMMISSION", "2")], ("original", "id"))
df.withColumn("keyword_match", contains_udf(fn.col("original"),key_labels)).show()

When I run this code, I get the following error:

Py4JError: An error occurred while calling z:org.apache.spark.sql.functions.col.
Trace: py4j.Py4JException:
Method col([class java.util.ArrayList]) does not exist

What am I doing wrong?

Does this answer your question? [Filter pyspark dataframe if contains a list of strings](https://stackoverflow.com/questions/58186366/filter-pyspark-dataframe-if-contains-a-list-of-strings) or probably this https://stackoverflow.com/questions/48869922/how-to-efficiently-check-if-a-list-of-words-is-contained-in-a-spark-dataframe — ggagliano, May 06 '20 at 13:46
it does, however I wanted to know why passing a list to udf does not work in this case? — Grigory Sharkov, May 06 '20 at 14:01
in your case udf expects that key_labels is a column or the name of a column, instead it is not. This answers almost your other question: https://stackoverflow.com/a/47912902/7306659 — ggagliano, May 06 '20 at 14:31

score 3 · Accepted Answer · answered May 06 '20 at 14:28

To make your function work , you should create an array column to compare:

df.select(fn.array([fn.lit(i) for i in key_labels])).show(truncate=False)

+----------------------------------+
|array(COMMISSION, COM, PRET, LOAN)|
+----------------------------------+
|[COMMISSION, COM, PRET, LOAN]     |
|[COMMISSION, COM, PRET, LOAN]     |
+----------------------------------+

So you code would look like below:

def containsAny(string, array):
    if len(string) == 0:
        return False
    else:
        return (any(word in string for word in array))

contains_udf = fn.udf(containsAny, T.BooleanType())
(df.withColumn("keyword_match", contains_udf(fn.col("original"),
 fn.array([fn.lit(i) for i in key_labels])))).show()

Outputs:

+----------+---+-------------+
|  original| id|keyword_match|
+----------+---+-------------+
|COMMISSION|  1|         true|
|CAMMISSION|  2|        false|
+----------+---+-------------+

However you could also use isin:

df.withColumn('keyword_match',df['original'].isin(key_labels)).show()

+----------+---+-------------+
|  original| id|keyword_match|
+----------+---+-------------+
|COMMISSION|  1|         true|
|CAMMISSION|  2|        false|
+----------+---+-------------+

The reason I am not using isin is because original contains other symbols. therefore to apply this solution I need to first split a string into words and then cycle through an array, however sometimes the string I will be searching for will contain several words at the same time. So function contains is the right choice... — Grigory Sharkov, May 24 '20 at 18:50

score 1 · Answer 2 · edited Oct 07 '21 at 08:52

1

Another solution that works as well is rlike function. In fact it works much faster than an udf.

regex = "|".join(r"(" + x + r")" for x in key_labels)
df = spark.createDataFrame([("COMMISSION", "1"), ("CAMMISSION", "2")], ("original", "id"))
df.select("original","id",fn.col("original").rlike(regex).alias("keyword_match")).show()

edited Oct 07 '21 at 08:52

Peter Csala

17,736
16
35
75

answered Oct 07 '21 at 08:43

Grigory Sharkov

121
1
8

score 0 · Answer 3 · answered May 06 '20 at 13:29

The solution that worked is passing the list as a default value, but I still do not understand why the other way did not work:

def containsAny(string, array=key_labels):
    if len(string) == 0:
        return False
    else:
        return (any(word in string for word in array))

contains_udf = fn.udf(containsAny, T.BooleanType())

df = spark.createDataFrame([("COMMISSION", "1"), ("CAMMISSION", "2")], ("original", "id"))
df.withColumn("keyword_match", contains_udf(fn.col("original"))).show()

Pyspark; UDF that checks if a column contains one of elements of a list

3 Answers3

Linked