3

I have a dataframe and I want to check if on of its columns contains at least one keywords:

from pyspark.sql import types as T
import pyspark.sql.functions as fn
key_labels = ["COMMISSION", "COM", "PRET", "LOAN"]

def containsAny(string, array):
    if len(string) == 0:
        return False
    else:
        return (any(word in string for word in array))

contains_udf = fn.udf(containsAny, T.BooleanType())

df = spark.createDataFrame([("COMMISSION", "1"), ("CAMMISSION", "2")], ("original", "id"))
df.withColumn("keyword_match", contains_udf(fn.col("original"),key_labels)).show()

When I run this code, I get the following error:

Py4JError: An error occurred while calling z:org.apache.spark.sql.functions.col.
Trace: py4j.Py4JException:
Method col([class java.util.ArrayList]) does not exist

What am I doing wrong?

Grigory Sharkov
  • 121
  • 1
  • 8
  • Does this answer your question? [Filter pyspark dataframe if contains a list of strings](https://stackoverflow.com/questions/58186366/filter-pyspark-dataframe-if-contains-a-list-of-strings) or probably this https://stackoverflow.com/questions/48869922/how-to-efficiently-check-if-a-list-of-words-is-contained-in-a-spark-dataframe – ggagliano May 06 '20 at 13:46
  • it does, however I wanted to know why passing a list to udf does not work in this case? – Grigory Sharkov May 06 '20 at 14:01
  • in your case udf expects that key_labels is a column or the name of a column, instead it is not. This answers almost your other question: https://stackoverflow.com/a/47912902/7306659 – ggagliano May 06 '20 at 14:31

3 Answers3

3

To make your function work , you should create an array column to compare:

df.select(fn.array([fn.lit(i) for i in key_labels])).show(truncate=False)

+----------------------------------+
|array(COMMISSION, COM, PRET, LOAN)|
+----------------------------------+
|[COMMISSION, COM, PRET, LOAN]     |
|[COMMISSION, COM, PRET, LOAN]     |
+----------------------------------+

So you code would look like below:

def containsAny(string, array):
    if len(string) == 0:
        return False
    else:
        return (any(word in string for word in array))

contains_udf = fn.udf(containsAny, T.BooleanType())
(df.withColumn("keyword_match", contains_udf(fn.col("original"),
 fn.array([fn.lit(i) for i in key_labels])))).show()

Outputs:

+----------+---+-------------+
|  original| id|keyword_match|
+----------+---+-------------+
|COMMISSION|  1|         true|
|CAMMISSION|  2|        false|
+----------+---+-------------+

However you could also use isin:

df.withColumn('keyword_match',df['original'].isin(key_labels)).show()

+----------+---+-------------+
|  original| id|keyword_match|
+----------+---+-------------+
|COMMISSION|  1|         true|
|CAMMISSION|  2|        false|
+----------+---+-------------+
anky
  • 74,114
  • 11
  • 41
  • 70
  • 1
    exactly, no need for udf, `isin` is the way to go – murtihash May 06 '20 at 18:33
  • 2
    The reason I am not using isin is because original contains other symbols. therefore to apply this solution I need to first split a string into words and then cycle through an array, however sometimes the string I will be searching for will contain several words at the same time. So function contains is the right choice... – Grigory Sharkov May 24 '20 at 18:50
1

Another solution that works as well is rlike function. In fact it works much faster than an udf.

regex = "|".join(r"(" + x + r")" for x in key_labels)
df = spark.createDataFrame([("COMMISSION", "1"), ("CAMMISSION", "2")], ("original", "id"))
df.select("original","id",fn.col("original").rlike(regex).alias("keyword_match")).show()
Peter Csala
  • 17,736
  • 16
  • 35
  • 75
Grigory Sharkov
  • 121
  • 1
  • 8
0

The solution that worked is passing the list as a default value, but I still do not understand why the other way did not work:

def containsAny(string, array=key_labels):
    if len(string) == 0:
        return False
    else:
        return (any(word in string for word in array))

contains_udf = fn.udf(containsAny, T.BooleanType())

df = spark.createDataFrame([("COMMISSION", "1"), ("CAMMISSION", "2")], ("original", "id"))
df.withColumn("keyword_match", contains_udf(fn.col("original"))).show()
Grigory Sharkov
  • 121
  • 1
  • 8