PySpark PandasUDF with 2 different argument data types

Question

I have a dataframe A with a column containing a string. I want to compare this string with another dataframe B, which only has one column that contains a list of tuples of strings. What I did so far: I transformed B into a list, through which I iterated via a UDF (in withColumn). The UDF takes the string-column from A as argument and eventually returns a string (not displayed here):

search_list = [('string1', 'string2', 'string3'), ('string4', 'string5', 'string6')...]
def text_search(free_search_text):
   if free_search_text != '':
      free_search_text = free_search_text
   else: 
      return 'no hit'
   for pattern in search_list:
      for word in pattern:
          if word in free_search_text:
          ...

This is obviously very slow (since I have more UDFs with similar structure), therefore I was thinking about using PandasUDFs to vectorize it. However, I am not sure if it is actually possible to run this logic via PandasUDF. If I want to use this PandasUDF as part of A.withColumn(), I would have to parse col("free_search_text") as one argument and B.toPandas() as second argument. My main source is this article. As the return value of the UDF is a string, I am not sure if the SCALAR_ITER function-type would work, as it should return primitive data types, not strings. Besides it seems impossible to parse 2 arguments, one not being a pd.Series:

search_list = [('string1', 'string2', 'string3'), ('string4', 'string5', 'string6')...]
def text_search(free_search_text: StringType(), iterator: Iterator[pd.Series]) -> Iterator[pd.Series]:
    if free_search_text != '':
        free_search_text = free_search_text
    else: 
        return 'no hit'
    for pattern in iterator:
        for word in pattern:
            if word in free_search_text:
               ...
A = A.withColumn('search_result', text_search(col('free_search_text'), B.search_pattern))

B is a Pandas DataFrame. The function throws an error:

"Value Error: Canot convert column into bool: please use '&' for 'and', '|'
for 'or', '~' for 'not' when building DataFrame boolean expressions.

Any idea how I can make it work?

PySpark PandasUDF with 2 different argument data types

0 Answers0