I have a dataframe A with a column containing a string. I want to compare this string with another dataframe B, which only has one column that contains a list of tuples of strings. What I did so far: I transformed B into a list, through which I iterated via a UDF (in withColumn). The UDF takes the string-column from A as argument and eventually returns a string (not displayed here):
search_list = [('string1', 'string2', 'string3'), ('string4', 'string5', 'string6')...]
def text_search(free_search_text):
if free_search_text != '':
free_search_text = free_search_text
else:
return 'no hit'
for pattern in search_list:
for word in pattern:
if word in free_search_text:
...
This is obviously very slow (since I have more UDFs with similar structure), therefore I was thinking about using PandasUDFs to vectorize it. However, I am not sure if it is actually possible to run this logic via PandasUDF. If I want to use this PandasUDF as part of A.withColumn(), I would have to parse col("free_search_text") as one argument and B.toPandas() as second argument. My main source is this article. As the return value of the UDF is a string, I am not sure if the SCALAR_ITER function-type would work, as it should return primitive data types, not strings. Besides it seems impossible to parse 2 arguments, one not being a pd.Series:
search_list = [('string1', 'string2', 'string3'), ('string4', 'string5', 'string6')...]
def text_search(free_search_text: StringType(), iterator: Iterator[pd.Series]) -> Iterator[pd.Series]:
if free_search_text != '':
free_search_text = free_search_text
else:
return 'no hit'
for pattern in iterator:
for word in pattern:
if word in free_search_text:
...
A = A.withColumn('search_result', text_search(col('free_search_text'), B.search_pattern))
B is a Pandas DataFrame. The function throws an error:
"Value Error: Canot convert column into bool: please use '&' for 'and', '|'
for 'or', '~' for 'not' when building DataFrame boolean expressions.
Any idea how I can make it work?