df = sqlContext.createDataFrame(pd.DataFrame([('id1', ['a','b','c']),
('id2', ['a','b','d']),
('id3', ['c','e','f'])],
columns=['id', 'items']))
from pyspark.sql.functions import udf, col, when
from pyspark.sql.types import ArrayType, StringType, IntegerType
filter_array_udf = udf(lambda arr: [1 if (x =='a' and x=='b') else 0 for x in arr], "int")
df2 = df.withColumn("listed1", filter_array_udf(col("items")))
df2.show(20,0)
I am trying to get the row flagged if a certain id contains 'a' or 'b' string. for which the udf returns null value. I am fairly new to udfs. What do i have to change in the given udf to get the desired result
df.filter(F.array_contains(F.col('items'),'a')).show()
This works for just a single string but if I pass ['a', 'b'] in array. It throws error
Unsupported literal type class java.util.ArrayList [a, b]