Filter on the basis of multiple strings in a pyspark array column

Question

df = sqlContext.createDataFrame(pd.DataFrame([('id1', ['a','b','c']),
                                              ('id2', ['a','b','d']),
                                              ('id3', ['c','e','f'])], 
                                              columns=['id', 'items']))

df.show()

from pyspark.sql.functions import udf, col, when
from pyspark.sql.types import ArrayType, StringType, IntegerType

filter_array_udf = udf(lambda arr: [1 if (x =='a' and x=='b') else 0 for x in arr], "int")

df2 = df.withColumn("listed1", filter_array_udf(col("items")))
df2.show(20,0)

df2.show()

I am trying to get the row flagged if a certain id contains 'a' or 'b' string. for which the udf returns null value. I am fairly new to udfs. What do i have to change in the given udf to get the desired result

df.filter(F.array_contains(F.col('items'),'a')).show()

This works for just a single string but if I pass ['a', 'b'] in array. It throws error

Unsupported literal type class java.util.ArrayList [a, b]

score 1 · Accepted Answer · answered Jun 19 '21 at 08:06

Use array_intersect to check the elements are in the array column.

from pyspark.sql import functions as f

df.withColumn('temp', f.array_intersect(f.col('items'), f.array(f.lit('a'), f.lit('b')))) \
  .withColumn('listed1', f.expr('if(temp != array(), true, false)')) \
  .show(10, False)

+---+---------+------+-------+
|id |items    |temp  |listed1|
+---+---------+------+-------+
|id1|[a, b, c]|[a, b]|true   |
|id2|[a, b, d]|[a, b]|true   |
|id3|[c, e, f]|[]    |false  |
+---+---------+------+-------+

Got to know about the functionality of array_intersect & expr functions. This solution is a bliss. Thanks! — Ishant Sahu, Jun 28 '21 at 13:20

score 0 · Answer 2 · answered Jun 19 '21 at 00:44

0

I would use lit(v1 and v2). df.select will return boolean. So if needing to display df, use filter

from pyspark.sql.functions import *
df.filter(array_contains(df.items,  lit('a'and 'b'))).show()

answered Jun 19 '21 at 00:44

wwnde

26,119
6
18
32

No this doesn't work. Columns are not iterable. – Ishant Sahu Jun 28 '21 at 13:14

Filter on the basis of multiple strings in a pyspark array column

2 Answers2