pyspark df.filter for dynamic columns based on user arguments

Question

I am trying to perform isin() using df.filter but I need to filter columns dynamically based on user arguments passed. I am appending custom columnns to list and join later which converts it into string, since isin() is also getting converted into string it throws error. how do I keep isin as expression and not as str.

    tmp=[]
count = 0
if bl is not None:
    tmp.append(f"df.b.isin({args.bl})")
    count += 1
if cl is not None:
    tmp.append(f"df.cl.isin({args.cl})")
    count += 1
if cl_g is not None:
    tmp.append(f"df.cl_g.isin({args.cl_g})")
    count += 1
if pl is not None:
    tmp.append(f"df.pl.isin({args.pl})")
    count += 1
if pl_b is not None:
    tmp.append(f"df.pl_b.isin({args.pl_b})")
    count += 1

if count >1:
    return df.filter(" and ".join(tmp))
elif count > 0:
    return df.filter("".join(tmp))

Jeremy · Answer 1 · 2023-03-14T01:17:17.893

from functools import reduce
from operator import and_  
from pyspark.sql import functions as F

col_list = ["bl", "cl", ...]
args_dict = {}
for col in col_list:
    if col:
        args_dict[col] = args.col
if len(args_dict) > 0:
    df.where(reduce(and_, (F.col(key).isin(args_dict[key]) for key in args_dict), F.lit(True)))

Something like this should work if I am understanding your question correctly. The reduce function here applies the and function between each key value pair in the dictionary.

pyspark df.filter for dynamic columns based on user arguments

1 Answers1