Pyspark Distinct records form the string column by considering Null values in groupby

Question

I have a dataframe like the following:

rdd = sc.parallelize([(22,'fl1.variant,fl2.variant,fl3.control','xxx','yyy','zzz'),(22,'fl1.variant, fl2.neither,fl3.control','xxx','yyy','NULL'),
    (22,'fl1.variant, fl2.neither,fl4.control','xxx','yyy','NULL').(22,'fl4.variant,fl2.variant,fl4.variant','xxx1','yyy1','zzz1'),(22,'fl3.control,fl3.control,fl3.variant','xxx1','yyy1','zzz1'),
    (22,'NULL','xxx1','yyy1','zzz1')])

df = rdd.toDF(['Date','Type','Data1','Data2','Data3'])

Sample data :

Date	Type	Data1	Data2	Data3
22	fl1.variant,fl2.variant,fl3.control	xxx	yyy	zzz
22	fl1.variant,fl2.neither,fl3.control	xxx	yyy	null
22	fl1.variant,fl2.neither,fl4.control	xxx	yyy	null
22	fl4.variant,fl2.variant,fl4.variant	xxx1	yyy1	zzz1
22	fl3.control,fl3.control,fl3.variant	xxx1	yyy1	zzz1
22		xxx1	yyy1	zzz1

I need to identify the distinct values of Type column data based on Data1, Data2 and Data3 columns. Type column's datatype is string separated by ,.

Based on Data1, Data2, Data3, I need to combine all the data and identify the unique values of Type column.

expected output :

data1	data2	data3	type_list
xxx	yyy	zzz	[fl1.variant, fl2.variant,fl2.neither, fl3.control,fl4.control]
xxx1	yyy1	zzz1	[fl4.variant, fl2.variant, fl3.control, fl3.variant]

I tried like below but could not get the expected distinct values

data_df. \
        withColumn('type_split', func.split('type', ',')). \
        groupBy('data1','data2','data3'). \
     agg(func.array_distinct(func.flatten(func.collect_list('type_split'))).alias('type_list')). \
        show(truncate=False)

|data1|data2|data3|type_list|
|:----|:----|:----|:------------|
|xxx|yyy|zzz|[fl1.variant, fl2.variant,fl3.control]|
|xxx|yyy|NULL|[fl1.variant,  fl2.neither,fl3.control, fl4.control]|
|xxx1|yyy1|zzz1|[fl4.variant, fl2.variant,fl3.control,fl3.variant, ]|

In the above data second record is also get flatten irrespective of NULL value.

In the third record of Type_list there is a empty value in the array.

Your requirement leads to a conflict. How will you map the NULL value for `data3` if say all `data1`="xxx" and `data2`="yyy"? Now you have 3 `data3` values: "zzz", NULL and "zzz1". Where will you map the NULL? — Azhar Khan, Nov 14 '22 at 09:03
let us say in group by have data3 values: "zzz", NULL . so we can consider zzz and flatten records — shaa, Nov 14 '22 at 09:20
Still the same issue - you have these values for same date: "zzz", NULL, "zzz1". There is no order here - it can also be NULL, "zzz1", "zzz". — Azhar Khan, Nov 14 '22 at 09:24

Pyspark Distinct records form the string column by considering Null values in groupby

0 Answers0