I have a dataframe like the following:
rdd = sc.parallelize([(22,'fl1.variant,fl2.variant,fl3.control','xxx','yyy','zzz'),(22,'fl1.variant, fl2.neither,fl3.control','xxx','yyy','NULL'),
(22,'fl1.variant, fl2.neither,fl4.control','xxx','yyy','NULL').(22,'fl4.variant,fl2.variant,fl4.variant','xxx1','yyy1','zzz1'),(22,'fl3.control,fl3.control,fl3.variant','xxx1','yyy1','zzz1'),
(22,'NULL','xxx1','yyy1','zzz1')])
df = rdd.toDF(['Date','Type','Data1','Data2','Data3'])
Sample data :
Date | Type | Data1 | Data2 | Data3 |
---|---|---|---|---|
22 | fl1.variant,fl2.variant,fl3.control | xxx | yyy | zzz |
22 | fl1.variant,fl2.neither,fl3.control | xxx | yyy | null |
22 | fl1.variant,fl2.neither,fl4.control | xxx | yyy | null |
22 | fl4.variant,fl2.variant,fl4.variant | xxx1 | yyy1 | zzz1 |
22 | fl3.control,fl3.control,fl3.variant | xxx1 | yyy1 | zzz1 |
22 | xxx1 | yyy1 | zzz1 |
I need to identify the distinct values of Type column data based on Data1, Data2 and Data3 columns. Type column's datatype is string separated by ,.
Based on Data1, Data2, Data3, I need to combine all the data and identify the unique values of Type column.
expected output :
data1 | data2 | data3 | type_list |
---|---|---|---|
xxx | yyy | zzz | [fl1.variant, fl2.variant,fl2.neither, fl3.control,fl4.control] |
xxx1 | yyy1 | zzz1 | [fl4.variant, fl2.variant, fl3.control, fl3.variant] |
I tried like below but could not get the expected distinct values
data_df. \
withColumn('type_split', func.split('type', ',')). \
groupBy('data1','data2','data3'). \
agg(func.array_distinct(func.flatten(func.collect_list('type_split'))).alias('type_list')). \
show(truncate=False)
|data1|data2|data3|type_list|
|:----|:----|:----|:------------|
|xxx|yyy|zzz|[fl1.variant, fl2.variant,fl3.control]|
|xxx|yyy|NULL|[fl1.variant, fl2.neither,fl3.control, fl4.control]|
|xxx1|yyy1|zzz1|[fl4.variant, fl2.variant,fl3.control,fl3.variant, ]|
In the above data second record is also get flatten irrespective of NULL value.
In the third record of Type_list there is a empty value in the array.