0

I have a dataframe like the following:

rdd = sc.parallelize([(22,'fl1.variant,fl2.variant,fl3.control','xxx','yyy','zzz'),(22,'fl1.variant, fl2.neither,fl3.control','xxx','yyy','NULL'),
    (22,'fl1.variant, fl2.neither,fl4.control','xxx','yyy','NULL').(22,'fl4.variant,fl2.variant,fl4.variant','xxx1','yyy1','zzz1'),(22,'fl3.control,fl3.control,fl3.variant','xxx1','yyy1','zzz1'),
    (22,'NULL','xxx1','yyy1','zzz1')])

df = rdd.toDF(['Date','Type','Data1','Data2','Data3'])

Sample data :

Date Type Data1 Data2 Data3
22 fl1.variant,fl2.variant,fl3.control xxx yyy zzz
22 fl1.variant,fl2.neither,fl3.control xxx yyy null
22 fl1.variant,fl2.neither,fl4.control xxx yyy null
22 fl4.variant,fl2.variant,fl4.variant xxx1 yyy1 zzz1
22 fl3.control,fl3.control,fl3.variant xxx1 yyy1 zzz1
22 xxx1 yyy1 zzz1

I need to identify the distinct values of Type column data based on Data1, Data2 and Data3 columns. Type column's datatype is string separated by ,.

Based on Data1, Data2, Data3, I need to combine all the data and identify the unique values of Type column.

expected output :

data1 data2 data3 type_list
xxx yyy zzz [fl1.variant, fl2.variant,fl2.neither, fl3.control,fl4.control]
xxx1 yyy1 zzz1 [fl4.variant, fl2.variant, fl3.control, fl3.variant]

I tried like below but could not get the expected distinct values

data_df. \
        withColumn('type_split', func.split('type', ',')). \
        groupBy('data1','data2','data3'). \
     agg(func.array_distinct(func.flatten(func.collect_list('type_split'))).alias('type_list')). \
        show(truncate=False)
|data1|data2|data3|type_list|
|:----|:----|:----|:------------|
|xxx|yyy|zzz|[fl1.variant, fl2.variant,fl3.control]|
|xxx|yyy|NULL|[fl1.variant,  fl2.neither,fl3.control, fl4.control]|
|xxx1|yyy1|zzz1|[fl4.variant, fl2.variant,fl3.control,fl3.variant, ]|

In the above data second record is also get flatten irrespective of NULL value.

In the third record of Type_list there is a empty value in the array.

shaa
  • 17
  • 6
  • Your requirement leads to a conflict. How will you map the NULL value for `data3` if say all `data1`="xxx" and `data2`="yyy"? Now you have 3 `data3` values: "zzz", NULL and "zzz1". Where will you map the NULL? – Azhar Khan Nov 14 '22 at 09:03
  • You need some rule to fill the missing data. – Azhar Khan Nov 14 '22 at 09:04
  • can we consider date column and flatten the records ? – shaa Nov 14 '22 at 09:18
  • let us say in group by have data3 values: "zzz", NULL . so we can consider zzz and flatten records – shaa Nov 14 '22 at 09:20
  • Still the same issue - you have these values for same date: "zzz", NULL, "zzz1". There is no order here - it can also be NULL, "zzz1", "zzz". – Azhar Khan Nov 14 '22 at 09:24

0 Answers0