Iterate through row numbers of a partition and compare values to create new columns in PySpark SQL (spark 2.4.0)

Question

I want to make a loop on row numbers of a partitions in dataframe to check conditions and create extra columns depending on the result of current row_number.

So, i have a dataframe representing contacts with master id <-> raw id association (a master can have multiple raws). I partionned this dataframe with window function and a row_number() over a partition of most recent created raws :

# Create a window specification to define the partitionning of masters
# base on the most recent created
window_spec = Window \
    .partitionBy(raw_df.master_id) \
    .orderBy(raw_df.created.desc())

# Create a dataframe with a row number for each partition to process
# raws grouped with their master from the most recent to the least
partionned_df = raw_df \
    .select(
        fields + [
            F.row_number().over(window_spec).alias('row_nb')
        ]
    ) \
    .orderBy(['master_id', 'row_nb'])

This is my dataframe :

partionned_df.show()  
+------+---+---------------+---------+------------+-------------+------+
|master|raw|        created|last_name|mobile_phone|norm_civility|row_nb|
+------+---+---------------+---------+------------+-------------+------+
|     1|  3|02-05 11:22:...|     null|   641203047|         null|     1|
|     1|  5|02-05 11:22:...|     null|        null|     monsieur|     2|
|     1|  6|02-05 11:22:...|     null|        null|         null|     3|
|     1|  1|02-05 10:29:...|    name1|        null|           mr|     4|
|     3| 11|02-05 11:26:...|     null|        null|     monsieur|     1|
|     3|  4|02-05 10:30:...|     null|   630254175|         null|     2|
|     5| 15|02-05 11:35:...|     null|        null|         null|     1|
|     5| 13|02-05 11:35:...|     null|        null|         null|     2|
|     5| 14|02-05 11:33:...| name3   |        null|     monsieur|     3|
|     5|  7|02-05 11:22:...|     null|        null|         null|     4|
+---+------+---------------+---------+------------+-------------+------+

What i want is to create a new dataframe with only distinct masters and new columns after iterated on the row_number of each partition and execute some conditions (like if first_name is not null, etc..) and if the conditions are not verified on the first row_nb of the partition, check the other one etc until i looped every row number of the partition.

The final dataframe i want is something like :

+------+---------+------------+-------------+
|master|last_name|mobile_phone|norm_civility|
+------+---------+------------+-------------+
|     1|    name1|   641203047|    monsieur |
|     3|     null|   630254175|    monsieur |
|     5|    name3|        null|    monsieur |
+------+---------+------------+-------------+

Thank you very much for your help by advance.

In non-civility for master=1, you have both monsieur and mr. So how to choose between the two. If you are grouping on the basis of the master, then both need to be aggregated somehow. Should both be concatenated ? What’s the aggregation logic? — cph_sto, Feb 08 '19 at 16:14
Since monsieur appears first, so you use it instead of mr, right? — cph_sto, Feb 08 '19 at 16:24

cph_sto · Answer 1 · 2019-02-08T20:02:22.060

Aggregate this dataframe.The idea is to use ignorenulls option in first() function.

df = df.select('master','last_name','mobile_phone','norm_civility').groupBy(col('master')).agg(first(col('last_name'),ignorenulls = True).alias('last_name'),
                                   first(col('mobile_phone'),ignorenulls = True).alias('mobile_phone'),
                                   first(col('norm_civility'),ignorenulls = True).alias('norm_civility'))
df.show()
+------+---------+------------+-------------+
|master|last_name|mobile_phone|norm_civility|
+------+---------+------------+-------------+
|     5|    name3|        null|     monsieur|
|     1|    name1|   641203047|     monsieur|
|     3|     null|   630254175|     monsieur|
+------+---------+------------+-------------+

You can explore a similar question here.

Iterate through row numbers of a partition and compare values to create new columns in PySpark SQL (spark 2.4.0)

1 Answers1