retrieve the non null values from a PySpark dataframe row and store this value in a new column

Question

I have a PySpark dataframe which has column names which are unique_id's generated by UUID library. So I cannot query using column names. Each row in this pySpark dataframe has 1 "non null value". How do i create a new column which only has this 1 non null value? I have shared a sample below where "new_column" is the column I would like to create. Any help is appreciated. Thanks in advance

col1    col2      col3          col4             new_column    
Null    Null       xyz           Null                    xyz

I tried looking at dataframe operations but i was unable to find any relevant solution.

score 1 · Answer 1 · answered May 25 '23 at 05:08

1

Lets do coalesce on all columns

df = df.withColumn('new_column', F.coalesce(*df.columns))

+----+----+----+----+----------+
|col1|col2|col3|col4|new_column|
+----+----+----+----+----------+
|null|null| xyz|null|       xyz|
| pqr|null|null|null|       pqr|
+----+----+----+----+----------+

answered May 25 '23 at 05:08

Shubham Sharma

68,127
6
24
53

thank you so much ! this really helps. I forgot to add one detail however. There is an "id" column which is a unique identifier. I want to ignore that column alone and coalesce the remaining columns. How do i do that? – pscodes May 25 '23 at 22:53
That's easy just do: `df = df.withColumn('new_column', F.coalesce(*df.drop('id').columns))` – Shubham Sharma May 26 '23 at 03:11

retrieve the non null values from a PySpark dataframe row and store this value in a new column

1 Answers1