0

I have a PySpark dataframe which has column names which are unique_id's generated by UUID library. So I cannot query using column names. Each row in this pySpark dataframe has 1 "non null value". How do i create a new column which only has this 1 non null value? I have shared a sample below where "new_column" is the column I would like to create. Any help is appreciated. Thanks in advance

col1    col2      col3          col4             new_column    
Null    Null       xyz           Null                    xyz

I tried looking at dataframe operations but i was unable to find any relevant solution.

Shubham Sharma
  • 68,127
  • 6
  • 24
  • 53
pscodes
  • 11
  • 1

1 Answers1

1

Lets do coalesce on all columns

df = df.withColumn('new_column', F.coalesce(*df.columns))

+----+----+----+----+----------+
|col1|col2|col3|col4|new_column|
+----+----+----+----+----------+
|null|null| xyz|null|       xyz|
| pqr|null|null|null|       pqr|
+----+----+----+----+----------+
Shubham Sharma
  • 68,127
  • 6
  • 24
  • 53
  • thank you so much ! this really helps. I forgot to add one detail however. There is an "id" column which is a unique identifier. I want to ignore that column alone and coalesce the remaining columns. How do i do that? – pscodes May 25 '23 at 22:53
  • That's easy just do: `df = df.withColumn('new_column', F.coalesce(*df.drop('id').columns))` – Shubham Sharma May 26 '23 at 03:11