1

Firstly - I am so lost, I don't even know how to phrase / title my question.

My first dataframe was key value pairs, which I pivoted out with some simple code and it gave me the dataframe which you can see in image 1:

df = df.groupBy("ID", "SUBID").pivot("SOURCE", fields).agg(first(F.col("VALUE")))

enter image description here

It sort of has what I want, but I think I need to groupby again or merge or something to get my desired result, image 2..

I have tried a few things, but none seem to work. The only one which makes sense to me is I would either need another group by or a self join of some sort?

The result I would like (As I know its all for the same person [ID]) would be something like the below... Same Name has been applied across the board, same surname, same Town etc. Cells that contain values would retain their current values.

enter image description here

Looking for any advice or duplicate questions please.

Fizor
  • 1,480
  • 1
  • 16
  • 31
  • check this, https://stackoverflow.com/questions/61932391/pyspark-forward-and-backward-fill-within-column-level – Suresh Oct 01 '21 at 13:16
  • you want to complete with default values ? or with values from another line ? what happen when there are several values already in it ? which one do you choose. – Steven Oct 01 '21 at 13:21

1 Answers1

0

Create a window for the id column and then check for all other columns if the values within each window are unique. If so replace the null values with this unique value, otherwise replace the nulls with ###.

#creating some test data
data=[[1,'abc',None, 'def'],
      [1, None,None, 'ghi'],
      [1, None,'xyz', 'jkl'],
      [1, None,'XYZ', 'mno'],
      [2, '123','456', None],
      [2, None,None, '789']
    ]
df=spark.createDataFrame(data, ["id", "col1", "col2", "col3"])
#+---+----+----+----+
#| id|col1|col2|col3|
#+---+----+----+----+
#|  1| abc|null| def|
#|  1|null|null| ghi|
#|  1|null| xyz| jkl|
#|  1|null| XYZ| mno|
#|  2| 123| 456|null|
#|  2|null|null| 789|
#+---+----+----+----+

w = Window.partitionBy("id")

filled_cols=[F.col("id")] + \
  [F.coalesce(c,
    F.when(F.size(F.collect_set(c).over(w))==1, 
    F.collect_set(c).over(w)[0]).otherwise("###")).alias(c) 
  for c in df.columns if c != "id"]

df.select(filled_cols).show()
#+---+----+----+----+
#| id|col1|col2|col3|
#+---+----+----+----+
#|  1| abc| ###| def|
#|  1| abc| ###| ghi|
#|  1| abc| xyz| jkl|
#|  1| abc| XYZ| mno|
#|  2| 123| 456| 789|
#|  2| 123| 456| 789|
#+---+----+----+----+
werner
  • 13,518
  • 6
  • 30
  • 45