I have a PySpark dataframe df:
+---------+------------------+
|ceil_temp| test2|
+---------+------------------+
| -1|[6397024, 6425417]|
| 0|[6397024, 6425417]|
| 0|[6397024, 6425417]|
| 0|[6469640, 6531963]|
| 0|[6469640, 6531963]|
| 1|[6469640, 6531963]|
+---------+------------------+
I eventually want to add a new column(final) to this dataframe whose values are elements of list in test2 column based on the index of ceil_temp column. For example: if ceil_temp column has <0 or 0 value in it, final column has the element in the 0th index of test2 column.Something like this:
+---------+------------------+--------
|ceil_temp| test2|final |
+---------+------------------+--------
| -1|[6397024, 6425417]|6397024|
| 0|[6397024, 6425417]|6397024|
| 0|[6397024, 6425417]|6397024|
| 0|[6469640, 6531963]|6469640|
| 0|[6469640, 6531963]|6469640|
| 1|[6469640, 6531963]|6531963|
+---------+------------------+--------
To achieve this, I tried to extract ceil_temp and test2 as lists using flatMap:
m =df.select("ceil_temp").rdd.flatMap(lambda x: x).collect()
q= df.select("test2").rdd.flatMap(lambda x: x).collect()
l=[]
for i in range(len(num)):
if m[i]<0:
m[i]=0
else:
pass
l.append(q[i][m[i]])
Then converting this list l to a new df and joining it with original dataframe based on row index column that i add based on window function:
w = Window().orderBy()
df=df.withColumn("columnindex", rowNumber().over(w)).
However, the order of the lists extracted by flatMap doesn't seem to remain the same as that of parent dataframe df. I get the following:
m=[-1,0,0,0,0,1]
q=[[6469640, 6531963],[6469640, 6531963],[6469640, 6531963],[6397024, 6425417],[6397024, 6425417],[6397024, 6425417]]
Expected result:
m=[-1,0,0,0,0,1]
q=[[6397024, 6425417],[6397024, 6425417],[6397024, 6425417],[6469640, 6531963],[6469640, 6531963],[6469640, 6531963]]
Please advise on how to achieve the "final" column.