flatMap doesn't preserve order when creating lists from pyspark dataframe columns

Question

I have a PySpark dataframe df:

+---------+------------------+
|ceil_temp|             test2|
+---------+------------------+
|       -1|[6397024, 6425417]|
|        0|[6397024, 6425417]|
|        0|[6397024, 6425417]|
|        0|[6469640, 6531963]|
|        0|[6469640, 6531963]|
|        1|[6469640, 6531963]|
+---------+------------------+

I eventually want to add a new column(final) to this dataframe whose values are elements of list in test2 column based on the index of ceil_temp column. For example: if ceil_temp column has <0 or 0 value in it, final column has the element in the 0th index of test2 column.Something like this:

+---------+------------------+--------
|ceil_temp|             test2|final  |
+---------+------------------+--------
|       -1|[6397024, 6425417]|6397024|
|        0|[6397024, 6425417]|6397024|
|        0|[6397024, 6425417]|6397024|
|        0|[6469640, 6531963]|6469640|
|        0|[6469640, 6531963]|6469640|
|        1|[6469640, 6531963]|6531963|
+---------+------------------+--------

To achieve this, I tried to extract ceil_temp and test2 as lists using flatMap:

m =df.select("ceil_temp").rdd.flatMap(lambda x: x).collect()
q= df.select("test2").rdd.flatMap(lambda x: x).collect()

l=[]
for i in range(len(num)):
    if m[i]<0:
        m[i]=0
    else:
        pass
    l.append(q[i][m[i]])

Then converting this list l to a new df and joining it with original dataframe based on row index column that i add based on window function:

w = Window().orderBy()
df=df.withColumn("columnindex", rowNumber().over(w)).

However, the order of the lists extracted by flatMap doesn't seem to remain the same as that of parent dataframe df. I get the following:

m=[-1,0,0,0,0,1]
q=[[6469640, 6531963],[6469640, 6531963],[6469640, 6531963],[6397024, 6425417],[6397024, 6425417],[6397024, 6425417]]

Expected result:

m=[-1,0,0,0,0,1]
q=[[6397024, 6425417],[6397024, 6425417],[6397024, 6425417],[6469640, 6531963],[6469640, 6531963],[6469640, 6531963]]

Please advise on how to achieve the "final" column.

Michel Lemay · Answer 1 · 2017-08-10T19:02:52.643

0

I think you could achieve your desired outcome using UDF on the rows of your dataframe.

You could then withColumn with the result of your udf.

val df = spark.sparkContext.parallelize(List(
  (-1, List(6397024, 6425417)),
  (0,List(6397024, 6425417)),
  (0,List(6397024, 6425417)),
  (0,List(6469640, 6531963)),
  (0,List(6469640, 6531963)),
  (1,List(6469640, 6531963)))).toDF("ceil_temp", "test2")

import org.apache.spark.sql.functions.udf
val selectRightElement = udf {
  (ceilTemp: Int, test2: Seq[Int]) => {
    // dummy code for the example
    if (ceilTemp <= 0) test2(0) else test2(1)
  }
}

df.withColumn("final", selectRightElement(df("ceil_temp"), df("test2"))).show

Doing like that will prevent shuffling of your row order.

edited Aug 10 '17 at 19:02

answered Aug 10 '17 at 18:50

Michel Lemay

2,054
2
17
34

Thanks! I'm not too familiar with scala syntax, but how do i iterate through a column which contains an array (of integers) and return a particular element in python? (This part: if (ceilTemp <= 0) test2(0) else test2(1) ) – Mia21 Aug 10 '17 at 22:14
Any good reference on python should help you with that. It fall off of the scope of the original question about spark dataframes. – Michel Lemay Aug 14 '17 at 11:20

score 0 · Answer 2 · edited Jan 25 '18 at 01:29

0

I solved the above issue by:

df=df.withColumn("final",(df.test2).getItem(df.ceil_temp))

edited Jan 25 '18 at 01:29

Alper t. Turker

34,230
9
83
115

answered Aug 22 '17 at 16:49

Mia21

119
2
10

flatMap doesn't preserve order when creating lists from pyspark dataframe columns

2 Answers2

Linked