21

I am having a PySpark DataFrame -

valuesCol = [('Sweden',31),('Norway',62),('Iceland',13),('Finland',24),('Denmark',52)]
df = sqlContext.createDataFrame(valuesCol,['name','id'])
+-------+---+
|   name| id|
+-------+---+
| Sweden| 31|
| Norway| 62|
|Iceland| 13|
|Finland| 24|
|Denmark| 52|
+-------+---+

I wish to add a row column to this DataFrame, which is the row number (serial number) of the row, like shown below -

My final output should be:

+-------+---+--------+
|   name| id|row_num |
+-------+---+--------+
| Sweden| 31|       1|
| Norway| 62|       2|
|Iceland| 13|       3|
|Finland| 24|       4|
|Denmark| 52|       5|
+-------+---+--------+

My Spark version is 2.2

I am trying this code, but it doesn't work -

from pyspark.sql.functions import row_number
from pyspark.sql.window import Window
w = Window().orderBy()
df = df.withColumn("row_num", row_number().over(w))
df.show()

I am getting an Error:

AnalysisException: 'Window function row_number() requires window to be ordered, please add ORDER BY clause. For example SELECT row_number()(value_expr) OVER (PARTITION BY window_partition ORDER BY window_ordering) from table;'

If I understand it correctly, I need to order some column, but I don't want something like this w = Window().orderBy('id') because that will reorder the entire DataFrame.

Can anyone suggest how to achieve the above mentioned output using row_number() function?

cph_sto
  • 7,189
  • 12
  • 42
  • 78
  • @cph_sto- you may like this also.https://stackoverflow.com/questions/41313488/avoid-performance-impact-of-a-single-partition-mode-in-spark-window-functions – vikrant rana Jun 25 '19 at 00:20

2 Answers2

41

You should define column for order clause. If you don't need to order values then write a dummy value. Try below;

from pyspark.sql.functions import row_number,lit
from pyspark.sql.window import Window
w = Window().orderBy(lit('A'))
df = df.withColumn("row_num", row_number().over(w))
Ali Yesilli
  • 2,071
  • 13
  • 16
  • 1
    Thank you Sir. It works perfectly. Just a small question - I was missing ´lit('A')´. Can you kindly explain what is this part of the code doing? What is 'A' here, as it doesn't appear in the final output anyway. I will accept it as an answer anyway because that yields the output expected. – cph_sto Oct 29 '18 at 09:37
  • 2
    It is a dummy value. It means nothing you can write anything instead of A – Ali Yesilli Oct 29 '18 at 09:42
  • 1
    Understood, thanks :) Just one last question - I have seen that row_number() is used along with partitionBy() many a times, so if I load data from HDFS and add a column of row numbers, like above, will there be a reshuffle on the partitions? I know that Spark will only trigger an execution when an action is called and the Catalyst will rearrange operations to yield an optimal solution. My Query: I think there will be no repartitioning of the data by using row_numbers() after we load data from HDFS (and before we invoke any action), but just wanted to seek your perspective! – cph_sto Oct 29 '18 at 10:01
  • 1
    I think it will work. if you don't need to group data and obtain row numbers for each group, no need to use partitionBy clause. – Ali Yesilli Oct 29 '18 at 10:11
  • 1
    Perfect solution..; – Induraj PR Mar 29 '21 at 22:51
  • I guest this is still sorting and have a performance impact right? – steco Jul 13 '21 at 16:36
  • 1
    It definitely has impact since its sorting. I added this and i can notice a delay (1Tb data) – Equinox Nov 15 '21 at 14:08
  • 3
    `WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.` I get the above warning Is there any better way to implement this without getting the warning if do not need group or partition. I just want row number. – lunbox Nov 16 '21 at 15:58
  • @lunbox I get the same warning. Were you able to find a resolution around this? – mang4521 Mar 11 '22 at 16:27
  • I was getting the same warning then I changed my code to this: `w = Window.partitionBy('seller_id').orderBy('seller_id')` and used this way `df.withColumn('row_number', row_number().over(w)).withColumn('total_earned', sum(col('price')).over(w)).where(col('row_number') == 1).select('seller_name', 'total_earned').show()` – Gregory Iyama May 14 '22 at 03:20
4

I had a similar problem, but in my case @Ali Yesilli's solution failed, because I was reading multiple input files separately and ultimately unioning them all in a single dataframe. In this case, the order within the window ordered by a dummy variable proved to be unpredictable.

So to achieve more robust ordering, I used monotonically_increasing_id:

df = df.withColumn('original_order', monotonically_increasing_id())
df = df.withColumn('row_num', row_number().over(Window.orderBy('original_order')))
df = df.drop('original_order')
Waiski
  • 9,214
  • 3
  • 21
  • 30
  • Wouldn't this approach cause: `WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.`? – dataviews Aug 07 '23 at 16:41