-1

I'd like to run a window function in Spark SQL. I have Zeppelin sitting on top of a Spark cluster with Hadoop.

I'd like to add a row number to a table and group it by a combination of two IDs.

This is my data.

food      aisle      item    date_added
pear      1          1234    '2020-01-01 10:12'
banana    2.         1233    '2020-01-02 10:12'
banana    2          1211    '2020-01-03 10:12'
banana    2          1412    '2020-01-04 10:12'
apple     1          1452    '2020-01-05 10:12'
apple     1          1334    '2020-01-06 10:12'

I'd like to turn the data into this

food      aisle      item    date_added             rn
pear      1          1234    '2020-01-01 10:12'     1
banana    2          1233    '2020-01-02 10:12'     3
banana    2          1211    '2020-01-03 10:12'     2
banana    2          1412    '2020-01-04 10:12'     1
apple     1          1452    '2020-01-05 10:12'     2
apple     1          1334    '2020-01-06 10:12'     1

This is my query

%sql

select
    food,
    aisle,
    item,
    row_number() over (order by date_added desc 
                            partition by food, aisle 
                            rows between unbounded preceeding and current row) as rn
from fruits

This is the error

mismatched input 'partition' expecting {')', ',', 'RANGE', 'ROWS'}(line 5, pos 28)

How do I solve this error with Spark SQL?

Cauder
  • 2,157
  • 4
  • 30
  • 69

1 Answers1

2

The correct syntax is:

row_number() over (partition by food, aisle order by date_added desc) as rn 

A window frame specification is not needed for the ranking functions (row_number(), rank(), and dense_rank()).

Gordon Linoff
  • 1,242,037
  • 58
  • 646
  • 786