0

I have a DataFrame with columns "id" and "date". date is of format yyyy-mm-dd here is an example:

+---------+----------+
|  item_id|        ds|
+---------+----------+
| 25867869|2018-05-01|
| 17190474|2018-01-02|
| 19870756|2018-01-02|
|172248680|2018-07-29|
| 41148162|2018-03-01|
+---------+----------+

I want to create a new column, in which each date is associated with an integer starting from 1. such that the smallest(earliest) date gets integer 1 , next(2nd earliest date) gets assigned to 2 and so on..

I want my DataFrame to look like this... :

+---------+----------+---------+
|  item_id|        ds|   number|
+---------+----------+---------+
| 25867869|2018-05-01|        3|
| 17190474|2018-01-02|        1|
| 19870756|2018-01-02|        1|
|172248680|2018-07-29|        4|
| 41148162|2018-03-01|        2|
+---------+----------+---------+

Explanation:

2018 jan 02 date comes the earliest hence its number is 1. since there are 2 rows with same date, therefore 1 is located twice. after 2018-01-02 the next date comes as 2018-03-01 hence its number is 2 and so on... How can I create such column ?

Alper t. Turker
  • 34,230
  • 9
  • 83
  • 115
romal j
  • 11
  • 2

2 Answers2

1

This can be achieved by dense_rank in Window functions.

import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions._
val win = Window.orderBy(to_date(col("ds"),"yyyy-MM-dd").asc)
val df1 = df.withColumn("number", dense_rank() over win)

df1 will have the column number as you required.

Note : to_date(col("ds"),"yyyy-MM-dd") is mandatory, else it will be considered as Strings and does not survive the purpose.

Praveen L
  • 937
  • 6
  • 13
  • it is working very slow, because my data is partitioned on ds. while running the code it also shows a warning, "WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation." , can u help me solve this ? – romal j Jun 26 '18 at 05:38
  • As your requirement is to have integers in the increasing order based on the date in the entire table data, entire data need to be considered as single partition........ If you just need unique numbers (instead of consecutive numbers) , there are faster alternatives. – Praveen L Jun 26 '18 at 10:46
  • how about if we find out the oldest date and start finding out day difference using "date diff" function , this wont create any problem as my date data is very dense and almost all dates are included in it . can u please help me code my idea in scala ? – romal j Jun 28 '18 at 07:51
  • I felt this is very worth a try. Even though you may not get consecutive numbers, I hope it satisfies your requirement – Praveen L Jun 28 '18 at 07:54
-1

You should make a function to get the oldest query without a number something like:

SELECT * FROM tablename WHERE number IS NULL ORDER BY ds ASC

then make another query where you get the greatest number:

SELECT * FROM tablename ORDER BY number DESC

then if both queries have the same date then update the table with the same number:

UPDATE tablename SET number = 'greatest number from first query' WHERE ds = 'the date from first query'

or if the dates are diferent then the same but add 1 to the number:

UPDATE tablename SET number= 'greatest number from first query' + 1 WHERE ds = 'the date from first query'

To make this work you should first assgin the number 1 to the oldest entry. You should do this in a loop until the first query (checks if there is any number that is not set) is empty. The first query suposes that the empty column is all null, if it's another case then you should change the WHERE condition to check when the column is empty.

Nawrez
  • 3,314
  • 8
  • 28
  • 42
sgaleta
  • 53
  • 5