Questions tagged [spark-window-function]

31 questions
0
votes
1 answer

I want ntile(3) within ntile(3) as in subdivision within division by ntile()

I want to create a ntile(3) within an ntile(3). I have the following table: Customer Total_amt Digital_amt 1  100 45 2 200 150 3 150 23 4 300 100 5 350 350 6 112 10 7 312 15 8 260 160 9 232 150 10 190 132 I want to have…
0
votes
0 answers

Can I get a specific column value based on an aggregated column?

I am aggregating a table in AWS Glue (spark sql) to get the min of created_on and the sum of total. However I want to group by id, id2, id3 and to get the corresponding engment_partn_text value for the earliest created_on date. So that the first 3…
0
votes
0 answers

Default window frame in spark sql

I recently got bitten by the behavior highlighted in this question. My question is that why is the default behavior the way that it is. Why does merely adding an order by to window specification change the frame type to range and end of range to…
0
votes
1 answer

Group consecutive rows using spark scala with rows repeating

--------------+-------------------------+ | space_id |template |frequency| day |timestamp …
0
votes
0 answers

Get a cumulative sum with time interval condition

I have a dataframe with group, value, date_start, and date_end. I want to take the cumulative sum of all values: partition by group that end before the current date_start Here is how the data looks like: +-----+-----+----------+----------+ |group…
0
votes
0 answers

Spark - Map udf to windows in spark dataframe

Problem Statement: Have to group InputDf based on multiple columns (accountGuid, appID, deviceGuid, deviceMake) and order each group by time Need to check if the test Df exists in the exact sequence in each window If it exists, create a new…
0
votes
1 answer

transition matrix from pyspark dataframe

I have two columns (such as): from to 1 2 1 3 2 4 4 2 4 2 4 3 3 3 And I want to create a transition matrix (where sum of rows in a columns add up to 1): 1. 2. 3. 4. 1. 0 0 0 0 2. …
0
votes
1 answer

Return value from prior year Spark sql

I have the following table: YearMonth value 202101 5 202102 7 .... 10 202201 4 I need to create third column with the value from the previous year. When I have 202201 there should be valuePriorYear equals 5 (value from…
0
votes
1 answer

Window function sum, multiplied by condition

I am reviewing a code and would love to have a bit more clarity. Here is my PySpark Dataframe: YEAR_A YEAR_B AMOUNT 2000 2001 5 2000 2000 4 2000 2001 3 I initiate a window function: window = Window.partitionBy('YEAR_A') Then I would…
0
votes
1 answer

Window function based on a condition

I have the following DF: |-----------------------| |Date | Val | Cond| |-----------------------| |2022-01-08 | 2 | 0 | |2022-01-09 | 4 | 1 | |2022-01-10 | 6 | 1 | |2022-01-11 | 8 | 0 | |2022-01-12 | 2 | 1 | |2022-01-13 | 5 …
OdiumPura
  • 444
  • 5
  • 25
0
votes
1 answer

To find distinct rows even the column values are interchanged in SQL

I have a table like this: strt end no of km ------------------------------------ California India 9000 Singapore India 3000 India Frankfurt 6700 Frankfurt Singapore 4500 India …
sri123
  • 3
  • 4
0
votes
2 answers

Sum() Window Function in PySpark without defining window spec

I am trying to add a new column "grand total" to my table on each row. E.G: first_name Order_id price John 1 2.5 Ali 2 2 Abdul 3 3.5 What I want…
0
votes
1 answer

adding flag based on occurrence of repetitive pattern in column categories using pyspark

I have a pyspark dataframe like this: port#| log_date |code 1111 |2022-05-16 08:07:23|AAA 1111 |2022-05-16 08:08:23|XXX 1111 |2022-05-16 08:09:23|BBB 1111 |2022-05-16 08:10:23|CCC 1111 |2022-05-16 08:11:23|YYY 1111 |2022-05-16…
0
votes
0 answers

java.lang.IllegalStateException: Error reading delta file, stateful spark structured streaming with kafka

We are running a stateful structured streaming job which reads from Kafka and writes to HDFS. And we are hitting this exception: 17/12/08 05:20:12 ERROR FileFormatWriter: Aborting job null. org.apache.spark.SparkException: Job aborted due to stage…
0
votes
4 answers

find the top n unique values of a column based on ranking of another column within groups in pyspark

I have a dataframe like below: df = pd.DataFrame({ 'region': [1,1,1,1,1,1,2,2,2,3], 'store': ['A', 'A', 'C', 'C', 'D', 'B', 'F', 'F', 'E', 'G'], 'call_date': ['2022-03-10', '2022-03-09', '2022-03-08', '2022-03-07',…
zesla
  • 11,155
  • 16
  • 82
  • 147