Questions tagged [spark-window-function]
31 questions
0
votes
1 answer
I want ntile(3) within ntile(3) as in subdivision within division by ntile()
I want to create a ntile(3) within an ntile(3).
I have the following table:
Customer
Total_amt
Digital_amt
1
100
45
2
200
150
3
150
23
4
300
100
5
350
350
6
112
10
7
312
15
8
260
160
9
232
150
10
190
132
I want to have…

sp22
- 15
- 5
0
votes
0 answers
Can I get a specific column value based on an aggregated column?
I am aggregating a table in AWS Glue (spark sql) to get the min of created_on and the sum of total. However I want to group by id, id2, id3 and to get the corresponding engment_partn_text value for the earliest created_on date. So that the first 3…

Aleksei Díaz
- 35
- 5
0
votes
0 answers
Default window frame in spark sql
I recently got bitten by the behavior highlighted in this question. My question is that why is the default behavior the way that it is. Why does merely adding an order by to window specification change the frame type to range and end of range to…

Alok Lal
- 273
- 1
- 2
- 11
0
votes
1 answer
Group consecutive rows using spark scala with rows repeating
--------------+-------------------------+
| space_id |template |frequency| day |timestamp …

Rohit M
- 13
- 2
0
votes
0 answers
Get a cumulative sum with time interval condition
I have a dataframe with group, value, date_start, and date_end. I want to take the cumulative sum of all values:
partition by group
that end before the current date_start
Here is how the data looks like:
+-----+-----+----------+----------+
|group…

Hau Phan
- 1
- 1
0
votes
0 answers
Spark - Map udf to windows in spark dataframe
Problem Statement:
Have to group InputDf based on multiple columns (accountGuid, appID, deviceGuid, deviceMake) and order each group by time
Need to check if the test Df exists in the exact sequence in each window
If it exists, create a new…

sujoy majumder
- 1
- 2
0
votes
1 answer
transition matrix from pyspark dataframe
I have two columns (such as):
from
to
1
2
1
3
2
4
4
2
4
2
4
3
3
3
And I want to create a transition matrix (where sum of rows in a columns add up to 1):
1. 2. 3. 4.
1. 0 0 0 0
2. …

LaC
- 5
- 4
0
votes
1 answer
Return value from prior year Spark sql
I have the following table:
YearMonth
value
202101
5
202102
7
....
10
202201
4
I need to create third column with the value from the previous year.
When I have 202201 there should be valuePriorYear equals 5 (value from…

inspiredd
- 195
- 2
- 11
0
votes
1 answer
Window function sum, multiplied by condition
I am reviewing a code and would love to have a bit more clarity.
Here is my PySpark Dataframe:
YEAR_A
YEAR_B
AMOUNT
2000
2001
5
2000
2000
4
2000
2001
3
I initiate a window function:
window = Window.partitionBy('YEAR_A')
Then I would…

lejcestlesang
- 75
- 1
- 2
- 9
0
votes
1 answer
Window function based on a condition
I have the following DF:
|-----------------------|
|Date | Val | Cond|
|-----------------------|
|2022-01-08 | 2 | 0 |
|2022-01-09 | 4 | 1 |
|2022-01-10 | 6 | 1 |
|2022-01-11 | 8 | 0 |
|2022-01-12 | 2 | 1 |
|2022-01-13 | 5 …

OdiumPura
- 444
- 5
- 25
0
votes
1 answer
To find distinct rows even the column values are interchanged in SQL
I have a table like this:
strt end no of km
------------------------------------
California India 9000
Singapore India 3000
India Frankfurt 6700
Frankfurt Singapore 4500
India …

sri123
- 3
- 4
0
votes
2 answers
Sum() Window Function in PySpark without defining window spec
I am trying to add a new column "grand total" to my table on each row.
E.G:
first_name
Order_id
price
John
1
2.5
Ali
2
2
Abdul
3
3.5
What I want…

Hassaan Anwar
- 91
- 7
0
votes
1 answer
adding flag based on occurrence of repetitive pattern in column categories using pyspark
I have a pyspark dataframe like this:
port#| log_date |code
1111 |2022-05-16 08:07:23|AAA
1111 |2022-05-16 08:08:23|XXX
1111 |2022-05-16 08:09:23|BBB
1111 |2022-05-16 08:10:23|CCC
1111 |2022-05-16 08:11:23|YYY
1111 |2022-05-16…

user1537434
- 3
- 1
0
votes
0 answers
java.lang.IllegalStateException: Error reading delta file, stateful spark structured streaming with kafka
We are running a stateful structured streaming job which reads from Kafka and writes to HDFS. And we are hitting this exception:
17/12/08 05:20:12 ERROR FileFormatWriter: Aborting job null. org.apache.spark.SparkException: Job aborted due to stage…

Sushmita
- 1
- 1
0
votes
4 answers
find the top n unique values of a column based on ranking of another column within groups in pyspark
I have a dataframe like below:
df = pd.DataFrame({ 'region': [1,1,1,1,1,1,2,2,2,3],
'store': ['A', 'A', 'C', 'C', 'D', 'B', 'F', 'F', 'E', 'G'],
'call_date': ['2022-03-10', '2022-03-09', '2022-03-08', '2022-03-07',…

zesla
- 11,155
- 16
- 82
- 147