Questions tagged [partition-by]
266 questions
3
votes
2 answers
Reset Row Number on value change, but with repeat values in partition
I'm having trouble with something very similar to this question
T-sql Reset Row number on Field Change
The solution to this question is perfect, works fine. Except when I try with multiple other 'custno', it breaks down.
What I mean by that:
custno…

hello
- 33
- 1
- 3
3
votes
1 answer
spark: save ordered data to parquet
I have 30TB of data partioned by date and hour, split into 300 files each hour. I do some data conversion & then want the data sorted and saved in the sorted order for easy ingestion by a C++ program. I understand that when you serialize, the…

user1978816
- 812
- 1
- 8
- 19
3
votes
2 answers
how to find number of active users for say 1 day,2 days, 3 days.....postgreSQL
A distribution of # days active within a week: I am trying to find how many members are active for 1 day, 2days, 3days,…7days during a specific week 3/1-3/7.
Is there any way to use aggregate function on top of partition by?
If not what can be used…

KK44
- 31
- 1
3
votes
1 answer
groupby and join vs window in pyspark
I have a data frame in pyspark which has hundreds of millions of rows (here is a dummy sample of it):
import datetime
import pyspark.sql.functions as F
from pyspark.sql import Window,Row
from pyspark.sql.functions import col
from…

Dan R
- 71
- 7
3
votes
2 answers
Scala: collect_list() over Window with keeping null values
I have a data frame like the below:
+----+----+----+
|colA|colB|colC|
+----+----+----+
|1 |1 |23 |
|1 |2 |63 |
|1 |3 |null|
|1 |4 |32 |
|2 |2 |56 |
+----+----+----+
I apply the below instructions such that I create a…

WhoAmI
- 31
- 1
- 2
3
votes
0 answers
Spark partition by key
What is difference between two types of partitions in Spark?
For example: I load a text file toto.csv from disk to spark cluster
val text = sc.textFile("toto.csv", 100)
=> It split my file into 100 fragments without "rules"
After that, if I do
val…

minh-hieu.pham
- 1,029
- 2
- 12
- 21
3
votes
3 answers
clojure: partition a seq based on a seq of values
I would like to partition a seq, based on a seq of values
(partition-by-seq [3 5] [1 2 3 4 5 6])
((1 2 3)(4 5)(6))
The first input is a seq of split points.
The second input is a seq i would like to partition.
So, that the first list will be…

Thomas Deutsch
- 2,344
- 2
- 27
- 36
2
votes
2 answers
SQL get last column in a group based on order by and value of another column
From the table below, I am trying to create another column (last_k2X) that will show the last key2 where type was set to 'X' over time (ti).
In case X was set at the same time (ts) for multiple key2, (in same key1 partition), the new column last_k2X…

Alg_D
- 2,242
- 6
- 31
- 63
2
votes
1 answer
Rank date column based on dates in another table date column
I have the following tables
df_obs.show()
Item No
Date_Observed
Item 1
2021-09-20
Item 1
2022-12-05
Item 2
2022-10-27
Item 1
2022-09-20
Item 2
2023-02-20
Item 2
2023-03-20
Item 1
2023-01-20
df_purchase.withColumn("rank",…

imadari
- 23
- 5
2
votes
1 answer
Extracted Date within a Partition By Function
I am trying to run the following SQL query but BigQuery shows an error: PARTITION BY expression references column date which is neither grouped nor aggregated
SELECT
EXTRACT(MONTH FROM date) month,
country_name,
SUM(installs),
DENSE_RANK()…

Ihsan Mercan
- 23
- 3
2
votes
1 answer
PARTITION BY with date between 2 date
I work on Azure SQL Database working with SQL Server
In SQL, I try to have a table by day, but the day is not in the table.
I explain it by the example below:
TABLE STARTER: (Format Date: YYYY-MM-DD)
Date begin
Date…

Ouakrat
- 67
- 1
- 9
2
votes
1 answer
Write dataframe without column names as part of the file path
I have to write a Spark dataframe in the path of the format: base_path/{year}/{month}/{day}/{hour}/
If I do something like below:
pc = ["year", "month", "day", "hour"]
df.write.partitionBy(*pc).parquet("base_path/", mode = 'append')
It creates the…

seou1
- 446
- 1
- 5
- 21
2
votes
3 answers
nth group labelling (count only graduates with new group)
Say, I have a df:
df <- data.table(user = c('a', 'a', 'b')
); df
user
a
a
b
I am looking to create a new column, group_id, at user level:
user group_id
a 1
a 1
b 2
Is there a fast / scalable way? I can…

Sweepy Dodo
- 1,761
- 9
- 15
2
votes
2 answers
How to run SUM() OVER PARTITION BY for COUNT DISTINCT
I'm trying to get the number of distinct users for each event at a daily level while maintainig a running sum for every hour.
I'm using Athena/Presto as the query engine.
I tried the following query:
SELECT
eventname,
…

nish
- 6,952
- 18
- 74
- 128
2
votes
1 answer
PySpark: PartitionBy leaves the same value in column by which partitioned multiple times
I need to partitionBy in order to get distinct values in the time and match_instatid column, but it only produces distinct values about half the time
window_match_time_priority =…

Jakub Bares
- 159
- 1
- 11