Questions tagged [partition-by]

266 questions
3
votes
2 answers

Reset Row Number on value change, but with repeat values in partition

I'm having trouble with something very similar to this question T-sql Reset Row number on Field Change The solution to this question is perfect, works fine. Except when I try with multiple other 'custno', it breaks down. What I mean by that: custno…
hello
  • 33
  • 1
  • 3
3
votes
1 answer

spark: save ordered data to parquet

I have 30TB of data partioned by date and hour, split into 300 files each hour. I do some data conversion & then want the data sorted and saved in the sorted order for easy ingestion by a C++ program. I understand that when you serialize, the…
user1978816
  • 812
  • 1
  • 8
  • 19
3
votes
2 answers

how to find number of active users for say 1 day,2 days, 3 days.....postgreSQL

A distribution of # days active within a week: I am trying to find how many members are active for 1 day, 2days, 3days,…7days during a specific week 3/1-3/7. Is there any way to use aggregate function on top of partition by? If not what can be used…
KK44
  • 31
  • 1
3
votes
1 answer

groupby and join vs window in pyspark

I have a data frame in pyspark which has hundreds of millions of rows (here is a dummy sample of it): import datetime import pyspark.sql.functions as F from pyspark.sql import Window,Row from pyspark.sql.functions import col from…
Dan R
  • 71
  • 7
3
votes
2 answers

Scala: collect_list() over Window with keeping null values

I have a data frame like the below: +----+----+----+ |colA|colB|colC| +----+----+----+ |1 |1 |23 | |1 |2 |63 | |1 |3 |null| |1 |4 |32 | |2 |2 |56 | +----+----+----+ I apply the below instructions such that I create a…
WhoAmI
  • 31
  • 1
  • 2
3
votes
0 answers

Spark partition by key

What is difference between two types of partitions in Spark? For example: I load a text file toto.csv from disk to spark cluster val text = sc.textFile("toto.csv", 100) => It split my file into 100 fragments without "rules" After that, if I do val…
minh-hieu.pham
  • 1,029
  • 2
  • 12
  • 21
3
votes
3 answers

clojure: partition a seq based on a seq of values

I would like to partition a seq, based on a seq of values (partition-by-seq [3 5] [1 2 3 4 5 6]) ((1 2 3)(4 5)(6)) The first input is a seq of split points. The second input is a seq i would like to partition. So, that the first list will be…
Thomas Deutsch
  • 2,344
  • 2
  • 27
  • 36
2
votes
2 answers

SQL get last column in a group based on order by and value of another column

From the table below, I am trying to create another column (last_k2X) that will show the last key2 where type was set to 'X' over time (ti). In case X was set at the same time (ts) for multiple key2, (in same key1 partition), the new column last_k2X…
Alg_D
  • 2,242
  • 6
  • 31
  • 63
2
votes
1 answer

Rank date column based on dates in another table date column

I have the following tables df_obs.show() Item No Date_Observed Item 1 2021-09-20 Item 1 2022-12-05 Item 2 2022-10-27 Item 1 2022-09-20 Item 2 2023-02-20 Item 2 2023-03-20 Item 1 2023-01-20 df_purchase.withColumn("rank",…
2
votes
1 answer

Extracted Date within a Partition By Function

I am trying to run the following SQL query but BigQuery shows an error: PARTITION BY expression references column date which is neither grouped nor aggregated SELECT EXTRACT(MONTH FROM date) month, country_name, SUM(installs), DENSE_RANK()…
2
votes
1 answer

PARTITION BY with date between 2 date

I work on Azure SQL Database working with SQL Server In SQL, I try to have a table by day, but the day is not in the table. I explain it by the example below: TABLE STARTER: (Format Date: YYYY-MM-DD) Date begin Date…
Ouakrat
  • 67
  • 1
  • 9
2
votes
1 answer

Write dataframe without column names as part of the file path

I have to write a Spark dataframe in the path of the format: base_path/{year}/{month}/{day}/{hour}/ If I do something like below: pc = ["year", "month", "day", "hour"] df.write.partitionBy(*pc).parquet("base_path/", mode = 'append') It creates the…
seou1
  • 446
  • 1
  • 5
  • 21
2
votes
3 answers

nth group labelling (count only graduates with new group)

Say, I have a df: df <- data.table(user = c('a', 'a', 'b') ); df user a a b I am looking to create a new column, group_id, at user level: user group_id a 1 a 1 b 2 Is there a fast / scalable way? I can…
Sweepy Dodo
  • 1,761
  • 9
  • 15
2
votes
2 answers

How to run SUM() OVER PARTITION BY for COUNT DISTINCT

I'm trying to get the number of distinct users for each event at a daily level while maintainig a running sum for every hour. I'm using Athena/Presto as the query engine. I tried the following query: SELECT eventname, …
nish
  • 6,952
  • 18
  • 74
  • 128
2
votes
1 answer

PySpark: PartitionBy leaves the same value in column by which partitioned multiple times

I need to partitionBy in order to get distinct values in the time and match_instatid column, but it only produces distinct values about half the time window_match_time_priority =…
Jakub Bares
  • 159
  • 1
  • 11
1
2
3
17 18