Data partitioning deals with the dividing of a collection of data into smaller collections of data for the purpose of faster processing, easier statistics gathering and smaller memory/persistence footprint.
Questions tagged [data-partitioning]
337 questions
2
votes
2 answers
C++ Partition a vector of vectors using
Suppose you have a 2D vector defined as follows:
std::vector> v
and which represents a matrix:
1 1 0 1 3
0 4 6 0 1
5 0 0 3 0
6 3 0 2 5
I want to stable-partition (say with predicate el != 0) this matrix, but in all directions. This…

Desperados
- 434
- 5
- 13
2
votes
3 answers
How to create an average per partitions containing a maximum of 5 time dependent members?
My goal is to select an average of exactly 5 records only if they meet the left join criteria to another table.
Let's say we have table one (left) with records:
RECNUM ID DATE JOB
1 | cat | 2019.01.01 | meow
2 | dog |…

wounky
- 97
- 1
- 12
2
votes
2 answers
How Kafka Handles Keyed Message Related to Partition
Can anyone explain:
How actually Kafka store keyed message? Does a partition only assigned to a key? I mean, is it possible that a partition stores messages with multiple keys?
If first question answer is yes, then how if the number of key is more…

panoet
- 3,608
- 1
- 16
- 27
2
votes
1 answer
Determining partitioning key in range based partitioning of a MySQL Table
I've been researching for a while regarding database partitioning in MySQL. Since I have one ever-growing table in my DB, I thought of using partitioning as an effective tool to optimize it. I'm only interested in retaining recent data (say last 6…

Haagenti
- 5,602
- 1
- 9
- 17
2
votes
1 answer
How to partition to spread values?
I have a table with data:
Customers
Sequence
ID
many other columns (not important)
Sample data:
Sequence ID
-----------
214906 2613
214906 2614
214906 2615
214907 2613
214907 2614
214907 2615
214908 2613
214908 2614
214908 2615
214000 2613
213004…

John
- 218
- 4
- 8
2
votes
1 answer
Spark Partition Dataset By Column Value
(I am new to Spark) I need to store a large number of rows of data, and then handle updates to those data. We have unique IDs (DB PKs) for those rows, and we would like to shard the data set by uniqueID % numShards, to make equal sized, addressable…

radumanolescu
- 4,059
- 2
- 31
- 44
2
votes
1 answer
(SPARK) What is the best way to partition data on which multiple filters are applied?
I am working in Spark (on azure databricks) with a 15 billion rows file that looks like this :
+---------+---------------+----------------+-------------+--------+------+
|client_id|transaction_key|transaction_date| …

RobL
- 41
- 1
2
votes
1 answer
Change two bytes in a GUID
I'm using a partitioned CosmosDb, but I don't know the value of the partition key each time I want to get a resource by its id. Now using the id as partition key is not a solution for me, since it would take too long and take up too much space (I…

Carmen
- 55
- 3
2
votes
1 answer
Repartition Dask Dataframe with custom index
I have a huge Dask Dataframe similar to this
|Ind| C1 | C2 |....| Cn |
|-----------------------|
| 1 |val1| AE |....|time|
|-----------------------|
| 2 |val2| FB |....|time|
|-----------------------|
|...|....| .. |....|…

pichlbaer
- 923
- 1
- 10
- 18
2
votes
0 answers
spark repartition to one output file per customer
Assume I have a dataframe like:
client_id,report_date,date,value_1,value_2
1,2019-01-01,2019-01-01,1,2
1,2019-01-01,2019-01-02,3,4
1,2019-01-01,2019-01-03,5,6
2,2019-01-01,2019-01-01,1,2
2,2019-01-01,2019-01-02,3,4
2,2019-01-01,2019-01-03,5,6
My…

Georg Heiler
- 16,916
- 36
- 162
- 292
2
votes
1 answer
Creating data partitions over a selected range of data to be fed into caret::train function for cross-validation
I want to create jack-knife data partitions for the data frame below, with the partitions to be used in caret::train (like the caret::groupKFold() produces). However, the catch is that I want to restrict the test points to say greater than 16 days,…

André.B
- 617
- 8
- 17
2
votes
2 answers
How to partition an image to 64 block in matlab
I want to compute the Color Layout Descriptor (CLD) for each image.. this algorithm include four stages . in the First stage I must Partition each image into 64 block i(8×8)n order to compute a single representative color from each block .. I try to…

zenab
- 229
- 3
- 9
- 20
2
votes
2 answers
How Apache Spark partitions data of a big file
Let's say I have a cluster of 4 nodes each having 1 core. I have a 600 Petabytes size big file which I want to process through Spark. File could be stored in HDFS.
I think that way to determine no. of partitions is file size / total no. of cores in…

Anand
- 20,708
- 48
- 131
- 198
2
votes
1 answer
Incorrect splitting of data using sample.split in R and issue with logistic regression
I have 2 issues.
When I try to split my data into test and train sets, using sample.split as below, the sampling is done rather unclearly. What I mean is that the data d, has a length of 392 and so, 4:1 division should show 0.8*392= 313.6 i.e. 313…

Akshayanti
- 354
- 3
- 15
2
votes
1 answer
Obtain KeyedStream from custom partitioning in Flink
I know that Flink comes with custom partitioning APIs. However, the problem is that, after invoking partitionCustom on a DataStream you get a DataStream back and not a KeyedStream.
On the other hand, you cannot override the partitioning strategy for…

affo
- 453
- 3
- 15