Questions tagged [data-partitioning]

Data partitioning deals with the dividing of a collection of data into smaller collections of data for the purpose of faster processing, easier statistics gathering and smaller memory/persistence footprint.

337 questions
2
votes
2 answers

C++ Partition a vector of vectors using

Suppose you have a 2D vector defined as follows: std::vector> v and which represents a matrix: 1 1 0 1 3 0 4 6 0 1 5 0 0 3 0 6 3 0 2 5 I want to stable-partition (say with predicate el != 0) this matrix, but in all directions. This…
Desperados
  • 434
  • 5
  • 13
2
votes
3 answers

How to create an average per partitions containing a maximum of 5 time dependent members?

My goal is to select an average of exactly 5 records only if they meet the left join criteria to another table. Let's say we have table one (left) with records: RECNUM ID DATE JOB 1 | cat | 2019.01.01 | meow 2 | dog |…
wounky
  • 97
  • 1
  • 12
2
votes
2 answers

How Kafka Handles Keyed Message Related to Partition

Can anyone explain: How actually Kafka store keyed message? Does a partition only assigned to a key? I mean, is it possible that a partition stores messages with multiple keys? If first question answer is yes, then how if the number of key is more…
panoet
  • 3,608
  • 1
  • 16
  • 27
2
votes
1 answer

Determining partitioning key in range based partitioning of a MySQL Table

I've been researching for a while regarding database partitioning in MySQL. Since I have one ever-growing table in my DB, I thought of using partitioning as an effective tool to optimize it. I'm only interested in retaining recent data (say last 6…
Haagenti
  • 5,602
  • 1
  • 9
  • 17
2
votes
1 answer

How to partition to spread values?

I have a table with data: Customers Sequence ID many other columns (not important) Sample data: Sequence ID ----------- 214906 2613 214906 2614 214906 2615 214907 2613 214907 2614 214907 2615 214908 2613 214908 2614 214908 2615 214000 2613 213004…
John
  • 218
  • 4
  • 8
2
votes
1 answer

Spark Partition Dataset By Column Value

(I am new to Spark) I need to store a large number of rows of data, and then handle updates to those data. We have unique IDs (DB PKs) for those rows, and we would like to shard the data set by uniqueID % numShards, to make equal sized, addressable…
radumanolescu
  • 4,059
  • 2
  • 31
  • 44
2
votes
1 answer

(SPARK) What is the best way to partition data on which multiple filters are applied?

I am working in Spark (on azure databricks) with a 15 billion rows file that looks like this : +---------+---------------+----------------+-------------+--------+------+ |client_id|transaction_key|transaction_date| …
2
votes
1 answer

Change two bytes in a GUID

I'm using a partitioned CosmosDb, but I don't know the value of the partition key each time I want to get a resource by its id. Now using the id as partition key is not a solution for me, since it would take too long and take up too much space (I…
Carmen
  • 55
  • 3
2
votes
1 answer

Repartition Dask Dataframe with custom index

I have a huge Dask Dataframe similar to this |Ind| C1 | C2 |....| Cn | |-----------------------| | 1 |val1| AE |....|time| |-----------------------| | 2 |val2| FB |....|time| |-----------------------| |...|....| .. |....|…
pichlbaer
  • 923
  • 1
  • 10
  • 18
2
votes
0 answers

spark repartition to one output file per customer

Assume I have a dataframe like: client_id,report_date,date,value_1,value_2 1,2019-01-01,2019-01-01,1,2 1,2019-01-01,2019-01-02,3,4 1,2019-01-01,2019-01-03,5,6 2,2019-01-01,2019-01-01,1,2 2,2019-01-01,2019-01-02,3,4 2,2019-01-01,2019-01-03,5,6 My…
Georg Heiler
  • 16,916
  • 36
  • 162
  • 292
2
votes
1 answer

Creating data partitions over a selected range of data to be fed into caret::train function for cross-validation

I want to create jack-knife data partitions for the data frame below, with the partitions to be used in caret::train (like the caret::groupKFold() produces). However, the catch is that I want to restrict the test points to say greater than 16 days,…
André.B
  • 617
  • 8
  • 17
2
votes
2 answers

How to partition an image to 64 block in matlab

I want to compute the Color Layout Descriptor (CLD) for each image.. this algorithm include four stages . in the First stage I must Partition each image into 64 block i(8×8)n order to compute a single representative color from each block .. I try to…
zenab
  • 229
  • 3
  • 9
  • 20
2
votes
2 answers

How Apache Spark partitions data of a big file

Let's say I have a cluster of 4 nodes each having 1 core. I have a 600 Petabytes size big file which I want to process through Spark. File could be stored in HDFS. I think that way to determine no. of partitions is file size / total no. of cores in…
Anand
  • 20,708
  • 48
  • 131
  • 198
2
votes
1 answer

Incorrect splitting of data using sample.split in R and issue with logistic regression

I have 2 issues. When I try to split my data into test and train sets, using sample.split as below, the sampling is done rather unclearly. What I mean is that the data d, has a length of 392 and so, 4:1 division should show 0.8*392= 313.6 i.e. 313…
Akshayanti
  • 354
  • 3
  • 15
2
votes
1 answer

Obtain KeyedStream from custom partitioning in Flink

I know that Flink comes with custom partitioning APIs. However, the problem is that, after invoking partitionCustom on a DataStream you get a DataStream back and not a KeyedStream. On the other hand, you cannot override the partitioning strategy for…
affo
  • 453
  • 3
  • 15