Questions tagged [partitioning]

Partitioning is a performance strategy whereby you divide possibly very large groups of data into some number of smaller groups of data.

Partitioning is a performance strategy whereby you divide possibly very large groups of data into some number of smaller groups of data.

The expectation is that with algorithms of order exponentially greater than N the total time it takes to process the smaller groups and combine the results is still less than the time it would take to process the one larger set of data.

Partitioning is similar to range partitioning in many ways. As in partitioning by RANGE, each partition must be explicitly defined.

3138 questions
12
votes
2 answers

In Oracle SQL, can I query a partition of a table instead of an entire table to make it run faster?

I would like to query a table with a million records for customers named 'FooBar' that have records dated on 7-24-2016. The table has 10 days of data in it. select * from table where customer = 'FooBar' and insert_date between to_date('2016-07-24…
Cale Sweeney
  • 1,014
  • 1
  • 15
  • 37
12
votes
1 answer

Write Spark dataframe as CSV with partitions

I'm trying to write a dataframe in spark to an HDFS location and I expect that if I'm adding the partitionBy notation Spark will create partition (similar to writing in Parquet format) folder in form of partition_column_name=partition_value ( i.e…
Lior Baber
  • 852
  • 3
  • 11
  • 25
12
votes
3 answers

Postgresql Table Partitioning Django Project

I have a Django 1.7 project that uses Postgres 9.3. I have a table that will have rather high volume. The table will have anywhere from 13million to 40million new rows a month. I would like to know what the best way to incorporate Postgres table…
arcane
  • 457
  • 6
  • 10
11
votes
5 answers

Need an algorithm to split a series of numbers

After a few busy nights my head isn't working so well, but this needs to be fixed yesterday, so I'm asking the more refreshed community of SO. I've got a series of numbers. For example: 1, 5, 7, 13, 3, 3, 4, 1, 8, 6, 6, 6 I need to split this…
Vilx-
  • 104,512
  • 87
  • 279
  • 422
11
votes
2 answers

Spark: Order of column arguments in repartition vs partitionBy

Methods taken into consideration (Spark 2.2.1): DataFrame.repartition (the two implementations that take partitionExprs: Column* parameters) DataFrameWriter.partitionBy Note: This question doesn't ask the difference between these methods From docs…
y2k-shubham
  • 10,183
  • 11
  • 55
  • 131
11
votes
2 answers

Why does sortBy transformation trigger a Spark job?

As per Spark documentation only RDD actions can trigger a Spark job and the transformations are lazily evaluated when an action is called on it. I see the sortBy transformation function is applied immediately and it is shown as a job trigger in the…
Prabu Soundar Rajan
  • 799
  • 1
  • 8
  • 14
11
votes
6 answers

Quicksort - Hoare's partitioning with duplicate values

I have implemented the classic Hoare's partitioning algorithm for Quicksort. It works with any list of unique numbers [3, 5, 231, 43]. The only problem is when I have a list with duplicates [1, 57, 1, 34]. If I get duplicate values I enter an…
valdi.k
  • 331
  • 1
  • 3
  • 7
11
votes
2 answers

pyspark partitioning data using partitionby

I understand that partitionBy function partitions my data. If I use rdd.partitionBy(100) it will partition my data by key into 100 parts. i.e. data associated with similar keys will be grouped together Is my understanding correct? Is it advisable…
user2543622
  • 5,760
  • 25
  • 91
  • 159
11
votes
2 answers

Hive doesn't read partitioned parquet files generated by Spark

I'm having a problem to read partitioned parquet files generated by Spark in Hive. I'm able to create the external table in hive but when I try to select a few lines, hive returns only an "OK" message with no rows. I'm able to read the partitioned…
ALunz
  • 311
  • 2
  • 8
11
votes
4 answers

Is it possible to partially refresh a materialized view in Oracle?

I have a very complex Oracle view based on other materialized views, regular views as well as some tables (I can't "fast refresh" it). Most of the time, existing records in this view are based on a date and are "stable", with new record sets having…
Galghamon
  • 2,012
  • 18
  • 27
11
votes
2 answers

Database sharding on Heroku

At some point in the next few months our app will be at the size where we need to shard our DB. We are using Heroku for hosting, Node.js/PostgreSQL stack. Conceptually, it makes sense for our app to have each logical shard represent one user and all…
raviparikh
  • 295
  • 1
  • 4
  • 11
11
votes
4 answers

How to script sfdisk or parted for multiple partitions?

For QA purposes I need to be able to partition a drive via a bash script up to 30 or more partitions for both RHEL and SLES. I have attempted to do this in BASH with fdisk via a "here document" which works but as you can guess blows up in various…
LabRat
  • 237
  • 3
  • 4
  • 14
10
votes
2 answers

How does one Azure table storage table with many partition keys compare to many tables with fewer partition keys?

I have a Windows Azure application in which all read queries of TableA are executed on single partitions for a range of rowkeys. The Partition Keys that facilitate this storage scheme are actually flattened names of objects in a hierarchy, such that…
user483679
  • 665
  • 1
  • 7
  • 21
10
votes
2 answers

Sharded load balancing for stateful services in Kubernetes

I am currently switching from Service Fabric to Kubernetes and was wondering how to do custom and more complex load balancing. So far I already read about Kubernetes offering "Services" which do load balancing for pods hidden behind them, but this…
10
votes
1 answer

how to add new column to partitioned tables in postgres

I have created a new master table with multiple partitions on basis of a column value using declarative partitioning of postgres 10. How can i add new columns to the tables?
Shreya Batra
  • 730
  • 1
  • 6
  • 15