Questions tagged [partitioning]

Partitioning is a performance strategy whereby you divide possibly very large groups of data into some number of smaller groups of data.

Partitioning is a performance strategy whereby you divide possibly very large groups of data into some number of smaller groups of data.

The expectation is that with algorithms of order exponentially greater than N the total time it takes to process the smaller groups and combine the results is still less than the time it would take to process the one larger set of data.

Partitioning is similar to range partitioning in many ways. As in partitioning by RANGE, each partition must be explicitly defined.

3138 questions
17
votes
4 answers

Clustering, Sharding or simple Partition / Replication

We have created a Facebook application and it got a lot of virality. The problem is that our database started getting REALLY FULL (some tables have more than 25 million rows now). It got to the point that the app just stopped working because there…
albertosh
  • 2,416
  • 7
  • 25
  • 32
17
votes
4 answers

How to get the number of elements in partition?

Is there any way to get the number of elements in a spark RDD partition, given the partition ID? Without scanning the entire partition. Something like this: Rdd.partitions().get(index).size() Except I don't see such an API for spark. Any ideas?…
Geo
  • 173
  • 1
  • 1
  • 5
17
votes
2 answers

How to see table partition size in MySQL ( is it even possible? )

I've partitioned my table horizontally and I'd like to see how the rows are currently distributed. Searching the web didn't bring any relevant results. Could anyone tell me if this is possible?
16
votes
2 answers

Partition data for efficient joining for Spark dataframe/dataset

I need to join many DataFrames together based on some shared key columns. For a key-value RDD, one can specify a partitioner so that data points with same key are shuffled to same executor so joining is more efficient (if one has shuffle related…
16
votes
3 answers

Efficient querying of multi-partition Postgres table

I've just restructured my database to use partitioning in Postgres 8.2. Now I have a problem with query performance: SELECT * FROM my_table WHERE time_stamp >= '2010-02-10' and time_stamp < '2010-02-11' ORDER BY id DESC LIMIT 100; There are 45…
Adrian Pronk
  • 13,486
  • 7
  • 36
  • 60
16
votes
1 answer

Best way to manage row expiration in mysql

An application does the following: writes a row to a table that has a unique ID read the table and find the unique ID and output the other variables (among which the timestamp). The question is: the application needs to read only the non-expired…
smartcity
  • 191
  • 1
  • 2
  • 6
16
votes
8 answers

How to select rows from partition in MySQL

I made partition my 300MB table and trying to make select query from p0 partition with this command SELECT * FROM employees PARTITION (p0); But I am getting following error ERROR 1064 (42000): You have an error in your SQL syntax; check the manual…
Kad
  • 542
  • 1
  • 5
  • 18
16
votes
1 answer

Partitions and UPDATE

I'm diving deeper and deeper into MySQL Features, and the next one I'm trying out is table partitions There's basically only one question about them, where I couldn't find a clear answer yet: If you UPDATE a row, will the row be moved to another…
Katai
  • 2,773
  • 3
  • 31
  • 45
15
votes
5 answers

Clojure partition by filter

In Scala, the partition method splits a sequence into two separate sequences -- those for which the predicate is true and those for which it is false: scala> List(1, 5, 2, 4, 6, 3, 7, 9, 0, 8).partition(_ % 2 == 0) res1: (List[Int], List[Int]) =…
Ralph
  • 31,584
  • 38
  • 145
  • 282
15
votes
3 answers

How to optimize partitioning when migrating data from JDBC source?

I am trying to move data from a table in PostgreSQL table to a Hive table on HDFS. To do that, I came up with the following code: val conf = new…
Metadata
  • 2,127
  • 9
  • 56
  • 127
15
votes
1 answer

Partitioning in spark while reading from RDBMS via JDBC

I am running spark in cluster mode and reading data from RDBMS via JDBC. As per Spark docs, these partitioning parameters describe how to partition the table when reading in parallel from multiple…
Dev
  • 13,492
  • 19
  • 81
  • 174
15
votes
1 answer

Understanding shuffle managers in Spark

Let me help to clarify about shuffle in depth and how Spark uses shuffle managers. I report some very helpful…
Giorgio
  • 1,073
  • 3
  • 15
  • 33
15
votes
3 answers

spark parquet write gets slow as partitions grow

I have a spark streaming application that writes parquet data from stream. sqlContext.sql( """ |select |to_date(from_utc_timestamp(from_unixtime(at), 'US/Pacific')) as event_date, …
Gaurav Shah
  • 5,223
  • 7
  • 43
  • 71
15
votes
1 answer

How to Partition a Table by Month ("Both" YEAR & MONTH) and create monthly partitions automatically?

I'm trying to Partition a Table by both Year and Month. The Column through which I'll partition is a datetime type column with an ISO Format ('20150110', 20150202', etc). For example, I have sales data for 2010, 2011, 2012. I'd Like the data to be…
Amr Tharwat
  • 251
  • 1
  • 2
  • 10
15
votes
2 answers

Undo Table Partitioning

I have a table 'X' and did the following CREATE PARTITION FUNCTION PF1(INT) AS RANGE LEFT FOR VALUES (1, 2, 3, 4) CREATE PARTITION SCHEME PS1 AS PARTITION PF1 ALL TO ([PRIMARY]) CREATE CLUSTERED INDEX CIDX_X ON X(col1) ON PS1(col1) this 3 steps…
Storm
  • 4,307
  • 11
  • 40
  • 57