Questions tagged [partitioning]

Partitioning is a performance strategy whereby you divide possibly very large groups of data into some number of smaller groups of data.

The expectation is that with algorithms of order exponentially greater than N the total time it takes to process the smaller groups and combine the results is still less than the time it would take to process the one larger set of data.

Partitioning is similar to range partitioning in many ways. As in partitioning by RANGE, each partition must be explicitly defined.

3138 questions

votes

5 answers

How to perform one operation on each executor once in spark

I have a weka model stored in S3 which is of size around 400MB. Now, I have some set of record on which I want to run the model and perform prediction. For performing prediction, What I have tried is, Download and load the model on driver as a…

asked Oct 13 '16 at 08:20

Neha

votes

3 answers

Database - Designing an "Events" Table

After reading the tips from this great Nettuts+ article I've come up with a table schema that would separate highly volatile data from other tables subjected to heavy reads and at the same time lower the number of tables needed in the whole database…

mysql database database-design relational partitioning

asked Apr 20 '10 at 02:35

Alix Axel

151,645
95
393
500

votes

3 answers

How to update partition metadata in Hive , when partition data is manualy deleted from HDFS

What is the way to automatically update the metadata of Hive partitioned tables? If new partition data's were added to HDFS (without alter table add partition command execution) . then we can sync up the metadata by executing the command 'msck…

hive partitioning

asked Jan 14 '14 at 07:43

vinu.m.19

votes

5 answers

What is table partitioning?

In which case we should use table partitioning?

sql partitioning database-partitioning database-table

asked Nov 30 '09 at 18:06

P Sharma

2,638
11
31
35

votes

2 answers

In Apache Spark, why does RDD.union not preserve the partitioner?

As everyone knows partitioners in Spark have a huge performance impact on any "wide" operations, so it's usually customized in operations. I was experimenting with the following code: val rdd1 = sc.parallelize(1 to 50).keyBy(_ % 10) …

apache-spark partitioning hadoop-partitioning

asked Apr 30 '15 at 20:49

tribbloid

4,026
14
64
103

votes

2 answers

Spark lists all leaf node even in partitioned data

I have parquet data partitioned by date & hour, folder structure: events_v3 -- event_date=2015-01-01 -- event_hour=2015-01-1 -- part10000.parquet.gz -- event_date=2015-01-02 -- event_hour=5 -- part10000.parquet.gz I have…

apache-spark amazon-s3 apache-spark-sql partitioning parquet

asked Sep 15 '16 at 14:19

Gaurav Shah

5,223
7
43
71

votes

3 answers

How to control partition size in Spark SQL

I have a requirement to load data from an Hive table using Spark SQL HiveContext and load into HDFS. By default, the DataFrame from SQL output is having 2 partitions. To get more parallelism i need more partitions out of the SQL. There is no…

apache-spark hive apache-spark-sql partitioning

asked Jul 07 '16 at 15:34

nagendra

1,885
3
17
27

votes

1 answer

Is it possible to create a kafka topic with dynamic partition count?

I am using kafka to stream the events of page visits by the website users to an analytics service. Each event will contain the following details for the consumer: user id IP address of the user I need very high throughput, so I decided to…

apache-kafka partitioning kafka-consumer-api

asked Sep 24 '15 at 12:40

vivek_jonam

3,237
8
32
44

votes

4 answers

Partitioning a large skewed dataset in S3 with Spark's partitionBy method

I am trying to write out a large partitioned dataset to disk with Spark and the partitionBy algorithm is struggling with both of the approaches I've tried. The partitions are heavily skewed - some of the partitions are massive and others are…

apache-spark apache-spark-sql partitioning

asked Oct 28 '18 at 23:52

Powers

18,150
10
103
108

votes

6 answers

Apache Spark: Get number of records per partition

I want to check how can we get information about each partition such as total no. of records in each partition on driver side when Spark job is submitted with deploy mode as a yarn cluster in order to log or print on the console.

scala apache-spark hadoop apache-spark-sql partitioning

asked Sep 04 '17 at 07:34

nilesh1212

1,561
2
26
60

votes

5 answers

Java 8 partition list

Is it possible to partition a List in pure Jdk8 into equal chunks (sublists). I know it is possible using Guava Lists class, but can we do it with pure Jdk? I don't want to add new jars to my project, just for one use case. SOLUTONS: The best…

java java-8 partitioning

asked Jun 23 '15 at 06:52

Beri

11,470
4
35
57

votes

8 answers

Split a list of numbers into n chunks such that the chunks have (close to) equal sums and keep the original order

This is not the standard partitioning problem, as I need to maintain the order of elements in the list. So for example if I have a list [1, 6, 2, 3, 4, 1, 7, 6, 4] and I want two chunks, then the split should give [[1, 6, 2, 3, 4, 1], [7, 6, 4]]…

python algorithm partitioning

asked Feb 19 '16 at 23:35

Ng Oon-Ee

1,193
1
10
26

votes

1 answer

How does partitioning work in Spark?

I'm trying to understand how partitioning is done in Apache Spark. Can you guys help please? Here is the scenario: a master and two nodes with 1 core each a file count.txt of 10 MB in size How many partitions does the following create? rdd =…

apache-spark partitioning

asked Oct 14 '14 at 19:02

abhishek kurasala

votes

4 answers

Missing STOPKEY per partition in Oracle plan for paging by local index

There is next partitioned table: CREATE TABLE "ERMB_LOG_TEST_BF"."OUT_SMS"( "TRX_ID" NUMBER(19,0) NOT NULL ENABLE, "CREATE_TS" TIMESTAMP (3) DEFAULT systimestamp NOT NULL ENABLE, /* other fields... */ ) PCTFREE 10 PCTUSED 40 INITRANS 1…

sql oracle prepared-statement partitioning sql-execution-plan

asked Mar 12 '13 at 09:32

Tsimon Dorakh

votes

8 answers

How to find all partitions of a set

I have a set of distinct values. I am looking for a way to generate all partitions of this set, i.e. all possible ways of dividing the set into subsets. For instance, the set {1, 2, 3} has the following partitions: { {1}, {2}, {3} }, { {1, 2}, {3}…

c# algorithm set partitioning

asked Dec 11 '13 at 21:19

Daniel Wolf

12,855
13
54
80

Prev 1

…

99 100 Next