Questions tagged [partitioning]

Partitioning is a performance strategy whereby you divide possibly very large groups of data into some number of smaller groups of data.

Partitioning is a performance strategy whereby you divide possibly very large groups of data into some number of smaller groups of data.

The expectation is that with algorithms of order exponentially greater than N the total time it takes to process the smaller groups and combine the results is still less than the time it would take to process the one larger set of data.

Partitioning is similar to range partitioning in many ways. As in partitioning by RANGE, each partition must be explicitly defined.

3138 questions
31
votes
5 answers

How to perform one operation on each executor once in spark

I have a weka model stored in S3 which is of size around 400MB. Now, I have some set of record on which I want to run the model and perform prediction. For performing prediction, What I have tried is, Download and load the model on driver as a…
Neha
  • 537
  • 2
  • 7
  • 15
30
votes
3 answers

Database - Designing an "Events" Table

After reading the tips from this great Nettuts+ article I've come up with a table schema that would separate highly volatile data from other tables subjected to heavy reads and at the same time lower the number of tables needed in the whole database…
Alix Axel
  • 151,645
  • 95
  • 393
  • 500
30
votes
3 answers

How to update partition metadata in Hive , when partition data is manualy deleted from HDFS

What is the way to automatically update the metadata of Hive partitioned tables? If new partition data's were added to HDFS (without alter table add partition command execution) . then we can sync up the metadata by executing the command 'msck…
vinu.m.19
  • 495
  • 2
  • 8
  • 16
28
votes
5 answers

What is table partitioning?

In which case we should use table partitioning?
P Sharma
  • 2,638
  • 11
  • 31
  • 35
27
votes
2 answers

In Apache Spark, why does RDD.union not preserve the partitioner?

As everyone knows partitioners in Spark have a huge performance impact on any "wide" operations, so it's usually customized in operations. I was experimenting with the following code: val rdd1 = sc.parallelize(1 to 50).keyBy(_ % 10) …
tribbloid
  • 4,026
  • 14
  • 64
  • 103
26
votes
2 answers

Spark lists all leaf node even in partitioned data

I have parquet data partitioned by date & hour, folder structure: events_v3 -- event_date=2015-01-01 -- event_hour=2015-01-1 -- part10000.parquet.gz -- event_date=2015-01-02 -- event_hour=5 -- part10000.parquet.gz I have…
Gaurav Shah
  • 5,223
  • 7
  • 43
  • 71
26
votes
3 answers

How to control partition size in Spark SQL

I have a requirement to load data from an Hive table using Spark SQL HiveContext and load into HDFS. By default, the DataFrame from SQL output is having 2 partitions. To get more parallelism i need more partitions out of the SQL. There is no…
nagendra
  • 1,885
  • 3
  • 17
  • 27
26
votes
1 answer

Is it possible to create a kafka topic with dynamic partition count?

I am using kafka to stream the events of page visits by the website users to an analytics service. Each event will contain the following details for the consumer: user id IP address of the user I need very high throughput, so I decided to…
vivek_jonam
  • 3,237
  • 8
  • 32
  • 44
25
votes
4 answers

Partitioning a large skewed dataset in S3 with Spark's partitionBy method

I am trying to write out a large partitioned dataset to disk with Spark and the partitionBy algorithm is struggling with both of the approaches I've tried. The partitions are heavily skewed - some of the partitions are massive and others are…
Powers
  • 18,150
  • 10
  • 103
  • 108
25
votes
6 answers

Apache Spark: Get number of records per partition

I want to check how can we get information about each partition such as total no. of records in each partition on driver side when Spark job is submitted with deploy mode as a yarn cluster in order to log or print on the console.
nilesh1212
  • 1,561
  • 2
  • 26
  • 60
25
votes
5 answers

Java 8 partition list

Is it possible to partition a List in pure Jdk8 into equal chunks (sublists). I know it is possible using Guava Lists class, but can we do it with pure Jdk? I don't want to add new jars to my project, just for one use case. SOLUTONS: The best…
Beri
  • 11,470
  • 4
  • 35
  • 57
24
votes
8 answers

Split a list of numbers into n chunks such that the chunks have (close to) equal sums and keep the original order

This is not the standard partitioning problem, as I need to maintain the order of elements in the list. So for example if I have a list [1, 6, 2, 3, 4, 1, 7, 6, 4] and I want two chunks, then the split should give [[1, 6, 2, 3, 4, 1], [7, 6, 4]]…
Ng Oon-Ee
  • 1,193
  • 1
  • 10
  • 26
24
votes
1 answer

How does partitioning work in Spark?

I'm trying to understand how partitioning is done in Apache Spark. Can you guys help please? Here is the scenario: a master and two nodes with 1 core each a file count.txt of 10 MB in size How many partitions does the following create? rdd =…
abhishek kurasala
  • 295
  • 1
  • 3
  • 6
24
votes
4 answers

Missing STOPKEY per partition in Oracle plan for paging by local index

There is next partitioned table: CREATE TABLE "ERMB_LOG_TEST_BF"."OUT_SMS"( "TRX_ID" NUMBER(19,0) NOT NULL ENABLE, "CREATE_TS" TIMESTAMP (3) DEFAULT systimestamp NOT NULL ENABLE, /* other fields... */ ) PCTFREE 10 PCTUSED 40 INITRANS 1…
23
votes
8 answers

How to find all partitions of a set

I have a set of distinct values. I am looking for a way to generate all partitions of this set, i.e. all possible ways of dividing the set into subsets. For instance, the set {1, 2, 3} has the following partitions: { {1}, {2}, {3} }, { {1, 2}, {3}…
Daniel Wolf
  • 12,855
  • 13
  • 54
  • 80