Questions tagged [partitioner]

Partitioners are software components that divide possibly very large groups of data into some number of smaller groups of data of hopefully equal size.

This is a performance technique that reduces the amount or time spent processing the entire set of data with algorithms having exponential magnitude.

59 questions
2
votes
1 answer

Springbatch dynamic multiple xml File writer

I have to do a batch that : read some data from DB (each row is an item, this is fine) then do some process to add some more data (more data is always better ;) ) then here is my problem, I have to write each item in an xml file who's name depends…
bodtx
  • 590
  • 9
  • 29
2
votes
2 answers

Why is the Partitioner invoked even with a single reducer

If we have a MR Job configured to run only with a single reducer it seems logical that a Partitioner need not be invoked. However i just gave this a shot and it looks like the Partitioner is invoked even if the job is configured with a single…
Sudarshan
  • 8,574
  • 11
  • 52
  • 74
2
votes
2 answers

Custom Counter inside the Hadoop Partitioner

I would like to capture some information on keys and their values inside a custom Partitioner (or even the default HashPartitioner). I can use custom counters inside both mappers and reducers by accessing the "context" variable. However, inside the…
1
vote
2 answers

Hadoop order of operations

According to the attached image found on yahoo's hadoop tutorial, the order of operations is map > combine > partition which should be followed by reduce Here is my an example key emmited by the map operation LongValueSum:geo_US|1311722400|E …
Premal Shah
  • 181
  • 4
  • 13
1
vote
1 answer

Kafka RoundRobin partitioner not distributing messages to all the partitions

I am trying to use Kafka's RoundRobinPartitioner class for distributing messages evenly across all the partitions. My Kafka topic configuration is as follows: name: multischemakafkatopicodd number of partitions: 16 replication factor: 2 Say, if I am…
1
vote
0 answers

Is Optaplanner Strength Comparator compatible with Partitioning?

Has anyone tried Optaplanner's Partitioned Search feature at the same time as the strength comparator class?? Firstly, I created a custom partitioner that splits the planning entities and assigns the planning values (it does not split the planning…
pineapplw
  • 71
  • 3
1
vote
2 answers

Technique for joining with spark dataframe w/ custom partitioner works w/ python, but not scala?

I recently read an article that described how to custom partition a dataframe [ https://dataninjago.com/2019/06/01/create-custom-partitioner-for-spark-dataframe/ ] in which the author illustrated the technique in Python. I use Scala, and the…
Chris Bedford
  • 2,560
  • 3
  • 28
  • 60
1
vote
2 answers

How to properly apply HashPartitioner before a join in Spark?

To reduce shuffling during the joining of two RDDs, I decided to partition them using HashPartitioner first. Here is how I do it. Am I doing it correctly, or is there a better way to do this? val rddA = ... val rddB = ... val numOfPartitions =…
MetallicPriest
  • 29,191
  • 52
  • 200
  • 356
1
vote
0 answers

Spark even data distribution

I am trying to solve skewed data problem in the dataframe. I have introduced a new column based on bin packing algorithm which should evenly distribute data among the bins (partitions in my case). My count for the bin is 500,000 rows. I have…
Waqar Ahmed
  • 5,005
  • 2
  • 23
  • 45
1
vote
1 answer

Customize Partitioner to balance inputs to reducers

Suppose my mappers output N keys (these keys are different), and I have K reducers. How to write custom Paritioner so that each reducer receive approximately N/K keys? Which keys going to which receives is not important. Example: Suppose my mappers…
cdt
  • 85
  • 10
1
vote
1 answer

type HashPartitioner is not a member of org.apache.spark.sql.SparkSession

I was using spark-shell to experiment with Spark's HashPartitioner. The error is shown as follows: scala> val data = sc.parallelize(List((1, 3), (2, 4), (3, 6), (3, 7))) data: org.apache.spark.rdd.RDD[(Int, Int)] = ParallelCollectionRDD[0] at…
Xiangyu
  • 824
  • 9
  • 34
1
vote
0 answers

Why does hadoop partitioner do a binary AND?

I'm completely new to Hadoop and fairly new to Map/Reduce so bear with me if this is a very simple question. In hadoop's hash partitioner, why does it do a hash(key) & Integer.MAX_VALUE before doing a modulo with the number of reducers? What is the…
Kevin
  • 3,209
  • 9
  • 39
  • 53
1
vote
1 answer

How to use Distributed cache in partitioner hadoop?

I am new in hadoop and mapreduce partitioner.I want to write my own partitioner and i need to read a file in partitioner. i have searched many times and i get that i should use distributed cache. this is my question that how can i use distributed…
Saeed Nasehi
  • 940
  • 1
  • 11
  • 27
1
vote
2 answers

Does the default hash partitioner still work if a custom partitioner is defined in Hadoop Map Reduce?

As I am new to hadoop,I tried out the sample code from http://www.tutorialspoint.com/map_reduce/map_reduce_partitioner.htm I found that the program uses 3 different partitions based on age group and 3 reducers are also used , which is expected. But…
1
vote
2 answers

What if a custom partitioner is made to select different partitions for records having the same key?

While learning Hadoop MapReduce, I came across how to create a custom Partitioner class. I understand that we need to define the abstract getPartition method in our class. This method is supposed to return the Partition number (an integer) for the…
Ankit Khettry
  • 997
  • 1
  • 13
  • 33