Hadoop partitioning deals with questions about how hadoop decides which key/value pairs are to be sent to which reducer (partition).
Questions tagged [hadoop-partitioning]
339 questions
1
vote
1 answer
Creating view in HIVE
I want to create a view on a hive table which is partitioned . My view definition is as below:
create view schema.V1 as select t1.* from scehma.tab1 as t1 inner join (select record_key ,max(last_update) as last_update from scehma.tab1 group by…

jayanta layak
- 11
- 4
1
vote
1 answer
TotalOrderPartitioner and mrjob
How does one specify the TotalOrderPartitioner when using mrjob? Is this the default, or must it be specified explicitly? I've seen inconsistent behavior on different data sets.

vy32
- 28,461
- 37
- 122
- 246
1
vote
1 answer
Set date function as variable and use in beeline and hql file (hive)
Could anyone please explain to me how to solve this issue.
I want to use from_unixtime(unix_timestamp() - 86400, 'yyyyMMdd) as the value for a variable and use it in a query's where clause that is stored in an hql file. I have tried:
beeline…

smastika
- 137
- 2
- 12
1
vote
1 answer
Facing an error when using TotalOrderPartitioner MapReduce
I have written the below program.
I have run it without using TotalOrderPartitioner and it has run well. So I don't think there are any issues with Mapper or Reducer class as such.
But when I include the code for TotalOrderPartitioner i.e. write…

Don Sam
- 525
- 5
- 20
1
vote
1 answer
different keys goes into 1 file even if using Hadoop custom Partitioner
I am running out a minute issue.
I am trying to get different file for different keys from Reducer.
Partitioner
public class customPartitioner extends Partitioner implements
Configurable {
private Configuration…

USB
- 6,019
- 15
- 62
- 93
1
vote
1 answer
Spark-SQl DataFrame partitions
I need to load an Hive table using spark-sql and then to run some machine-learning algho on that. I do that writing:
val dataSet = sqlContext.sql(" select * from table")
It works well, but If I wanted to increase number of partions of the dataSet…

Edge07
- 13
- 3
1
vote
0 answers
How to do a secondary sort on filenames with numbers in hadoop streaming?
I'm trying to sort file names such as
cat1.pdf, cat2.pdf, ... cat10.pdf ...
I'm utilizing a sort right now with the following parameters:
-D…

user110977
- 21
- 2
1
vote
0 answers
How to select top rows in hadoop?
I am reading a 138MB file from Hadoop and trying to assign sequence numbers to each record. Below is the approach I followed.
I read the entire file using cascading, assigned current slice number and current record counter to each record. This was…

Abhishek Korpe
- 11
- 1
1
vote
3 answers
hadoop mapreduce unordered tuple as map key
Based on the wordcount example from Hadoop - The Definitive Guide, I've developed a mapreduce job to count the occurence of unordered tuples of Strings. The input looks like this (just larger):
a b
c c
d d
b a
a …

user3365
- 31
- 2
- 7
1
vote
1 answer
Using Hadoop Partitioner and Comparator Class Together
I have a file that has two columns id and timestamp. I'm count the number of sessions each value has - determined by inactivity for more than 30 minutes. However, I'm having trouble with the streaming commands. An example few row is as…

cloud36
- 1,026
- 6
- 21
- 35
1
vote
2 answers
How to get the most uniform partition results?
I don't know if there is any algrithom to get the optimal parition for a key based data partition (need to ensure the same key records in the same result data set).
For example: I have a data set needs to be divided into two parts:
key …

Tim
- 659
- 1
- 7
- 16
1
vote
1 answer
How to override shuffle/sort in map/reduce or else, how can I get the sorted list in map/reduce from the last element to the patitioner
Assuming only one reducer.
My scenario is to get the list of top N scorers in the university. The data is in format. The Map/reduce framework, by default, sorting the data, in ascending order. But I want the list in descending order, or atleast if…

Jack Daniel
- 2,527
- 3
- 31
- 52
1
vote
0 answers
How to split a log file based on the the timestamp/Date
I have to analyze a huge log file for management report purpose.
The format of the log file is as below:-
[2014-08-28 08:49:40 GMT][Level:DEBUG] Connection from UGUBUKBBBHJGJ.mt.site (123.131.21.20) , user : 12345678 for compositeId :…
user3548788
1
vote
1 answer
isSplitable in combineFileInputFormat does not work
I have thousands of small files, and I want to process them with combineFileInputFormat.
In combineFileInputFormat, there are multiple small files for one mapper, each file will not be split.
the snippet of one of the small input files like…

alec.tu
- 1,647
- 2
- 20
- 41
1
vote
1 answer
Hadoop getting time difference between dates
I am struggling something like this in hadoop
I get following as a result of my mapper
KeyValue1, 2014-02-01 20:42:00
KeyValue1, 2014-02-01 20:45:12
KeyValue1, 2014-05-01 10:35:02
KeyValue2, 2014-03-01 01:45:12
KeyValue2, 2014-03-01…

Bedi Egilmez
- 1,494
- 1
- 18
- 26