Questions tagged [hadoop-partitioning]

Hadoop partitioning deals with questions about how hadoop decides which key/value pairs are to be sent to which reducer (partition).

339 questions
1
vote
1 answer

Creating view in HIVE

I want to create a view on a hive table which is partitioned . My view definition is as below: create view schema.V1 as select t1.* from scehma.tab1 as t1 inner join (select record_key ,max(last_update) as last_update from scehma.tab1 group by…
1
vote
1 answer

TotalOrderPartitioner and mrjob

How does one specify the TotalOrderPartitioner when using mrjob? Is this the default, or must it be specified explicitly? I've seen inconsistent behavior on different data sets.
vy32
  • 28,461
  • 37
  • 122
  • 246
1
vote
1 answer

Set date function as variable and use in beeline and hql file (hive)

Could anyone please explain to me how to solve this issue. I want to use from_unixtime(unix_timestamp() - 86400, 'yyyyMMdd) as the value for a variable and use it in a query's where clause that is stored in an hql file. I have tried: beeline…
smastika
  • 137
  • 2
  • 12
1
vote
1 answer

Facing an error when using TotalOrderPartitioner MapReduce

I have written the below program. I have run it without using TotalOrderPartitioner and it has run well. So I don't think there are any issues with Mapper or Reducer class as such. But when I include the code for TotalOrderPartitioner i.e. write…
Don Sam
  • 525
  • 5
  • 20
1
vote
1 answer

different keys goes into 1 file even if using Hadoop custom Partitioner

I am running out a minute issue. I am trying to get different file for different keys from Reducer. Partitioner public class customPartitioner extends Partitioner implements Configurable { private Configuration…
USB
  • 6,019
  • 15
  • 62
  • 93
1
vote
1 answer

Spark-SQl DataFrame partitions

I need to load an Hive table using spark-sql and then to run some machine-learning algho on that. I do that writing: val dataSet = sqlContext.sql(" select * from table") It works well, but If I wanted to increase number of partions of the dataSet…
1
vote
0 answers

How to do a secondary sort on filenames with numbers in hadoop streaming?

I'm trying to sort file names such as cat1.pdf, cat2.pdf, ... cat10.pdf ... I'm utilizing a sort right now with the following parameters: -D…
1
vote
0 answers

How to select top rows in hadoop?

I am reading a 138MB file from Hadoop and trying to assign sequence numbers to each record. Below is the approach I followed. I read the entire file using cascading, assigned current slice number and current record counter to each record. This was…
1
vote
3 answers

hadoop mapreduce unordered tuple as map key

Based on the wordcount example from Hadoop - The Definitive Guide, I've developed a mapreduce job to count the occurence of unordered tuples of Strings. The input looks like this (just larger): a b c c d d b a a …
user3365
  • 31
  • 2
  • 7
1
vote
1 answer

Using Hadoop Partitioner and Comparator Class Together

I have a file that has two columns id and timestamp. I'm count the number of sessions each value has - determined by inactivity for more than 30 minutes. However, I'm having trouble with the streaming commands. An example few row is as…
cloud36
  • 1,026
  • 6
  • 21
  • 35
1
vote
2 answers

How to get the most uniform partition results?

I don't know if there is any algrithom to get the optimal parition for a key based data partition (need to ensure the same key records in the same result data set). For example: I have a data set needs to be divided into two parts: key …
1
vote
1 answer

How to override shuffle/sort in map/reduce or else, how can I get the sorted list in map/reduce from the last element to the patitioner

Assuming only one reducer. My scenario is to get the list of top N scorers in the university. The data is in format. The Map/reduce framework, by default, sorting the data, in ascending order. But I want the list in descending order, or atleast if…
Jack Daniel
  • 2,527
  • 3
  • 31
  • 52
1
vote
0 answers

How to split a log file based on the the timestamp/Date

I have to analyze a huge log file for management report purpose. The format of the log file is as below:- [2014-08-28 08:49:40 GMT][Level:DEBUG] Connection from UGUBUKBBBHJGJ.mt.site (123.131.21.20) , user : 12345678 for compositeId :…
user3548788
1
vote
1 answer

isSplitable in combineFileInputFormat does not work

I have thousands of small files, and I want to process them with combineFileInputFormat. In combineFileInputFormat, there are multiple small files for one mapper, each file will not be split. the snippet of one of the small input files like…
alec.tu
  • 1,647
  • 2
  • 20
  • 41
1
vote
1 answer

Hadoop getting time difference between dates

I am struggling something like this in hadoop I get following as a result of my mapper KeyValue1, 2014-02-01 20:42:00 KeyValue1, 2014-02-01 20:45:12 KeyValue1, 2014-05-01 10:35:02 KeyValue2, 2014-03-01 01:45:12 KeyValue2, 2014-03-01…
Bedi Egilmez
  • 1,494
  • 1
  • 18
  • 26