Questions tagged [hadoop-partitioning]

Hadoop partitioning deals with questions about how hadoop decides which key/value pairs are to be sent to which reducer (partition).

339 questions
2
votes
2 answers

map reduce with two input files, with one file processed based on another

I need to write a map reduce that takes input as two input files. First input file looks like this: key1 , 25 key1 , 35 key1 , 60 key2 , 30 key3 , 45 key3 , 65 Second input file is as follows: key1, -10 key2, -20 key3, -15 and I need to get an…
user2715182
  • 653
  • 2
  • 10
  • 23
2
votes
3 answers

TotalOrderPartitioner giving wrong key class Error

I am trying my hands on TotalOrderPartitioner hadoop. While doing so I am getting the following error. Error stating - "wrong key class" Driver Code - import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.IntWritable; import…
2
votes
2 answers

How does SparkContext.textFile work under the covers?

I am trying to understand the textFile method deeply, but I think my lack of Hadoop knowledge is holding me back here. Let me lay out my understanding and maybe you can correct anything that is incorrect When sc.textFile(path) is called, then…
Justin Pihony
  • 66,056
  • 18
  • 147
  • 180
2
votes
4 answers

Map reduce and hash partitioning

While learning about MapReduce, I encountered this question: A given Mapreduce program has the Map phase generate 100 key-value pairs with 10 unique keys. How many Reduce tasks can this program have when at least one Reduce task will certainly be…
2
votes
0 answers

Custom hash function for Hive buckets

I need to implement total ordering of output results in Hive with several reducers(e.g.4). As I found by the link Hive is using expression: hash_function(bucketing column) mod num_buckets. And as a result of input set of numbers(41,42,43,51,52,53)…
Speise
  • 789
  • 1
  • 12
  • 28
2
votes
1 answer

Issue while installing hadoop-2.2.0 in linux 64 bit machine

Using this link ,tried installing Hadoop version - 2.2.0(single node cluster)in ubuntu 12.04(64 bit machine) http://bigdatahandler.com/hadoop-hdfs/installing-single-node-hadoop-2-2-0-on-ubuntu/ while formatting the hdfs file system via namenode…
2
votes
2 answers

Hadoop in action Patent example explanation

I was going through the examples for patent data in Hadoop in action. Could you please explain in detail about the data sets being used? The patent citation data set This data set contains two columns citing and cited patents. Citing column refers…
2
votes
1 answer

understanding custom partitioner in hadoop

i am learning partitioner concept now.can any one explain me the below piece of code.it is hard for me to understand public class TaggedJoiningPartitioner extends Partitioner { @Override public int getPartition(TaggedKey…
user1585111
  • 1,019
  • 6
  • 19
  • 35
2
votes
2 answers

Failed to get system directory - hadoop

Using hadoop multinode setup (1 mater , 1 salve) After starting up start-mapred.sh on master , i found below error in TT logs (Slave an) org.apache.hadoop.mapred.TaskTracker: Failed to get system directory can some one help me to know what can be…
Surya
  • 3,408
  • 5
  • 27
  • 35
2
votes
0 answers

creating new table with dynamic partitions from existing non-partitioned table in Hive

I have existing table structure in HIVE which has various fields e.g.(a string, b string, tstamp string, c string) including one tstamp field. I need to create a new partitioned table(table_partitioned) from the existing table(original_table) but…
hitrix
  • 133
  • 3
  • 11
2
votes
1 answer

hadoop file splitting using KeyFieldBasedPartitioner

I have a big file that is formatted as follows sample name \t index \t score And I'm trying to split this file based off of sample name using Hadoop Streaming. I know ahead of time how many samples there are, so can specify how many reducers I…
mortonjt
  • 650
  • 1
  • 5
  • 23
2
votes
1 answer

Can already partitioned input data improve the hadoop processing?

I know that during the intermediate steps between mapper and reducer, hadoop will sort and partition the data on its way to the reducer. Since I am dealing with already partitioned data in my input to the mapper, is there a way to take advantage of…
2
votes
1 answer

Hadoop reducers receiving wrong data

I have a load of JobControls running at the same time, all with the same set of ControlledJobs. Each JobControl is dealing with a different set of input / output files, by date range, but they are all of the type. The problem that I am observing is…
Ben Smith
  • 1,554
  • 1
  • 15
  • 26
1
vote
1 answer

Gitolite ACL partition activation with fstab ?

I don't understand and i don't find any information about ACL and gitolite. In first intention, i want to install gitosis, which need instalation of apt-get install ACL package for debian, and activation of acl into fstab file. With gitolite, a…
reyman64
  • 523
  • 4
  • 34
  • 73
1
vote
0 answers

Spark sc.binaryFiles() partitioning small files and YARN

Using the sc.binaryFiles() function in Spark 2.3.0 on a Hortonworks 2.6.5 server, I noticed its behavior which I cannot explain regarding the default partitioning in a YARN managed cluster. Please see the sample code below: import…
uhlik
  • 105
  • 9