Hadoop partitioning deals with questions about how hadoop decides which key/value pairs are to be sent to which reducer (partition).
Questions tagged [hadoop-partitioning]
339 questions
0
votes
1 answer
How to use value of IntWritable as condition to partition data?
I want to use the value of IntWritable as the condition to partition data. But it seems Partitioner() can not get value.
public static class GroupMapper extends Mapper

yyyyyyrc
- 21
- 3
0
votes
1 answer
Is it possible to virtually divide hadoop cluster into small clusters
We are working to build a big cluster of 100 nodes with 300 TB storage. Then we have to serve it to different users (clients) with restricted resources limit i.e., we do not want to expose complete cluster to each user. Is it possible ? If it is not…

Hafiz Muhammad Shafiq
- 8,168
- 12
- 63
- 121
0
votes
0 answers
Dynamic partitioning inserting null value for the second column of partition
I'm trying create dynamic partitioning based on two columns, and load data from file which is present in the hdfs location.
But while loading data into the dynamically partitioned table from staging table, the second column in the partitioning is…

learner
- 155
- 3
- 18
0
votes
0 answers
Broadcast join to join two dataframes in SPARK efficiently
I am having a DataFrame df1 which has some 2 Million rows. I have already repartitioned it on the basis of a key called ID, since the data was ID based -
df=df.repartition(num_of_partitions,'ID')
Now, I wish to join this df to a relatively small…

cph_sto
- 7,189
- 12
- 42
- 78
0
votes
1 answer
Hive query not reading partition field
I created a partitioned Hive table using the following query
CREATE EXTERNAL TABLE `customer`(
`cid` string COMMENT '',
`member` string COMMENT '',
`account` string COMMENT '')
PARTITIONED BY…

user2316771
- 111
- 1
- 1
- 11
0
votes
2 answers
How to run a spark program in Java in parallel
So I have a java application that has spark maven dependencies and on running it, it launches spark server on the host where its run. The server instance has 36 cores. I am specifying SparkSession instance where I am mentioning the number of cores…

Atihska
- 4,803
- 10
- 56
- 98
0
votes
0 answers
Writing MapReduce and YARN application together
I want to run MapReduce application using Hadoop 2.6.5 (in my own native cluster) and I want to update some things in YARN thus I have seen that I can write my own YARN application…

Or Raz
- 39
- 2
- 11
0
votes
0 answers
Why does a partition need to be sorted prior to being reduced?
From here:
As per hadoop definitive guide "Within each partition, the back-ground
thread performs an in-memory sort by key, and if there is a combiner
function, it is run on the output of the sort"
I thought a partition corresponds to one key,…

Mario Ishac
- 5,060
- 3
- 21
- 52
0
votes
1 answer
Map-Reduce job failing to deliver expected partitioned files
In a Map-Reduce job, I am using five different files where in my dataset contains values under two categories P and I. After I specific values are found, I am passing those into I-part-r-00000 file and accordingly, for P. I am using…

Mohit Sudhera
- 341
- 1
- 4
- 16
0
votes
1 answer
How AM selects the node for each reduce task?
I am doing two jobs of Word count example in the same cluster (I run hadoop 2.65 locally with my a multi-cluster) where my code run the two jobs one after the other.
Where both of the jobs share the same mapper, reducer and etc. but each one of them…

Or Raz
- 39
- 2
- 11
0
votes
2 answers
Combine Multiple Hive Tables as single table in Hadoop
Hi I have multiple Hive tables around 15-20 tables. All the tables will be common schema . I Need to combine all the tables as single table.The single table should be queried from reporting tool, So performance is also needs to be care..
I tried…

Teju Priya
- 595
- 3
- 8
- 18
0
votes
1 answer
Convert value while inserting into HIVE table
i have created bucketed table called emp_bucket into 4 buckets clustered on salary column. The structure of the table is as below:
hive> describe Consultant_Table_Bucket;
OK
id int
age …

Sunil
- 553
- 1
- 12
- 30
0
votes
1 answer
Hadoop Spark - Store in one Large File instead of Many Small ones and Index
On a daily basis i would be calculating some stats and storing it in a file (about 40 rows of data). df below is calculated daily. The issue is when i store it each day it becomes a new file and i do not want to do this as hadoop doesn't deal well…

SecretAgent
- 97
- 10
0
votes
1 answer
Record count for Hive partitioned table
I have a table called "transaction" in Hive which is partitioned on a column called "DS" which will have data like "2018-05-05", "2018-05-09", "2018-05-10" and so on
This table is populated overnight for the day which got completed. At any point,…

Prashanth G B
- 1
- 1
- 1
0
votes
1 answer
Hadoop MapReduce - How to create dynamic partition
How to create dynamic partition using java map reduce, like sql we have group by country column. Example i have country based dataset and need to separate the records based on country ( partition). We can't limit the coutry. since every day will get…

Learn Hadoop
- 2,760
- 8
- 28
- 60