Hadoop partitioning deals with questions about how hadoop decides which key/value pairs are to be sent to which reducer (partition).
Questions tagged [hadoop-partitioning]
339 questions
2
votes
2 answers
How to prevent bucket creation if it is not exists in spark on emr
I'm, running spark step on emr cluster. it gathers all small files and accumulated them to one big file.
So i receive list of buckets to process, but before processing bucket i want to check if bucket exists and if it contains any files. For that…

Yehor
- 67
- 6
2
votes
1 answer
passing multiple dates as a paramters to Hive query
I am trying to pass a list of dates as parameter to my hive query.
#!/bin/bash
echo "Executing the hive query - Get distinct dates"
var=`hive -S -e "select distinct substr(Transaction_date,0,10) from test_dev_db.TransactionUpdateTable;"`
echo…

vikrant rana
- 4,509
- 6
- 32
- 72
2
votes
0 answers
PySpark: Partitioning and hashing multiple dataframes, then joining
Background: I am working with clinical data with a lot of different .csv/.txt files. All these files are patientID based, but with different fields. I am importing these files into DataFrames, which I will join at a later stage after first…

cph_sto
- 7,189
- 12
- 42
- 78
2
votes
0 answers
Spark Clustered By/Bucket by dataset not using memory
I recently came across Spark bucketby/clusteredby here.
I tried to mimic this for a 1.1TB source file from S3 (already in parquet). Plan is to completely avoid shuffle as most of the datasets are always joined on "id" column. Here are is what I am…

androboy
- 817
- 1
- 12
- 24
2
votes
1 answer
Maximum number of partitions in hive
I have 1500 partition in my hive tables but while doing query it is taking more time then expected.
Maximum number of partitions can be created in hive table.
user6546116
2
votes
1 answer
Hive insert overwrites truncates the table in few cases
I was working on one solution and found that in some particular cases, hive insert overwrite truncates the table however in few cases it doesn't. Would someone please explain me what it's behaving like that?
to explain this, I am table two tables,…

Gaurang Shah
- 11,764
- 9
- 74
- 137
2
votes
1 answer
Delete/Update partition with sparklyr
I'm using the spark_write_table function from sparklyr to write tables into HDFS, using the partition_by parameter to define how to store them:
R> my_table %>%
spark_write_table(.,
path="mytable",
mode="append",
…

dalloliogm
- 8,718
- 6
- 45
- 55
2
votes
0 answers
What is the correct way to know where a file is located on HDFS cluster?
I need to develop my own job executor (it is not homework) leveraging on datanode locality.
I have a cluster (2 datanodes) of Hadoop 2.7.1 .
(see http://jugsi.blogspot.it/2017/08/configuring-hadoop-271-on-windows-w-ssh.html)
My code:
public static…

venergiac
- 7,469
- 2
- 48
- 70
2
votes
1 answer
How to deal with hive partitioning for performance versus over-partitioning
We have a very large Hadoop dataset having more than a decade of historical transaction data - 6.5B rows and counting. We have partitioned it on year and month.
Performance is poor for a number of reasons. Nearly all of our queries can be further…

Tom Harrison
- 13,533
- 3
- 49
- 77
2
votes
0 answers
Spark repartition operation duplicated
I've got an RDD in Spark which I've cached. Before I cache it I repartition it. This works, and I can see in the storage tab in spark that it has the expected number of partitions.
This is what the stages look like on subsequent runs:
It's…

yarrichar
- 423
- 5
- 17
2
votes
1 answer
AggregateByKey Partitioning?
I have :
A_RDD = anRDD.map()
B_RDD = A_RDD.aggregateByKey()
Alright, my Question is :
If i put partitionBy(new HashPartitioner) after A_RDD like :
A_RDD = anRDD.map().partitionBy(new HashPartitioner(2))
B_RDD = A_RDD.aggregateByKey()
1)Will this…

Spar
- 463
- 1
- 5
- 23
2
votes
2 answers
Basics of Hadoop and MapReduce functioning
I have just started to learn Hadoop and map-reduce concepts and have following few questions that I would like to get cleared before moving forward:
From what I Understand:
Hadoop is specifically used when there is a huge amount of data involved.…

theimpatientcoder
- 1,184
- 3
- 19
- 32
2
votes
1 answer
TotalOrderPartitioner and Partition file
I am learning hadoop mapreduce, and I am working with the Java API. I learnt about the TotalOrderPartitioner used to 'globally' sort the output by keys, across the cluster and that it needs a partition file (generated using…

Ankit Khettry
- 997
- 1
- 13
- 33
2
votes
3 answers
How to create a key, value pair in mapreduce program if values are stored across the boundaries ?
In the input file which I need to process have data classified by headers and its respective records. My 200 MB file has 3 such headers and its records split across 4 blocks(3*64 MB and 1*8 MB).
The data would be in below format
HEADER 1
Record…

Vamsinag R
- 51
- 4
2
votes
1 answer
Can't put file from local directory to HDFS
I have created a file with name "file.txt" in the local directory , now I want to put it in HDFS by using :-
]$ hadoop fs -put file.txt abcd
I am getting a response like
put: 'abcd': no such file or directory
I have never worked on Linux. Please…

lamiaheart
- 29
- 1
- 3