Hadoop partitioning deals with questions about how hadoop decides which key/value pairs are to be sent to which reducer (partition).
Questions tagged [hadoop-partitioning]
339 questions
0
votes
1 answer
Hand selecting parquet partitions vs filtering them in pyspark
This might be a dumb question,
But is there any difference between manually specifying the partition columns in a parquet file, as opposed to loading it and then filtering them?
For Example:
I have a parquet file that is partitioned by DATE. If I…

thentangler
- 1,048
- 2
- 12
- 38
0
votes
2 answers
Moving files from one parquet partition to another
I have a very large amount of data in my S3 bucket partitioned by two columns MODULE and DATE
such that the file structure of my parquets are:
s3://my_bucket/path/file.parquet/MODULE='XYZ'/DATE=2020-01-01
I have 7 MODULE and the DATE ranges from…

thentangler
- 1,048
- 2
- 12
- 38
0
votes
1 answer
partitioning and re-partittioning parquet files using pyspark
I have a parquet partitioning issue that I am trying to solve. I have read a lot of material on partitioning in this site and on the web but still couldn't explain my problem.
Step 1: I have a large dataset (~2TB) that has MODULE and DATE columns…

thentangler
- 1,048
- 2
- 12
- 38
0
votes
1 answer
can you overlap partitions when writing parquet files
I have a very large dataframe around 2TB in size.
There are 2 columns by which I can partition them: MODULE and DATE
If I partition them by MODULE each module can have the same dates for example MODULE A might have dates 2020-07-01 , 2020-07-02 and…

thentangler
- 1,048
- 2
- 12
- 38
0
votes
1 answer
Hive Partitioned by Date -- Processing multiple dates at a time?
I might have a gap in understanding hive partitioning. I have an external table that is partitioned by date. I'm generating the parquet files via a query on a managed hive table. I currently run a bash script to process incrementally by date…

Jammy
- 413
- 1
- 6
- 12
0
votes
1 answer
hdfs partitioned data back up when overwriting a hive table
I have an external table and partitioned on 3 columns and stored in…

Rocky1989
- 369
- 8
- 28
0
votes
2 answers
How to fetch latest date from a hive table partitioned on date column?
eg. If my date column is load_date, using max(load_date) operator will scan every data file in hive making it a costly operation. Instead is there any optimal way to get the latest load_date from the table.

AbhishekB
- 31
- 1
- 5
0
votes
1 answer
Can a create a directory on a remote cluster in hadoop by doing a -mkdir?
We are moving data inter cluster on a partition by partition basis and we have a requirement to use
-update -skipcrccheck option only for this. In order to run distcp on a partition by partition basis with these options requires partition directory…

Kireet Bhat
- 77
- 1
- 2
- 11
0
votes
1 answer
How to insert into hive table, partitioned by date reading from temp table?
I have a Hive temp table without any partitions which has the data required. I want to select this data and insert into another table partitioned by date. I tried following techniques with no luck.
Source table schema
CREATE TABLE…

Saawan
- 363
- 6
- 24
0
votes
0 answers
Simple way to deal with poor folder structure for partitions in Apache Spark
Often times, data is available with a folder structure like,
2000-01-01/john/smith
rather than the Hive partition spec,
date=2000-01-01/first_name=john/last_name=smith
Spark (and pyspark) can read partitioned data easily when using the Hive folder…
user3002273
0
votes
2 answers
Hive Bucketing : Number of distinct column value is greater than Number of bucketing number
In hive, say I have a table employee with 1000 records and I am bucketing with subject column.
The total distinct values of the subject column is 20, but my total number of buckets is 6.
How the shuffling happens?
While understanding the bucketing I…

Krishna kumar
- 3
- 1
0
votes
1 answer
How to delete the most recently created files in multiple HDFS directories?
I made a mistake and have added a few hundred part files to a table partitioned by date. I am able to see which files are new (these are the ones I want to remove). Most cases I've seen on here relate to deleting files older than a certain date, but…

phenderbender
- 625
- 2
- 8
- 18
0
votes
1 answer
pyspark write overwrite is partitioned but is still overwriting the previous load
I am running a pyspark script where I'm saving off some data to a s3 bucket each time the script is run and I have this code:
data.repartition(1).write.mode("overwrite").format("parquet").partitionBy('time_key').save( "s3://path/to/directory")
It…

Cards14
- 99
- 1
- 9
0
votes
1 answer
Process each partition and each row in each partition, one at a time
Question:
I have below 2 dataframes stored in an array. Data is already partitioned by SECURITY_ID.
Dataframe 1 (DF1):
+-------------+----------+----------+--------+---------+---------+
|…

Voila
- 85
- 2
- 15
0
votes
1 answer
In a map reduce word count program need to fetch the files where the words exist
I am reading multiple input files for a word count problem.
Example file names:
file1.txt
file2.txt
file3.txt
I am able to get the word count but what should be added if I also want to get the file names along with count where the words exist.
for…

Rakesh R
- 11
- 3