Questions tagged [hadoop-partitioning]

Hadoop partitioning deals with questions about how hadoop decides which key/value pairs are to be sent to which reducer (partition).

339 questions
0
votes
1 answer

Hand selecting parquet partitions vs filtering them in pyspark

This might be a dumb question, But is there any difference between manually specifying the partition columns in a parquet file, as opposed to loading it and then filtering them? For Example: I have a parquet file that is partitioned by DATE. If I…
thentangler
  • 1,048
  • 2
  • 12
  • 38
0
votes
2 answers

Moving files from one parquet partition to another

I have a very large amount of data in my S3 bucket partitioned by two columns MODULE and DATE such that the file structure of my parquets are: s3://my_bucket/path/file.parquet/MODULE='XYZ'/DATE=2020-01-01 I have 7 MODULE and the DATE ranges from…
thentangler
  • 1,048
  • 2
  • 12
  • 38
0
votes
1 answer

partitioning and re-partittioning parquet files using pyspark

I have a parquet partitioning issue that I am trying to solve. I have read a lot of material on partitioning in this site and on the web but still couldn't explain my problem. Step 1: I have a large dataset (~2TB) that has MODULE and DATE columns…
thentangler
  • 1,048
  • 2
  • 12
  • 38
0
votes
1 answer

can you overlap partitions when writing parquet files

I have a very large dataframe around 2TB in size. There are 2 columns by which I can partition them: MODULE and DATE If I partition them by MODULE each module can have the same dates for example MODULE A might have dates 2020-07-01 , 2020-07-02 and…
thentangler
  • 1,048
  • 2
  • 12
  • 38
0
votes
1 answer

Hive Partitioned by Date -- Processing multiple dates at a time?

I might have a gap in understanding hive partitioning. I have an external table that is partitioned by date. I'm generating the parquet files via a query on a managed hive table. I currently run a bash script to process incrementally by date…
Jammy
  • 413
  • 1
  • 6
  • 12
0
votes
1 answer

hdfs partitioned data back up when overwriting a hive table

I have an external table and partitioned on 3 columns and stored in…
Rocky1989
  • 369
  • 8
  • 28
0
votes
2 answers

How to fetch latest date from a hive table partitioned on date column?

eg. If my date column is load_date, using max(load_date) operator will scan every data file in hive making it a costly operation. Instead is there any optimal way to get the latest load_date from the table.
AbhishekB
  • 31
  • 1
  • 5
0
votes
1 answer

Can a create a directory on a remote cluster in hadoop by doing a -mkdir?

We are moving data inter cluster on a partition by partition basis and we have a requirement to use -update -skipcrccheck option only for this. In order to run distcp on a partition by partition basis with these options requires partition directory…
Kireet Bhat
  • 77
  • 1
  • 2
  • 11
0
votes
1 answer

How to insert into hive table, partitioned by date reading from temp table?

I have a Hive temp table without any partitions which has the data required. I want to select this data and insert into another table partitioned by date. I tried following techniques with no luck. Source table schema CREATE TABLE…
Saawan
  • 363
  • 6
  • 24
0
votes
0 answers

Simple way to deal with poor folder structure for partitions in Apache Spark

Often times, data is available with a folder structure like, 2000-01-01/john/smith rather than the Hive partition spec, date=2000-01-01/first_name=john/last_name=smith Spark (and pyspark) can read partitioned data easily when using the Hive folder…
user3002273
0
votes
2 answers

Hive Bucketing : Number of distinct column value is greater than Number of bucketing number

In hive, say I have a table employee with 1000 records and I am bucketing with subject column. The total distinct values of the subject column is 20, but my total number of buckets is 6. How the shuffling happens? While understanding the bucketing I…
0
votes
1 answer

How to delete the most recently created files in multiple HDFS directories?

I made a mistake and have added a few hundred part files to a table partitioned by date. I am able to see which files are new (these are the ones I want to remove). Most cases I've seen on here relate to deleting files older than a certain date, but…
phenderbender
  • 625
  • 2
  • 8
  • 18
0
votes
1 answer

pyspark write overwrite is partitioned but is still overwriting the previous load

I am running a pyspark script where I'm saving off some data to a s3 bucket each time the script is run and I have this code: data.repartition(1).write.mode("overwrite").format("parquet").partitionBy('time_key').save( "s3://path/to/directory") It…
Cards14
  • 99
  • 1
  • 9
0
votes
1 answer

Process each partition and each row in each partition, one at a time

Question: I have below 2 dataframes stored in an array. Data is already partitioned by SECURITY_ID. Dataframe 1 (DF1): +-------------+----------+----------+--------+---------+---------+ |…
Voila
  • 85
  • 2
  • 15
0
votes
1 answer

In a map reduce word count program need to fetch the files where the words exist

I am reading multiple input files for a word count problem. Example file names: file1.txt file2.txt file3.txt I am able to get the word count but what should be added if I also want to get the file names along with count where the words exist. for…
Rakesh R
  • 11
  • 3