Highest Voted 'hadoop-partitioning' Questions

0

votes

1 answer

Hand selecting parquet partitions vs filtering them in pyspark

This might be a dumb question, But is there any difference between manually specifying the partition columns in a parquet file, as opposed to loading it and then filtering them? For Example: I have a parquet file that is partitioned by DATE. If I…

asked Oct 26 '20 at 15:11

thentangler

1,048
2
12
38

0

votes

2 answers

Moving files from one parquet partition to another

I have a very large amount of data in my S3 bucket partitioned by two columns MODULE and DATE such that the file structure of my parquets are: s3://my_bucket/path/file.parquet/MODULE='XYZ'/DATE=2020-01-01 I have 7 MODULE and the DATE ranges from…

amazon-s3 pyspark parquet hadoop-partitioning

asked Sep 17 '20 at 04:00

thentangler

1,048
2
12
38

0

votes

1 answer

partitioning and re-partittioning parquet files using pyspark

I have a parquet partitioning issue that I am trying to solve. I have read a lot of material on partitioning in this site and on the web but still couldn't explain my problem. Step 1: I have a large dataset (~2TB) that has MODULE and DATE columns…

apache-spark pyspark partitioning hadoop-partitioning

asked Jul 14 '20 at 01:58

thentangler

1,048
2
12
38

0

votes

1 answer

can you overlap partitions when writing parquet files

I have a very large dataframe around 2TB in size. There are 2 columns by which I can partition them: MODULE and DATE If I partition them by MODULE each module can have the same dates for example MODULE A might have dates 2020-07-01 , 2020-07-02 and…

apache-spark amazon-s3 pyspark parquet hadoop-partitioning

asked Jul 12 '20 at 06:52

thentangler

1,048
2
12
38

0

votes

1 answer

Hive Partitioned by Date -- Processing multiple dates at a time?

I might have a gap in understanding hive partitioning. I have an external table that is partitioned by date. I'm generating the parquet files via a query on a managed hive table. I currently run a bash script to process incrementally by date…

hadoop hive hadoop-partitioning

asked Jul 09 '20 at 16:32

Jammy

413
1
6
12

0

votes

1 answer

hdfs partitioned data back up when overwriting a hive table

I have an external table and partitioned on 3 columns and stored in…

pyspark hive hdfs hadoop-partitioning

asked Jun 28 '20 at 10:53

Rocky1989

369
8
28

0

votes

2 answers

How to fetch latest date from a hive table partitioned on date column?

eg. If my date column is load_date, using max(load_date) operator will scan every data file in hive making it a costly operation. Instead is there any optimal way to get the latest load_date from the table.

sql hadoop hive hiveql hadoop-partitioning

asked May 23 '20 at 18:17

AbhishekB

31
1
5

0

votes

1 answer

Can a create a directory on a remote cluster in hadoop by doing a -mkdir?

We are moving data inter cluster on a partition by partition basis and we have a requirement to use -update -skipcrccheck option only for this. In order to run distcp on a partition by partition basis with these options requires partition directory…

hadoop mkdir hadoop-partitioning

asked Apr 28 '20 at 19:25

Kireet Bhat

77
1
2
11

0

votes

1 answer

How to insert into hive table, partitioned by date reading from temp table?

I have a Hive temp table without any partitions which has the data required. I want to select this data and insert into another table partitioned by date. I tried following techniques with no luck. Source table schema CREATE TABLE…

hadoop hive hiveql hadoop-partitioning hive-query

asked Mar 10 '20 at 19:03

Saawan

363
6
24

0

votes

0 answers

Simple way to deal with poor folder structure for partitions in Apache Spark

Often times, data is available with a folder structure like, 2000-01-01/john/smith rather than the Hive partition spec, date=2000-01-01/first_name=john/last_name=smith Spark (and pyspark) can read partitioned data easily when using the Hive folder…

apache-spark pyspark hive hadoop-partitioning

asked Feb 13 '20 at 18:56

user3002273

0

votes

2 answers

Hive Bucketing : Number of distinct column value is greater than Number of bucketing number

In hive, say I have a table employee with 1000 records and I am bucketing with subject column. The total distinct values of the subject column is 20, but my total number of buckets is 6. How the shuffling happens? While understanding the bucketing I…

hive bigdata hiveql hadoop-partitioning

asked Jan 23 '20 at 07:54

Krishna kumar

3
1

0

votes

1 answer

How to delete the most recently created files in multiple HDFS directories?

I made a mistake and have added a few hundred part files to a table partitioned by date. I am able to see which files are new (these are the ones I want to remove). Most cases I've seen on here relate to deleting files older than a certain date, but…

hadoop hive hdfs hql hadoop-partitioning

asked Nov 15 '19 at 13:07

phenderbender

625
2
8
18

0

votes

1 answer

pyspark write overwrite is partitioned but is still overwriting the previous load

I am running a pyspark script where I'm saving off some data to a s3 bucket each time the script is run and I have this code: data.repartition(1).write.mode("overwrite").format("parquet").partitionBy('time_key').save( "s3://path/to/directory") It…

amazon-s3 pyspark hadoop-partitioning

asked Oct 23 '19 at 00:18

Cards14

99
1
9

0

votes

1 answer

Process each partition and each row in each partition, one at a time

Question: I have below 2 dataframes stored in an array. Data is already partitioned by SECURITY_ID. Dataframe 1 (DF1): +-------------+----------+----------+--------+---------+---------+ |…

scala apache-spark hadoop hdfs hadoop-partitioning

asked Sep 23 '19 at 14:34

Voila

85
2
15

0

votes

1 answer

In a map reduce word count program need to fetch the files where the words exist

I am reading multiple input files for a word count problem. Example file names: file1.txt file2.txt file3.txt I am able to get the word count but what should be added if I also want to get the file names along with count where the words exist. for…

java hadoop mapreduce hadoop2 hadoop-partitioning

asked Jul 01 '19 at 15:42

Rakesh R

11
3

Questions tagged [hadoop-partitioning]