Questions tagged [hive-partitions]

To be used for questions regarding partitions in hive.

Partitioning is a way of dividing a table into related parts based on the values of partitioned columns such as date, city, and department. Using partition, it is easy to query a portion of the data.

Partitions are essentially horizontal slices of data which allow larger sets of data to be separated into more manageable chunks. In Hive, partitioning is supported for both managed and external tables in the table definition as seen below.

144 questions
1
vote
1 answer

What hashing algorithm does Hive use for partitioning?

I need to understand the algorithm used by Hive to hash partition data. For example, Spark uses Murmur Hashing. Any ideas or resources?
1
vote
2 answers

Hive | Create partition on a date

I need to create an external hive table on top of a csv file. CSV is having col1, col2, col3 and col4. But my external hive table should be partitioned on month but my csv file doesn't have any month field. col1 is date field. How can I do this?
user13516187
1
vote
1 answer

How can we drop a HIVE table with its underlying file structure, without corrupting another table under the same path?

Assuming we have 2 hive tables created under the same HDFS file path. I want to be able to drop a table WITH the HDFS files path, without corrupting the other table that's in the same shared path. By doing the following: drop table…
GeoSal
  • 333
  • 1
  • 2
  • 15
1
vote
1 answer

msck repair a big table take very long time

I have a daily ingestion of data into HDFS . From data into HDFS I generate Hive tables partitioned by date and another column. One day has 130G data. After generate the data, I run msck repair. Now every msck tasks more than 2 hours. In my mind,…
Gary Wang
  • 81
  • 1
  • 1
  • 4
1
vote
1 answer

Hive partition column

We have avro partitioned table in hive. When we query table, partition column is displaying at the end. Is there any way to display partition column at first? Eg: select * from tablea Output: Col1 col2 partition_column Expected…
user11069271
  • 109
  • 2
  • 6
1
vote
1 answer

Hive table deduplication across multiple partitions

I am trying to de duplicate a table that may have duplicates across partitions. For example id device_id os country unix_time app_id dt 2 2 3a UK 7 5 2019-12-22 1 2 3a USA 4 5 …
1
vote
0 answers

How to create partitioned and bucked external table in hive with delta directories?

I created a partitioned and bucketed table in HIVE by merging many files. Due to some reasons, that table cannot be accessed from HIVE, maybe its metadata is lost, though the data is there along with partitions,delta directories and buckets. I have…
Ayaz49
  • 325
  • 2
  • 4
  • 18
1
vote
2 answers

get number of partitions in pyspark

I select all from a table and create a dataframe (df) out of it using Pyspark. Which is partitioned as: partitionBy('date', 't', 's', 'p') now I want to get number of partitions through using df.rdd.getNumPartitions() but it returns a much…
Alan
  • 417
  • 1
  • 7
  • 22
1
vote
0 answers

Pyspark: insert dataframe into partitioned hive table

Apologies if I'm being really basic here but I need a little Pyspark help trying to dynamically overwrite partitions in a hive table. Tables are drastically simplified, but the issue I'm struggling with is (I hope) clear. I'm pretty new to PySpark…
Amit
  • 41
  • 2
  • 6
1
vote
1 answer

bash - grabbing the partitions of a hive table using grep and regex

I am trying to get the partition column names of a hive table in bash using grep and regex. I am trying this: hive -e 'show create table employees' | grep -E 'PARTITIONED BY (.*)' This is giving me the result like: PARTITIONED BY ( How do I have…
Hemanth
  • 705
  • 2
  • 16
  • 32
1
vote
1 answer

Unable to create Hive unique paritions

I am unable to create unique partitions. when i am uploading data, it's creating all the dates as partition again and again, even the dates are same create table product_order1(id int,user_id int,amount int,product string, city string, txn_date…
Priyanka
  • 25
  • 1
  • 10
1
vote
1 answer

Performance of Group By on Partition Column in Hive

I have a table with 4 columns with col4 as the partition column in Hive. This is a huge table with ~9M rows inserted every 5 hours. I have a restriction that I cannot change the design of this table as it is used for other reports as well. CREATE…
underwood
  • 845
  • 2
  • 11
  • 22
1
vote
2 answers

Deletion of Partitions

I am not able to drop partition in hive table. ALTER TABLE db.table drop if exists partition(dt="****-**-**/id=**********"); OK Time taken: 0.564 seconds But partitions are not getting deleted Below is the what I get when I check partitions of my…
1
vote
2 answers

INSERT OVERWRITE PARTITION () checks if partition exists

I want to check if a certain partition already exists before "insert overwrite" it. Only need to insert when that partition does not exist. How to modify this query? INSERT OVERWRITE TABLE myname.mytable PARTITION (ds='2019-07-19')
daydayup
  • 2,049
  • 5
  • 22
  • 47
1
vote
1 answer

How to undo ALTER TABLE ... ADD PARTITION without deleting data

Let's suppose I have two hive tables, table_1 and table_2. I use: ALTER TABLE table_2 ADD PARTITION (col=val) LOCATION [table_1_location] Now, table_2 will have the data in table_1 at the partition where col = val. What I want to do is reverse this…
allen kim
  • 1,705
  • 2
  • 14
  • 13