1

Request:- How can I insert partition key pair into each parquet file while inserting data into Hive/Impala table.

Hive Table DDL [ create external table db.tbl_name ( col1 string, col2 string) Partitioned BY (date_col string) STORED AS parquet LOCATION 'hdfs_path/db/tbl_name' ]

Let's insert data into this hive table.

INSERT INTO db.tbl_name PARTITION (date_col=2020-07-26) VALUES ('test1_col1','test1_col2')

Once records get inserted, let's view data into parquet file using parquet-tools or any other tool.

parquet-tool cat hdfs_path/db/tbl_name/date_col=2020-07-26/parquet_file.parquet

Below would be the view.

**********************
col1 = test1_col1 
col2 = test1_col2
**********************

However, if I fire following HQL query on Hive/Impala, then it will read partition value from metadata.

**Query**- select * from db.tbl_name
**Result** -
col1        col2        date_col
test1_col1  test1_col2  2020-07-26

Question- Is there any way, where we can view partition columnn name and value in parquet file like below.


col1 = test1_col1 
col2 = test1_col2 
date_col = 2020-07-26

Peace_Dude
  • 11
  • 3
  • Hi @HappyCoder, You should show us the `CREATE TABLE` statement and a `DATA SAMPLE` to figure out what is happening. – Chema Jul 23 '20 at 11:16
  • 1
    Partition columns are not supposed to be **inside** parquet files, they are persisted externally as HDFS directories, e.g. `hdfs://...//=/part-xxx-file.parquet`. – mazaneicha Jul 25 '20 at 13:23
  • Hi @chema - Edited the question and explained each step. – Peace_Dude Jul 26 '20 at 08:20
  • 2
    Hi @mazaneicha - I want them inside parquet file as well. How can I achieve this ? Is there any property behind this ? – Peace_Dude Jul 26 '20 at 08:20
  • 1
    You'll need to add `data_col_2` as a non-partition column, and have it contain the same date as `data_col`. There is no other way afaik. – mazaneicha Jul 26 '20 at 12:41
  • Hi, I would do as @mazaneicha suggest. – Chema Jul 27 '20 at 10:48
  • @mazaneicha and Chema - I had this idea, but it's not feasible in our environment. However, I feel like, there should be a way supported by Hive/Impala. Anyways, Thanks for your help and support. – Peace_Dude Jul 27 '20 at 17:20

1 Answers1

0

Please use this -
INSERT INTO db.tbl_name PARTITION (date_col) VALUES ('test1_col1','test1_col2','2020-07-26');

Always mention partition name inside brackets() like above. And then in the values/select clause, order the partition column in the end.
Thats all you need to insert into hive/impala partitioned table.

Koushik Roy
  • 6,868
  • 2
  • 12
  • 33
  • Insert into Hive is not an issue. However, insert or display partition column name and value into parquet (data) file is a challenge. – Peace_Dude Aug 11 '20 at 15:20
  • I'm sorry, how Hive table definition would help? – Peace_Dude Aug 12 '20 at 08:53
  • When you insert data into Hive/Impala partitioned table, partitioned column name and value store in Hive Metastore and HDFS directory gets created, and underlying data (parquet) files doesn't store those partition column and value like other column. Challenge is to store them in parquet (data) file. – Peace_Dude Aug 12 '20 at 08:55
  • i was trying to answer your question - " Is there any way, where we can view partition columnn name and value in parquet file like below." – Koushik Roy Aug 12 '20 at 14:25
  • Okay! Table definition would only give us partition column name and datatype, but it will never keep their values (row). – Peace_Dude Aug 13 '20 at 10:09