I understand that when you create ORC tables, it will improve the speed dramatically. However, can we improve it further by partitioning and bucketing an ORC table? If so, how to do partitioning and bucketing in an existing ORC table?
2 Answers
You can bucket and partition an ORC table.
Partitions are directly mapped to directories in HDFS. You can ALTER TABLE and add partition. You'd have to do partition recovery after thou. Everything is well explained here: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-AlterPartition.
Personally I'd create new table wih dynamic partitioning and copy the data to new table.

- 83
- 7
-
could you elaborate the way you do dynamic partitioning? – Seen Dec 29 '16 at 15:37
Partitioning and Bucketing are features offered to help improve query performance. In Hive, as explained by Karol, Partitioning is mapped to a hdfs directory structure and the way to partition is totally driven by the query needs and pattern. For example
customer_purchases table stores all the transactions over the past 2 - 3yrs (around 1-2 PB of data). An analyst is trying to answer "How much of sales happened during the first quarter of 2017 month-wise?".
WITHOUT PARTITION
customer_purchases table schema
transaction_id | cust_id | price_per_unit | units_purchased | invoiceDate
Sample dataset
1,CustomerId-32,3.24,91,2017-10-19
2,CustomerId-16,3.24,88,2017-10-14
3,CustomerId-3,1.96,99,2017-10-14
4,CustomerId-95,1.96,38,2017-10-17
5,CustomerId-51,1.32,39,2017-10-18
6,CustomerId-29,1.32,14,2017-10-14
7,CustomerId-15,3.88,66,2017-10-19
8,CustomerId-74,1.32,44,2017-10-17
9,CustomerId-43,3.88,22,2017-10-18
Stored as csvs in hdfs://your-nn/your-path/data*.csv
SELECT month(invoiceDate), count(*) FROM customer_purchases WHERE YEAR(invoiceDate) = '2017' AND MONTH(invoiceDate) BETWEEN 1 AND 3 GROUP BY MONTH(invoiceDate)
The above statement does an entire table scan to perform the filter (where) and aggregation (group by). This is inefficient as we are only need a fraction of the dataset.
WITH PARTITION
We could infer that the partition is more time-series based as there is a date range. In order to avoid the full-table scan, we could create a partition which is more month based. Following are the changes
customer_purchases table schema (partition column 'yr' and 'mon')
transaction_id | cust_id | price_per_unit | units_purchased | invoiceDate | mon
The same data is stored in hdfs as hdfs://your-nn/your-path///data*.csv where is yyyy format year and is any value between 1 and 12 (Jan through Dec).
with the new hdfs structure and hive table schema structure. The query would be
SELECT mon, count(*) FROM customer_purchases WHERE yr='2017' AND mon BETWEEN 1 AND 3 GROUP BY mon
The explain plan on the above query would now scan only files under yr=2017 directory and mon=1, mon=2 and mon=3 subdirectories. This is a small dataset and you will have the results returned faster.
As per ORC file format, there is nothing that would change except for the files in hdfs location would be .orc instead of .csv.
BUCKETING adds grouping the transactions in specific files.
Does that answer your question?
DYNAMIC PATITIONING helps in performing the partition automatically based on input transaction data in a table.

- 600
- 1
- 8
- 23