1

I understand that when you create ORC tables, it will improve the speed dramatically. However, can we improve it further by partitioning and bucketing an ORC table? If so, how to do partitioning and bucketing in an existing ORC table?

iPhoneJavaDev
  • 821
  • 5
  • 33
  • 78

2 Answers2

1

You can bucket and partition an ORC table.

Partitions are directly mapped to directories in HDFS. You can ALTER TABLE and add partition. You'd have to do partition recovery after thou. Everything is well explained here: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-AlterPartition.

Personally I'd create new table wih dynamic partitioning and copy the data to new table.

1

Partitioning and Bucketing are features offered to help improve query performance. In Hive, as explained by Karol, Partitioning is mapped to a hdfs directory structure and the way to partition is totally driven by the query needs and pattern. For example

customer_purchases table stores all the transactions over the past 2 - 3yrs (around 1-2 PB of data). An analyst is trying to answer "How much of sales happened during the first quarter of 2017 month-wise?".

WITHOUT PARTITION

customer_purchases table schema

transaction_id | cust_id | price_per_unit | units_purchased | invoiceDate

Sample dataset

1,CustomerId-32,3.24,91,2017-10-19
2,CustomerId-16,3.24,88,2017-10-14
3,CustomerId-3,1.96,99,2017-10-14
4,CustomerId-95,1.96,38,2017-10-17
5,CustomerId-51,1.32,39,2017-10-18
6,CustomerId-29,1.32,14,2017-10-14
7,CustomerId-15,3.88,66,2017-10-19
8,CustomerId-74,1.32,44,2017-10-17
9,CustomerId-43,3.88,22,2017-10-18

Stored as csvs in hdfs://your-nn/your-path/data*.csv

SELECT month(invoiceDate), count(*) FROM customer_purchases WHERE YEAR(invoiceDate) = '2017' AND MONTH(invoiceDate) BETWEEN 1 AND 3 GROUP BY MONTH(invoiceDate)

The above statement does an entire table scan to perform the filter (where) and aggregation (group by). This is inefficient as we are only need a fraction of the dataset.

WITH PARTITION

We could infer that the partition is more time-series based as there is a date range. In order to avoid the full-table scan, we could create a partition which is more month based. Following are the changes

customer_purchases table schema (partition column 'yr' and 'mon')

transaction_id | cust_id | price_per_unit | units_purchased | invoiceDate | mon

The same data is stored in hdfs as hdfs://your-nn/your-path///data*.csv where is yyyy format year and is any value between 1 and 12 (Jan through Dec).

with the new hdfs structure and hive table schema structure. The query would be

SELECT mon, count(*) FROM customer_purchases WHERE yr='2017' AND mon BETWEEN 1 AND 3 GROUP BY mon

The explain plan on the above query would now scan only files under yr=2017 directory and mon=1, mon=2 and mon=3 subdirectories. This is a small dataset and you will have the results returned faster.

As per ORC file format, there is nothing that would change except for the files in hdfs location would be .orc instead of .csv.

BUCKETING adds grouping the transactions in specific files.

Does that answer your question?

DYNAMIC PATITIONING helps in performing the partition automatically based on input transaction data in a table.

venBigData
  • 600
  • 1
  • 8
  • 23