How to deal with hive partitioning for performance versus over-partitioning

Question

We have a very large Hadoop dataset having more than a decade of historical transaction data - 6.5B rows and counting. We have partitioned it on year and month.

Performance is poor for a number of reasons. Nearly all of our queries can be further qualified by customer_id, as well, but we have 500 customers and growing quickly. If we narrow the query to a given month, we still need to scan all records just to find the records for one customer. The data is stored as Parquet now, so the main performance issues are not related to scanning all of the contents of a record.

We hesitated to add a partition on customer because if we have 120 year-month partitions, and 500 customers in each this will make 60K partitions which is larger than Hive metastore can effectively handle. We also hesitated to partition only on customer_id because some customers are huge and other tiny, so we have a natural data skew.

Ideally, we would be able to partition historical data, which is used far less frequently using one rule (perhaps year + customer_id) and current data using another (like year/month + customer_id). Have considered using multiple datasets, but managing this over time seems like more work and changes and so on.

Are there strategies, or capabilities of Hive that provide a way to handle a case like this where we "want" lots of partitions for performance, but are limited by the metastore?

I am also confused about the benefit of bucketing. A suitable bucketing based on customer id, for example, would seem to help in a similar way as partitioning. Yet Hortonworks "strongly recommends against" buckets (with no explanation why). Several other pages suggest bucketing is useful for sampling. Another good discussion of bucketing from Hortonworks indicates that Hive cannot do pruning with buckets the same way it can with partitions.

We're on a recent version of Hive/Hadoop (moving from CDH 5.7 to AWS EMR).

My 2 cents: use a columnar format which stores min/max counters for every column on every stripe (allowing for Skip Scans, especially if your data is sorted judiciously on INSERT), such as ORC. Tweak the Hive config cf. https://www.slideshare.net/Hadoop_Summit/data-driving-yahoo-mail-growth-and-evolution-with-a-50-pb-hadoop-warehouse pp.12-21. Then consolidate the partitions. And don't store the columnar files on S3 _(kills random access, hence the Skip Scan feature)_... — Samson Scharfrichter, Aug 16 '17 at 20:25
It seems that you outline a solution, why not keep 2 tables partitioned differently? — Alex Libov, Aug 16 '17 at 20:26
My 2 cents, continued - about ORC tuning https://www.slideshare.net/Hadoop_Summit/orc-file-optimizing-your-big-data (and BTW Parquet is not a bad choice, either...) — Samson Scharfrichter, Aug 16 '17 at 20:38
Thanks Samson and @alex-libov. I have clarified the question. We have indeed considered adding a second table partitioned differently, but this creates a lot of management overhead (archiving old data periodically, managing access by a view, etc.) — Tom Harrison, Aug 16 '17 at 21:03
You can try hive indexes, create index on customer_id column and keep the partition as is that is year and month and see how its goes. — kkmishra, Aug 17 '17 at 08:29
Do you store your CustomerID as a String, or as an Integer? Do you put data for a specific Customer in specific Parquet files, or not? It could make a big difference on "predicate pushdown" (i.e the Skip Scan I was talking about). Cf https://stackoverflow.com/questions/41700231/spark-parquet-statisticsmin-max-integration — Samson Scharfrichter, Aug 18 '17 at 20:18
Customer ID is an integer. But our Parquet files are partitioned by date only and my question is mainly around how to re-organize the data in a way that is simple, doesn't create new problems (e.g. small files, over partitioning), and is maintainable over the long haul. The predicate pushdown option of bucketing seems appealing and we're trying some tests to see if it works in our case. — Tom Harrison, Aug 18 '17 at 22:16

score 0 · Answer 1 · answered Jan 29 '18 at 14:28

In real 60K partitions is not a big problem for Hive. I have experience with about 2MM partitions for one Have table and it works pretty fast. Some details you can find on link https://andr83.io/1123 Of course you need write queries carefully. Also I can recommend to use ORC format with indexes and bloom filters support.

How to deal with hive partitioning for performance versus over-partitioning

1 Answers1