We have a very large Hadoop dataset having more than a decade of historical transaction data - 6.5B rows and counting. We have partitioned it on year and month.
Performance is poor for a number of reasons. Nearly all of our queries can be further qualified by customer_id, as well, but we have 500 customers and growing quickly. If we narrow the query to a given month, we still need to scan all records just to find the records for one customer. The data is stored as Parquet now, so the main performance issues are not related to scanning all of the contents of a record.
We hesitated to add a partition on customer because if we have 120 year-month partitions, and 500 customers in each this will make 60K partitions which is larger than Hive metastore can effectively handle. We also hesitated to partition only on customer_id because some customers are huge and other tiny, so we have a natural data skew.
Ideally, we would be able to partition historical data, which is used far less frequently using one rule (perhaps year + customer_id) and current data using another (like year/month + customer_id). Have considered using multiple datasets, but managing this over time seems like more work and changes and so on.
Are there strategies, or capabilities of Hive that provide a way to handle a case like this where we "want" lots of partitions for performance, but are limited by the metastore?
I am also confused about the benefit of bucketing. A suitable bucketing based on customer id, for example, would seem to help in a similar way as partitioning. Yet Hortonworks "strongly recommends against" buckets (with no explanation why). Several other pages suggest bucketing is useful for sampling. Another good discussion of bucketing from Hortonworks indicates that Hive cannot do pruning with buckets the same way it can with partitions.
We're on a recent version of Hive/Hadoop (moving from CDH 5.7 to AWS EMR).