I have a dataset that I want to partition by a particular key (clientID) but some clients produce far, far more data that others. There's a feature in Hive called either "ListBucketing" invoked by "skewed by" specifically to deal with this situation.
However, I cannot find any indication that Spark supports this feature, or how (if it does support it) to make use of it.
Is there a Spark feature that is the equivalent? Or, does Spark have some other set of features by which this behavior can be replicated?
(As a bonus - and requirement for my actual use-case - does your suggest method work with Amazon Athena?)