1

I have a dataset that I want to partition by a particular key (clientID) but some clients produce far, far more data that others. There's a feature in Hive called either "ListBucketing" invoked by "skewed by" specifically to deal with this situation.

However, I cannot find any indication that Spark supports this feature, or how (if it does support it) to make use of it.

Is there a Spark feature that is the equivalent? Or, does Spark have some other set of features by which this behavior can be replicated?

(As a bonus - and requirement for my actual use-case - does your suggest method work with Amazon Athena?)

Narfanator
  • 5,595
  • 3
  • 39
  • 71
  • Did you look at https://stackoverflow.com/questions/40373577/skewed-dataset-join-in-spark? – tk421 Mar 27 '19 at 21:12

1 Answers1

1

As far as I know, there is no such out of the box tool in Spark. In case of skewed data, what's very common is to add an artificial column to further bucketize the data.

Let's say you want to partition by column "y", but the data is very skewed like in this toy example (1 partition with 5 rows, the others with only one row):

val df = spark.range(8).withColumn("y", when('id < 5, 0).otherwise('id))
df.show()
+---+---+
| id|  y|
+---+---+
|  0|  0|
|  1|  0|
|  2|  0|
|  3|  0|
|  4|  0|
|  5|  5|
|  6|  6|
|  7|  7|
+-------+

Now let's add an artificial random column and write the dataframe.

val maxNbOfBuckets = 3
val part_df = df.withColumn("r", floor(rand() * nbOfBuckets))
part_df.show
+---+---+---+
| id|  y|  r|
+---+---+---+
|  0|  0|  2|
|  1|  0|  2|
|  2|  0|  0|
|  3|  0|  0|
|  4|  0|  1|
|  5|  5|  2|
|  6|  6|  2|
|  7|  7|  1|
+---+---+---+

// and writing. We divided the partition with 5 elements into 3 partitions.
part_df.write.partitionBy("y", "r").csv("...")
Oli
  • 9,766
  • 5
  • 25
  • 46
  • That would bucket the data (which is already doable, Spark has that as a feature). But, you could just replace the column value lambda with something more sophisticated; `(val in LIST) ? val : "others"` and that would do the thing. – Narfanator Mar 27 '19 at 21:38
  • Should also work in Athena, although I'd need to use, say, `clientPartition` and keep the `clientID` as a regular column to fake the `skewed by` functionality. – Narfanator Mar 27 '19 at 21:39
  • Impressive solution @Oli. What about the randomness of rand()? Would the parallel execution increase the possibility of collisions i.e many identical values in comparison with an one node only (coalesce(1)) execution? – abiratsis Mar 28 '19 at 16:23
  • SparkSQL functions are meant to be called on distributed datasets and according to the documentation, the samples are i.i.d.. Therefore, although I do not know the detail of the implementation (so I could be wrong), I would say that it works fine even without `coalesce(1)`. – Oli Mar 28 '19 at 16:29
  • I mean this [issue](https://stackoverflow.com/questions/4253500/uniform-distribution-with-random) but for a distributed system as Spark. How can we guarantee the uniform distribution for different nodes? I guess we can't at the moment since different instances of Random can not guarantee that. A built-in functionality maybe would solve it somehow but still the performance would be a blocking factor for such a feature I guess – abiratsis Mar 29 '19 at 09:31
  • How right, I understand what you mean. Actually, if you use one instance of Random per partition, or worse per row, the samples won't be i.i.d.. The `SparkSQL` function however is meant exactly for that purpose and will give you satisfactory results. – Oli Mar 29 '19 at 09:34
  • OK @Oli :) for some reason I missed that, just checked it you are right that have that already – abiratsis Mar 30 '19 at 19:12