How to load a bucketed DataFrame that would preserve bucketing?

Question

I have bucketized a dataframe, i.e. bucketBy and saveAsTable.

If I load it with spark.read.parquet, I don't benefit from optimization (no shuffling).

scala> spark.read.parquet("${spark-warehouse}/tab1").groupBy("a").count.explain(true)
== Physical Plan ==
*HashAggregate(keys=[a#35117], functions=[count(1)], output=[a#35117, count#35126L])
+- Exchange hashpartitioning(a#35117, 200)
   +- *HashAggregate(keys=[a#35117], functions=[partial_count(1)], output=[a#35117, count#35132L])
      +- *FileScan parquet [a#35117] Batched: true, Format: Parquet, Location: InMemoryFileIndex[file:/Users/yann.moisan/projects/teads/data/spark-warehouse/tab1], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<a:int>

I need to load it with spark.table to benefit from optimization.

scala> spark.table("tab1").groupBy("a").count().explain(true)
== Physical Plan ==
*HashAggregate(keys=[a#149], functions=[count(1)], output=[a#149, count#35140L])
+- *HashAggregate(keys=[a#149], functions=[partial_count(1)], output=[a#149, count#35146L])
   +- *FileScan parquet default.tab1[a#149] Batched: true, Format: Parquet, Location: InMemoryFileIndex[file:/Users/yann.moisan/projects/teads/data/spark-warehouse/tab1], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<a:int>

I don't understand why Spark do not detect automatically the bucketization in the first case, by using the filename for example that is a bit different in this case part-00007-ca117fc2-2552-4693-b6f7-6b27c7c4bca7_00001.snappy.parquet ?

score 3 · Answer 1 · answered Apr 18 '18 at 20:21

3

I don't understand why Spark do not detect automatically the bucketization in the first case

Simple. No support for bucketed dataframes that are not loaded as bucketed tables using spark.table.

answered Apr 18 '18 at 20:21

Jacek Laskowski

72,696
27
242
420

1

It is never too late for an answer ;) – T. Gawęda Apr 18 '18 at 21:32
2

It's only now when I found time and courage to explore bucketing support in Spark SQL in more details --> https://jaceklaskowski.gitbooks.io/mastering-spark-sql/spark-sql-bucketing.html – Jacek Laskowski Apr 18 '18 at 22:16
@androboy Recommend asking a separate question to get the best "exposure". You could link the question here. – Jacek Laskowski Nov 17 '18 at 18:21
Thanks Jacek Laskowski! Here is my complete question - https://stackoverflow.com/questions/53398930/spark-clustered-by-bucket-by-dataset-not-using-memory – androboy Nov 26 '18 at 07:31

How to load a bucketed DataFrame that would preserve bucketing?

1 Answers1