4

I have bucketized a dataframe, i.e. bucketBy and saveAsTable.

If I load it with spark.read.parquet, I don't benefit from optimization (no shuffling).

scala> spark.read.parquet("${spark-warehouse}/tab1").groupBy("a").count.explain(true)
== Physical Plan ==
*HashAggregate(keys=[a#35117], functions=[count(1)], output=[a#35117, count#35126L])
+- Exchange hashpartitioning(a#35117, 200)
   +- *HashAggregate(keys=[a#35117], functions=[partial_count(1)], output=[a#35117, count#35132L])
      +- *FileScan parquet [a#35117] Batched: true, Format: Parquet, Location: InMemoryFileIndex[file:/Users/yann.moisan/projects/teads/data/spark-warehouse/tab1], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<a:int>

I need to load it with spark.table to benefit from optimization.

scala> spark.table("tab1").groupBy("a").count().explain(true)
== Physical Plan ==
*HashAggregate(keys=[a#149], functions=[count(1)], output=[a#149, count#35140L])
+- *HashAggregate(keys=[a#149], functions=[partial_count(1)], output=[a#149, count#35146L])
   +- *FileScan parquet default.tab1[a#149] Batched: true, Format: Parquet, Location: InMemoryFileIndex[file:/Users/yann.moisan/projects/teads/data/spark-warehouse/tab1], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<a:int>

I don't understand why Spark do not detect automatically the bucketization in the first case, by using the filename for example that is a bit different in this case part-00007-ca117fc2-2552-4693-b6f7-6b27c7c4bca7_00001.snappy.parquet ?

Jacek Laskowski
  • 72,696
  • 27
  • 242
  • 420
Yann Moisan
  • 8,161
  • 8
  • 47
  • 91

1 Answers1

3

I don't understand why Spark do not detect automatically the bucketization in the first case

Simple. No support for bucketed dataframes that are not loaded as bucketed tables using spark.table.

Jacek Laskowski
  • 72,696
  • 27
  • 242
  • 420
  • 1
    It is never too late for an answer ;) – T. Gawęda Apr 18 '18 at 21:32
  • 2
    It's only now when I found time and courage to explore bucketing support in Spark SQL in more details --> https://jaceklaskowski.gitbooks.io/mastering-spark-sql/spark-sql-bucketing.html – Jacek Laskowski Apr 18 '18 at 22:16
  • @androboy Recommend asking a separate question to get the best "exposure". You could link the question here. – Jacek Laskowski Nov 17 '18 at 18:21
  • Thanks Jacek Laskowski! Here is my complete question - https://stackoverflow.com/questions/53398930/spark-clustered-by-bucket-by-dataset-not-using-memory – androboy Nov 26 '18 at 07:31