Spark SQL partition awareness querying hive table

Question

Given partitioned by some_field (of int type) Hive table with data stored as Avro files, I want to query table using Spark SQL in a way that returned Data Frame have to be already partitioned by some_field (used for partitioning).

Query looks like just

SELECT * FROM some_table

By default Spark doesn't do that, returned data_frame.rdd.partitioner is None.

One way to get result is via explicit repartitioning after querying, but probably there is better solution.

HDP 2.6, Spark 2.

Thanks.

I think there are 2 separate things that you are talking about, hive partition and dataset partitioning and both are completely independent. Follow [line](https://stackoverflow.com/questions/44222307/spark-rdd-default-number-of-partitions) to read about rdd/dataset partitioning. — Rahul Sharma, Nov 08 '17 at 16:45
Of course, they are independent, but until execution engine cannot utilize underlying storage partitioning, latter is useless. Thanks for link. — Valentin P., Nov 08 '17 at 17:23

zero323 · Answer 1 · 2017-11-08T16:51:00.730

First of all you have to distinguish between partitioning of a Dataset and partitioning of the converted RDD[Row]. No matter what is the execution plan of the former one, the latter one won't have a Partitioner:

scala> val df = spark.range(100).repartition(10, $"id")
df: org.apache.spark.sql.Dataset[Long] = [id: bigint]

scala> df.rdd.partitioner
res1: Option[org.apache.spark.Partitioner] = None

However internal RDD, might have a Partitioner:

scala> df.queryExecution.toRdd.partitioner
res2: Option[org.apache.spark.Partitioner] = Some(org.apache.spark.sql.execution.CoalescedPartitioner@5a05e0f3)

This however is unlikely to help you here, because as of today (Spark 2.2), Data Source API is not aware of the physical storage information (with exception of simple partition pruning). This should change in the upcoming Data Source API. Please refer to the JIRA ticket (SPARK-15689) and design document for details.

Spark SQL partition awareness querying hive table

1 Answers1