I was going the blog published on the Databricks website on Cost based optimizer (CBO) that was introduced in Spark 2.2.
It mentions that cost of a query plan is computed based on the formula :
cost = weight * cardinality + (1.0 - weight) * size
My assumption is cardinality is the based on the joins and size is the total no of rows returned.
For example, if run the following query in the spark :
val queryStmt = "select * from maha a, maha b where a.county=b.county and a.county='KINGS'"
val exec: QueryExecution = session.sql(queryStmt).queryExecution
val stats: Statistics = exec.optimizedPlan.stats
println(exec.stringWithStats)
OUTPUT:
== Optimized Logical Plan ==
Join Inner, (county#7469 = county#7474), Statistics(sizeInBytes=420.2 MB, rowCount=3.50E+6, hints=none)
:- Filter (isnotnull(county#7469) && (county#7469 = KINGS)), Statistics(sizeInBytes=122.4 KB, rowCount=1.87E+3, hints=none)
: +- Relation[Year#7467,FirstName#7468,County#7469,Sex#7470,Count#7471] parquet, Statistics(sizeInBytes=15.0 MB, rowCount=2.36E+5, hints=none)
+- Filter ((county#7474 = KINGS) && isnotnull(county#7474)), Statistics(sizeInBytes=122.4 KB, rowCount=1.87E+3, hints=none)
+- Relation[Year#7472,FirstName#7473,County#7474,Sex#7475,Count#7476] parquet, Statistics(sizeInBytes=15.0 MB, rowCount=2.36E+5, hints=none)
== Physical Plan ==
*(2) BroadcastHashJoin [county#7469], [county#7474], Inner, BuildRight
:- *(2) Project [Year#7467, FirstName#7468, County#7469, Sex#7470, Count#7471]
: +- *(2) Filter (isnotnull(county#7469) && (county#7469 = KINGS))
: +- *(2) FileScan parquet default.maha[Year#7467,FirstName#7468,County#7469,Sex#7470,Count#7471] Batched: true, Format: Parquet, Location: InMemoryFileIndex[file:/C:/dev/query-analyzer/spark-warehouse/maha], PartitionFilters: [], PushedFilters: [IsNotNull(County), EqualTo(County,KINGS)], ReadSchema: struct<Year:int,FirstName:string,County:string,Sex:string,Count:int>
+- BroadcastExchange HashedRelationBroadcastMode(List(input[2, string, true]))
+- *(1) Project [Year#7472, FirstName#7473, County#7474, Sex#7475, Count#7476]
+- *(1) Filter ((county#7474 = KINGS) && isnotnull(county#7474))
+- *(1) FileScan parquet default.maha[Year#7472,FirstName#7473,County#7474,Sex#7475,Count#7476] Batched: true, Format: Parquet, Location: InMemoryFileIndex[file:/C:/dev/query-analyzer/spark-warehouse/maha], PartitionFilters: [], PushedFilters: [EqualTo(County,KINGS), IsNotNull(County)], ReadSchema: struct<Year:int,FirstName:string,County:string,Sex:string,Count:int>
How should we calculate the cost of the query plan? Should we sum up the stats of intermediates step together or just take the final take the stats from the final step and calculate it?
Spark-Version : 2.3.0