0

we are facing slow performance with power bi DIRECT QUERY reading the data from delta lake in Azure Databricks

We are trying to refresh some PBI dashboards which are reading data from Delta tables. Models are created in power BI Desktop . DirectQuery is used on PBI side to automatically update the desired partitions Underlying delta tables are partitioned on date keys Plan generated shows Broadcast hash joins and use of partitions properly.

The Spark UI -JDBC/ODBC tab shows sessions fired from PBI when a visual is refreshed and after 2 or 3 constant refreshes - performance improves

but once there is a gap of 5+ minutes - the performance degrades and it shows something like "Listing Corss References" etc in session executions .

seems some sort of plan is purged from memory which makes it slower.

Can someone provide insights into what is happening here

EDIT 1: Added a screenshot from the Spark UI (jdbc/odbc)

enter image description here

EDIT 2: Added Query plan


= Parsed Logical Plan == Project [col_0#407547 AS User ID#407618, tohivestring(col_1)#407567 AS Measured Time Timestamp#407619, col_2#407549L AS Measured Date Key#407620L, col_3#407550L AS Measured Time Key#407621L, col_4#407551L AS Location Key#407622L, col_5#407552L AS Measured Time Banding Key#407623L, col_6#407565 AS Measured Time in Secs#407624, col_7#407566 AS Measured Time Duration In Secs#407625, col_8#407555 AS Person Count#407626] +- Project [col_0#407547, tohivestring(col_1#407548, Some(Etc/UTC)) AS tohivestring(col_1)#407567, col_2#407549L, col_3#407550L, col_4#407551L, col_5#407552L, cast(col_6#407553 as string) AS col_6#407565, cast(col_7#407554 as string) AS col_7#407566, col_8#407555] +- Project [User ID#407422 AS col_0#407547, Measured Time Timestamp#407423 AS col_1#407548, Measured Date Key#407424L AS col_2#407549L, Measured Time Key#407425L AS col_3#407550L, Location Key#407426L AS col_4#407551L, Measured Time Banding Key#407427L AS col_5#407552L, Measured Time in Secs#407428 AS col_6#407553, Measured Time Duration In Secs#407429 AS col_7#407554, Person Count#407430 AS col_8#407555] +- Project [User ID#407422, Measured Time Timestamp#407423, Measured Date Key#407424L, Measured Time Key#407425L, Location Key#407426L, Measured Time Banding Key#407427L, Measured Time in Secs#407428, Measured Time Duration In Secs#407429, Person Count#407430] +- Filter ((cast(Measured Date Key#407424L as double) >= 2.0230822E7) AND (cast(Measured Date Key#407424L as double) < 2.0230823E7)) +- SubqueryAlias spark_catalog.silver.viewqueuemanagementperuserarriving +- View (hive_metastore.silver.viewqueuemanagementperuserarriving, [User ID#407422,Measured Time Timestamp#407423,Measured Date Key#407424L,Measured Time Key#407425L,Location Key#407426L,Measured Time Banding Key#407427L,Measured Time in Secs#407428,Measured Time Duration In Secs#407429,Person Count#407430]) +- Project [cast(User ID#407413 as string) AS User ID#407422, cast(Measured Time Timestamp#407414 as timestamp) AS Measured Time Timestamp#407423, cast(Measured Date Key#407415L as bigint) AS Measured Date Key#407424L, cast(Measured Time Key#407416L as bigint) AS Measured Time Key#407425L, cast(Location Key#407417L as bigint) AS Location Key#407426L, cast(Measured Time Banding Key#407418L as bigint) AS Measured Time Banding Key#407427L, cast(Measured Time in Secs#407419 as decimal(10,0)) AS Measured Time in Secs#407428, cast(Measured Time Duration In Secs#407420 as decimal(10,0)) AS Measured Time Duration In Secs#407429, cast(Person Count#407421 as int) AS Person Count#407430] +- Project [UserId#407488 AS User ID#407413, cast(from_unixtime(unix_timestamp(MeasuredTimeDateTime#407489, yyyy-MM-dd HH:mm:ss.SSS, Some(Etc/UTC), false), yyyy-MM-dd HH:mm, Some(Etc/UTC)) as timestamp) AS Measured Time Timestamp#407414, dimMeasureDatekey#407490L AS Measured Date Key#407415L, dimMeasureTimekey#407491L AS Measured Time Key#407416L, dimLocationKey#407493L AS Location Key#407417L, dimMeasureTimeBandingKey#407494L AS Measured Time Banding Key#407418L, MeasureTimeInSecs#407495 AS Measured Time in Secs#407419, MeasureTimeDurationInSecs#407497 AS Measured Time Duration In Secs#407420, PersonCount#407500 AS Person Count#407421] +- Filter (Function#407513 = Immigration) +- Join Inner, (dimLocationKey#407493L = dimLocationKey#407508L) :- SubqueryAlias Q : +- SubqueryAlias spark_catalog.silver.factqueuemanagementperuser : +- Relation spark_catalog.silver.factqueuemanagementperuser[DateTimeCreated#407480,CreatedByUser#407481,DateTimeLastModified#407482,ModifiedByUser#407483,dimAirportKey#407484L,factQPerUserKey#407485L,QAreaId#407486L,QAreaName#407487,UserId#407488,MeasuredTimeDateTime#407489,dimMeasureDatekey#407490L,dimMeasureTimekey#407491L,RouteStartDateTime#407492,dimLocationKey#407493L,dimMeasureTimeBandingKey#407494L,MeasureTimeInSecs#407495,MeasureTimeInMins#407496,MeasureTimeDurationInSecs#407497,MeasureTimeNoFilterInSecs#407498,MeasureTimeNoFilterInMins#407499,PersonCount#407500,Outcome#407501,dimTerminalKey#407502L,BatchId#407503L] parquet +- SubqueryAlias L +- SubqueryAlias spark_catalog.silver.dimlocation +- Relation spark_catalog.silver.dimlocation[DateTimeCreated#407504,CreatedByUser#407505,DateTimeLastModified#407506,ModifiedByUser#407507,dimLocationKey#407508L,Airport#407509,Terminal#407510,Side#407511,A/D#407512,Function#407513,SubArea#407514,LocationName#407515,ProcessName#407516,SystemName#407517,SystemDeviceId#407518,BatchId#407519L] parquet

== Analyzed Logical Plan == User ID: string, Measured Time Timestamp: string, Measured Date Key: bigint, Measured Time Key: bigint, Location Key: bigint, Measured Time Banding Key: bigint, Measured Time in Secs: string, Measured Time Duration In Secs: string, Person Count: int Project [col_0#407547 AS User ID#407618, tohivestring(col_1)#407567 AS Measured Time Timestamp#407619, col_2#407549L AS Measured Date Key#407620L, col_3#407550L AS Measured Time Key#407621L, col_4#407551L AS Location Key#407622L, col_5#407552L AS Measured Time Banding Key#407623L, col_6#407565 AS Measured Time in Secs#407624, col_7#407566 AS Measured Time Duration In Secs#407625, col_8#407555 AS Person Count#407626] +- Project [col_0#407547, tohivestring(col_1#407548, Some(Etc/UTC)) AS tohivestring(col_1)#407567, col_2#407549L, col_3#407550L, col_4#407551L, col_5#407552L, cast(col_6#407553 as string) AS col_6#407565, cast(col_7#407554 as string) AS col_7#407566, col_8#407555] +- Project [User ID#407422 AS col_0#407547, Measured Time Timestamp#407423 AS col_1#407548, Measured Date Key#407424L AS col_2#407549L, Measured Time Key#407425L AS col_3#407550L, Location Key#407426L AS col_4#407551L, Measured Time Banding Key#407427L AS col_5#407552L, Measured Time in Secs#407428 AS col_6#407553, Measured Time Duration In Secs#407429 AS col_7#407554, Person Count#407430 AS col_8#407555] +- Project [User ID#407422, Measured Time Timestamp#407423, Measured Date Key#407424L, Measured Time Key#407425L, Location Key#407426L, Measured Time Banding Key#407427L, Measured Time in Secs#407428, Measured Time Duration In Secs#407429, Person Count#407430] +- Filter ((cast(Measured Date Key#407424L as double) >= 2.0230822E7) AND (cast(Measured Date Key#407424L as double) < 2.0230823E7)) +- SubqueryAlias spark_catalog.silver.viewqueuemanagementperuserarriving +- View (hive_metastore.silver.viewqueuemanagementperuserarriving, [User ID#407422,Measured Time Timestamp#407423,Measured Date Key#407424L,Measured Time Key#407425L,Location Key#407426L,Measured Time Banding Key#407427L,Measured Time in Secs#407428,Measured Time Duration In Secs#407429,Person Count#407430]) +- Project [cast(User ID#407413 as string) AS User ID#407422, cast(Measured Time Timestamp#407414 as timestamp) AS Measured Time Timestamp#407423, cast(Measured Date Key#407415L as bigint) AS Measured Date Key#407424L, cast(Measured Time Key#407416L as bigint) AS Measured Time Key#407425L, cast(Location Key#407417L as bigint) AS Location Key#407426L, cast(Measured Time Banding Key#407418L as bigint) AS Measured Time Banding Key#407427L, cast(Measured Time in Secs#407419 as decimal(10,0)) AS Measured Time in Secs#407428, cast(Measured Time Duration In Secs#407420 as decimal(10,0)) AS Measured Time Duration In Secs#407429, cast(Person Count#407421 as int) AS Person Count#407430] +- Project [UserId#407488 AS User ID#407413, cast(from_unixtime(unix_timestamp(MeasuredTimeDateTime#407489, yyyy-MM-dd HH:mm:ss.SSS, Some(Etc/UTC), false), yyyy-MM-dd HH:mm, Some(Etc/UTC)) as timestamp) AS Measured Time Timestamp#407414, dimMeasureDatekey#407490L AS Measured Date Key#407415L, dimMeasureTimekey#407491L AS Measured Time Key#407416L, dimLocationKey#407493L AS Location Key#407417L, dimMeasureTimeBandingKey#407494L AS Measured Time Banding Key#407418L, MeasureTimeInSecs#407495 AS Measured Time in Secs#407419, MeasureTimeDurationInSecs#407497 AS Measured Time Duration In Secs#407420, PersonCount#407500 AS Person Count#407421] +- Filter (Function#407513 = Immigration) +- Join Inner, (dimLocationKey#407493L = dimLocationKey#407508L) :- SubqueryAlias Q : +- SubqueryAlias spark_catalog.silver.factqueuemanagementperuser : +- Relation spark_catalog.silver.factqueuemanagementperuser[DateTimeCreated#407480,CreatedByUser#407481,DateTimeLastModified#407482,ModifiedByUser#407483,dimAirportKey#407484L,factQPerUserKey#407485L,QAreaId#407486L,QAreaName#407487,UserId#407488,MeasuredTimeDateTime#407489,dimMeasureDatekey#407490L,dimMeasureTimekey#407491L,RouteStartDateTime#407492,dimLocationKey#407493L,dimMeasureTimeBandingKey#407494L,MeasureTimeInSecs#407495,MeasureTimeInMins#407496,MeasureTimeDurationInSecs#407497,MeasureTimeNoFilterInSecs#407498,MeasureTimeNoFilterInMins#407499,PersonCount#407500,Outcome#407501,dimTerminalKey#407502L,BatchId#407503L] parquet +- SubqueryAlias L +- SubqueryAlias spark_catalog.silver.dimlocation +- Relation spark_catalog.silver.dimlocation[DateTimeCreated#407504,CreatedByUser#407505,DateTimeLastModified#407506,ModifiedByUser#407507,dimLocationKey#407508L,Airport#407509,Terminal#407510,Side#407511,A/D#407512,Function#407513,SubArea#407514,LocationName#407515,ProcessName#407516,SystemName#407517,SystemDeviceId#407518,BatchId#407519L] parquet

== Optimized Logical Plan == Project [UserId#407488 AS User ID#407618, tohivestring(cast(from_unixtime(unix_timestamp(MeasuredTimeDateTime#407489, yyyy-MM-dd HH:mm:ss.SSS, Some(Etc/UTC), false), yyyy-MM-dd HH:mm, Some(Etc/UTC)) as timestamp), Some(Etc/UTC)) AS Measured Time Timestamp#407619, dimMeasureDatekey#407490L AS Measured Date Key#407620L, dimMeasureTimekey#407491L AS Measured Time Key#407621L, dimLocationKey#407493L AS Location Key#407622L, dimMeasureTimeBandingKey#407494L AS Measured Time Banding Key#407623L, cast(MeasureTimeInSecs#407495 as string) AS Measured Time in Secs#407624, cast(MeasureTimeDurationInSecs#407497 as string) AS Measured Time Duration In Secs#407625, PersonCount#407500 AS Person Count#407626] +- Join Inner, (dimLocationKey#407493L = dimLocationKey#407508L) :- Project [UserId#407488, MeasuredTimeDateTime#407489, dimMeasureDatekey#407490L, dimMeasureTimekey#407491L, dimLocationKey#407493L, dimMeasureTimeBandingKey#407494L, MeasureTimeInSecs#407495, MeasureTimeDurationInSecs#407497, PersonCount#407500] : +- Filter ((isnotnull(dimMeasureDatekey#407490L) AND ((cast(dimMeasureDatekey#407490L as double) >= 2.0230822E7) AND (cast(dimMeasureDatekey#407490L as double) < 2.0230823E7))) AND isnotnull(dimLocationKey#407493L)) : +- Relation spark_catalog.silver.factqueuemanagementperuser[DateTimeCreated#407480,CreatedByUser#407481,DateTimeLastModified#407482,ModifiedByUser#407483,dimAirportKey#407484L,factQPerUserKey#407485L,QAreaId#407486L,QAreaName#407487,UserId#407488,MeasuredTimeDateTime#407489,dimMeasureDatekey#407490L,dimMeasureTimekey#407491L,RouteStartDateTime#407492,dimLocationKey#407493L,dimMeasureTimeBandingKey#407494L,MeasureTimeInSecs#407495,MeasureTimeInMins#407496,MeasureTimeDurationInSecs#407497,MeasureTimeNoFilterInSecs#407498,MeasureTimeNoFilterInMins#407499,PersonCount#407500,Outcome#407501,dimTerminalKey#407502L,BatchId#407503L] parquet +- Project [dimLocationKey#407508L] +- Filter ((isnotnull(Function#407513) AND (Function#407513 = Immigration)) AND isnotnull(dimLocationKey#407508L)) +- Relation spark_catalog.silver.dimlocation[DateTimeCreated#407504,CreatedByUser#407505,DateTimeLastModified#407506,ModifiedByUser#407507,dimLocationKey#407508L,Airport#407509,Terminal#407510,Side#407511,A/D#407512,Function#407513,SubArea#407514,LocationName#407515,ProcessName#407516,SystemName#407517,SystemDeviceId#407518,BatchId#407519L] parquet == Physical Plan == AdaptiveSparkPlan isFinalPlan=false +- Project [UserId#407488 AS User ID#407618, tohivestring(cast(from_unixtime(unix_timestamp(MeasuredTimeDateTime#407489, yyyy-MM-dd HH:mm:ss.SSS, Some(Etc/UTC), false), yyyy-MM-dd HH:mm, Some(Etc/UTC)) as timestamp), Some(Etc/UTC)) AS Measured Time Timestamp#407619, dimMeasureDatekey#407490L AS Measured Date Key#407620L, dimMeasureTimekey#407491L AS Measured Time Key#407621L, dimLocationKey#407493L AS Location Key#407622L, dimMeasureTimeBandingKey#407494L AS Measured Time Banding Key#407623L, cast(MeasureTimeInSecs#407495 as string) AS Measured Time in Secs#407624, cast(MeasureTimeDurationInSecs#407497 as string) AS Measured Time Duration In Secs#407625, PersonCount#407500 AS Person Count#407626] +- BroadcastHashJoin [dimLocationKey#407493L], [dimLocationKey#407508L], Inner, BuildRight, false :- Filter (((isnotnull(dimMeasureDatekey#407490L) AND (cast(dimMeasureDatekey#407490L as double) >= 2.0230822E7)) AND (cast(dimMeasureDatekey#407490L as double) < 2.0230823E7)) AND isnotnull(dimLocationKey#407493L)) : +- FileScan parquet spark_catalog.silver.factqueuemanagementperuser[UserId#407488,MeasuredTimeDateTime#407489,dimMeasureDatekey#407490L,dimMeasureTimekey#407491L,dimLocationKey#407493L,dimMeasureTimeBandingKey#407494L,MeasureTimeInSecs#407495,MeasureTimeDurationInSecs#407497,PersonCount#407500] Batched: true, DataFilters: [isnotnull(dimMeasureDatekey#407490L), (cast(dimMeasureDatekey#407490L as double) >= 2.0230822E7)..., Format: Parquet, Location: PreparedDeltaFileIndex(1 paths)[abfss://silver@laketest001.dfs.core.windows.net/queue_mgmt/fact..., PartitionFilters: [], PushedFilters: [IsNotNull(dimMeasureDatekey), IsNotNull(dimLocationKey)], ReadSchema: struct<UserId:string,MeasuredTimeDateTime:timestamp,dimMeasureDatekey:bigint,dimMeasureTimekey:bi... +- Exchange SinglePartition, EXECUTOR_BROADCAST, [plan_id=223127] +- Project [dimLocationKey#407508L] +- Filter ((isnotnull(Function#407513) AND (Function#407513 = Immigration)) AND isnotnull(dimLocationKey#407508L)) +- FileScan parquet spark_catalog.silver.dimlocation[dimLocationKey#407508L,Function#407513] Batched: true, DataFilters: [isnotnull(Function#407513), (Function#407513 = Immigration), isnotnull(dimLocationKey#407508L)], Format: Parquet, Location: PreparedDeltaFileIndex(1 paths)[abfss://silver@laketest001.dfs.core.windows.net/cdm/dimLocation], PartitionFilters: [], PushedFilters: [IsNotNull(Function), EqualTo(Function,Immigration), IsNotNull(dimLocationKey)], ReadSchema: structdimLocationKey:bigint,Function:string

azuresnowflake1
  • 135
  • 1
  • 10
  • Can you please provide plans, DBR version, wether you are using Photon or not and how you are doing auth? – Chris Aug 21 '23 at 15:34
  • Hi Chris - we are using DBR 13, spark 3.4.0.. We were using Photon but disabled it after checking in the plan where it showed displayed some message stating some features were not supported by photon. What do you suggest ? – azuresnowflake1 Aug 22 '23 at 01:19
  • For auth : we are using an AD email account which is embedded in the model – azuresnowflake1 Aug 22 '23 at 01:22
  • @Chris - added the query plan and a screenshot from Spark UI – azuresnowflake1 Aug 22 '23 at 07:57
  • re Photon, even if it says some features are disabled it typically still works faster. ymmv. – Chris Aug 22 '23 at 08:09
  • 1
    I added an answer but it would be good to know if you are using doubles on purpose, why, and if not figure out if it's coming from powerbi. I have had the fun of supporting powerbi's lack of 64bit long support in urls in [Quality](https://sparkutils.github.io/quality/0.1.1/getting_started/#query-optimisations) and it has that feel. – Chris Aug 22 '23 at 08:24
  • It is coming form PowerBI - trying to investigate why this is so – azuresnowflake1 Aug 22 '23 at 08:36

1 Answers1

1

You aren't using push downs or partitions on the location parquet file:

PartitionFilters: [], 
PushedFilters: [IsNotNull(dimMeasureDatekey), IsNotNull(dimLocationKey)],

but do have on the Function holding one.

PartitionFilters: [], 
PushedFilters: [IsNotNull(Function), EqualTo(Function,Immigration), IsNotNull(dimLocationKey)],

On the filters side:

[isnotnull(dimMeasureDatekey#407490L), (cast(dimMeasureDatekey#407490L as double) >= 2.0230822E7)...,

and

Filter ((cast(Measured Date Key#407424L as double) >= 2.0230822E7) AND (cast(Measured Date Key#407424L as double) < 2.0230823E7))

you have casting. Casts can't, currently, be optimised out even if they are reversable. I created an issue and half solution for those that are - needs an easier way to disable / enable.

It would seem date is a long but you are casting it to double then comparing, this doesn't have a chance of being pushed down even if you had partitions on dimMeasureDateKey, they won't be used.

You must pass in a long and not cast (or move the cast to the other side if possible) in order for pushdowns / predicates to work.

Chris
  • 1,240
  • 7
  • 8