I have defined a view in an AzureSynapse serverless pool that fetches data with openrowset. I also have data in a table in a dedicated sql pool.
Can I do a sql query that would join these two tables?
I have defined a view in an AzureSynapse serverless pool that fetches data with openrowset. I also have data in a table in a dedicated sql pool.
Can I do a sql query that would join these two tables?
Azure Synapse Analytics really has three separate engines (at least that's the way I see it) and your SQL query has to execute from one of these engines and that gives you your choices:
Execute from inside the dedicated SQL pool. You will have access to your dedicated table and depending on the definition of your serverless SQL pool OPENROWSET
query, you may be able to reproduce it within dedicated, eg as an external table. If you've created a view in serverless, there is no way to reference it directly at this time from dedicated. Dedicated does not support OPENROWSET (at this time) in the same way serverless does.
Execute from inside serverless. You will not have direct access to your table from dedicated, but you could dump it out using CREATE EXTERNAL TABLE AS (CETAS)
into your data lake, which would then allow you to read it from serverless with another OPENROWSET
query.
Execute from Apache Spark pools. Use the third engine to bring the other two together. Getting access to dedicated tables from within Azure Analytics Spark notebooks is as easy as running spark.read.synapsesql
with a three-part name. Here is a simple example (in Scala)
val dfNation = spark.read.synapsesql("tpch_ded.dbo.nation")
dfNation.show
dfNation.createOrReplaceTempView("nation")
Getting access to a serverless object is a bit more involved but I got this Scala example to work inspired by Jovan Popovich here:
Class.forName("com.microsoft.sqlserver.jdbc.SQLServerDriver");
val jdbcHostname = "XXX-ondemand.sql.azuresynapse.net"
val jdbcPort = 1433
val jdbcDatabase = "master"
//Create connection String
val jdbcUrl = s"jdbc:sqlserver://${jdbcHostname}:${jdbcPort};database=${jdbcDatabase};encrypt=true;trustServerCertificate=false;hostNameInCertificate=*.database.windows.net;loginTimeout=60;"
import java.util.Properties
val props = new Properties()
val jdbcUsername = "someSQLUser"
val jdbcPassword = "xxx"
props.put("user", s"${jdbcUsername}")
props.put("password", s"${jdbcPassword}")
val driverClass = "com.microsoft.sqlserver.jdbc.SQLServerDriver"
props.put("Driver", s"${driverClass}")
val sqlQuery = """select top 10 *
from openrowset(
bulk 'https://pandemicdatalake.blob.core.windows.net/public/curated/covid-19/ecdc_cases/latest/ecdc_cases.parquet',
format = 'parquet') AS rows"""
// this worked but isn't using the prop object
val df = spark.read.format("jdbc").
option("url", jdbcUrl).
option("user", s"${jdbcUsername}").
option("password", s"${jdbcPassword}").
option("Driver", s"${driverClass}").
option("query", sqlQuery).
load()
df.show(false)
So you could bring the two dataframes together in a notebook using join
. The Spark pools and serveless SQL pools also have a shared metadata model so you can create managed\unmanaged tables in Spark and they appear in serverless. Better examples here.
We're looking at using pattern 2 at the moment which has the downside of duplication with the upside of reducing cost on the dedicated SQL pool, ie use dedicated to import and process data, dump it out then pause it.
It feels like it should be simpler right? Maybe I'm missing something or maybe this is set to improve on the roadmap.