1

Im using synapse in azure. I have data in the serverless sql pool. I want to import that data to a dataframe in databricks.

I am getting the following error:

Py4JJavaError: An error occurred while calling o568.load.
: java.lang.ClassNotFoundException: Failed to find data source: com.databricks.spark.sqldw. Please find packages at http://spark.apache.org/third-party-projects.html
    at org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:656)
    at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:195)
    at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:168)
    at sun.reflect.GeneratedMethodAccessor102.invoke(Unknown Source)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
    at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
    at py4j.Gateway.invoke(Gateway.java:282)
    at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
    at py4j.commands.CallCommand.execute(CallCommand.java:79)
    at py4j.GatewayConnection.run(GatewayConnection.java:238)
    at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.ClassNotFoundException: com.databricks.spark.sqldw.DefaultSource
...
...
...

The pyspark code i am using is:

spark.conf.set(
  "fs.azure.account.key.adlsAcct.blob.core.windows.net",
  "GVk3234fds2JX/fahOcjig3gNy198yasdhfkjasdyf87HWmDVlx1wLRmu7asdfaP3g==")



sc._jsc.hadoopConfiguration().set(
  "fs.azure.account.key.adlsAcct.blob.core.windows.net",
  "GVk3234fds2JX/fahOcjig3gNy198yasdhfkjasdyf87HWmDVlx1wLRmu7asdfaP3g==")


  
df = spark.read \
.format("com.databricks.spark.sqldw") \
.option("url","jdbc:sqlserver://synapse-myworkspace-ondemand.sql.azuresynapse.net:1433;database=myDB;user=myUser;password=userPass123;encrypt=false;trustServerCertificate=true;hostNameInCertificate=*.sql.azuresynapse.net;loginTimeout=30;") \
.option("tempdir", "wasbs://projects@adlsAcct.dfs.core.windows.net/Lakehouse/tempDir") \
.option("forwardSparkAzureStorageCredentials","true") \
.option("dbtble","tbl_sampledata") \
.load()

I can confirm:

  • Firewall setting to allow azure services to connect is configured.
  • User has access to the sql serverless pool database.
  • i have tried with integrated auth and i get the same result.

To my eye, the error looks like databricks cannot find the format com.databricks.spark.sqldw, but that could be a red herring.

appreciate any advise and expertise

wilson_smyth
  • 1,202
  • 1
  • 14
  • 39

1 Answers1

2

One of the advantages of working Azure Synapse Analytics is integration, in that the various components of storage, database, pipeline, notebook etc tend to work together a bit easier than setting up the standalone components, eg Databricks notebook, where you have to write code like yours, including hadoopConfiguration etc

One simple way to getting data from a dedicated SQL pool to a Synapse notebook is using the synapsesql method. A simple example:

%%spark
// Get the table with synapsesql method and expose as temp view
val df = spark.read.synapsesql("dedi_pool.dbo.someTable")

df.createOrReplaceTempView("someTable")

Unfortunately this method is only implemented in Scala at the moment (as far as I am aware) but you can save the dataframe as a temp view which exposes it to SparkSQL and Python:

%%sql
SELECT * FROM someTable;

And here's the Python to retrieve the temp view in Python:

%%pyspark
## Get the table with synapsesql method and expose as temp view
df = spark.sql("select * from someTable")

df.show()

And here's my results:

results

Check the main documentation for this technique here.

For serverless SQL pools, I was initially frustrated by this not being built-in, but then if you think about it, you would be using duplicate services, ie the serverless engine to query the underlying files and the Apache Spark pool to query that placeholder / external table, which is effectively querying they underlying files. So you might as well reference the files directly using spark.read and whatever the fileformat is, eg .csv. Example taken from the docs:

%%pyspark
df = spark.read.load('abfss://users@contosolake.dfs.core.windows.net/NYCTripSmall.parquet', format='parquet')
display(df.limit(10))

This is what I was thinking: parquet serverless and apache spark pools

It is however technically possible and I covered the technique here.

wBob
  • 13,710
  • 3
  • 20
  • 37
  • 1
    thanks for this, its very helpful. long and the short is apache spark connector to synapse sql pools does not work for serverless. The main use case i can see for this being useful is when i want to simply join tables or query a view that does the joining of the underlying data files for me. Ill have to live with importing the files directly for the moment. Thank you! – wilson_smyth Jul 29 '21 at 18:40