Background
I am converting a (postgre)SQL team over to running on Spark with Delta Lake by providing a sql runner in scala that picks up and runs their files in a conventional order of directories and files. My goal is to make it possible to convert their SQL code as easily as possible into a Spark/Delta Lake dialect without making it too strange or convoluted for them.
Goal
At first I was interpolating the Delta Lake root data path for them, but then I found this tip that said if you create a simple select * from deltatable
view, then you can actually do delta lake table changes to that sql view that materialize in the delta table below (very cool!). [Tip appears just above the "Schema Validation" section linked here: https://docs.delta.io/latest/delta-update.html#schema-validation.]
Hypothesis
I've tested and this works for update, delete, and merge, but not for insert. My guess is because it looks just like a sql query and I am passing it in as a sql query, so notice the error is coming from the spark sql analyzer, whereas it normally comes from the delta lake analyzer if I mess up the merge syntax or something.
Question
So, is there a way to read a sql file and force it to be interpreted by delta lake even if it looks superficially like a normal sql statement?
Specifically, is there something I can do in the scala code that runs the sql file to indicate I always want delta lake interpretation, or maybe something I can add to every insert statement like a using delta
clause?
Stakes
The only way I know so far to force it to interpret this normal looking table/view as a delta operation is to use the delta path to refer to the table, e.g.:
delta.`/blah/blah/schema/table`
... which is pretty intimidating and error prone for my spark novitiates. (I actually gave them a handle that I grab and edit to upgrade a normal(ish) ${handle}schema.table
reference into the full delta path at runtime, but even that isn't ideal).
Code: (Sql file mentioned at top of error below):
insert into
test_table_changes__test_table
values 1,2
;
(I do a little cleaning of their sql at run time and then pass the string it into spark.sql( )
to create a dataframe that I later write to a new delta table (as appropriate).
Error (in an sbt test
run)
[info] SparkSpec:
[info] As a spark operator,
[info] I want to be able to submit the spark job with one command,
[info] - So it's easy to remember,
[info] - And returns consistent results.
[info] Feature: update/delete/insert into tables
[info] Scenario: sql developer changes table data *** FAILED ***
[info] com.wayspring.spark.claims.Sandbox$Util$ExceptionWithPath: Error while processing: src/test/resources/sql/test_table_changes/1-test_insert/1-test_insert.sql
[info] ...
[info] Cause: org.apache.spark.sql.AnalysisException: Inserting into an RDD-based table is not allowed.;
[info] 'InsertIntoStatement Project [id#1183], false, false
[info] +- LocalRelation [col1#2608]
[info] at org.apache.spark.sql.execution.datasources.PreWriteCheck$.failAnalysis(rules.scala:493)
[info] at org.apache.spark.sql.execution.datasources.PreWriteCheck$.$anonfun$apply$6(rules.scala:524)
[info] at org.apache.spark.sql.execution.datasources.PreWriteCheck$.$anonfun$apply$6$adapted(rules.scala:496)
[info] at org.apache.spark.sql.catalyst.trees.TreeNode.foreach(TreeNode.scala:357)
[info] at org.apache.spark.sql.execution.datasources.PreWriteCheck$.apply(rules.scala:496)
[info] at org.apache.spark.sql.execution.datasources.PreWriteCheck$.apply(rules.scala:491)
[info] at org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis$47(CheckAnalysis.scala:617)
[info] at org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis$47$adapted(CheckAnalysis.scala:617)
[info] at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
[info] at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
[info] ...
[info] Given sql developer uses table_changes statement in sql
[info] When it finishes
[info] Then target table is correct
[info] Given sql developer uses insert statement in sql
[info] Run completed in 25 seconds, 625 milliseconds.
[info] Total number of tests run: 2
[info] Suites: completed 2, aborted 0
[info] Tests: succeeded 1, failed 1, canceled 0, ignored 0, pending 0
[info] *** 1 TEST FAILED ***
[error] Failed tests:
[error] SparkSpec
[error] (Test / test) sbt.TestsFailedException: Tests unsuccessful
[error] Total time: 29 s, completed Mar 17, 2023 4:17:20 PM