0

Background

I am converting a (postgre)SQL team over to running on Spark with Delta Lake by providing a sql runner in scala that picks up and runs their files in a conventional order of directories and files. My goal is to make it possible to convert their SQL code as easily as possible into a Spark/Delta Lake dialect without making it too strange or convoluted for them.

Goal

At first I was interpolating the Delta Lake root data path for them, but then I found this tip that said if you create a simple select * from deltatable view, then you can actually do delta lake table changes to that sql view that materialize in the delta table below (very cool!). [Tip appears just above the "Schema Validation" section linked here: https://docs.delta.io/latest/delta-update.html#schema-validation.]

Hypothesis

I've tested and this works for update, delete, and merge, but not for insert. My guess is because it looks just like a sql query and I am passing it in as a sql query, so notice the error is coming from the spark sql analyzer, whereas it normally comes from the delta lake analyzer if I mess up the merge syntax or something.

Question

So, is there a way to read a sql file and force it to be interpreted by delta lake even if it looks superficially like a normal sql statement?

Specifically, is there something I can do in the scala code that runs the sql file to indicate I always want delta lake interpretation, or maybe something I can add to every insert statement like a using delta clause?

Stakes

The only way I know so far to force it to interpret this normal looking table/view as a delta operation is to use the delta path to refer to the table, e.g.:

delta.`/blah/blah/schema/table`

... which is pretty intimidating and error prone for my spark novitiates. (I actually gave them a handle that I grab and edit to upgrade a normal(ish) ${handle}schema.table reference into the full delta path at runtime, but even that isn't ideal).

Code: (Sql file mentioned at top of error below):

insert into
test_table_changes__test_table
values 1,2
;

(I do a little cleaning of their sql at run time and then pass the string it into spark.sql( ) to create a dataframe that I later write to a new delta table (as appropriate).

Error (in an sbt test run)

[info] SparkSpec:                    
[info] As a spark operator,                                               
[info] I want to be able to submit the spark job with one command,                       
[info] - So it's easy to remember,                                        
[info] - And returns consistent results.                                                 
[info] Feature: update/delete/insert into tables                          
[info]   Scenario: sql developer changes table data *** FAILED ***        
[info]   com.wayspring.spark.claims.Sandbox$Util$ExceptionWithPath: Error while processing: src/test/resources/sql/test_table_changes/1-test_insert/1-test_insert.sql
[info]   ...                         
[info]   Cause: org.apache.spark.sql.AnalysisException: Inserting into an RDD-based table is not allowed.;
[info] 'InsertIntoStatement Project [id#1183], false, false        
[info] +- LocalRelation [col1#2608]                                       
[info]   at org.apache.spark.sql.execution.datasources.PreWriteCheck$.failAnalysis(rules.scala:493)
[info]   at org.apache.spark.sql.execution.datasources.PreWriteCheck$.$anonfun$apply$6(rules.scala:524)
[info]   at org.apache.spark.sql.execution.datasources.PreWriteCheck$.$anonfun$apply$6$adapted(rules.scala:496)
[info]   at org.apache.spark.sql.catalyst.trees.TreeNode.foreach(TreeNode.scala:357)           
[info]   at org.apache.spark.sql.execution.datasources.PreWriteCheck$.apply(rules.scala:496)  
[info]   at org.apache.spark.sql.execution.datasources.PreWriteCheck$.apply(rules.scala:491)  
[info]   at org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis$47(CheckAnalysis.scala:617)
[info]   at org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis$47$adapted(CheckAnalysis.scala:617)
[info]   at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)           
[info]   at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)          
[info]   ...                                                              
[info]     Given sql developer uses table_changes statement in sql        
[info]     When it finishes                                               
[info]     Then target table is correct                                   
[info]     Given sql developer uses insert statement in sql               
[info] Run completed in 25 seconds, 625 milliseconds.                     
[info] Total number of tests run: 2                                       
[info] Suites: completed 2, aborted 0                                     
[info] Tests: succeeded 1, failed 1, canceled 0, ignored 0, pending 0                    
[info] *** 1 TEST FAILED ***                                              
[error] Failed tests:    
[error]         SparkSpec                                                 
[error] (Test / test) sbt.TestsFailedException: Tests unsuccessful        
[error] Total time: 29 s, completed Mar 17, 2023 4:17:20 PM               
combinatorist
  • 562
  • 1
  • 4
  • 17
  • Wait, poking around the Delta Lake docs, I have a new hypothesis, maybe when I put the delta table path in it works, not because it switches to the delta sql interpreter, but because it is a delta table, which allows spark sql to insert into. That still doesn't explain why the special view works for simple deletes and updates, though ... maybe because they actually refer to something in the table instead of simply appending ... ? – combinatorist Mar 17 '23 at 23:35

0 Answers0