0

I am trying to build a sql parser that will be parsing complicated sql queries

for example, below is an example of one of such queries:

select * from (select a.a, a.b from db.tmp_table1 a JOIN db.tmp_table2 b 
where a.c = b.c 
and a.d = 'interesting' 
and b.e = 234)

the output that we will be having is

(schema = db, table: tmp_table1, cols used = a,b,c,d   filters=(d = 'interesting'))
(schema = db, table: tmp_table2, cols used = c, e   filters=(e = 234))

I tried running some experiments by doing this:

val logicalPlan = spark.sessionState.sqlParser.parsePlan(query)
logicalPlan: org.apache.spark.sql.catalyst.plans.logical.LogicalPlan =
'Project [*]
+- 'SubqueryAlias `__auto_generated_subquery_name`
   +- 'Project ['a.a, 'a.b]
      +- 'Filter ((('a.c = 'b.c) && ('a.d = interesting)) && ('b.e = 234))
         +- 'Join Inner
            :- 'SubqueryAlias `a`
            :  +- 'UnresolvedRelation `db`.`tmp_table1`
            +- 'SubqueryAlias `b`
               +- 'UnresolvedRelation `db`.`tmp_table2`

However the part where I am stuck on is how do I parse the logicalPlan to get my desired outputs. I checked some answers in stack overflow related to such parsing like this but the requirements in such questions seems to be a lot simpler.

1 Answers1

0

The best place to look for info is the Spark source code itself. All plans can be pattern matched but look for the printed plan and search the source for examples using a given LogicalPlan. e.g. googling for "projection logicalplan spark github" shows LogicalPlan, LogicalPlanVistor and basicLogicalOperators which includes Project Subquery, SubqueryAlias, Filter and Join others which should be a good starting point for you.

Note Join conditions and Filter conditions aren't the same, "join on" conditions fall under the Join and other conditions are under Filter (e.g. optimiser rules in Quality).

In all cases you are pattern matching on Trees of either LogicalPlan or the underlying Expressions, you can collect or transform them as appropriate. Also note that Pushdown Predicates Expressions may look like catalyst.expressions but they are a more abstract source specific subset.

Be aware that any code using internal Spark api's can often be Spark version specific, particularly when it comes to pattern matching Expressions (Cast and Add have changed in the 3x for example).

Lastly I'd highly suggest use use a debugger for the exploration.

Chris
  • 1,240
  • 7
  • 8
  • thanks @Chris . Sorry I don't have enough reputations to upvote your answer. let me look through and see how to implement it – Data Geek Aug 25 '23 at 21:02