1

We're building a data warehouse in BigQuery where we generate a large amount of data marts using standard sql statements. These can be quite large and complex. To track data lineage across a chain of dependencies, we'd like to automatically parse the SQL statements and get all the output columns, matched up with the input table.column(s).

Simple example:

SELECT t1.a, t2.b, t1.a + t2.b AS c FROM table1 t1 JOIN table2 t2 ON t1.a = t2.a

Should end up giving us:

Input Output table1.a a table2.b b table1.a c table1.b c

We've tried using this: https://www.npmjs.com/package/node-sql-parser, but it comes up short in some of our complex scenarios.

Is there any library available in any language which supports parsing a SQL statement and returning the AST for the full standard SQL grammar?

Bjoern
  • 433
  • 3
  • 16

1 Answers1

-1

You can use google/zetasql, which is what BigQuery uses for parsing StandardSQL.

Yun Zhang
  • 5,185
  • 2
  • 10
  • 29
  • Yes, I should have mentioned that I'm looking into that. It is however not in a very user-friendly state. I'm currently getting assistance from the very helpful contributors to get it up and running. – Bjoern Jan 09 '20 at 10:32
  • 2
    Sorry to hear that it is not in a user-friendly state. It is still your best bet than any other solution for its first-party nature. – Yun Zhang Jan 09 '20 at 21:50
  • Actually, let me rephrase that. It is in a perfectly fine state, but it requires some development to get to where I want, and the documentation is somewhat lacking at this point. – Bjoern Jan 12 '20 at 17:46
  • 1
    @Bjoern I have the same usecase, can you guide me in a proper direction with some rough steps to get this to work? – Msvstl Apr 06 '22 at 10:17