I am currently working on a lineage system that will be deployed in our google-cloud space, the goal is to extract and parse SQL queries executed from BigQuery using audit logs and create a lineage out of those. I explored a couple of existing products but eventually decided to build my own.
I came across the Google's POC and documentation (https://cloud.google.com/architecture/building-a-bigquery-data-lineage-solution) for data-lineage which led me to explore the usage of zetaSQL.
I was able to parse simple queries and extract referenced tables and output columns from it using Analyzer.extractTableNamesFromStatement(query)
and Analyzer.analyzeStatement(query, analyzerOptions, simpleCatalog
. However, I was wondering if zetaSQL is capable of achieving these things below, and if so, It would be really helpful if you can give me some examples or path to take.
- Obtain source columns from
ResolvedNodes
? - if the source table being referenced in the SQL is a derived table, can it also cascade the lineage parsing of it?
additionally, I was also exploring the usage of SimpleCatalog
and
I was wondering.
- Why does zetaSQL needs a registration of catalog of tables when the reference tables are already available from the SQL queries? I kind of have a clue it does a lot of checks in background but what are those checks?
- are those registered catalogs only available at runtime and then wiped out after the job terminates? does it need to register all tables referenced in a SQL or is there a way to cache it somehow?
Thank you.