0

I am currently working on a lineage system that will be deployed in our google-cloud space, the goal is to extract and parse SQL queries executed from BigQuery using audit logs and create a lineage out of those. I explored a couple of existing products but eventually decided to build my own.

I came across the Google's POC and documentation (https://cloud.google.com/architecture/building-a-bigquery-data-lineage-solution) for data-lineage which led me to explore the usage of zetaSQL.

I was able to parse simple queries and extract referenced tables and output columns from it using Analyzer.extractTableNamesFromStatement(query) and Analyzer.analyzeStatement(query, analyzerOptions, simpleCatalog. However, I was wondering if zetaSQL is capable of achieving these things below, and if so, It would be really helpful if you can give me some examples or path to take.

  • Obtain source columns from ResolvedNodes?
  • if the source table being referenced in the SQL is a derived table, can it also cascade the lineage parsing of it?

additionally, I was also exploring the usage of SimpleCatalog and I was wondering.

  • Why does zetaSQL needs a registration of catalog of tables when the reference tables are already available from the SQL queries? I kind of have a clue it does a lot of checks in background but what are those checks?
  • are those registered catalogs only available at runtime and then wiped out after the job terminates? does it need to register all tables referenced in a SQL or is there a way to cache it somehow?

Thank you.

Jay-r Bangit
  • 59
  • 1
  • 9
  • 1
    Your question was closed yesterday. Do not repost the same question. Edit the existing question to meet guidelines. https://stackoverflow.com/questions/73356720/zetasql-why-does-it-need-a-look-up-catalog-when-analyzing-statement – John Hanley Aug 16 '22 at 21:50
  • To help you create good questions, read this guide: https://stackoverflow.com/help/how-to-ask I also recommend that you ask **one** well defined question with a clear problem description. Your original question was closed for that reason (needs to be more focused). – John Hanley Aug 16 '22 at 21:53
  • Your assumption is wrong. Some of us have decades of database experience. Your question does not meet the question requirements. Read the link and the advice I have provided. Otherwise, your question will be closed again. – John Hanley Aug 17 '22 at 03:31

1 Answers1

0

I'll answer my own question for bullet number 1, it seems it is depending on the type of node. for example, if you are trying to obtain lineage on the functions applied on a column, We could use ResolvedFunctionCall to obtain transformations applied and from there you can do a visit on node ResolvedColumnRef which enables you to get source column for that expression.

Jay-r Bangit
  • 59
  • 1
  • 9