1

I'm using a Terraform handy module published by Google Cloud Platform to spin up a BigQuery dataset complete with several tables and a number of views. This works really well, except that when the views depend on the tables, I have to retry my apply, because there's nothing telling Terraform to do one before the other. I assign both the tables and views properties based on contents of an assets subdirectory, which gives me all the schema and templated queries I want.

Is there a supported way to do this short of abandoning the nice module reuse pattern? I'm really happy with the pattern I have.

For completeness, my code to invoke the module is

locals {
  #
  # Tables
  #

  # Description attribute for each table. If absent, no description is set (null).
  table_to_description = {
# ...
  }

  # Values that don't ever change set for this dataset.
  TABLE_CONSTANTS = {
    time_partitioning = null
    expiration_time   = null
    clustering        = []
    labels = {
      terraform_managed = "true"
    }
  }

  TABLE_SCHEMA_SUFFIX = ".json"


  table_schema_filenames = fileset(pathexpand("${path.module}/assets/schemas"), "*.json")
  // fileset() doesn't have an option to output full paths, so we need to re-expand them
  table_schema_paths = [for file_name in local.table_schema_filenames : pathexpand("${path.module}/assets/schemas/${file_name}")]

  # Build a vector of objects, one for each table
  table_inputs = [for full_path in local.table_schema_paths : {
    schema = full_path
    # TODO(jaycarlton) I do not yet see a way around doing the replacement twice, as it's not possible
    #   to refer to other values in the same object when defining it.
    table_id    = replace(basename(full_path), local.TABLE_SCHEMA_SUFFIX, "")
    description = lookup(local.table_to_description, replace(basename(full_path), local.TABLE_SCHEMA_SUFFIX, ""), null)
  }]

  # Merge calculated inputs with the ones we use every time.
  tables = [for table_input in local.table_inputs :
    merge(table_input, local.TABLE_CONSTANTS)
  ]

  #
  # Views
  #
  VIEW_CONSTANTS = {
    # Reporting Subsystem always uses Standard SQL Syntax
    use_legacy_sql = false,
    labels = {
      terraform_managed = "true"
    }
  }
  QUERY_TEMPLATE_SUFFIX = ".sql"
  # Local filenames for view templates. Returns something like ["latest_users.sql", "users_by_id.sql"]
  view_query_template_filenames = fileset("${path.module}/assets/views", "*.sql")
  # expanded to fully qualified path, e.g. ["/repos/workbench/terraform/modules/reporting/views/latest_users.sql", ...]
  //  view_query_template_paths = [for file_name in local.view_query_template_filenames : pathexpand("./reporting/views/${file_name}")]
  view_query_template_paths = [for file_name in local.view_query_template_filenames : pathexpand("${path.module}/assets/views/${file_name}")]

  # Create views for each .sql file in the views directory. There is no Terraform
  # dependency from the view to the table(s) it queries, and I  don't believe the SQL is even checked
  # for accuracy prior to creation on the BQ side.
  views = [for view_query_template_path in local.view_query_template_paths :
    merge({
      view_id = replace(basename(view_query_template_path), local.QUERY_TEMPLATE_SUFFIX, ""),
      query = templatefile(view_query_template_path, {
        project = var.project_id
        dataset = var.reporting_dataset_id
      })
  }, local.VIEW_CONSTANTS)]

}

# All BigQuery assets for Reporting subsystem
module "main" {
  source     = "terraform-google-modules/bigquery/google"
  version    = "~> 4.3"
  dataset_id = var.reporting_dataset_id
  project_id = var.project_id
  location   = "US"

  # Note: friendly_name is discovered in plan and apply steps, but can't be
  # entered here. Maybe they're just not exposed by the dataset module but the resources are looking
  # for them?
  dataset_name = "Workbench ${title(var.aou_env)} Environment Reporting Data" # exposed as friendly_name in plan
  description  = "Daily output of relational tables and time series views for analysis. Views are provided for general ad-hoc analysis."

  tables = local.tables

  # Note that, when creating this module fom the ground up, it's common to see an error like
  # `Error: googleapi: Error 404: Not found: Table my-project:my_dataset.my_table, notFound`. It seems
  # to be a momentary issue due to the dataset's existence not yet being observable to the table/view
  # create API. So far, it's always worked on a re-run.
  # TODO(jaycarlton) see if there's a way to put a retry on this. I'm not convinced that will work
  #   outside of a resource context (and inside a third-party module).
  views = local.views

}

The errors look like:

Error: googleapi: Error 404: Not found: Table <MY_PROJECT>:<MY_DATASET>.user, notFound

  on .terraform/modules/workbench.reporting.main/main.tf line 76, in resource "google_bigquery_table" "view":
  76: resource "google_bigquery_table" "view" {

The views are accepted by Terraform by rejected by BigQuery, as the tables they refer to are either not created yet or not yet available. It looks like depends_on goes in a resource block, but those are abstracted away in this case from what I understand. A retrier would also solve my problem (but less gracefully), as in all cases re-running terraform apply works.

Jay Carlton
  • 1,118
  • 1
  • 11
  • 27

2 Answers2

1

If I'm understanding the necessary data flow here, I think one way to make this work would be to arrange for local.views to depend on one of the table-related output values in the module. Since your local.views expression includes a templatefile call, you could potentially achieve that by passing the table names into the template:

  views = [for view_query_template_path in local.view_query_template_paths :
    merge({
      view_id = replace(basename(view_query_template_path), local.QUERY_TEMPLATE_SUFFIX, ""),
      query = templatefile(view_query_template_path, {
        project     = var.project_id
        dataset     = var.reporting_dataset_id
        table_names = module.main.table_names
      })
  }, local.VIEW_CONSTANTS)]

Because the table_names output value depends on the table resources in the module, you can use it to declare an indirect dependency on the tables. The views argument to the module will then itself depend on local.views, and thus indirectly (now to levels removed) depend on the tables.

The above assumes that the configuration of the tables inside the module doesn't in any way depend on the views. If there were such a dependency then this would create a dependency cycle, but from a quick read of the module source code that doesn't seem like it will be a problem here.

A key thing this answer is depending on is that Terraform module input variables and output variables are each a separate node in the dependency graph, rather than the entire module being a node. Therefore one input variable for a module can potentially depend on an output value from that same module, as long as the dependencies inside the module don't cause that to create a dependency cycle.

Martin Atkins
  • 62,420
  • 8
  • 120
  • 138
  • We initially implemented this approach using implicit dependency and then we realized that defining the policy I mentioned in my answer, was a cleaner implementation for our overall infrastructure. – Gongora Jun 24 '21 at 02:27
  • We have found certain use cases in which we face similar issue to the one expressed in this topic by OP, but the cause is the dependency between multiple views under the same dataset rather than dependency between views and tables. I would like to ask for your advice if possible Martin. My first thought was to define, somehow, dynamic dependency within the module (remember that the behavior of google's module is that you define all views under a dataset through a single instance of the module, [they use](https://github.com/cbsi-dto/tf-mod-bq/blob/master/main.tf#L105) `for_each`) – Gongora Jun 24 '21 at 02:32
  • Even though dynamic dependency sounds too hacky (if it is doable) in any case we are going to need to extend google's module for that use case. – Gongora Jun 24 '21 at 02:39
  • I no longer have the mental context loaded about what we were discussing here, and it sounds like your situation is at least slightly different than what this question was about, so I'd suggest starting a new top-level question on Stack Overflow where you can share fuller details about what you're trying to achieve and what you tried so far. – Martin Atkins Jun 24 '21 at 19:09
0

We implemented our framework using a similar approach but to mitigate this use case, we decide to define a policy in which a bq-dataset will never host tables and views together, only a set of one of them. That way our CI will run/apply always first the code for ds/tables and then the code for ds/views.

Gongora
  • 595
  • 6
  • 10
  • In my case I definitely want to have the views be in the same dataset as the tables, or it would really complicate the story for my users. – Jay Carlton Jul 27 '21 at 21:41