1

I'm writing a bash script to parse a bunch (a dozen or more) massive Terraform files that contain a large number of google_bigquery_dataset resources and their associated IAM access blocks. The script should take each dataset resource and copy it to another file, named for the dataset itself.

All of this is fine, except extracting the name of the dataset from the resource's "dataset_id" field. This would be easy enough, if not for the fact that some of these dataset resources have authorized view blocks that also contain "dataset_id" values.

Here is an example of such a resource:

resource "google_bigquery_dataset" "project-bigquery-dataset-RESOURCE_NAME" {
  access {
    role          = "WRITER"
    special_group = "projectWriters"
  }

  access {
    role          = "READER"
    special_group = "projectReaders"
  }

  access {
    role          = "WRITER"
    user_by_email = "user1@project.iam.gserviceaccount.com"
  }

  access {
    role          = "OWNER"
    special_group = "projectOwners"
  }

  access {
    view {
      dataset_id = "DO_NOT_WANT"
      project_id = "project"
      table_id   = "table1"
    }
  }

  access {
    view {
      dataset_id = "DO_NOT_WANT"
      project_id = "project"
      table_id   = "table2"
    }
  }

  access {
    view {
      dataset_id = "DO_NOT_WANT"
      project_id = "project"
      table_id   = "table3"
    }
  }

  dataset_id                      = "THIS_IS_WHAT_I_WANT"
  default_partition_expiration_ms = "0"
  delete_contents_on_destroy      = "false"

  labels = {
    application-name = "app-name"
  }

  location = "US"
  project  = "project"
}

Before I realized that the authorized view blocks also had a dataset_id field, I was using this to try to grab the value I wanted, assuming startIndex and endIndex are just the start and end line numbers representing a complete dataset resource block as above:

fileName=$( sed -n ${startIndex},${endIndex}p $bigFile | grep "dataset_id" | cut -d\" -f2)

Which works only when there are not Authorized View blocks contained other dataset_id values.

I then tried to use a Negative Lookbehind:

fileName=$( sed -n ${startIndex},${endIndex}p $bigFile | grep '(?<!view {]n)dataset_id' | cut -f1 -d\"

That doesn't work. I'm not sure if it's because of the newline or because of the whitespace between the end of view { and the start of dataset_id = "DO_NOT_WANT".

I've tried variations on it, such as (?<!view\s{\s)\s*dataset_id without success.

Is there any way to capture only the dataset_id that isn't in a view block?

A couple notes:

  1. I can guarantee that view { will always precede the dataset_id in a block, without a line break.
  2. I cannot guarantee the order. It's possible the dataset_id I'm trying to capture could be present before the view blocks, after them, or even somewhere between them.
  3. Desired output for the above example would simply be THIS_IS_WHAT_I_WANT Any help would be appreciated.
RavinderSingh13
  • 130,504
  • 14
  • 57
  • 93
dxh3707
  • 147
  • 1
  • 7
  • 3
    I assume that `awk` is a better tool for your requirement. – Cyrus Oct 12 '22 at 19:40
  • 1
    Please add your desired output for that sample input to your question (no comment). – Cyrus Oct 12 '22 at 19:56
  • 1
    `grep` implements BRE, and optionally ERE. Standard `grep` has no lookahead or lookbehind capabilities at all -- that's PCRE syntax. Have you thought about using a perl oneliner? (perl is, after all, where PCRE comes from). – Charles Duffy Oct 12 '22 at 21:24
  • (also, grep is _not_ "bash regex" support; the regex support built into bash uses `[[ $string =~ $regex ]]` syntax - though that's ERE syntax, so it too doesn't support lookbehind. `grep` is a completely independent tool provided by your operating system vendor, not part of bash). – Charles Duffy Oct 12 '22 at 21:25
  • This might help: `grep '^ dataset_id' file` – Cyrus Oct 12 '22 at 22:21

3 Answers3

3

With your shown samples only, please try following awk code. Written and tested in GNU awk.

awk -v RS= -v FS="\n" '
/^[[:space:]]+dataset_id[[:space:]]+/{
  split($1,arr,"\"")
  print arr[2]
}
'  Input_file

Explanation: Simple explanation for complete code would be:

  • Setting RS(Record separator) as paragraph mode in awk program.
  • Then setting FS(Field separator) as new line.
  • Then in main block checking condition if line starts from 1 or more spaces followed by dataset_id followed by again 1 or more spaces, if this condition is TRUE then:
  • Using split function of awk to split $1(first field) into an array named arr with delimiter of ". This basically creates an array named arr with index of 1 2 3 4 and so on depending upon how many elements it splits based on delimiter.
  • Then printing array arr's 2nd element which is required output by OP.
RavinderSingh13
  • 130,504
  • 14
  • 57
  • 93
1

If your grep supports -P (PCRE) option, would you please try the following. It is tested with your shown sample.

grep -Poz '(?:^|\n)(?:(?!view).)*\n\s*dataset_id\s*=\s*"\K[^"]+' input_file

Output:

THIS_IS_WHAT_I_WANT

Assumption

  • If view { preceds the dataset_id, the two words span consecutive two lines.

Explanations

  • As we need to examine the pattern match across lines, -z option is put to grep to treat the input as sequences of lines.
  • The regex (?:^|\n)(?:(?!view).)*\n\s*dataset_id\s*=\s*"\K[^"]+ matches (at least) two lines which do not contain the word view in the previous line before the line containing dataset_id.
  • (?:^|\n) anchors the start of the line, as the multiline option (?m) does not work due to the -z option.
  • As the lookbehind assertion does not allow variable length match, we need to use (?:(?!view).)* as an alternative for (?<!view.*).
  • The following \n\s*dataset_id makes sure at least one newline exists between view and dataset_id. Otherwise the regex matches a single line which just contains dataset_id causing over-detection.
  • \K discards the previous matched substring to exclude it in the output.
tshiono
  • 21,248
  • 2
  • 14
  • 22
0

not guaranteed to work with your hcl, but could try converting to json first

$ cat foo.tf | 
yj -c | 
jq  -r '.resource[].google_bigquery_dataset[][][].dataset_id'
THIS_IS_WHAT_I_WANT
THIS_IS_WHAT_I_WANT
0bel1sk
  • 505
  • 5
  • 4