Terraform AWS Athena to use Glue catalog as db

Question

I'm confused as to how I should use terraform to connect Athena to my Glue Catalog database.

I use

resource "aws_glue_catalog_database" "catalog_database" {
    name = "${var.glue_db_name}"
}

resource "aws_glue_crawler" "datalake_crawler" {
    database_name = "${var.glue_db_name}"
    name          = "${var.crawler_name}"
    role          = "${aws_iam_role.crawler_iam_role.name}"
    description   = "${var.crawler_description}"
    table_prefix  = "${var.table_prefix}"
    schedule      = "${var.schedule}" 

    s3_target {
      path = "s3://${var.data_bucket_name[0]}"
  }
    s3_target {
      path = "s3://${var.data_bucket_name[1]}"
  }
 }

to create a Glue DB and the crawler to crawl an s3 bucket (here only two), but I don't know how I link the Athena query service to the Glue DB. In the terraform documentation for Athena, there doesn't appear to be a way to connect Athena to a Glue catalog but only to an S3 Bucket. Clearly, however, Athena can be integrated with Glue.

How can I terraform an Athena database to use my Glue catalog as its data source rather than an S3 bucket?

Did you run the crawler? Did it create AWS Glue tables? If you do not define `aws_glue_catalog_table` resources with terraform that point to their respective S3 locations, the crawler will need to run at least once to create the tables. Once they are created your Glue DB and the tables should become visible in Athena, even without defining a terraform `aws_athena_database` resource. — Martin, Mar 28 '19 at 15:29
I've run the crawler. There are no tables created in the Glue db. To be honest, I don't even know if I've given the crawler the correct role/policies: I've created a gist that includes the code I'm using: https://gist.github.com/stevenranney/3f5545e1e736266807b1f337e0be58e0 — Steven, Mar 28 '19 at 18:52
I suggest to also add tag "terraform" (or "terraform-aws") to the question. — Martin, Apr 04 '19 at 06:38

Martin · Accepted Answer · 2020-06-11T22:37:39.470

Our current basic setup for having Glue crawl one S3 bucket and create/update a table in a Glue DB, which can then be queried in Athena, looks like this:

Crawler role and role policy:

The assume_role_policy of the IAM role needs only Glue as principal
The IAM role policy allows actions for Glue, S3, and logs
The Glue actions and resources can probably be narrowed down to the ones really needed
The S3 actions are limited to those needed by the crawler

resource "aws_iam_role" "glue_crawler_role" {
  name = "analytics_glue_crawler_role"

  assume_role_policy = <<EOF
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Action": "sts:AssumeRole",
      "Principal": {
        "Service": "glue.amazonaws.com"
      },
      "Effect": "Allow",
      "Sid": ""
    }
  ]
}
EOF
}

resource "aws_iam_role_policy" "glue_crawler_role_policy" {
  name = "analytics_glue_crawler_role_policy"
  role = "${aws_iam_role.glue_crawler_role.id}"
  policy = <<EOF
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "glue:*",
      ],
      "Resource": [
        "*"
      ]
    },
    {
      "Effect": "Allow",
      "Action": [
        "s3:GetBucketLocation",
        "s3:ListBucket",
        "s3:GetBucketAcl",
        "s3:GetObject",
        "s3:PutObject",
        "s3:DeleteObject"
      ],
      "Resource": [
        "arn:aws:s3:::analytics-product-data",
        "arn:aws:s3:::analytics-product-data/*",
      ]
    },
    {
      "Effect": "Allow",
      "Action": [
        "logs:CreateLogGroup",
        "logs:CreateLogStream",
        "logs:PutLogEvents"
      ],
      "Resource": [
        "arn:aws:logs:*:*:/aws-glue/*"
      ]
    }
  ]
}
EOF
}

S3 Bucket, Glue Database and Crawler:

resource "aws_s3_bucket" "product_bucket" {
  bucket = "analytics-product-data"
  acl = "private"
}

resource "aws_glue_catalog_database" "analytics_db" {
  name = "inventory-analytics-db"
}

resource "aws_glue_crawler" "product_crawler" {
  database_name = "${aws_glue_catalog_database.analytics_db.name}"
  name = "analytics-product-crawler"
  role = "${aws_iam_role.glue_crawler_role.arn}"

  schedule = "cron(0 0 * * ? *)"

  configuration = "{\"Version\": 1.0, \"CrawlerOutput\": { \"Partitions\": { \"AddOrUpdateBehavior\": \"InheritFromTable\" }, \"Tables\": {\"AddOrUpdateBehavior\": \"MergeNewColumns\" } } }"

  schema_change_policy {
    delete_behavior = "DELETE_FROM_DATABASE"
  }

  s3_target {
    path = "s3://${aws_s3_bucket.product_bucket.bucket}/products"
  }
}

I find glue crawler useful for discovering partitions. It does a reasonable job of discovering a schema initially but I find I spend a lot of time changing crawlers and correcting their discovered schemas so I tend to just use them once for that purpose and then ongoing just for discovering partitions. — Davos, Aug 06 '19 at 13:57
Thanks for the comment @Davos. That is definitely correct. We have indeed moved to defining our schemas using Terraform `aws_glue_catalog_table` resources. Adding partitions can also be done without a crawler by using Athena queries: `ALTER TABLE ADD IF NOT EXISTS PARTITON ( ) LOCATION ;`. For some simple cases (one partition per day) we just add a new partition for the next day every day shortly before midnight by running a Lambda executing such a query. — Martin, Aug 07 '19 at 15:04
I read somewhere else the suggestion of just creating a lot of them ahead of time; Athena support 20000 partitions per table and you can create 100 at once. Lambda seems a good option too. — Davos, Aug 07 '19 at 15:32
With AWS Glue Catalog, the max. number of partitions per table is actually 10 million, with a max of 20 million partitions in an AWS account (all tables in all Glue databases). 20000 used to be the maximum with Athena catalog. https://docs.aws.amazon.com/general/latest/gr/aws_service_limits.html#limits_glue — Martin, Aug 07 '19 at 19:58
Oh thanks for correcting that and the docs reference. 10 million seems like you'd want _proper big data_ to go to that extreme, not like the extra-mediumⒸ data I mostly see. On the other hand, Bigquery only allows 4000 which is not enough https://cloud.google.com/bigquery/quotas#partitioned_tables. — Davos, Aug 08 '19 at 00:48

score 0 · Answer 2 · answered Mar 28 '19 at 22:20

I had many things wrong in my Terraform code. To start with:

The S3 bucket argument in the aws_athena_database code refers to the bucket for query output not the data the table should be built from.
I had set up my aws_glue_crawler to write to a Glue database rather than an Athena db. Indeed, as Martin suggested above, once correctly set up, Athena was able to see the tables in the Glue db.
I did not have the correct policies attached to my crawler. Initially, the only policy attached to the crawler role was
```
resource "aws_iam_role_policy_attachment" "crawler_attach" {
    policy_arn = "arn:aws:iam::aws:policy/service-role/AWSGlueServiceRole"
    role = "${aws_iam_role.crawler_iam_role.name}"
} 
```
after setting a second policy that explicitly allowed all S3 access to all of the buckets I wanted to crawl and attaching that policy to the same crawler role, the crawler ran and updated tables successfully.

The second policy:

resource "aws_iam_policy" "crawler_bucket_policy" {
    name = "crawler_bucket_policy"
    path = "/"
    description = "Gives crawler access to buckets"
    policy = <<EOF
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "Stmt1553807998309",
      "Action": "*",
      "Effect": "Allow",
      "Resource": "*"
    },
    {
      "Sid": "Stmt1553808056033",
      "Action": "s3:*",
      "Effect": "Allow",
      "Resource": "arn:aws:s3:::bucket0"
    },
    {
      "Sid": "Stmt1553808078743",
      "Action": "s3:*",
      "Effect": "Allow",
      "Resource": "arn:aws:s3:::bucket1"
    },
    {
      "Sid": "Stmt1553808099644",
      "Action": "s3:*",
      "Effect": "Allow",
      "Resource": "arn:aws:s3:::bucket2"
    },
    {
      "Sid": "Stmt1553808114975",
      "Action": "s3:*",
      "Effect": "Allow",
      "Resource": "arn:aws:s3:::bucket3"
    },
    {
      "Sid": "Stmt1553808128211",
      "Action": "s3:*",
      "Effect": "Allow",
      "Resource": "arn:aws:s3:::bucket4"
    }
  ]
}
EOF
}

I'm confident that I can get away from hardcoding the bucket names in this policy but I don't yet know how to do that.

crawler_bucket_policy has one statement allowing action `"*"` on resource `"*"` - this seems very open, is it really necessary? Our current setup, which I'll list as an answer below, allows action `"glue:*"` on resource `"*"` in addition to the s3 actions. — Martin, Mar 29 '19 at 23:42
I suggest upgrading to the Glue catalog, it automatically integrates with Athena, and the Athena catalog is deprecated. I concur with the comment from @Martin about the wide open policy, not safe. Also, "Resource" can be a list of buckets which will save you having so many statements in there. Generally for policies the data source `aws_iam_policy_document` works well as it is HCL syntax, and then you can use the `json` output of it as the `policy` property in an `aws_iam_role_policy` resource. Then it is easy to pass in a list of buckets as a variable without jsonencoding and other weirdness — Davos, Aug 06 '19 at 13:53

Terraform AWS Athena to use Glue catalog as db

2 Answers2