I'm fairly new to AWS Firehose and Glue and everything, and I'm flummoxed. I'm trying to get the data that comes through the firehose to be converted properly to a Parquet file in S3.
My question is this: How can I make any record format conversions fail whenever they don't have the exact list of column names and data types that I'm looking for? In other words, it should fail if there are more or fewer columns, columns with incorrect names, or values with incorrect data types.
Below is all the background information on what I've attempted thus far and the Terraform code I currently have.
Initially, I had it set up with the column names defined directly in the aws_glue_catalog_table
resource in my Terraform file, but that was allowing data that should not have been accepted to get converted and stored as a Parquet file. It was correctly handling data the data that I want it to handle, but it was also allowing a variety of bad data through.
My desired column names (and their types) are:
- occurred_at: timestamp
- account_id: bigint
- form_id: string
- action: string
- url: string
- visitor_uuid: string
- properties: string
An example object that it happily accepted is the following object; an excerpt from the database table is shown here:
Table "public.accounts"
Column | Type | Collation | Nullable | Default
--------------------------------+-----------------------------+-----------+----------+--------------------------------------
id | integer | | not null | nextval('accounts_id_seq'::regclass)
subscription_id | integer | | |
name | character varying(255) | | |
url | character varying(255) | | |
deleted_at | timestamp without time zone | | |
created_at | timestamp without time zone | | not null |
updated_at | timestamp without time zone | | not null |
default_from_email | character varying(255) | | |
default_postal_address | text | | |
default_from_name | character varying(255) | | |
primary_email | character varying(255) | | |
disable_js | boolean | | | false
status | character varying(255) | | |
is_onboarded | boolean | | |
hide_guided_setup | boolean | | |
is_js_installed | boolean | | |
domain | character varying(255) | | |
enable_third_party_cookies | boolean | | |
phone_number | character varying(255) | | |
disable_analytics | boolean | | | false
has_activated_campaign | boolean | | | false
has_enabled_goal | boolean | | | false
current_setup_id | integer | | |
default_template_id | integer | | |
default_html_footer | text | | |
default_text_footer | text | | |
enable_lead_scoring | boolean | | not null | false
raw_lead_settings | text | | |
regenerate_js | boolean | | | true
sending_status | character varying(255) | | |
The resulting parquet files have the url
field populated, but the remaining 6 values, which don't have matching column names in the database table, are None
(or NaT
or NaN
, depending on the column's data type).
In an effort to reject bad data I attempted to use an AWS Glue Schema to handle the record format conversion. I have tried both the Avro schema and protobuf, both of which fail in different ways. I have not tried using JSON, because it doesn't seem to support timestamps, which I need.
The Avro implementation yields failed conversion files uploaded to S3 which look like this:
{
"attemptsMade":1,
"arrivalTimestamp":1659372890895,
"lastErrorCode":"DataFormatConversion.InvalidSchema",
"lastErrorMessage":"The schema is invalid. The specified table has no columns.",
"attemptEndingTimestamp":1659372956967,
"rawData":"eyJpZCI6MSwiYWNjb3VudF9pZCI6MSwiZm9ybV9pZCI6MSwidmlzaXRvcl91dWlkIjoiIiwiYWN0aW9uIjoic3VibWl0IiwiY3JlYXRlZF9hdCI6IjIwMTYtMDUtMThUMjI6MTU6MjkuNDIwWiIsInVwZGF0ZWRfYXQiOiIyMDE2LTA1LTE4VDIyOjE1OjI5LjQyMFoiLCJ1cmwiOiIiLCJtYW51YWxseV9vcGVuZWQiOm51bGwsIm9jY3VycmVkX2F0IjoiMjAxNi0wNS0xOFQyMjoxNToyOS40MTlaIn0=",
"sequenceNumber":"49631940267636444464128680289902809552022038273287258114",
"subSequenceNumber":null,
"dataCatalogTable": {
"catalogId":null,
"databaseName":"drip-staging-sn-form-events-glue-db",
"tableName":"drip-staging-sn-form-events-glue-table",
"region":"us-east-1",
"versionId":"LATEST",
"roleArn":"arn:aws:iam::[[my AWS account number]]:role/drip-staging-sn-form-event-firehose-role"
}
}
I have no idea why it says it can't find the columns, because when I open up the Glue table in the AWS console, it shows the columns I expect to see, and it says that it pulled the columns from a Glue schema.
Next, I tried protobufs, which failed when I tried to apply the Terraform plan, with the following error message:
aws_glue_catalog_table.sn_form_events: Modifying... [id=[[my AWS account number]]:drip-staging-sn-form-events-glue-db:drip-staging-sn-form-events-glue-table]
╷
│ Error: Error updating Glue Catalog Table: InvalidInputException: No SchemaDereferencingStrategy found for schema type: PROTOBUF
│
│ with aws_glue_catalog_table.sn_form_events,
│ on drip-firehose-streams.tf line 61, in resource "aws_glue_catalog_table" "sn_form_events":
│ 61: resource "aws_glue_catalog_table" "sn_form_events" {
Here's the Terraform file that I'm using to set things up. I've commented out the protobufs code and the columns
within the Glue table, but the commented out versions are what I use when trying to use each of those alternatives. Note that var.resource_prefix
evaluates to drip
and var.resource_domain
evaluates to staging
data "aws_s3_bucket" "sn_form_event_destination" {
bucket = "drip-data-mart"
}
resource "aws_cloudwatch_log_group" "sn_form_events" {
name = "/aws/kinesisfirehose/${var.resource_prefix}-${var.resource_domain}-sn-form-events"
retention_in_days = 30
}
resource "aws_cloudwatch_log_stream" "sn_form_events" {
name = "DestinationDelivery"
log_group_name = aws_cloudwatch_log_group.sn_form_events.name
}
data "template_file" "firehose_service_policy_json" {
template = file("drip-role-firehose-service-policy.tpl")
vars = {
log_group_arn = aws_cloudwatch_log_group.sn_form_events.arn
s3_bucket_arn = data.aws_s3_bucket.sn_form_event_destination.arn
s3_prefix = local.s3_prefix
glue_database_name = aws_glue_catalog_database.sn_form_events.name
glue_table_name = aws_glue_catalog_table.sn_form_events.name
}
}
resource "aws_iam_role" "firehose_role" {
name = "${var.resource_prefix}-${var.resource_domain}-sn-form-event-firehose-role"
assume_role_policy = <<EOF
{
"Version": "2012-10-17",
"Statement": [
{
"Action": "sts:AssumeRole",
"Principal": {
"Service": "firehose.amazonaws.com"
},
"Effect": "Allow",
"Sid": ""
}
]
}
EOF
}
resource "aws_iam_policy" "firehose_service_policy" {
name = "${local.pond_cc_name}SNFormEventFirehoseServicePolicy"
policy = data.template_file.firehose_service_policy_json.rendered
}
resource "aws_iam_role_policy_attachment" "firehose-role-attach-service-policy" {
role = aws_iam_role.firehose_role.id
policy_arn = aws_iam_policy.firehose_service_policy.arn
}
resource "aws_glue_catalog_database" "sn_form_events" {
name = "${var.resource_prefix}-${var.resource_domain}-sn-form-events-glue-db"
}
resource "aws_glue_catalog_table" "sn_form_events" {
name = "${var.resource_prefix}-${var.resource_domain}-sn-form-events-glue-table"
database_name = aws_glue_catalog_database.sn_form_events.name
table_type = "EXTERNAL_TABLE"
parameters = {
EXTERNAL = "TRUE"
"parquet.compression" = "SNAPPY"
}
storage_descriptor {
location = "s3://${data.aws_s3_bucket.sn_form_event_destination.id}/${local.s3_prefix}sn-form-events"
input_format = "org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat"
output_format = "org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat"
ser_de_info {
name = "${var.resource_prefix}-${var.resource_domain}-sn-form-events"
serialization_library = "org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe"
parameters = {
"serialization.format" = 1
}
}
schema_reference {
schema_version_number = 1
schema_id {
schema_arn = "arn:aws:glue:us-east-1:[[my AWS account number]]:schema/sn-form-events-glue-schema-registry/sn-form-events-glue-schema"
}
}
# schema_reference {
# schema_version_number = 1
# schema_id {
# schema_arn = "arn:aws:glue:us-east-1:[[my AWS account number]]:schema/sn-form-events-glue-schema-registry/sn_form_events.proto"
# }
# }
# columns {
# name = "occurred_at"
# type = "timestamp"
# }
# columns {
# name = "account_id"
# type = "bigint"
# }
# columns {
# name = "form_id"
# type = "string"
# }
# columns {
# name = "action"
# type = "string"
# }
# columns {
# name = "url"
# type = "string"
# }
# columns {
# name = "visitor_uuid"
# type = "string"
# }
# columns {
# name = "properties"
# type = "string"
# }
}
depends_on = [
aws_glue_catalog_database.sn_form_events,
# aws_glue_schema.sn_form_events_glue_schema-protobuf
aws_glue_schema.sn_form_events_glue_schema
]
}
resource "aws_glue_registry" "sn_form_events_glue_schema_registry" {
registry_name = "sn-form-events-glue-schema-registry"
}
resource "aws_glue_schema" "sn_form_events_glue_schema" {
schema_name = "sn-form-events-glue-schema"
registry_arn = aws_glue_registry.sn_form_events_glue_schema_registry.arn
data_format = "AVRO"
compatibility = "NONE"
schema_definition = <<EOF
{
"type": "record",
"name": "sn_form_events_glue_schema",
"fields": [
{
"name": "occurred_at",
"type": "long",
"logicalType": "timestamp-millis"
},
{
"name": "account_id",
"type": "long"
},
{
"name": "form_id",
"type": "string"
},
{
"name": "action",
"type": "string"
},
{
"name": "url",
"type": "string"
},
{
"name": "visitor_uuid",
"type": "string"
},
{
"name": "properties",
"type": "string"
}]
}
EOF
}
# resource "aws_glue_schema" "sn_form_events_glue_schema-protobuf" {
# schema_name = "sn_form_events.proto"
# registry_arn = aws_glue_registry.sn_form_events_glue_schema_registry.arn
# data_format = "PROTOBUF"
# compatibility = "NONE"
# schema_definition = <<EOF
# syntax = "proto3";
# package sn-form-events;
# import "google/protobuf/timestamp.proto";
# message sn_form_events{
# google.protobuf.Timestamp occurred_at = 1;
# int64 account_id = 2;
# string form_id = 3;
# string action = 4;
# string url = 5;
# string visitor_uuid = 6;
# string properties = 7;
# }
# EOF
# }
resource "aws_kinesis_firehose_delivery_stream" "sn_form_events" {
destination = "extended_s3"
name = "${var.resource_prefix}-${var.resource_domain}-sn-form-events"
tags = {
Name = "${var.resource_prefix}-${var.resource_domain}-sn-form-events"
}
extended_s3_configuration {
bucket_arn = data.aws_s3_bucket.sn_form_event_destination.arn
prefix = "${local.s3_prefix}year=!{timestamp:yyyy}/month=!{timestamp:MM}/day=!{timestamp:dd}/"
buffer_interval = var.resource_domain == "production" ? 300 : 60 # 5 minutes for production, 1 minute for staging
buffer_size = 128 # 128 MB
compression_format = "UNCOMPRESSED"
role_arn = aws_iam_role.firehose_role.arn
error_output_prefix = "${local.s3_prefix}failed_conversions/year=!{timestamp:yyyy}/month=!{timestamp:MM}/day=!{timestamp:dd}/!{firehose:error-output-type}/"
cloudwatch_logging_options {
enabled = true
log_group_name = aws_cloudwatch_log_group.sn_form_events.name
log_stream_name = aws_cloudwatch_log_stream.sn_form_events.name
}
data_format_conversion_configuration {
enabled = true
input_format_configuration {
deserializer {
open_x_json_ser_de {
case_insensitive = true
column_to_json_key_mappings = {}
convert_dots_in_json_keys_to_underscores = false
}
}
}
output_format_configuration {
serializer {
parquet_ser_de {
block_size_bytes = 268435456 # 256MiB - default at time of writing
compression = "SNAPPY"
enable_dictionary_compression = false
max_padding_bytes = 0
page_size_bytes = 1048576 # 1MiB - default at time of writing
writer_version = "V1"
}
}
}
schema_configuration {
database_name = aws_glue_catalog_database.sn_form_events.name
region = "us-east-1"
role_arn = aws_iam_role.firehose_role.arn
table_name = aws_glue_catalog_table.sn_form_events.name
version_id = "LATEST"
}
}
processing_configuration {
enabled = false
}
}
server_side_encryption {
enabled = false
key_type = "AWS_OWNED_CMK"
}
depends_on = [
aws_glue_catalog_table.sn_form_events,
aws_iam_role.firehose_role,
aws_glue_schema.sn_form_events_glue_schema
]
}
Here's my Terraform version and AWS provider version:
$ cat tf-state.tf
terraform {
required_version = "~> 1.2.5"
backend "s3" {}
required_providers {
aws = {
source = "hashicorp/aws"
version = "~> 4.23.0"
}
template = {
source = "hashicorp/template"
version = "~> 2.2.0"
}
}
}
Lastly, I've seen this specific answer to a different but similar question, and it did not resolve my issues: https://stackoverflow.com/a/69440421/3991562
Thanks for any advice! :D