0

I'm fairly new to AWS Firehose and Glue and everything, and I'm flummoxed. I'm trying to get the data that comes through the firehose to be converted properly to a Parquet file in S3.

My question is this: How can I make any record format conversions fail whenever they don't have the exact list of column names and data types that I'm looking for? In other words, it should fail if there are more or fewer columns, columns with incorrect names, or values with incorrect data types.

Below is all the background information on what I've attempted thus far and the Terraform code I currently have.

Initially, I had it set up with the column names defined directly in the aws_glue_catalog_table resource in my Terraform file, but that was allowing data that should not have been accepted to get converted and stored as a Parquet file. It was correctly handling data the data that I want it to handle, but it was also allowing a variety of bad data through.

My desired column names (and their types) are:

  • occurred_at: timestamp
  • account_id: bigint
  • form_id: string
  • action: string
  • url: string
  • visitor_uuid: string
  • properties: string

An example object that it happily accepted is the following object; an excerpt from the database table is shown here:

                                                  Table "public.accounts"
             Column             |            Type             | Collation | Nullable |               Default
--------------------------------+-----------------------------+-----------+----------+--------------------------------------
 id                             | integer                     |           | not null | nextval('accounts_id_seq'::regclass)
 subscription_id                | integer                     |           |          |
 name                           | character varying(255)      |           |          |
 url                            | character varying(255)      |           |          |
 deleted_at                     | timestamp without time zone |           |          |
 created_at                     | timestamp without time zone |           | not null |
 updated_at                     | timestamp without time zone |           | not null |
 default_from_email             | character varying(255)      |           |          |
 default_postal_address         | text                        |           |          |
 default_from_name              | character varying(255)      |           |          |
 primary_email                  | character varying(255)      |           |          |
 disable_js                     | boolean                     |           |          | false
 status                         | character varying(255)      |           |          |
 is_onboarded                   | boolean                     |           |          |
 hide_guided_setup              | boolean                     |           |          |
 is_js_installed                | boolean                     |           |          |
 domain                         | character varying(255)      |           |          |
 enable_third_party_cookies     | boolean                     |           |          |
 phone_number                   | character varying(255)      |           |          |
 disable_analytics              | boolean                     |           |          | false
 has_activated_campaign         | boolean                     |           |          | false
 has_enabled_goal               | boolean                     |           |          | false
 current_setup_id               | integer                     |           |          |
 default_template_id            | integer                     |           |          |
 default_html_footer            | text                        |           |          |
 default_text_footer            | text                        |           |          |
 enable_lead_scoring            | boolean                     |           | not null | false
 raw_lead_settings              | text                        |           |          |
 regenerate_js                  | boolean                     |           |          | true
 sending_status                 | character varying(255)      |           |          |

The resulting parquet files have the url field populated, but the remaining 6 values, which don't have matching column names in the database table, are None (or NaT or NaN, depending on the column's data type).

In an effort to reject bad data I attempted to use an AWS Glue Schema to handle the record format conversion. I have tried both the Avro schema and protobuf, both of which fail in different ways. I have not tried using JSON, because it doesn't seem to support timestamps, which I need.

The Avro implementation yields failed conversion files uploaded to S3 which look like this:

{
  "attemptsMade":1,
  "arrivalTimestamp":1659372890895,
  "lastErrorCode":"DataFormatConversion.InvalidSchema",
  "lastErrorMessage":"The schema is invalid. The specified table has no columns.",
  "attemptEndingTimestamp":1659372956967,
  "rawData":"eyJpZCI6MSwiYWNjb3VudF9pZCI6MSwiZm9ybV9pZCI6MSwidmlzaXRvcl91dWlkIjoiIiwiYWN0aW9uIjoic3VibWl0IiwiY3JlYXRlZF9hdCI6IjIwMTYtMDUtMThUMjI6MTU6MjkuNDIwWiIsInVwZGF0ZWRfYXQiOiIyMDE2LTA1LTE4VDIyOjE1OjI5LjQyMFoiLCJ1cmwiOiIiLCJtYW51YWxseV9vcGVuZWQiOm51bGwsIm9jY3VycmVkX2F0IjoiMjAxNi0wNS0xOFQyMjoxNToyOS40MTlaIn0=",
  "sequenceNumber":"49631940267636444464128680289902809552022038273287258114",
  "subSequenceNumber":null,
  "dataCatalogTable": {
    "catalogId":null,
    "databaseName":"drip-staging-sn-form-events-glue-db",
    "tableName":"drip-staging-sn-form-events-glue-table",
    "region":"us-east-1",
    "versionId":"LATEST",
    "roleArn":"arn:aws:iam::[[my AWS account number]]:role/drip-staging-sn-form-event-firehose-role"
  }
}

I have no idea why it says it can't find the columns, because when I open up the Glue table in the AWS console, it shows the columns I expect to see, and it says that it pulled the columns from a Glue schema.

Next, I tried protobufs, which failed when I tried to apply the Terraform plan, with the following error message:

aws_glue_catalog_table.sn_form_events: Modifying... [id=[[my AWS account number]]:drip-staging-sn-form-events-glue-db:drip-staging-sn-form-events-glue-table]
╷
│ Error: Error updating Glue Catalog Table: InvalidInputException: No SchemaDereferencingStrategy found for schema type: PROTOBUF
│
│   with aws_glue_catalog_table.sn_form_events,
│   on drip-firehose-streams.tf line 61, in resource "aws_glue_catalog_table" "sn_form_events":
│   61: resource "aws_glue_catalog_table" "sn_form_events" {

Here's the Terraform file that I'm using to set things up. I've commented out the protobufs code and the columns within the Glue table, but the commented out versions are what I use when trying to use each of those alternatives. Note that var.resource_prefix evaluates to drip and var.resource_domain evaluates to staging

data "aws_s3_bucket" "sn_form_event_destination" {
  bucket = "drip-data-mart"
}

resource "aws_cloudwatch_log_group" "sn_form_events" {
  name              = "/aws/kinesisfirehose/${var.resource_prefix}-${var.resource_domain}-sn-form-events"
  retention_in_days = 30
}

resource "aws_cloudwatch_log_stream" "sn_form_events" {
  name           = "DestinationDelivery"
  log_group_name = aws_cloudwatch_log_group.sn_form_events.name
}

data "template_file" "firehose_service_policy_json" {
  template = file("drip-role-firehose-service-policy.tpl")

  vars = {
    log_group_arn      = aws_cloudwatch_log_group.sn_form_events.arn
    s3_bucket_arn      = data.aws_s3_bucket.sn_form_event_destination.arn
    s3_prefix          = local.s3_prefix
    glue_database_name = aws_glue_catalog_database.sn_form_events.name
    glue_table_name    = aws_glue_catalog_table.sn_form_events.name
  }
}

resource "aws_iam_role" "firehose_role" {
  name = "${var.resource_prefix}-${var.resource_domain}-sn-form-event-firehose-role"

  assume_role_policy = <<EOF
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Action": "sts:AssumeRole",
      "Principal": {
        "Service": "firehose.amazonaws.com"
      },
      "Effect": "Allow",
      "Sid": ""
    }
  ]
}
EOF
}

resource "aws_iam_policy" "firehose_service_policy" {
  name   = "${local.pond_cc_name}SNFormEventFirehoseServicePolicy"
  policy = data.template_file.firehose_service_policy_json.rendered
}

resource "aws_iam_role_policy_attachment" "firehose-role-attach-service-policy" {
  role       = aws_iam_role.firehose_role.id
  policy_arn = aws_iam_policy.firehose_service_policy.arn
}

resource "aws_glue_catalog_database" "sn_form_events" {
  name = "${var.resource_prefix}-${var.resource_domain}-sn-form-events-glue-db"
}

resource "aws_glue_catalog_table" "sn_form_events" {
  name          = "${var.resource_prefix}-${var.resource_domain}-sn-form-events-glue-table"
  database_name = aws_glue_catalog_database.sn_form_events.name

  table_type = "EXTERNAL_TABLE"

  parameters = {
    EXTERNAL              = "TRUE"
    "parquet.compression" = "SNAPPY"
  }

  storage_descriptor {
    location      = "s3://${data.aws_s3_bucket.sn_form_event_destination.id}/${local.s3_prefix}sn-form-events"
    input_format  = "org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat"
    output_format = "org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat"

    ser_de_info {
      name                  = "${var.resource_prefix}-${var.resource_domain}-sn-form-events"
      serialization_library = "org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe"

      parameters = {
        "serialization.format" = 1
      }
    }

    schema_reference {
      schema_version_number = 1
      schema_id {
        schema_arn = "arn:aws:glue:us-east-1:[[my AWS account number]]:schema/sn-form-events-glue-schema-registry/sn-form-events-glue-schema"
      }
    }

    # schema_reference {
    #   schema_version_number = 1
    #   schema_id {
    #     schema_arn = "arn:aws:glue:us-east-1:[[my AWS account number]]:schema/sn-form-events-glue-schema-registry/sn_form_events.proto"
    #   }
    # }

    # columns {
    #   name = "occurred_at"
    #   type = "timestamp"
    # }

    # columns {
    #   name = "account_id"
    #   type = "bigint"
    # }

    # columns {
    #   name = "form_id"
    #   type = "string"
    # }

    # columns {
    #   name = "action"
    #   type = "string"
    # }

    # columns {
    #   name = "url"
    #   type = "string"
    # }

    # columns {
    #   name = "visitor_uuid"
    #   type = "string"
    # }

    # columns {
    #   name = "properties"
    #   type = "string"
    # }
  }

  depends_on = [
    aws_glue_catalog_database.sn_form_events,
    # aws_glue_schema.sn_form_events_glue_schema-protobuf
    aws_glue_schema.sn_form_events_glue_schema
  ]
}

resource "aws_glue_registry" "sn_form_events_glue_schema_registry" {
  registry_name = "sn-form-events-glue-schema-registry"
}

resource "aws_glue_schema" "sn_form_events_glue_schema" {
  schema_name       = "sn-form-events-glue-schema"
  registry_arn      = aws_glue_registry.sn_form_events_glue_schema_registry.arn
  data_format       = "AVRO"
  compatibility     = "NONE"
  schema_definition = <<EOF
  {
    "type": "record",
    "name": "sn_form_events_glue_schema",
    "fields": [
      {
        "name": "occurred_at",
        "type": "long",
        "logicalType": "timestamp-millis"
      },
      {
        "name": "account_id",
        "type": "long"
      },
      {
        "name": "form_id",
        "type": "string"
      },
      {
        "name": "action",
        "type": "string"
      },
      {
        "name": "url",
        "type": "string"
      },
      {
        "name": "visitor_uuid",
        "type": "string"
      },
      {
        "name": "properties",
        "type": "string"
    }]
  }
EOF
}

# resource "aws_glue_schema" "sn_form_events_glue_schema-protobuf" {
#   schema_name       = "sn_form_events.proto"
#   registry_arn      = aws_glue_registry.sn_form_events_glue_schema_registry.arn
#   data_format       = "PROTOBUF"
#   compatibility     = "NONE"
#   schema_definition = <<EOF
#   syntax = "proto3";
#   package sn-form-events;

#   import "google/protobuf/timestamp.proto";

#   message sn_form_events{
#     google.protobuf.Timestamp occurred_at = 1;
#     int64 account_id = 2;
#     string form_id = 3;
#     string action = 4;
#     string url = 5;
#     string visitor_uuid = 6;
#     string properties = 7;
#   }
# EOF
# }

resource "aws_kinesis_firehose_delivery_stream" "sn_form_events" {
  destination = "extended_s3"
  name        = "${var.resource_prefix}-${var.resource_domain}-sn-form-events"

  tags = {
    Name = "${var.resource_prefix}-${var.resource_domain}-sn-form-events"
  }

  extended_s3_configuration {
    bucket_arn          = data.aws_s3_bucket.sn_form_event_destination.arn
    prefix              = "${local.s3_prefix}year=!{timestamp:yyyy}/month=!{timestamp:MM}/day=!{timestamp:dd}/"
    buffer_interval     = var.resource_domain == "production" ? 300 : 60 # 5 minutes for production, 1 minute for staging
    buffer_size         = 128                                            # 128 MB
    compression_format  = "UNCOMPRESSED"
    role_arn            = aws_iam_role.firehose_role.arn
    error_output_prefix = "${local.s3_prefix}failed_conversions/year=!{timestamp:yyyy}/month=!{timestamp:MM}/day=!{timestamp:dd}/!{firehose:error-output-type}/"

    cloudwatch_logging_options {
      enabled         = true
      log_group_name  = aws_cloudwatch_log_group.sn_form_events.name
      log_stream_name = aws_cloudwatch_log_stream.sn_form_events.name
    }

    data_format_conversion_configuration {
      enabled = true

      input_format_configuration {
        deserializer {

          open_x_json_ser_de {
            case_insensitive                         = true
            column_to_json_key_mappings              = {}
            convert_dots_in_json_keys_to_underscores = false
          }
        }
      }

      output_format_configuration {
        serializer {

          parquet_ser_de {
            block_size_bytes              = 268435456 # 256MiB - default at time of writing
            compression                   = "SNAPPY"
            enable_dictionary_compression = false
            max_padding_bytes             = 0
            page_size_bytes               = 1048576 # 1MiB - default at time of writing
            writer_version                = "V1"
          }
        }
      }

      schema_configuration {
        database_name = aws_glue_catalog_database.sn_form_events.name
        region        = "us-east-1"
        role_arn      = aws_iam_role.firehose_role.arn
        table_name    = aws_glue_catalog_table.sn_form_events.name
        version_id    = "LATEST"
      }
    }

    processing_configuration {
      enabled = false
    }
  }

  server_side_encryption {
    enabled  = false
    key_type = "AWS_OWNED_CMK"
  }

  depends_on = [
    aws_glue_catalog_table.sn_form_events,
    aws_iam_role.firehose_role,
    aws_glue_schema.sn_form_events_glue_schema
  ]
}

Here's my Terraform version and AWS provider version:

$ cat tf-state.tf
terraform {
  required_version = "~> 1.2.5"

  backend "s3" {}
  required_providers {
    aws = {
      source  = "hashicorp/aws"
      version = "~> 4.23.0"
    }

    template = {
      source  = "hashicorp/template"
      version = "~> 2.2.0"
    }
  }
}

Lastly, I've seen this specific answer to a different but similar question, and it did not resolve my issues: https://stackoverflow.com/a/69440421/3991562

Thanks for any advice! :D

Ben Sandeen
  • 1,403
  • 3
  • 14
  • 17
  • And where have you defined the Glue schema resource? – Marko E Aug 02 '22 at 10:55
  • Have you considered configuring Data Transformation (via Lambda) for the Firehose stream? You can implement custom validation logic and explicitly fail processing or drop the event whenever validation fails. – Andrew Nguonly Aug 02 '22 at 13:58
  • @MarkoE The Glue schema resource is defined in the Terraform file I've pasted (the fourth gray text box). You should be able to find it if you CTRL+F (or Cmd+F) for `sn_form_events_glue_schema`. Sorry, I know this is a wall of text, thanks for taking the time to look through it! And @AndrewNguonly I have considered that, but from what I understand, this should be possible to do without a Lambda, right? Otherwise what's the point of a Glue schema? And for that matter, it feels like this should be the default behavior of the Firehose itself. I'll keep Lambdas in mind as a last resort – Ben Sandeen Aug 02 '22 at 14:45
  • 1
    Not sure if this is related: https://github.com/awslabs/aws-glue-schema-registry/issues/201#issuecomment-1199804620. And yeah, I did miss the schema definition. :) – Marko E Aug 02 '22 at 14:49
  • Interesting, thanks for the link! That's at least confirmation that using protobufs is not going to work. It's kind of annoying that AWS's own PR announcement seems to contradict this: https://aws.amazon.com/blogs/big-data/introducing-protocol-buffers-protobuf-schema-support-in-amazon-glue-schema-registry/ Maybe I'm not reading it closely enough, though (which is definitely possible) :P – Ben Sandeen Aug 02 '22 at 17:37
  • Update: I noticed some text that indicates that firehose record conversion can only be done on JSON data. I found this text when editing the "Transform and convert records" section, under the heading "Convert record format": ```Data in Apache Parquet or Apache ORC format is typically more efficient to query than JSON. Kinesis Data Firehose can convert your JSON-formatted source records using a schema from a table defined in AWS Glue . For records that aren't in JSON format, create a Lambda function that converts them to JSON in the Transform source records with AWS Lambda section above.``` – Ben Sandeen Aug 02 '22 at 22:23

0 Answers0