Hudi DeltaStreamer with AWS Glue Data Catalog syncs the database, but not the tables

Question

This is similar to being unable to sync AWS Glue Data Catalog where you run a spark-submit with Hudi DeltaStreamer, except you only sync the database (and not the tables).

E.g. you submit:

spark-submit \
  --conf spark.hadoop.hive.metastore.client.factory.class=com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory \
  --deploy-mode cluster \
  --jars /usr/lib/spark/external/lib/spark-avro.jar,/usr/lib/hudi/hudi-spark-bundle.jar,/usr/lib/hudi/hudi-utilities-bundle.jar,/usr/lib/hudi/cli/lib/aws-java-sdk-glue-1.12.397.jar,/usr/lib/hive/auxlib/aws-glue-datacatalog-hive3-client.jar,/usr/lib/hadoop/hadoop-aws.jar,/usr/lib/hadoop/hadoop-aws-3.3.3-amzn-2.jar --conf spark.sql.catalogImplementation=hive \
  --conf spark.sql.catalog.spark_catalog=org.apache.spark.sql.hudi.catalog.HoodieCatalog \
  --conf spark.sql.extensions=org.apache.spark.sql.hudi.HoodieSparkSessionExtension \
  --class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer /usr/lib/hudi/hudi-utilities-slim-bundle.jar \
  --table-type COPY_ON_WRITE \
  --source-class org.apache.hudi.utilities.sources.AvroDFSSource \
  --source-ordering-field id \
  --target-base-path s3a://my-bucket/data/my_database/my_target_table/ 
  --sync-tool-classes org.apache.hudi.aws.sync.AwsGlueCatalogSyncTool \
  --props file:///etc/hudi/conf/hudi-defaults.conf \
  --target-table my_target_table
  --schemaprovider-class org.apache.hudi.utilities.schema.SchemaRegistryProvider \
  --enable-sync \
  --enable-hive-sync

And you see the database synced in hive, but not the tables:

beeline -u jdbc:hive2://ip-1-1-1-1:10000
Connecting to jdbc:hive2://ip-1-1-1-1:10000

show databases;
+-----------------------------------+
|           database_name           |
+-----------------------------------+
| my_database                       |
+-----------------------------------+

show tables;
+----------------------------------------------------+
|                      tab_name                      |
+----------------------------------------------------+
|                                                    |
+----------------------------------------------------+

Will · Answer 1 · 2023-04-11T20:43:12.320

A couple things:

Check your IAM permissions (usually there's an error like 403 if this is the issue), e.g.

{
          Action = [
            "glue:CreateDatabase",
            "glue:UpdateDatabase",
            "glue:DeleteDatabase",
            "glue:GetDatabase",
            "glue:GetDatabases",
            "glue:CreateTable",
            "glue:UpdateTable",
            "glue:DeleteTable",
            "glue:GetTable",
            "glue:GetTables",
            "glue:GetTableVersions",
            "glue:CreatePartition",
            "glue:BatchCreatePartition",
            "glue:UpdatePartition",
            "glue:DeletePartition",
            "glue:BatchDeletePartition",
            "glue:GetPartition",
            "glue:GetPartitions",
            "glue:BatchGetPartition",
            "glue:CreateUserDefinedFunction",
            "glue:UpdateUserDefinedFunction",
            "glue:DeleteUserDefinedFunction",
            "glue:GetUserDefinedFunction",
            "glue:GetUserDefinedFunctions",
          ]
          Effect = "Allow"
          Resource = [
            "arn:aws:glue:us-west-2:1111111111:table/my_database/*",
            "arn:aws:glue:us-west-2:1111111111:database/my_database",
            "arn:aws:glue:us-west-2:1111111111:catalog",
          ]
        },

You can tell if you have permissions by connecting with beeline:

beeline -u jdbc:hive2://ip-1-1-1-1:10000/my_database

If you get a permission denied, modify your roles permissions by checking your current role:

aws sts get-caller-identity

And you'll see your assumed-role.

For some reason, I needed the default database to have access in order to sync:

            "arn:aws:glue:us-west-2:1111111111:table/default/*",
            "arn:aws:glue:us-west-2:1111111111:database/default",

Check your /etc/hudi/conf/hudi-defaults.conf file. For mine, I had the partition information incorrectly set (e.g. hoodie.datasource.write.keygenerator.class=org.apache.hudi.keygen.TimestampBasedKeyGenerator didn't have the right settings). The .hoodie files appeared, but not the table in AWS Glue Data Catalog. I tested by updating the partition to something simple/terrible for performance (e.g. id) and verified the AWS Glue Data Catalog sync worked (so I could rule out permission issues), then went back to adjusting my hudi configurations.

Hudi DeltaStreamer with AWS Glue Data Catalog syncs the database, but not the tables

1 Answers1