The task was as following - build a near real-time pipeline for the project in AWS infrastructure. Having Nosql(DynamoDB) and SQL(RDS databases) (S3 stored data can be added in the future) sources we need to combine them into tables, used by analytics team. The problems we faced:
- Constant updates should be done in target table for analytics
- DynamoDB and RDS cdc logs come in different forms
- Target table format should be easily integrated with AWS Athena (used for analytics queries)
Our final ETL pipeline looks like this:
- DynamoDB tables with enabled Kinesis data streams
- DMS dumps cdc logs from RDS databases to **Kinesis **data streams
- Glue job does all needed cleaning and transforms with the help of Spark structured streaming
- Hudi is used as target table format, cause it showed most efficient results for constant updates (comparing with Iceberg and Deltalake)
Glue version -- 3.0
Final solution works fine, but there are some questionable design decisions like:
- using custom Spark structured streams instead of Glue streaming (for now glue only supports one streaming source, but we have several kinesis streams)
- combining several streams is quite tricky in Spark structured streaming , like performing joins and aggregations
- also Flink is considered to be faster than Spark, but in Glue we are bounded to Spark transformations
So I'm wondering if Kinesis Analytics writing to target Hudi table can be a better solution - faster, easier, more flexible (for example when new datasources are added)?