Connect Redshift Spectrum/ AWS EMR with Hudi directly or via AWS Glue Data Catalog

Question

I'm trying to understand how to properly connect Redshift Spectrum with Hudi data.

Looks like I can directly create Redshift external table for data managed in Apache Hudi like it is described by the following documentation https://docs.aws.amazon.com/redshift/latest/dg/c-spectrum-external-tables.html The other way is to integrate Hudi with AWS Glue Data Catalog like it is mentioned here https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-hudi-how-it-works.html and then access Hudi tables with Redshift Spectrum via AWS Glue Data Catalog.

The same needs I have for AWS EMR for Apache Spark. Looks like I may use Hudi directly from EMR or via AWS Glue Data Catalog.

Right now, I don't understand what way to choose. Could you please advise what is the benefit to use Hudi via AWS Glue Data Catalog, or do I need to use it directly from Redshift Spectrum and AWS EMR ?

Missing the point about catalog. Bugs could also force your hand. — thebluephantom, Sep 12 '21 at 14:31
Could you add apache-spark as tag as Glue is by and large also Spark (serverless). — thebluephantom, Aug 01 '22 at 07:17

score 3 · Accepted Answer · answered Sep 12 '21 at 18:13

3

Given that with Spark on EMR you need a catalog, Hive metastore if you will, then using the AWS Glue Catalog is an option.

If you elect to use Glue as metastore then use that as the source for all data. Unless errors are evident in which case use the Hudi api for Spark.

answered Sep 12 '21 at 18:13

thebluephantom

16,458
8
40
83

Thanks for your answer. What errors do you mean ? – alexanoid Sep 12 '21 at 18:16
If you browse around there are some issues to be seen. Either we r in hudi or databricks delta territory. – thebluephantom Sep 12 '21 at 18:23
1

I see, thanks. Right now I''m doing R&D in order to unload cold data from our Redshift instance. This is why I'm looking for something like AWS S3 + Apache Hudi for UPSERTS over S3 – alexanoid Sep 12 '21 at 18:26
1

https://awsfeed.com/whats-new/big-data/new-features-from-apache-hudi-available-in-amazon-emr They have invested a lot. – thebluephantom Sep 12 '21 at 18:40
U do not need spark for that task. – thebluephantom Sep 12 '21 at 18:41
unload existing data from Redshift is only one side of the full picture. We have a set of ETL(Apache Spark) jobs which also should start use Hudi for old/new data – alexanoid Sep 12 '21 at 18:57
1

Then use glue as i say – thebluephantom Sep 12 '21 at 19:10

Connect Redshift Spectrum/ AWS EMR with Hudi directly or via AWS Glue Data Catalog

1 Answers1