2

I have an ETL pipeline where data coming from redshift, reading the data in (py)spark dataframes, performing calculations and dumping back the result to some target in redshift. So the flow is => Redshift source schema--> Spark 3.0 --> Redshift target schema. This is done in EMR using spark-redshift library provided by databricks. But my data has million of records and doing a full load everytime is not a good option.

How can I perform incremental load/upserts in spark-redshift library, the option I wanted to go with is delta lake(open source and guarantees ACID) but we cannot simply read and write delta files to Redshift Spectrum using delta lake integration.

Please guide me how can i achieve this, also, if there are any alternatives.

Emad
  • 21
  • 2

0 Answers0