0

Background: We use fivetran to ingest data from multiple sources into AWS Redshift. We've got ETL scripts which we run on top of these tables to create other more relevant tables. Furthermore, we've got final views/tables which can be built on top of these second layer tables.

All these new tables we create can be either be independent (just read from root tables) or can rely on each other.

This is where the problem arises. Too many inter-connections between views/tables.

Lastly, i've got a basic Flask app which uses rq scheduler to queue jobs and run ETL scripts at given time.

Problem: As we grow, we are seeing a drop in database performance (increase in execution times) and serialisation error on different tables.

Question: What's the best way to structure our infrastructure?

AIViz
  • 82
  • 9

1 Answers1

0

You need to step back and look at the dependencies in your updates. If you are getting serialization errors you have some circular dependencies that haven't been thought through (very common). Let's define some phases of data update within Redshift for each update cycle. (You phase definitions may vary but what is important is to think through these dependencies.)

First, phase 0 - data load for this cycle (N). Load data (COPY) into staging tables. This only depends on external information. Staging tables are phase 0 tables. Phase 0 tables only depend on external data (and these may be temp tables in reality).

Phase 1 - process & sanitize incoming data and apply this information to phase 1 tables. If any of these phase 1 tables need information from other phase 1, 2, 3 (etc) tables then these need to be defined as information coming from the N-1, previous cycle of data update.

Phase 2 - is the update of fact tables (phase 2 tables). These can only depend on Phase 1 or 0 tables for this cycle or if these updates need phase 2 or 3 information then these are defined as N - 1 versions of this data.

Phase 3 - update of derived / summary tables (phase 3 tables). These can depend on any previous phase of this cycle.

Again you definition of phases can be different. What is important is that you define the linear flow of information for each update cycle. No loops! If you need to use some later phase information is needs to be defined as N - 1 (previous data update) information. If something cannot be defined as N - 1 version then you may need to add a phase to the flow.

Removing the loops will remove the serialization errors as long as each phase only runs AFTER the previous phase is complete (and committed). This will also address any performance issues arising from locks and stalls. Every update SQL of each phase can run in parallel as all these SQL statements have already updated input tables (either N, or N - 1 versions).

Once you have this linear flow defined you can see which SQL statements are the long tent poles for each phase and attack any performance issues in these statements. As data grows the impact of inefficiencies grows and new issues will arise.

Now if your database has multiple independent data domains (dependency trees) you can break these apart and run them independently but be careful to ensure that things are independent. This can be done for only later phases if there is an advantage to that - common phase 0 and 1 but split phase 2 and 3.

So I'd say that you, firstly, have a data lineage issue and that this needs to be addressed before looking at infrastructure. Infrastructure / code may (will) need to be addressed as well but first things first.

Bill Weiner
  • 8,835
  • 2
  • 7
  • 18
  • I've got all my loading procedures sorted in the right order now. Additionally, I've also grouped them in 5 batches. I'm running these as cron jobs. How can I ensure they run in the right order i.e. Batch 1 > Batch 2 > Batch 3 > Batch 4 > Batch 5 without too much time delay within them. Other issue is, batch2 doesn't know if batch 1 is still running or finished running successfully? Having too much gap between jobs doesn't help. I did a poor job of explaining but hope you get the gist. As always, thanks for your help Bill – AIViz Nov 24 '22 at 09:14
  • It sounds like you are running these steps from an EC2 instance as just chunks of SQL (possibly using the psql cli). Is this right? If so you could code up the interlocks you need or find a tool that does this. Another way to go (I like to use) is AWS Step Functions. This is an in-the-cloud state-machine that can be use to manage the steps but may be a bridge too far as you will need to move to Redshift Data API or some messaging from your ec2 to the state-machine. A smaller lift is likely AWS Data Pipeline as this can be set up to just run batches of SQL and handle errors. – Bill Weiner Nov 24 '22 at 16:38
  • I have built a basic flask app which uses cron jobs to run Redshift SQL calls at fixed time. All of this is hosted on Kubernetes. Can you please explain what do you mean by interlocks and how can I code them up? Although, looking at AWS Step functions, it seems a viable option. Essentially I've got 5 stored procedures and I want to run them linearly, i.e. Run Stored Proc 1, as soon it finishes successfully, run Stored Proc 2... – AIViz Nov 24 '22 at 21:14
  • Basically just as you describe - "run them linearly, i.e. Run Stored Proc 1, as soon it finishes successfully, run Stored Proc 2". Then you may want to run some action if one step doesn't complete successfully - error recovery. – Bill Weiner Nov 25 '22 at 03:43