In this hypothetical example, we have a data flow across multiple engineering teams in an ecommerce company. These teams deliver services, produce data, and consume data at different points of the flow.
For example;
- 'Team Orders' maintains the Orders database and interfaces
- 'Team Traffic' generates web traffic data
- 'Team Warehouse' maintains the data warehouse
- 'Team Traffic' depends on 'Team Order's service to retrieve order data and associate it with web traffic
- 'Team Warehouse' depends on 'Team Traffic's data to build DW tables
Imagine that 'Team Orders' hits a db issue (load, latency, whatever) - Their monitoring system alerts an engineer who starts investigating the db issue.
In the meantime, 'Team Traffic' has also been alerted, as they see a spike in bad responses. They start investigating, and quickly realise the issue is with 'Team Orders’s service, and raise a ticket to 'Team Order'
Downstream from all of this, 'Team Warehouse' is receiving bad data. Their DW monitoring alerts them to this variance, so they start looking for root cause.
The problem is, we now have at least three engineers investigating the same issue, and they might not even be aware that other teams are doing the same thing.
An important point is that all three teams are using different monitoring and alerting systems; Team Orders is monitoring for db server issues, while Team Warehouse is looking for variances in record counts.
There are other approaches; alerting at the top of the pipeline only (blocking downstream escalations) or alerting at the bottom of the pipeline to upstream systems.
Are there any best practises, white papers, or engineering solutions I can research to understand the different ways to alert and escalate data issues across multiple eng/support teams?