Methods for monitoring and alerting data issues when dealing with complex dependencies

Question

In this hypothetical example, we have a data flow across multiple engineering teams in an ecommerce company. These teams deliver services, produce data, and consume data at different points of the flow.

For example;

'Team Orders' maintains the Orders database and interfaces
'Team Traffic' generates web traffic data
'Team Warehouse' maintains the data warehouse
'Team Traffic' depends on 'Team Order's service to retrieve order data and associate it with web traffic
'Team Warehouse' depends on 'Team Traffic's data to build DW tables

Imagine that 'Team Orders' hits a db issue (load, latency, whatever) - Their monitoring system alerts an engineer who starts investigating the db issue.

In the meantime, 'Team Traffic' has also been alerted, as they see a spike in bad responses. They start investigating, and quickly realise the issue is with 'Team Orders’s service, and raise a ticket to 'Team Order'

Downstream from all of this, 'Team Warehouse' is receiving bad data. Their DW monitoring alerts them to this variance, so they start looking for root cause.

The problem is, we now have at least three engineers investigating the same issue, and they might not even be aware that other teams are doing the same thing.

An important point is that all three teams are using different monitoring and alerting systems; Team Orders is monitoring for db server issues, while Team Warehouse is looking for variances in record counts.

There are other approaches; alerting at the top of the pipeline only (blocking downstream escalations) or alerting at the bottom of the pipeline to upstream systems.

Are there any best practises, white papers, or engineering solutions I can research to understand the different ways to alert and escalate data issues across multiple eng/support teams?

score 0 · Answer 1 · answered Mar 30 '16 at 20:46

Highly recommend The Practice of Cloud System Administration, goes over some of this in great detail. Here we have 3 levels of monitoring

End to End (oh crap something is wrong)
Per service/API (oh crap member of SQL cluster is down, API is responding slow or with something other than a 200/300 HTTP code etc)
APM - What piece of code etc is slow, error rates for specific services etc.

These + the logs give us most of what we need to know what's going on, the generally we have a single person responsible for seeing that the issue is fixed - the coordinate the fix but they do NOT do the technical work, that is farmed out to others. The coordinater's job is to make sure we don't step on each other's toes fixing the issue.

Methods for monitoring and alerting data issues when dealing with complex dependencies

1 Answers1