I'm just getting started with implementing Jaeger in our pipeline, using python and the opentelemetry packages. A single task has let's say 5 steps. Step 1 can take 1-2 days. Then step 2 can take 3-4 hours. And then steps 3-5 typically take 5-30 minutes each.
My issue is that, this means that monitoring the overall time taken for each tracing is useful to find anomalies in Step 1, but useless for the other steps, because the variation in time taken for Step 1 swamps out any variation in the times taken for the rest of the steps. This makes it hard for the teams focused on monitoring and improving steps 2-5 to see at a glance when their processes are getting slow.
Is there a way in the Jaeger UI to create separate dashboards or queries, where tracings can be calculated based on a specified set of spans? If each span is named appropriately, can I do something like:
- Dashboard 1: Show tracing times calculated with just span 1
- Dashboard 2: Show tracing times calculated with just span 2
- Dashboard 3: Show tracing times calculated with spans 3-5
If this isn't possible with Jaeger, is it possible with Grafana or some other frontend UI?
Note, I still need to record all spans, because I want to monitor all the steps, so removing them from the data itself is not an option. Ideally I want to do the filtering at the UI level to create separate dashboards that are useful to their respective teams.