1

Linkerd's docs explain how to track down failing requests using the tap command, but in some cases the success rate might be very high, with only a single failed request every hour or so. How is it possible to track down those requests that are considered "unsuccessful"? Perhaps a way to log them somewhere?

vmalloc
  • 820
  • 5
  • 8

1 Answers1

0

It sounds like you're looking for a way to configure Linkerd to trap requests that fail and dump the request data somewhere, which is not supported by Linkerd at the moment.

You do have a couple of options with the current functionality to derive some of the info that you're looking for. The Linkerd proxies record error rates as Prometheus metrics which are consumed by Grafana to render the dashboards. When you observe one of these infrequent errors, you can use the time window functionality in Grafana to find the precise time that the error occurred, then refer to the service log to see if there are any corresponding error messages there. If the error is coming from the service itself, then you can add as much logging info about the request that you need to in order to help solve the problem.

Another option, which I haven't tried myself is to integrate linkerd tap into your monitoring system to collect the request info and save the data for the requests that fail. There's a caveat here in that you will want to be careful about leaving a tap command running, because it will continuously collect data from the tap control plane component, which will add load to that service.

Perhaps a more straightforward approach would be to ensure that all the proxy logs and service logs are written to a long-term store like Splunk, an ELK (Elasticsearch, Logstash, and Kibana), or Loki. Then you can set up alerting (Prometheus alert-manager, for example) to send a notification when a request fails, then you can match the time of the failure with the logs that have been collected.

You could also look into adding distributed tracing to your environment. Depending on the implementation that you use (jaeger, zipkin, etc.) I think the interface will allow you to inspect the details of the request for each trace.

One final thought: since Linkerd is an open source project, I'd suggest opening a feature request with specifics on the behavior that you'd like to see and work with the community to get it implemented. I know the roadmap includes plans to be able to see the request bodies using linkerd tap and this sounds like a good use case for having those bodies.

cpretzer
  • 236
  • 2
  • 6