0

I encountered an issue that more or less took Thanos down. I spun up one of my internal "observability stacks" about 2PM on Monday. I was feeding my stack with Grafana agent running on each of a few "instrumented machines". About 5:30PM, I spun up a new Windows instance- and half an hour later, it was the only machine writing metrics.

It turns out that this windows machine had clock skew of about four and a half hours. Somehow, it was able to write to Thanos via Grafana Agent's Prometheus remote_write protocol. However, Thanos then refused to accept remote_write from anywhere else.

I was able to verify this was the issue by using Thanos Query to look forward in time, figure out that the skew was 4.5 hours, and then adjust the time-window parameter in Thanos to allow writing of data. As soon as I restarted Thanos, data with the current correct time started flowing from the sources that were previously broken.

A few questions:

  1. Why didn't Thanos serve as a sort of gatekeeper and keep this far-future data out of the database? Is there some configuration option for this?
  2. Is there some other way to either prevent or detect that this is happening, and alert on it? I do want to sit down and write some alerts using the absent function to detect when data goes missing, so I suppose that might catch this issue.

0 Answers0