I encountered an issue that more or less took Thanos down. I spun up one of my internal "observability stacks" about 2PM on Monday. I was feeding my stack with Grafana agent running on each of a few "instrumented machines". About 5:30PM, I spun up a new Windows instance- and half an hour later, it was the only machine writing metrics.
It turns out that this windows machine had clock skew of about four and a half hours. Somehow, it was able to write to Thanos via Grafana Agent's Prometheus remote_write protocol. However, Thanos then refused to accept remote_write from anywhere else.
I was able to verify this was the issue by using Thanos Query to look forward in time, figure out that the skew was 4.5 hours, and then adjust the time-window
parameter in Thanos to allow writing of data. As soon as I restarted Thanos, data with the current correct time started flowing from the sources that were previously broken.
A few questions:
- Why didn't Thanos serve as a sort of gatekeeper and keep this far-future data out of the database? Is there some configuration option for this?
- Is there some other way to either prevent or detect that this is happening, and alert on it? I do want to sit down and write some alerts using the
absent
function to detect when data goes missing, so I suppose that might catch this issue.