I have tail input set in telegraf configuration which checks flask.log file with grok pattern. Flask uses custom python script to get alerts from alertmanager's api and send it somewhere further. At the end of the flask.log, there is always a line that returns http response code of webhook request the script is doing and it is sent to prometheus as tail_ metric by telegraf, so i expect to see value of 200 when everything works, and 500 if it doesn't (It's pretty simple request, we haven't had any other codes AFAIK, but that's not the point).
It mostly works almost as expected, there are two problems.
One is that the telegraf sometimes just stops sending this metric to prometheus or stops reading it from log file without any reason and won't start again until I restart telegraf.
Second is that when we had a problem with the app and the response code was 500 in the log, it didn't read/scrape it at all.
We have many metrics in telegraf.conf and this one is the only one that stops randomly, others work just fine.
My telegraf config:
[[inputs.tail]]
interval = "30s"
flush_interval = "60s"
files = ["<path>/flask.log"]
# ## Read file from beginning.
from_beginning = false
character_encoding = "utf-8"
data_format = "grok"
grok_patterns = <grok pattern>
Grok pattern works just fine, we've tested it and it should read both 200 and 500 with no problems.
I tried changing the interval and flush_interval, thinking that maybe there is some timing issues between updating log and reading it by telegraf, but it seems that it doesn't work.