0

On one of my server in GCP something wrong with google-cloud-ops-agent. Fluent Bit that agent uses for logs writes too many errors logs. For three days it had 88 GB, and before we already cleaned. I can’t recognize what exactly logs mean. Can somebody help with it?

root@***:/var/log/google-cloud-ops-agent/subagents# tail -50 logging-module.log
[2022/02/15 16:56:06] [error] [storage] [cio file] file is not mmap()ed: tail.1:29458-1644260316.150179737.flb
[2022/02/15 16:56:06] [error] [input chunk] error writing data from tail.1 instance
[2022/02/15 16:56:06] [error] [storage] format check failed: tail.1/29458-1644260316.150179737.flb
[2022/02/15 16:56:06] [error] [storage] format check failed: tail.1/29458-1644260316.150179737.flb
[2022/02/15 16:56:06] [error] [storage] [cio file] file is not mmap()ed: tail.1:29458-1644260316.150179737.flb
[2022/02/15 16:56:06] [error] [input chunk] error writing data from tail.1 instance
[2022/02/15 16:56:06] [error] [storage] format check failed: tail.1/29458-1644260316.150179737.flb
[2022/02/15 16:56:06] [error] [storage] format check failed: tail.1/29458-1644260316.150179737.flb
[2022/02/15 16:56:06] [error] [storage] [cio file] file is not mmap()ed: tail.1:29458-1644260316.150179737.flb
[2022/02/15 16:56:06] [error] [input chunk] error writing data from tail.1 instance
[2022/02/15 16:56:06] [error] [storage] format check failed: tail.1/29458-1644260316.150179737.flb
[2022/02/15 16:56:06] [error] [storage] format check failed: tail.1/29458-1644260316.150179737.flb
[2022/02/15 16:56:06] [error] [storage] [cio file] file is not mmap()ed: tail.1:29458-1644260316.150179737.flb
[2022/02/15 16:56:06] [error] [input chunk] error writing data from tail.1 instance

After restart google-cloud-ops-agent-fluent-bit.service it started infinity run and down and it repeating:

root@***:/var/log/google-cloud-ops-agent/subagents# tail -300 logging-module.log 
[2022/02/15 18:15:46] [ info] [output:stackdriver:stackdriver.1] metadata_server set to http://metadata.google.internal
[2022/02/15 18:15:46] [ warn] [output:stackdriver:stackdriver.1] client_email is not defined, using a default one
[2022/02/15 18:15:46] [ warn] [output:stackdriver:stackdriver.1] private_key is not defined, fetching it from metadata server
[2022/02/15 18:15:46] [ info] [output:stackdriver:stackdriver.0] worker #7 started

.....

[2022/02/15 18:15:46] [ info] [input:storage_backlog:storage_backlog.2] register tail.1/29458-1644238945.234513362.flb
[2022/02/15 18:15:46] [ info] [input:storage_backlog:storage_backlog.2] register tail.1/29458-1644238950.216326541.flb
[2022/02/15 18:15:46] [ info] [input:storage_backlog:storage_backlog.2] register tail.1/29458-1644238953.150198939.flb
[2022/02/15 18:15:46] [ info] [input:storage_backlog:storage_backlog.2] register tail.1/29458-1644238957.150224348.flb
[2022/02/15 18:15:46] [error] [storage] format check failed: tail.1/29458-1644260316.150179737.flb
[2022/02/15 18:15:46] [error] [engine] could not segregate backlog chunks
[2022/02/15 18:15:46] [ info] [output:stackdriver:stackdriver.0] thread worker #0 stopping...
[2022/02/15 18:15:46] [ info] [output:stackdriver:stackdriver.0] thread worker #0 stopped
[2022/02/15 18:15:46] [ info] [output:stackdriver:stackdriver.0] thread worker #1 stopping...

Restarts google-cloud-ops-agent-opentelemetry-collector.service and google-cloud-ops-agent.service not helped. Any ideas why it happaning and what does logs mean?

1 Answers1

0

You didn't mention the version that is experiencing this issue, or whether you've upgraded from an earlier version, but there was a bug in Ops agent versions prior to 2.7.1 that caused buffer corruption, which manifested in later versions as the error you are quoting ("format check failed"). The solution is to delete the corrupted files until the agent runs properly. See the public issue tracker for detailed instructions.

Igor Peshansky
  • 727
  • 4
  • 12
  • Version of agent - 2.7.1~ubuntu18.04, Fluent Bit v1.8.11 According issue tracker the problem already was fixed in these version – Anton Makarov Mar 15 '22 at 09:35
  • If you have upgraded from a version prior to 2.7.1 *to* 2.7.1 or later, then the corruption caused by a bug in the earlier version would manifest as the later version crashing in the way you describe. 2.7.1 should no longer cause such buffer corruption (that's the problem that was fixed), but it may still crash when encountering corrupt buffers. You should delete the corrupt buffers (in your case, `tail.1/29458-1644260316.150179737.flb`). – Igor Peshansky Mar 17 '22 at 03:29
  • Thank you, Igor! I'm appreciate your help! – Anton Makarov Mar 19 '22 at 05:18