0

I have installed the GCP ops agent into some machines which are in production, to get its metrics and be able to create alerts. They don't have any service account assigned and cannot be changed because they are in production (and some of them includes local SSD). As I cannot set the service account directly on the instance, I have decided to install it manually and I have almost done it.

To do it I have created a service account, and a json key for that account. I have installed that json using the gcloud command:

gcloud auth application-default login --key-file=key-file.json

and I have copied the json to root config folder to use it as default service account for Google applications:

cp key-file.json /root/.config/gcloud/application_default_credentials.json

With those commands I have seen that now I am able to get an application access token using the following command:

gcloud auth application-default print-access-token

And now the metrics collector seems to be working, because I have data on GCP. My problem is that Fluent Bit sill having authentication problems because is unable to get the oauth token:

[2021/12/09 14:01:11] [error] [output:stackdriver:stackdriver.1] can't fetch token from the metadata server
[2021/12/09 14:01:11] [error] [output:stackdriver:stackdriver.1] cannot retrieve oauth2 token

So I have searched a bit more and I have seen that GOOGLE_SERVICE_CREDENTIALS variable should work, but when I edit the systemctl daemon to add the environment variable with the following override:

[Service]
Environment='GOOGLE_SERVICE_CREDENTIALS=/root/.config/gcloud/application_default_credentials.json'

Fluent Bit seems to be able to get the access token now:

[2021/12/09 14:07:56] [ info] [output:stackdriver:stackdriver.0] metadata_server set to http://metadata.google.internal
[2021/12/09 14:07:56] [ info] [oauth2] HTTP Status=200
[2021/12/09 14:07:56] [ info] [oauth2] access token from 'www.googleapis.com:443' retrieved
[2021/12/09 14:07:56] [ info] [output:stackdriver:stackdriver.0] worker #0 started
[2021/12/09 14:07:56] [ info] [output:stackdriver:stackdriver.0] worker #1 started
[2021/12/09 14:07:56] [ info] [output:stackdriver:stackdriver.0] worker #2 started
[2021/12/09 14:07:56] [ info] [output:stackdriver:stackdriver.0] worker #3 started
[2021/12/09 14:07:56] [ info] [output:stackdriver:stackdriver.0] worker #4 started
[2021/12/09 14:07:56] [ info] [output:stackdriver:stackdriver.0] worker #5 started
[2021/12/09 14:07:56] [ info] [output:stackdriver:stackdriver.0] worker #6 started
[2021/12/09 14:07:56] [ info] [output:stackdriver:stackdriver.0] worker #7 started

But it start to crash:

Dec  9 14:05:48 tradeinn-web-pro-mariadb-11 systemd[1]: google-cloud-ops-agent-fluent-bit.service: Main process exited, code=killed, status=6/ABRT
Dec  9 14:05:48 tradeinn-web-pro-mariadb-11 systemd[1]: google-cloud-ops-agent-fluent-bit.service: Failed with result 'signal'.
Dec  9 14:05:48 tradeinn-web-pro-mariadb-11 systemd[1]: google-cloud-ops-agent-fluent-bit.service: Service RestartSec=100ms expired, scheduling restart.
Dec  9 14:05:48 tradeinn-web-pro-mariadb-11 systemd[1]: google-cloud-ops-agent-fluent-bit.service: Scheduled restart job, restart counter is at 4.
Dec  9 14:05:48 tradeinn-web-pro-mariadb-11 systemd[1]: Stopped Google Cloud Ops Agent - Logging Agent.
Dec  9 14:05:48 tradeinn-web-pro-mariadb-11 systemd[1]: Starting Google Cloud Ops Agent - Logging Agent...
Dec  9 14:05:48 tradeinn-web-pro-mariadb-11 systemd[1]: Started Google Cloud Ops Agent - Logging Agent.
Dec  9 14:05:48 tradeinn-web-pro-mariadb-11 fluent-bit[27095]: #033[1mFluent Bit v1.8.4#033[0m
Dec  9 14:05:48 tradeinn-web-pro-mariadb-11 fluent-bit[27095]: * #033[1m#033[93mCopyright (C) 2019-2021 The Fluent Bit Authors#033[0m
Dec  9 14:05:48 tradeinn-web-pro-mariadb-11 fluent-bit[27095]: * #033[1m#033[93mCopyright (C) 2015-2018 Treasure Data#033[0m
Dec  9 14:05:48 tradeinn-web-pro-mariadb-11 fluent-bit[27095]: * Fluent Bit is a CNCF sub-project under the umbrella of Fluentd
Dec  9 14:05:48 tradeinn-web-pro-mariadb-11 fluent-bit[27095]: * https://fluentbit.io
Dec  9 14:05:49 tradeinn-web-pro-mariadb-11 fluent-bit[27095]: [2021/12/09 14:05:49] [engine] caught signal (SIGSEGV)
Dec  9 14:05:49 tradeinn-web-pro-mariadb-11 fluent-bit[27095]: [2021/12/09 14:05:49] [engine] caught signal (SIGSEGV)
Dec  9 14:05:49 tradeinn-web-pro-mariadb-11 fluent-bit[27095]: #0  0x5636b208e714      in  ???() at ???:0
Dec  9 14:05:49 tradeinn-web-pro-mariadb-11 fluent-bit[27095]: #1  0x5636b200fc87      in  ???() at ???:0
Dec  9 14:05:49 tradeinn-web-pro-mariadb-11 fluent-bit[27095]: #2  0x5636b227601f      in  ???() at ???:0
Dec  9 14:05:49 tradeinn-web-pro-mariadb-11 fluent-bit[27095]: #3  0x5636b208e714      in  ???() at ???:0
Dec  9 14:05:49 tradeinn-web-pro-mariadb-11 fluent-bit[27095]: #4  0x5636b200fc87      in  ???() at ???:0
Dec  9 14:05:49 tradeinn-web-pro-mariadb-11 fluent-bit[27095]: #5  0x5636b227601f      in  ???() at ???:0

Is there any way to make it work with a service account json?

Jason Aller
  • 3,541
  • 28
  • 38
  • 38
  • What GKE version are you using? How did you deployed and configured your fluentbit? Did you follow any tutorial? Did you try [this guide](https://cloud.google.com/community/tutorials/kubernetes-engine-customize-fluentbit?hl=en)? – PjoterS Dec 10 '21 at 09:51
  • Hello. There is no GKE involved in this problem, just some GCP Instances which doesn't have a service account attached because were created wrongly, and now cannot be stopped to change it. I have followed the official GCP documentation, which is just download and execute a bash script. The rest of steps were tests done to try to solve the problem, and was solved partially. – Daniel Carrasco Dec 10 '21 at 11:44
  • When you are using `$ gcloud auth application-default print-access-token` it's for default builtin SA, if you want to use custom, command looks like this `$ gcloud auth print-access-token SA_NAME@PROJECT_ID.iam.gserviceaccount.com`. To sum up, You have created a new SA, download key JSON and you have set a variable. Did you set it in ~/.bashrc or ~/.profile as per [this doc](https://cloud.google.com/docs/authentication/getting-started#setting_the_environment_variable)? It's working for a while and then you are getting rejected or its working the whole time but you are getting those error msgs? – PjoterS Dec 13 '21 at 13:53
  • The `$ gcloud auth application-default print-access-token` works fine all the time. Also the metric collector was failing until I have copied the file to `/root/.config/gcloud/application_default_credentials.json`. Now it works perfectly and the only one that fails is fluentd, which gives me an authentication error by default, and when I add the environment variable to the service directly is able to authenticate but crash. Checked permissions and they are correct. – Daniel Carrasco Dec 14 '21 at 09:48
  • On what OS your VM's are based? What is your OpsAgent version? Could you share more logs regarding this crash (I guess it's from `/var/log/google-fluentd/google-fluentd.log`) without private information? So it's working, sending logs but from time to time it's crashing or it's crashing over and over again? I guess you already checked [Troubleshooting guide](https://cloud.google.com/logging/docs/agent/logging/troubleshooting)? – PjoterS Dec 15 '21 at 11:26
  • Did you solve your issue? If no, could you respond to my question above? – PjoterS Dec 22 '21 at 08:52
  • Sorry I was out by a time. No, I was not able to solve the problem for now. Our VM are based on Debian 10 and Ubuntu 18.04 (both fails in the same way), and the OpsAgent version is 2.7.1. I have to repeat the process to get the crash log (I leave it disabled), so I need a bit of time to do it. Tomorrow I'll try to crash it again and I'll search into the guide to see if I locate the problem (I think that I already did it, but to be sure). – Daniel Carrasco Jan 03 '22 at 16:11

0 Answers0