2

Before going for a more sophisticated automation approach (Terraform and/or Helm chart) I am trying to get a dev AWS EKS environment working with this guide: https://aws-otel.github.io/docs/introduction

These steps go fine:

kubectl apply -f https://amazon-eks.s3.amazonaws.com/docs/addons-otel-permissions.yaml

eksctl create iamserviceaccount \
    --name adot-collector \
    --namespace opentelemetry-operator-system \
    --cluster <MY-CLUSTER> \
    --attach-policy-arn arn:aws:iam::aws:policy/AmazonPrometheusRemoteWriteAccess \
    --attach-policy-arn arn:aws:iam::aws:policy/AWSXrayWriteOnlyAccess \
    --attach-policy-arn arn:aws:iam::aws:policy/CloudWatchAgentServerPolicy \
    --approve \
    --override-existing-serviceaccounts

The next part of the guide gets a little confusing because it states that you can either do this:

aws eks create-addon --addon-name adot --cluster-name <your_cluster_name>

or, if you wish to pass in a more customized configuration, do this:

aws eks create-addon \
    --cluster-name <YOUR-EKS-CLUSTER-NAME> \
    --addon-name adot \
    --configuration-values file://configuration-values.json \
    --resolve-conflicts=OVERWRITE

My goal is is to create the Collector using the "statefulset" mode, but no matter what I try in the configuration-values.json file, it is never creating anything for the Collector --no statefulset, no pods. The operator pod is the only thing that gets created and nothing I can make sense of in the operator log --looks like standard stuff.

This is the configuration-values.json file I am trying:

{
  "replicaCount": 1,
  "manager": {
    "resources": {
      "limits": {
        "cpu": "200m",
        "memory": "256Mi"
      },
      "requests": {
        "cpu": "100m",
        "memory": "128Mi"
      }
    }
  },
  "kubeRBACProxy": {
    "resources": {
      "limits": {
        "cpu": "50m",
        "memory": "64Mi"
      },
      "requests": {
        "cpu": "10m",
        "memory": "32Mi"
      }
    }
  },
  "collector": {
    "mode": "statefulset",
    "serviceAccount": {
      "create": false,
      "name": "adot-collector"
    },
    "resources": {
      "limits": {
        "cpu": "1",
        "memory": "2Gi"
      },
      "requests": {
        "cpu": "500m",
        "memory": "1Gi"
      }
    }
    }
  }

I am confused as to what the issue might be? The aws eks create-addon actually completes successfully but there are never any Collector pods or statefulset. Could this be a lack of resources in my EKS Cluster (it's a smaller, 3-node dev cluster)?

I am adding logs from the operator:

  1. no collector pods:
❯ k get pods -n opentelemetry-operator-system
NAME                                      READY   STATUS    RESTARTS   AGE
opentelemetry-operator-79b9f86654-ntt9p   2/2     Running   0          3m16s
  1. operator logs:
I0814 21:11:50.958866       1 leaderelection.go:255] successfully acquired lease opentelemetry-operator-system/9f7554c3.opentelemetry.io
{"level":"info","ts":"2023-08-14T21:11:50Z","logger":"instrumentation-upgrade","msg":"looking for managed Instrumentation instances to upgrade"}
{"level":"info","ts":"2023-08-14T21:11:50Z","logger":"collector-upgrade","msg":"looking for managed instances to upgrade"}
{"level":"info","ts":"2023-08-14T21:11:50Z","msg":"Starting EventSource","controller":"opentelemetrycollector","controllerGroup":"opentelemetry.io","controllerKind":"OpenTelemetryCollector","source":"kind source: *v1alpha1.OpenTelemetryCollector"}
{"level":"info","ts":"2023-08-14T21:11:50Z","msg":"Starting EventSource","controller":"opentelemetrycollector","controllerGroup":"opentelemetry.io","controllerKind":"OpenTelemetryCollector","source":"kind source: *v1.ConfigMap"}
{"level":"info","ts":"2023-08-14T21:11:50Z","msg":"Starting EventSource","controller":"opentelemetrycollector","controllerGroup":"opentelemetry.io","controllerKind":"OpenTelemetryCollector","source":"kind source: *v1.ServiceAccount"}
{"level":"info","ts":"2023-08-14T21:11:50Z","msg":"Starting EventSource","controller":"opentelemetrycollector","controllerGroup":"opentelemetry.io","controllerKind":"OpenTelemetryCollector","source":"kind source: *v1.Service"}
{"level":"info","ts":"2023-08-14T21:11:50Z","msg":"Starting EventSource","controller":"opentelemetrycollector","controllerGroup":"opentelemetry.io","controllerKind":"OpenTelemetryCollector","source":"kind source: *v1.Deployment"}
{"level":"info","ts":"2023-08-14T21:11:50Z","msg":"Starting EventSource","controller":"opentelemetrycollector","controllerGroup":"opentelemetry.io","controllerKind":"OpenTelemetryCollector","source":"kind source: *v1.DaemonSet"}
{"level":"info","ts":"2023-08-14T21:11:50Z","msg":"Starting EventSource","controller":"opentelemetrycollector","controllerGroup":"opentelemetry.io","controllerKind":"OpenTelemetryCollector","source":"kind source: *v1.StatefulSet"}
{"level":"info","ts":"2023-08-14T21:11:50Z","msg":"Starting EventSource","controller":"opentelemetrycollector","controllerGroup":"opentelemetry.io","controllerKind":"OpenTelemetryCollector","source":"kind source: *v2.HorizontalPodAutoscaler"}
{"level":"info","ts":"2023-08-14T21:11:50Z","msg":"Starting Controller","controller":"opentelemetrycollector","controllerGroup":"opentelemetry.io","controllerKind":"OpenTelemetryCollector"}
{"level":"info","ts":"2023-08-14T21:11:51Z","logger":"collector-upgrade","msg":"no instances to upgrade"}
{"level":"info","ts":"2023-08-14T21:11:51Z","logger":"instrumentation-upgrade","msg":"no instances to upgrade"}
{"level":"info","ts":"2023-08-14T21:11:51Z","msg":"Starting workers","controller":"opentelemetrycollector","controllerGroup":"opentelemetry.io","controllerKind":"OpenTelemetryCollector","worker count":1}

logs when trying "deployment" mode for contoller (default):

 k get pods -n opentelemetry-operator-system
NAME                                      READY   STATUS    RESTARTS   AGE
opentelemetry-operator-79b9f86654-lqnjd   2/2     Running   0          79s
❯ k logs opentelemetry-operator-79b9f86654-lqnjd -n opentelemetry-operator-system
{"level":"info","ts":"2023-08-14T21:29:42Z","msg":"Starting the OpenTelemetry Operator","opentelemetry-operator":"0.76.1-adot-46-g803a86e","opentelemetry-collector":"public.ecr.aws/aws-observability/aws-otel-collector:v0.30.0","opentelemetry-targetallocator":"public.ecr.aws/aws-observability/adot-operator-targetallocator:0.78.1","operator-opamp-bridge":"public.ecr.aws/aws-observability/adot-operator-opamp-bridge:0.78.0","auto-instrumentation-java":"public.ecr.aws/aws-observability/adot-autoinstrumentation-java:1.27.0","auto-instrumentation-nodejs":"public.ecr.aws/aws-observability/adot-operator-autoinstrumentation-nodejs:0.39.1","auto-instrumentation-python":"public.ecr.aws/aws-observability/adot-operator-autoinstrumentation-python:0.39b0","auto-instrumentation-dotnet":"public.ecr.aws/aws-observability/adot-operator-autoinstrumentation-dotnet:0.7.0","auto-instrumentation-go":"ghcr.io/open-telemetry/opentelemetry-go-instrumentation/autoinstrumentation-go:v0.2.1-alpha","auto-instrumentation-apache-httpd":"ghcr.io/open-telemetry/opentelemetry-operator/autoinstrumentation-apache-httpd:1.0.2","feature-gates":"operator.autoinstrumentation.apache-httpd,operator.autoinstrumentation.dotnet,-operator.autoinstrumentation.go,operator.autoinstrumentation.java,operator.autoinstrumentation.nodejs,operator.autoinstrumentation.python,-operator.collector.rewritetargetallocator","build-date":"2023-06-15T16:35:10Z","go-version":"go1.20.5","go-arch":"amd64","go-os":"linux","labels-filter":[]}
{"level":"info","ts":"2023-08-14T21:29:42Z","logger":"setup","msg":"the env var WATCH_NAMESPACE isn't set, watching all namespaces"}
{"level":"info","ts":"2023-08-14T21:29:42Z","logger":"controller-runtime.metrics","msg":"Metrics server is starting to listen","addr":"0.0.0.0:8080"}
{"level":"info","ts":"2023-08-14T21:29:42Z","logger":"controller-runtime.builder","msg":"Registering a mutating webhook","GVK":"opentelemetry.io/v1alpha1, Kind=OpenTelemetryCollector","path":"/mutate-opentelemetry-io-v1alpha1-opentelemetrycollector"}
{"level":"info","ts":"2023-08-14T21:29:42Z","logger":"controller-runtime.webhook","msg":"Registering webhook","path":"/mutate-opentelemetry-io-v1alpha1-opentelemetrycollector"}
{"level":"info","ts":"2023-08-14T21:29:42Z","logger":"controller-runtime.builder","msg":"Registering a validating webhook","GVK":"opentelemetry.io/v1alpha1, Kind=OpenTelemetryCollector","path":"/validate-opentelemetry-io-v1alpha1-opentelemetrycollector"}
{"level":"info","ts":"2023-08-14T21:29:42Z","logger":"controller-runtime.webhook","msg":"Registering webhook","path":"/validate-opentelemetry-io-v1alpha1-opentelemetrycollector"}
{"level":"info","ts":"2023-08-14T21:29:42Z","logger":"controller-runtime.builder","msg":"Registering a mutating webhook","GVK":"opentelemetry.io/v1alpha1, Kind=Instrumentation","path":"/mutate-opentelemetry-io-v1alpha1-instrumentation"}
{"level":"info","ts":"2023-08-14T21:29:42Z","logger":"controller-runtime.webhook","msg":"Registering webhook","path":"/mutate-opentelemetry-io-v1alpha1-instrumentation"}
{"level":"info","ts":"2023-08-14T21:29:42Z","logger":"controller-runtime.builder","msg":"Registering a validating webhook","GVK":"opentelemetry.io/v1alpha1, Kind=Instrumentation","path":"/validate-opentelemetry-io-v1alpha1-instrumentation"}
{"level":"info","ts":"2023-08-14T21:29:42Z","logger":"controller-runtime.webhook","msg":"Registering webhook","path":"/validate-opentelemetry-io-v1alpha1-instrumentation"}
{"level":"info","ts":"2023-08-14T21:29:42Z","logger":"controller-runtime.webhook","msg":"Registering webhook","path":"/mutate-v1-pod"}
{"level":"info","ts":"2023-08-14T21:29:42Z","logger":"setup","msg":"starting manager"}
{"level":"info","ts":"2023-08-14T21:29:42Z","msg":"Starting server","kind":"health probe","addr":"[::]:8081"}
{"level":"info","ts":"2023-08-14T21:29:42Z","logger":"controller-runtime.webhook.webhooks","msg":"Starting webhook server"}
{"level":"info","ts":"2023-08-14T21:29:42Z","msg":"starting server","path":"/metrics","kind":"metrics","addr":"[::]:8080"}
I0814 21:29:42.639882       1 leaderelection.go:245] attempting to acquire leader lease opentelemetry-operator-system/9f7554c3.opentelemetry.io...
{"level":"info","ts":"2023-08-14T21:29:42Z","logger":"controller-runtime.certwatcher","msg":"Updated current TLS certificate"}
{"level":"info","ts":"2023-08-14T21:29:42Z","logger":"controller-runtime.webhook","msg":"Serving webhook server","host":"","port":9443}
{"level":"info","ts":"2023-08-14T21:29:42Z","logger":"controller-runtime.certwatcher","msg":"Starting certificate watcher"}
I0814 21:29:42.648681       1 leaderelection.go:255] successfully acquired lease opentelemetry-operator-system/9f7554c3.opentelemetry.io
{"level":"info","ts":"2023-08-14T21:29:42Z","logger":"instrumentation-upgrade","msg":"looking for managed Instrumentation instances to upgrade"}
{"level":"info","ts":"2023-08-14T21:29:42Z","logger":"collector-upgrade","msg":"looking for managed instances to upgrade"}
{"level":"info","ts":"2023-08-14T21:29:42Z","msg":"Starting EventSource","controller":"opentelemetrycollector","controllerGroup":"opentelemetry.io","controllerKind":"OpenTelemetryCollector","source":"kind source: *v1alpha1.OpenTelemetryCollector"}
{"level":"info","ts":"2023-08-14T21:29:42Z","msg":"Starting EventSource","controller":"opentelemetrycollector","controllerGroup":"opentelemetry.io","controllerKind":"OpenTelemetryCollector","source":"kind source: *v1.ConfigMap"}
{"level":"info","ts":"2023-08-14T21:29:42Z","msg":"Starting EventSource","controller":"opentelemetrycollector","controllerGroup":"opentelemetry.io","controllerKind":"OpenTelemetryCollector","source":"kind source: *v1.ServiceAccount"}
{"level":"info","ts":"2023-08-14T21:29:42Z","msg":"Starting EventSource","controller":"opentelemetrycollector","controllerGroup":"opentelemetry.io","controllerKind":"OpenTelemetryCollector","source":"kind source: *v1.Service"}
{"level":"info","ts":"2023-08-14T21:29:42Z","msg":"Starting EventSource","controller":"opentelemetrycollector","controllerGroup":"opentelemetry.io","controllerKind":"OpenTelemetryCollector","source":"kind source: *v1.Deployment"}
{"level":"info","ts":"2023-08-14T21:29:42Z","msg":"Starting EventSource","controller":"opentelemetrycollector","controllerGroup":"opentelemetry.io","controllerKind":"OpenTelemetryCollector","source":"kind source: *v1.DaemonSet"}
{"level":"info","ts":"2023-08-14T21:29:42Z","msg":"Starting EventSource","controller":"opentelemetrycollector","controllerGroup":"opentelemetry.io","controllerKind":"OpenTelemetryCollector","source":"kind source: *v1.StatefulSet"}
{"level":"info","ts":"2023-08-14T21:29:42Z","msg":"Starting EventSource","controller":"opentelemetrycollector","controllerGroup":"opentelemetry.io","controllerKind":"OpenTelemetryCollector","source":"kind source: *v2.HorizontalPodAutoscaler"}
{"level":"info","ts":"2023-08-14T21:29:42Z","msg":"Starting Controller","controller":"opentelemetrycollector","controllerGroup":"opentelemetry.io","controllerKind":"OpenTelemetryCollector"}
{"level":"info","ts":"2023-08-14T21:29:42Z","logger":"instrumentation-upgrade","msg":"no instances to upgrade"}
{"level":"info","ts":"2023-08-14T21:29:42Z","logger":"collector-upgrade","msg":"no instances to upgrade"}
{"level":"info","ts":"2023-08-14T21:29:42Z","msg":"Starting workers","controller":"opentelemetrycollector","controllerGroup":"opentelemetry.io","controllerKind":"OpenTelemetryCollector","worker count":1}

Still no controller:

❯ k get deployments -n opentelemetry-operator-system
NAME                     READY   UP-TO-DATE   AVAILABLE   AGE
opentelemetry-operator   1/1     1            1           4m21s
Robert Campbell
  • 303
  • 3
  • 12
  • Hi Robert, ADOT PM here. This looks good to me, but without the logs it's hard to tell why it fails for you. Can you try a simpler config (using the default `deployment` mode) like so: ``` { "collector": { "serviceAccount": { "create": false, "name": "adot-collector" } } } ``` If that doesn't work, I'd ask you to [cut us a ticket](https://github.com/aws-observability/aws-otel-collector/issues) with the logs of the custom controller pod? – Michael Hausenblas Aug 14 '23 at 09:21
  • 1
    I added logs for the "mode": "statefulset" approach, I will try the "deployment" approach and add those logs. OF NOTE: one thing I did notice was that the sa that got created with the `eksctl create iamserviceaccount` command did not have a secret. I added a secret before redeploying the addon, and still was not able to bring up controller pods --I was hoping that was the issue. – Robert Campbell Aug 14 '23 at 21:26
  • Aha! If the secret doesn't have an annotation with the IAM role then it won't work, right. – Michael Hausenblas Aug 15 '23 at 15:17
  • Michael , thanks for following up. It feels like the issue I'm having is that the Collector doesn't want to install. Is the AWS Distro compatible with an existing Prometheus instance and not the AWS managed Prometheus? That is what I am trying to do --connect the AWS Distro with an existing Prometheus instnace. If that is an un-supported deployment pattern, I could see why this isnt working. – Robert Campbell Aug 16 '23 at 17:12
  • Yes, the ADOT collector has the Prometheus exporter for that. – Michael Hausenblas Aug 20 '23 at 21:28

1 Answers1

0

There was a tool that was recently release called eksdemo and it supports installing adot. It's still pretty early, but one of the things I like about it is that you can have it dump all of the steps it performs to configure a resource.

So, for example you could run:

eksdemo install adot -c <your_cluster> --dry-run

... and it will spit out everything that it does to configure adot.

I don't think it's doing the exact same configuration that you're using (ie. using a StatefulSet), but does configure it within a cluster, and you noted above that you might be open to an alternative configuration (ie. using a deployment).

Steven Evers
  • 16,649
  • 19
  • 79
  • 126