0

On Cloud Composer I have long running DAG tasks, each of them running for 4 to 6 hours. The task ends with an error which is caused by Kubernetes API. The error message states 401 Unauthorized.

The error message:

kubernetes.client.rest.ApiException: (401)
Reason: Unauthorized
HTTP response headers: HTTPHeaderDict({'Audit-Id': 'e1a37278-0693-4f36-8b04-0a7ce0b7f7a0', 'Cache-Control': 'no-cache, private', 'Content-Type': 'application/json', 'Date': 'Fri, 07 Jul 2023 08:10:15 GMT', 'Content-Length': '129'})
HTTP response body: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"Unauthorized","reason":"Unauthorized","code":401}

The kubernetes API token has an expiry of 1 hour and the Composer is not renewing the token before it expires. This issue never happened with Composer1, it started showing only when I migrated from Composer1 to Composer2

Additional details: There is an option in GKEStartPodOperator called is_delete_operator_pod that is set to true. This option deletes the pod from the cluster after the job is done. So, after the task is completed in about 4 hours, the Composer tries to delete the pod, and that time this 401 Unauthorized error is shown.

I have checked some Airflow configs like kubernetes.enable_tcp_keepalive that enables TCP keepalive mechanism for kubernetes clusters, but it doesn't help resolving the problem.

What can be done to prevent this error?

Kavya
  • 105
  • 1
  • 15
  • 1
    Hi @Kavya, can you follow this [Public Doc](https://cloud.google.com/kubernetes-engine/docs/troubleshooting#troubleshooting_error_400_issues) I hope it will help you with the resolving issue. And you can check more details in this [Git link](https://github.com/apache/airflow/issues/31648) – Arpita Shrivastava Jul 26 '23 at 11:39
  • Is this programming related? – Vega Jul 26 '23 at 14:18
  • 1
    The 401 Kubernetes exception only happens when the DAG task runs for more than 1 hour, if the task finishes within 1 hour, then it completes successfully without any errors. So, the issue is the kubernetes token is not being refreshed by default in Composer2 – Kavya Jul 26 '23 at 16:34
  • @Vega No, it's not programming related. It is related to either the GKE autopilot cluster or the Google Cloud Composer2 – Kavya Jul 26 '23 at 16:36
  • SO this question is off-topic on SO – Vega Jul 26 '23 at 19:18

2 Answers2

1

As mentioned in the comment This issue might occur when you try to run a kubectl command in your GKE cluster from a local environment. The command fails and displays an error message, usually with HTTP status code (Unauthorized).

The cause of this issue might be one of the following:

  • The gke-gcloud-auth-plugin authentication plugin is not correctly installed or configured.

  • You lack the permissions to connect to the cluster API server and run kubectl commands.

To diagnose the cause, do the following this Link

If you get a 401 error or a similar authorization error, ensure that you have the correct permissions to perform the operation. And for more information you can see Git Link

  • It is actually the same issue as mentioned it the Git Link. The issue maybe fixed in airflow 2.6.4 release – Kavya Jul 27 '23 at 12:04
  • Hi @kavya, If my answer addressed your question, do consider accepting and up voting it as per [Stack Overflow guidelines](https://stackoverflow.com/help/someone-answers) helping more Stack contributors with their researches. If not, let me know so that I can improve my answer. – Arpita Shrivastava Aug 01 '23 at 09:58
1

After experiencing the same issue, I found a fix in the latest version of the Google provider for Airflow, which is currently not yet available in Cloud Composer. However, you can manually override this by adding the release candidate package to your Cloud Composer instance.

You can use the release candidate for version 10.5.0 of the apache-airflow-providers-google python package. It can be found here.

The override can be accomplished by either manually adding a Pypi package in the Cloud Composer environment's settings, or by adding the package to the terraform resource. The updates takes about 15-30 minutes.

I tested this and can confirm it works. Tasks can again run longer than 1h.

Jonny5
  • 1,390
  • 1
  • 15
  • 41