On Cloud Composer I have long running DAG tasks, each of them running for 4 to 6 hours. The task ends with an error which is caused by Kubernetes API. The error message states 401 Unauthorized.
The error message:
kubernetes.client.rest.ApiException: (401)
Reason: Unauthorized
HTTP response headers: HTTPHeaderDict({'Audit-Id': 'e1a37278-0693-4f36-8b04-0a7ce0b7f7a0', 'Cache-Control': 'no-cache, private', 'Content-Type': 'application/json', 'Date': 'Fri, 07 Jul 2023 08:10:15 GMT', 'Content-Length': '129'})
HTTP response body: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"Unauthorized","reason":"Unauthorized","code":401}
The kubernetes API token has an expiry of 1 hour and the Composer is not renewing the token before it expires. This issue never happened with Composer1, it started showing only when I migrated from Composer1 to Composer2
Additional details: There is an option in GKEStartPodOperator called is_delete_operator_pod that is set to true. This option deletes the pod from the cluster after the job is done. So, after the task is completed in about 4 hours, the Composer tries to delete the pod, and that time this 401 Unauthorized error is shown.
I have checked some Airflow configs like kubernetes.enable_tcp_keepalive that enables TCP keepalive mechanism for kubernetes clusters, but it doesn't help resolving the problem.
What can be done to prevent this error?