Context:
I run envoy with grpc-web. I have a bunch of gRPC servers to route to. Each server has a dedicated route and cluster (see config below). Envoy runs inside a docker-container with no special changes (only config and SSL). Envoy and the gRPC servers are connected via a docker network.
Problem:
Whenever I restart the envoy-container, it takes a few requests until the grpc-web calls go through and don't time out. This is after the envoy-container is 100% started. Leaving it running for longer does not prevent this (left it for hours). After these first ~3 failing requests everything works fine until the container is restarted.
Relevant configs:
I removed any obviously unnecessary config in the docker compose and otherwise condensed the config as much as possible (removed all the repeated parts for each server).
envoy:
admin:
access_log_path: /tmp/admin_access.log
address:
socket_address: { address: 0.0.0.0, port_value: 9901 }
static_resources:
listeners:
- name: listener_0
address:
socket_address: { address: 0.0.0.0, port_value: 8080 }
listener_filters:
- name: "envoy.filters.listener.tls_inspector"
typed_config: { }
filter_chains:
# Use HTTPS (TLS) encryption for ingress data
# Disable this to allow tools like bloomRPC which don't work via https
transport_socket:
name: envoy.transport_socket.tls
typed_config:
"@type": type.googleapis.com/envoy.extensions.transport_sockets.tls.v3.DownstreamTlsContext
common_tls_context:
tls_certificates:
- certificate_chain:
filename: "/etc/envoy/envoy.pem"
private_key:
filename: "/etc/envoy/envoy.key"
filters:
- name: envoy.filters.network.http_connection_manager
typed_config:
"@type": type.googleapis.com/envoy.extensions.filters.network.http_connection_manager.v3.HttpConnectionManager
codec_type: auto
stat_prefix: ingress_http
access_log:
- name: envoy.access_loggers.file
# Logger for gRPC requests (can be identified by the presence of the "x-grpc-web"-header)
filter:
header_filter:
header:
name: "x-grpc-web"
typed_config:
"@type": type.googleapis.com/envoy.extensions.access_loggers.file.v3.FileAccessLog
path: /dev/stdout
format: "[%START_TIME%] \"%DOWNSTREAM_REMOTE_ADDRESS_WITHOUT_PORT%\": \"%REQ(X-ENVOY-ORIGINAL-PATH?:PATH)%\" -> \"%UPSTREAM_HOST%\" [gRPC-status: %GRPC_STATUS%] (cluster: %UPSTREAM_CLUSTER% route: %ROUTE_NAME%)\n"
- name: envoy.access_loggers.file
# Logger for HTTP(s) requests (everything that is not a gRPC request)
filter:
header_filter:
header:
name: "x-grpc-web"
invert_match: true
typed_config:
"@type": type.googleapis.com/envoy.extensions.access_loggers.file.v3.FileAccessLog
path: /dev/stdout
format: "[%START_TIME%] \"%DOWNSTREAM_REMOTE_ADDRESS_WITHOUT_PORT%\": \"%REQ(:METHOD)% %REQ(X-ENVOY-ORIGINAL-PATH?:PATH)%\" -> \"%UPSTREAM_HOST%\" [http(s)-status: %RESPONSE_CODE%] (cluster: %UPSTREAM_CLUSTER% route: %ROUTE_NAME%)\n"
stream_idle_timeout: 43200s # 12h
route_config:
name: local_route
virtual_hosts:
- name: gRPC-Web-Proxy
domains: [ "*" ]
request_headers_to_add:
- header:
key: "source"
value: "envoy"
append: false
- header:
key: "downstream-address"
value: "%DOWNSTREAM_REMOTE_ADDRESS_WITHOUT_PORT%"
append: false
cors:
allow_origin_string_match:
- prefix: "*"
allow_methods: GET, PUT, DELETE, POST, OPTIONS
allow_headers: keep-alive,user-agent,cache-control,content-type,content-transfer-encoding,x-accept-content-transfer-encoding,x-accept-response-streaming,x-user-agent,x-grpc-web,grpc-timeout,x-envoy-retry-grpc-on,x-envoy-max-retries,auth-token,x-real-ip,client-ip,x-forwarded-for,x-forwarded,x-cluster-client-ip,forwarded-for,forwarded
max_age: "1728000"
expose_headers: grpc-status,grpc-message
routes: # https://www.envoyproxy.io/docs/envoy/latest/api-v3/config/route/v3/route_components.proto
- name: grpcserver_gRPCRoute
match:
prefix: "/api/services.grpcserver"
route:
cluster: grpcserver_gRPCCluster
prefix_rewrite: "/services.grpcserver"
timeout: 0s # No timeout. Otherwise, streams will be aborted regularly
http_filters:
- name: envoy.filters.http.grpc_web
- name: envoy.filters.http.cors
- name: envoy.filters.http.router
clusters:
- name: grpcserver_gRPCCluster
connect_timeout: 0.25s
type: static
http2_protocol_options: { }
lb_policy: round_robin
load_assignment:
cluster_name: grpcserver_gRPCCluster
endpoints:
- lb_endpoints:
- endpoint:
address:
socket_address:
address: 127.0.0.1
port_value: 20001
transport_socket:
# Connect to microservice via TLS
name: envoy.transport_sockets.tls
typed_config:
"@type": type.googleapis.com/envoy.extensions.transport_sockets.tls.v3.UpstreamTlsContext
common_tls_context:
tls_certificates:
- certificate_chain: { "filename": "/etc/envoy/envoy.pem" }
private_key: { "filename": "/etc/envoy/envoy.key" }
# Validate CA of microservice
validation_context:
match_subject_alt_names:
trusted_ca:
filename: /etc/ssl/certs/ca-certificates.crt
docker-compose.yml:
version: '2.4'
networks:
core:
name: Service_Core
driver: bridge
ipam:
config:
- subnet: 198.51.100.0/24
gateway: 198.51.100.1
services:
envoy:
container_name: "envoy"
image: "envoyproxy/envoy:v1.17.1"
ports:
- 8080:8080
networks:
- core
restart: always
security_opt:
- apparmor:unconfined
environment:
- ENVOY_UID=17200
- ENVOY_GID=17200
volumes:
- "/somepath/envoy.pem:/etc/envoy/envoy.pem:ro"
- "/somepath/envoy.key:/etc/envoy/envoy.key:ro"
- "/somepath/ca.pem:/etc/ssl/certs/ca-certificates.crt:ro"
- "/somepath/envoy.yml:/etc/envoy/envoy.yaml:ro"
grpcserver:
image: "<grpcserver>"
container_name: "grpcserver"
restart: always
networks:
- core
security_opt:
- apparmor:unconfined
frontend:
image: "<frontend>" # an nginx with the files for the UI
container_name: "frontend"
restart: always
networks:
- core
ports:
- 80:80
- 443:443
volumes:
- "/somepath/ssl/:/opt/ssl/"
security_opt:
- apparmor:unconfined
What could be causing this behavior?
I am only interested in a fix regarding the docker or envoy config. I already considered using a workaround, but I would rather fix it instead.