Connection Timeout after 15 min when using sqlalchemy+docker swarm

Question

I have a FastAPI+SQLAlchemy+MariaDB application, which works fine when running local or in docker compose docker compose up . But when I run it in swarm mode (docker stack deploy -c docker-compose.yml issuetest), it creates an connection error after exactly 15 minutes of idle:

sqlalchemy.exc.OperationalError: (asyncmy.errors.OperationalError) (2013, 'Lost connection to MySQL server during query ([Errno 104] Connection reset by peer)')

The default MariaDB timeout should be 8 hours. I can avoid this issue by defining pool_recycle=60*10 (or any other value less than 15 minutes), but would like to understand, what went wrong.

To reproduce, here a minimalistic code sample of app/main.py

import uvicorn
from fastapi import FastAPI
from sqlalchemy.ext.asyncio import AsyncSession, create_async_engine
from sqlmodel import Field, SQLModel, select

engine = create_async_engine("mysql+asyncmy://root:pw@mariadbhost/somedb", future=True)
app = FastAPI()


class Car(SQLModel, table=True):
    id: int = Field(nullable=True, primary_key=True)
    name: str


@app.on_event("startup")
async def on_startup():
    async with engine.begin() as conn:
        await conn.run_sync(SQLModel.metadata.create_all)


async def get_db_cars():
    async with AsyncSession(engine) as session:
        statement = select(Car)
        result = await session.execute(statement)
        cars = result.scalars().all()
    return cars


@app.get("/dbcall")
async def dbcall():
    return await get_db_cars()


if __name__ == "__main__":
    uvicorn.run("main:app", host="0.0.0.0", port=8000, reload=True)

And the docker-compose.yml file:

version: '3.1'

services:
  mariadbhost:
    image: mariadb:10.7
    environment:
      MYSQL_ROOT_PASSWORD: pw
      MYSQL_DATABASE: somedb

  mybackend:
    image: myimage
    ports:
      - 8089:80

PzaThief · Answer 1 · 2023-03-21T10:37:04.007

I'm late. But this answer is for not only you but who reached this question in future.

You might use VIP network mode for MySQL server in swarm. Even if you didn't set VIP mode using endpoint_mode in compose file, endpoint_mode's default is VIP mode.

The Problem is timeout of Virtual IP of swarm network. swarm network using ipvs for routing your tcp request. but ipvs's default timeout for idle connection is 900s (linux kernel source) and Swarm dose not modify it.

This problem can appear in HTTP connection that no communications while 15min after established using HTTP client set keepalive time over 15min. HTTP client will communicate using established connection, but the connection is already broken by removed VIP.

ipvs assumes connections are stateless and short lived and VIP need to refresh after a time ago becuase of equally distributed routing for multiple instances. This is the reason for VIP's short timeout

So if you want to fix this problem, there are many ways like below.

change the network mode to dnsrr. dnsrr returns ip list of instances and your application need to choose one of them. so there is no more using vip. but you can't use it for service that need to expose port using ingress mode.
Use keep alive method. In your case, you can use health check query like select 1. someone that in HTTP keepalive connection case, you can use TCP keepalive using tcp probes or just modify keepalive time to under 15min.
change default value of ipvs. because of not recommended way and not tested in my computer, there is no more describe.

I hope this answer is useful to anyone suffered same problem.

Connection Timeout after 15 min when using sqlalchemy+docker swarm

1 Answers1