0

My Python (FastAPI) app is deployed to an AKS cluster (several replicas/pods). It makes requests to a MySQL database managed by "Azure for MySQL Single Server". The app works just fine, but I occasionally get timeout errors (more details later). The database is used to synchronize the product catalogue and to register some user events on an e-commerce website, so it's mostly used during daily hours and barely used at night. These are the errors occurred over the last 7 days: enter image description here

I use sqlalchemy (with pymysql) and this is how I set up the engine and the session maker:

        self.engine = create_engine(
            db_uri,
            pool_pre_ping=True,
            connect_args={
                # To enable SSL, we need to pass the following non-empty dict.
                # We don't actually need any real SSL certificate (ssl_ca) or keys.
                "ssl": {"ssl_ca": ""}
            }
        )
        self.session = sessionmaker(autoflush=False, autocommit=False, bind=self.engine)

Note that I use pool_pre_ping=True, which is described here. This should prevent any error due to using a connection closed by mysql after wait_timeout seconds of inactivity (which is set quite low on Azure).

The full error stack trace is the following:

Traceback (most recent call last):"
  File "/usr/local/lib/python3.9/site-packages/uvicorn/protocols/http/h11_impl.py", line 404, in run_asgi"
    result = await app(  # type: ignore[func-returns-value]"
  File "/usr/local/lib/python3.9/site-packages/uvicorn/middleware/proxy_headers.py", line 78, in __call__"
    return await self.app(scope, receive, send)"
  File "/usr/local/lib/python3.9/site-packages/uvicorn/middleware/message_logger.py", line 86, in __call__"
    raise exc from None"
  File "/usr/local/lib/python3.9/site-packages/uvicorn/middleware/message_logger.py", line 82, in __call__"
    await self.app(scope, inner_receive, inner_send)"
  File "/usr/local/lib/python3.9/site-packages/fastapi/applications.py", line 270, in __call__"
    await super().__call__(scope, receive, send)"
  File "/usr/local/lib/python3.9/site-packages/starlette/applications.py", line 124, in __call__"
    await self.middleware_stack(scope, receive, send)"
  File "/usr/local/lib/python3.9/site-packages/starlette/middleware/errors.py", line 184, in __call__"
    raise exc"
  File "/usr/local/lib/python3.9/site-packages/starlette/middleware/errors.py", line 162, in __call__"
    await self.app(scope, receive, _send)"
  File "/usr/local/lib/python3.9/site-packages/prometheus_fastapi_instrumentator/middleware.py", line 103, in __call__"
    raise exc"
  File "/usr/local/lib/python3.9/site-packages/prometheus_fastapi_instrumentator/middleware.py", line 101, in __call__"
    await self.app(scope, receive, send_wrapper)"
  File "/usr/local/lib/python3.9/site-packages/asgi_correlation_id/middleware.py", line 81, in __call__"
    await self.app(scope, receive, handle_outgoing_request)"
  File "/usr/local/lib/python3.9/site-packages/starlette/middleware/exceptions.py", line 75, in __call__"
    raise exc"
  File "/usr/local/lib/python3.9/site-packages/starlette/middleware/exceptions.py", line 64, in __call__"
    await self.app(scope, receive, sender)"
  File "/usr/local/lib/python3.9/site-packages/fastapi/middleware/asyncexitstack.py", line 21, in __call__"
    raise e"
  File "/usr/local/lib/python3.9/site-packages/fastapi/middleware/asyncexitstack.py", line 18, in __call__"
    await self.app(scope, receive, send)"
  File "/usr/local/lib/python3.9/site-packages/starlette/routing.py", line 680, in __call__"
    await route.handle(scope, receive, send)"
  File "/usr/local/lib/python3.9/site-packages/starlette/routing.py", line 275, in handle"
    await self.app(scope, receive, send)"
  File "/usr/local/lib/python3.9/site-packages/starlette/routing.py", line 65, in app"
    response = await func(request)"
  File "/usr/local/lib/python3.9/site-packages/fastapi/routing.py", line 231, in app"
    raw_response = await run_endpoint_function("
  File "/usr/local/lib/python3.9/site-packages/fastapi/routing.py", line 162, in run_endpoint_function"
    return await run_in_threadpool(dependant.call, **values)"
  File "/usr/local/lib/python3.9/site-packages/starlette/concurrency.py", line 41, in run_in_threadpool"
    return await anyio.to_thread.run_sync(func, *args)"
  File "/usr/local/lib/python3.9/site-packages/anyio/to_thread.py", line 31, in run_sync"
    return await get_asynclib().run_sync_in_worker_thread("
  File "/usr/local/lib/python3.9/site-packages/anyio/_backends/_asyncio.py", line 937, in run_sync_in_worker_thread"
    return await future"
  File "/usr/local/lib/python3.9/site-packages/anyio/_backends/_asyncio.py", line 867, in run"
    result = context.run(func, *args)"
  File "/app/./main.py", line 147, in register_session_event"
    return handle_register_event(db, session_event)"
  File "/app/./modules/events.py", line 61, in handle_register_event"
    session.commit()"
  File "/usr/local/lib/python3.9/site-packages/sqlalchemy/orm/session.py", line 1451, in commit"
    self._transaction.commit(_to_root=self.future)"
  File "/usr/local/lib/python3.9/site-packages/sqlalchemy/orm/session.py", line 829, in commit"
    self._prepare_impl()"
  File "/usr/local/lib/python3.9/site-packages/sqlalchemy/orm/session.py", line 808, in _prepare_impl"
    self.session.flush()"
  File "/usr/local/lib/python3.9/site-packages/sqlalchemy/orm/session.py", line 3386, in flush"
    self._flush(objects)"
  File "/usr/local/lib/python3.9/site-packages/sqlalchemy/orm/session.py", line 3526, in _flush"
    transaction.rollback(_capture_exception=True)"
  File "/usr/local/lib/python3.9/site-packages/sqlalchemy/util/langhelpers.py", line 70, in __exit__"
    compat.raise_("
  File "/usr/local/lib/python3.9/site-packages/sqlalchemy/util/compat.py", line 208, in raise_"
    raise exception"
  File "/usr/local/lib/python3.9/site-packages/sqlalchemy/orm/session.py", line 3486, in _flush"
    flush_context.execute()"
  File "/usr/local/lib/python3.9/site-packages/sqlalchemy/orm/unitofwork.py", line 456, in execute"
    rec.execute(self)"
  File "/usr/local/lib/python3.9/site-packages/sqlalchemy/orm/unitofwork.py", line 630, in execute"
    util.preloaded.orm_persistence.save_obj("
  File "/usr/local/lib/python3.9/site-packages/sqlalchemy/orm/persistence.py", line 212, in save_obj"
    for ("
  File "/usr/local/lib/python3.9/site-packages/sqlalchemy/orm/persistence.py", line 373, in _organize_states_for_save"
    for state, dict_, mapper, connection in _connections_for_states("
  File "/usr/local/lib/python3.9/site-packages/sqlalchemy/orm/persistence.py", line 1737, in _connections_for_states"
    connection = uowtransaction.transaction.connection(base_mapper)"
  File "/usr/local/lib/python3.9/site-packages/sqlalchemy/orm/session.py", line 626, in connection"
    return self._connection_for_bind(bind, execution_options)"
  File "/usr/local/lib/python3.9/site-packages/sqlalchemy/orm/session.py", line 735, in _connection_for_bind"
    conn = self._parent._connection_for_bind(bind, execution_options)"
  File "/usr/local/lib/python3.9/site-packages/sqlalchemy/orm/session.py", line 747, in _connection_for_bind"
    conn = bind.connect()"
  File "/usr/local/lib/python3.9/site-packages/sqlalchemy/engine/base.py", line 3315, in connect"
    return self._connection_cls(self, close_with_result=close_with_result)"
  File "/usr/local/lib/python3.9/site-packages/sqlalchemy/engine/base.py", line 96, in __init__"
    else engine.raw_connection()"
  File "/usr/local/lib/python3.9/site-packages/sqlalchemy/engine/base.py", line 3394, in raw_connection"
    return self._wrap_pool_connect(self.pool.connect, _connection)"
  File "/usr/local/lib/python3.9/site-packages/sqlalchemy/engine/base.py", line 3364, in _wrap_pool_connect"
    Connection._handle_dbapi_exception_noconnection("
  File "/usr/local/lib/python3.9/site-packages/sqlalchemy/engine/base.py", line 2198, in _handle_dbapi_exception_noconnection"
    util.raise_("
  File "/usr/local/lib/python3.9/site-packages/sqlalchemy/util/compat.py", line 208, in raise_"
    raise exception"
  File "/usr/local/lib/python3.9/site-packages/sqlalchemy/engine/base.py", line 3361, in _wrap_pool_connect"
    return fn()"
  File "/usr/local/lib/python3.9/site-packages/sqlalchemy/pool/base.py", line 320, in connect"
    return _ConnectionFairy._checkout(self)"
  File "/usr/local/lib/python3.9/site-packages/sqlalchemy/pool/base.py", line 884, in _checkout"
    fairy = _ConnectionRecord.checkout(pool)"
  File "/usr/local/lib/python3.9/site-packages/sqlalchemy/pool/base.py", line 491, in checkout"
    rec._checkin_failed(err, _fairy_was_created=False)"
  File "/usr/local/lib/python3.9/site-packages/sqlalchemy/util/langhelpers.py", line 70, in __exit__"
    compat.raise_("
  File "/usr/local/lib/python3.9/site-packages/sqlalchemy/util/compat.py", line 208, in raise_"
    raise exception"
  File "/usr/local/lib/python3.9/site-packages/sqlalchemy/pool/base.py", line 488, in checkout"
    dbapi_connection = rec.get_connection()"
  File "/usr/local/lib/python3.9/site-packages/sqlalchemy/pool/base.py", line 646, in get_connection"
    self.__connect()"
  File "/usr/local/lib/python3.9/site-packages/sqlalchemy/pool/base.py", line 678, in __connect"
    pool.logger.debug("Error on connect(): %s", e)"
  File "/usr/local/lib/python3.9/site-packages/sqlalchemy/util/langhelpers.py", line 70, in __exit__"
    compat.raise_("
  File "/usr/local/lib/python3.9/site-packages/sqlalchemy/util/compat.py", line 208, in raise_"
    raise exception"
  File "/usr/local/lib/python3.9/site-packages/sqlalchemy/pool/base.py", line 673, in __connect"
    self.dbapi_connection = connection = pool._invoke_creator(self)"
  File "/usr/local/lib/python3.9/site-packages/sqlalchemy/engine/create.py", line 578, in connect"
    return dialect.connect(*cargs, **cparams)"
  File "/usr/local/lib/python3.9/site-packages/sqlalchemy/engine/default.py", line 598, in connect"
    return self.dbapi.connect(*cargs, **cparams)"
  File "/usr/local/lib/python3.9/site-packages/pymysql/connections.py", line 353, in __init__"
    self.connect()"
  File "/usr/local/lib/python3.9/site-packages/pymysql/connections.py", line 664, in connect"
    raise exc"
sqlalchemy.exc.OperationalError: (pymysql.err.OperationalError) (2003, "Can't connect to MySQL server on '<omitted>.mysql.database.azure.com' (timed out)")"

Some facts:

  • The error is due to a connection timeout and happens after 10 seconds from when the request is handled by the web server; 10 seconds is the default connect_timeout value in pymysql.
  • From my logs, I can see that meanwhile the worker handling the request that is going to fail 10 seconds later is waiting, other requests are being handled just fine without any connection error (this applies to all pods, including the one handling the failed request). In other words, the DB is up.
  • Errors don't seem to happen when the load is high. I checked for CPU usage, memory usage, IOPS, active connections. Everything is fine and times of high load don't match error timestamps.

EDIT: Here's the relevant versions:

  • MySQL: 8.0
  • sqlalchemy: 1.4.41
  • pymysql: 1.0.2
janluke
  • 1,567
  • 1
  • 15
  • 19
  • Post TEXT data on justpaste.it and share the links. From your SSH login root, Text results of: A) SELECT COUNT(*), sum(data_length), sum(index_length), sum(data_free) FROM information_schema.tables; B) SHOW GLOBAL STATUS; after minimum 24 hours UPTIME C) SHOW GLOBAL VARIABLES; D) SHOW FULL PROCESSLIST; E) STATUS; not SHOW STATUS, just STATUS; G) SHOW ENGINE INNODB STATUS; for server workload tuning analysis to provide timeout extension suggestions. – Wilson Hauck Jan 28 '23 at 19:07
  • Post TEXT data on justpaste.it and share the links. Additional very helpful OS information includes - please, htop 1st page, if available, TERMINATE, top -b -n 1 for most active apps, top -b -n 1 -H for details on your mysql threads memory and cpu usage, ulimit -a for list of limits, iostat -xm 5 3 for IOPS by device & core/cpu count, df -h for Used - Free space by device, df -i for inode info by device, free -h for Used - Free Mem: and Swap:, cat /proc/meminfo includes VMallocUused, for server workload tuning analysis to provide suggestions. – Wilson Hauck Jan 28 '23 at 19:07
  • I can share only some of the info: https://gist.github.com/janluke/2fbf7cf3a833563b0080d0ff45fd7380. I checked (max) metrics on Azure. CPU usage is low, memory is stable at 80%. Connections are fine. IOPS occasionally reach 100% or near 100% but my errors don't happen in periods of high IOPS load. – janluke Jan 30 '23 at 11:26
  • Your github posting ended before the end of B) SHOW GLOBAL STATUS; . You will likely find that justpaste.it is more reasonable in posting the data requested. Will be unable to complete workload analysis with limited data that has been posted. – Wilson Hauck Jan 30 '23 at 21:40
  • That's all I get for SHOW GLOBAL STATUS, no text is missing. – janluke Jan 31 '23 at 02:02
  • Could you post C) SHOW GLOBAL VARIABLES; minimum. Other data would be nice to have. Do you know your Azure type that is running, or RAM GB and vcpu's/cores? Is data storage on 1) HDD, 2) SSD or 3) NVME ? Thanks – Wilson Hauck Jan 31 '23 at 21:54
  • `SHOW VARIABLES LIKE '%timeout%';` – Rick James Feb 01 '23 at 05:04
  • Timeout variables: https://gist.github.com/janluke/2fbf7cf3a833563b0080d0ff45fd7380#file-timeout_variables-txt All variables: https://gist.github.com/janluke/2fbf7cf3a833563b0080d0ff45fd7380#file-all_variables-txt – janluke Feb 01 '23 at 11:59
  • @WilsonHauck Pricing tier: general purpose Compute generation: Gen 5 vCores: 2 Memory: 10GB (5GB per vCore) Storage: 150GB -> max 450 IOPS (can't find if SSD) More info here: https://learn.microsoft.com/en-us/azure/mysql/single-server/concepts-pricing-tiers#compute-generations-vcores-and-memory – janluke Feb 01 '23 at 12:00
  • @janluke Thanks for posting additional info. Hoping to have an Answer to you with Suggestions in 48 hours. Analysis in process. – Wilson Hauck Feb 01 '23 at 15:49

1 Answers1

0

Could you ask Azure to run powercfg.cpl to determine the Power Plan in use? If you could convince them to select High Performance, you problem may be resolved.

If you can not convince Azure to use High Performance consider the following for Azure portal, SETTINGS, Server parameters

connect_timeout=30  # from 10 seconds to be more patient establishing a connnection
innodb_log_flush_notifier_timeout=30  # from 10 seconds for patience 
innodb_log_flusher_timeout=30  # from 10 seconds for patience 
innodb_log_write_notifier_timeout=30  # from 10 seconds for patience 
innodb_log_writer_timeout=30  # from 10 seconds for patience 

innodb_lru_scan_depth=100  # from 1024 to conserve 90% CPU cycles used for function every second 
innodb_io_capacity=400  # from 200 to use more of your provisioned IOPS 
thread_cache_size=64  # from 11 to reduce threads_created overhead RPhr=7 
innodb_flush_sync=OFF  # from ON to honor innodb_io_capacity limit

Observations, com_flush-37272 for RPhr of 120, typical is 12. Why are flush'es so frequent?

View profile for contact info. There are many more opportunities to improve performance. Get in touch, please.

Wilson Hauck
  • 2,094
  • 1
  • 11
  • 19