0

So i have a flask app, that connects to a Postgres12 DB using a user that just has select privileges. Often times i see my apis return 400 error, and api does this when the sql query executed returns None.

I have built a small wrapper class over execute and executemany function for some error handling.

import time
from functools import wraps

import psycopg
from psycopg import InterfaceError, OperationalError

logger = logging.getLogger(__name__)


def retry(fn):
    @wraps(fn)
    def wrapper(*args, **kw):
        cls = args[0]
        exec = None
        for x in range(cls._reconnectTries):
            try:
                return fn(*args, **kw)
            except (InterfaceError, OperationalError) as e:
                logger.warning(f"Database Connection {e} exception type: {type(e)}")
                logger.info(f"Idle for {cls._reconnectIdle} seconds")
                time.sleep(cls._reconnectIdle)
                cls._connect()
                exec = e
        import sys

        logger.exception(f"Exiting the system, {exec} ")
        sys.exit(exec)

    return wrapper


class Connection:
    _reconnectTries = 5
    _reconnectIdle = 2

    def __init__(self, conn_string):
        self._conn_string = conn_string
        self.conn = None
        self.cursor = None
        self._connect()

    def _connect(self):
        self.conn = psycopg.connect(self._conn_string)
        self.conn.autocommit = True
        self.cursor = self.conn.cursor()

    @retry
    def execute(self, **kwargs):
        # self.conn.commit()
        if "query" in kwargs:
            """
            this is done to ensure postgres logs multi line queries sent by client in single line for easier
            log collection and debugging.
            """
            kwargs["query"] = kwargs["query"].replace("\n", " ")
            kwargs["query"] = " ".join(kwargs["query"].split())
        return self.cursor.execute(**kwargs)

    @retry
    def executemany(self, **kwargs):
        # self.conn.commit()
        return self.cursor.executemany(**kwargs)

For the sake of simplicity, the query code looks somewhat like

store_detail_by_link_query = """
SELECT
   json_build_object('id', store.id, 'uuid', store.uuid)
FROM
   optimus_store store 

WHERE
store.link= %(store_link_or_domain)s and store.is_active = TRUE and store.is_deleted = FALSE;

"""
optimus_connection = Connection(conn_string=CONN_STRING)
params = {
            "store_link_or_domain": "dipen28",
        }
row = optimus_connection.execute(
            query=store_detail_by_link_query,
            params=params,
        ).fetchone()

The problem is, in the same api call if i just do a check that result is None, then rerun the query, at that time the result comes.

I know the data is there in database, this is coming to us in our production system and i am unable to reproduce it at local.

Any help is much appreciated.

davidism
  • 121,510
  • 29
  • 395
  • 339
Dipendra bhatt
  • 729
  • 5
  • 14
  • 1) Not sure why you rebuild the query, Postgres logs multi-line queries. 2) Best guess is this is a visibility issue. That one session makes the insert/update but has not committed them so the other session that runs the select does not see the data. Then the insert/update session commits and the second run of the select finds the data. – Adrian Klaver Aug 17 '22 at 17:32
  • The logging thing was just for personal use. I use pg analyze and it somehow doesnt handle multiline logs nicely. So i had that guess, but you see the problem is that store with link 'dipen28' is my personal test store and nobody has access to it. And there are not going to be insert and just update if there are one. And even if theres a update the select should not be hindered by it right. The select should still return some version of that data, the problem is the result comes none. – Dipendra bhatt Aug 18 '22 at 06:22
  • But you said there was no issue on your local instance just the production one. On the production instance there will be inserts, correct? Another potential issue is locking on the table. More information is available here [Locking](https://www.postgresql.org/docs/current/explicit-locking.html). The only lock that interferes with a normal SELECT is ACCESS EXCLUSIVE. Since this seems to happen only on the production instance you will need to crank up the logging, if possible to see if can catch the culprit. – Adrian Klaver Aug 18 '22 at 15:18
  • So what i meant by not able to repro in local is, for debugging this obscure bug i was actually connecting my local application with the prod db and then trying to repro it. So while both the prod application and local application were connected with the same db, i was doing a load test on a specific endpoint and the local application was giving 200 response 100% of time while at the same time the prod application was giving 1 or 2 random 400 response in the response. Thats what is confusing as hell – Dipendra bhatt Aug 19 '22 at 06:02
  • The load test was done by using hey tool, with the config hey -n 100 -c 10 – Dipendra bhatt Aug 19 '22 at 06:04
  • So what is the difference in the network connectivity between the local and production application? – Adrian Klaver Aug 19 '22 at 14:49
  • Don't think anything major, the prod application is in the same data centre as db. – Dipendra bhatt Aug 19 '22 at 20:02
  • Yes but where is the local application? – Adrian Klaver Aug 19 '22 at 20:04
  • So a update, in my connection class i am using cursor to execute the query. But i i checked psycopg3 exposes the execute method directly as a property of connection object. And using that execute directly from connection object, the random errors has gone. It has been almost 8 hours and zero such errors have come in prod. So i believe maybe using the cursor might have been creating some issues. https://github.com/psycopg/psycopg2/issues/346#issuecomment-363259003 as i found this issue is kinda similar to what i was facing. – Dipendra bhatt Aug 19 '22 at 20:07
  • PS by local connection i meant my iMac. – Dipendra bhatt Aug 19 '22 at 20:12
  • 1) The issue was for `psycopg2` not `psycopg(3)` and it was fixed almost 3 years ago. 2) From [here](https://github.com/psycopg/psycopg2/issues/346#issuecomment-456502595) I'm thinking the answer to your issue would be to do `exec = optimus_connection.execute(..)` then do `row = exec.fetchone()` 3) I'm betting your iMac is not in the data center but is connecting to it over a network. – Adrian Klaver Aug 19 '22 at 20:54
  • Yup like i have mentioned, after directly running execute on connection object solved the problem. But using psycopg3 on the cursor the problem was still coming. And yup my system was connecting to it over a network, i dont see how that would be an issue. Besides the problem was coming on prod application and could be reproduced with the load test using hey while it wasnt being reproduced in the local. Anyways i will chalk it up to some obscure bug in cursor, which until i can deterministically reproduce ignore it. Thank you for your help. – Dipendra bhatt Aug 19 '22 at 21:23

0 Answers0