How to reproducibly query random rows with SQLAlchemy from PostgreSQL?

Question

I am trying to pseudo-randomly select rows from a PostgreSQL table using SQLAlchemy, but I need to use a seed to guarantee reproducibility of the query. The specific case is concerning a publication being submitted, tied to a codebase.

This answer does a great job introducing how one can leverage the following one-liner to select a single random row:

select.order_by(func.random()) # for PostgreSQL, SQLite

Further, one can select many of psuedo-random rows via:

select.order_by(func.random()).limit(n)

But how do I ensure that I select the same psuedo-random rows every time I run the above query?

score 0 · Answer 1 · answered Nov 28 '19 at 16:03

You can leverage the setseed(n) postgreSQL method. Using sqlalchemy and ChEMBL as our sample DB, the full solution looks like this:

from sqlalchemy import create_engine, func, select

query = select([MoleculeRecord.molregno]).order_by(func.random()).limit(500)
SEED = .5  # Change this to any float from -1.0 -> 1.0 to get different query results

e = create_engine("postgres:///chembl_25")
conn = e.connect()
conn.execute(f"SELECT setseed({SEED})")

firstQueryResults = [x.molregno for x in conn.execute(query)]
secondQueryResults = [x.molregno for x in conn.execute(query)]

assert(firstQueryResults == secondQueryResults)

This returns 500 random rows, with the invariant that these 500 rows will always be the same each time this query is executed. Too select different random rows, change the SEED variable too something different, between -1.0 and 1.0.

As a note for Postgresql, the `TABLESAMPLE` clause also supports using a seed for repeatable results. — Ilja Everilä, Nov 28 '19 at 20:22

How to reproducibly query random rows with SQLAlchemy from PostgreSQL?

1 Answers1