I realize that similar questions have been asked before, but none of the solutions have worked for me.
To be succint: I have a SQLAlchemy Select object which is a query that includes a where(not_in(subquery)) clause:
query = Select(table.c.column)
.where(...)
.where(table.c.column.not_in(subquery))
subquery = Select(table_2.c.ids).distinct()
Very simple. The subquery returns around half a million rows, all of them to be excluded from the main query.
I need the result in a pandas' dataframe, so I execute it using read_sql:
data = pd.read_sql(con=engine.connect(),
sql=query)
When I try to run this, the application freezes. This is the error:
pyodbc.Error: ('07002', '[07002] [Microsoft][SQL Server Native Client 11.0]COUNT field incorrect or syntax error (0) (SQLExecDirectW)')
Investigating the issue, I eventually found out that there is a hard limit to the amount of parameters that a query can have (2100) because every single id in the subquery gets translated into a column, and the driver cannot translate that into a table with half a million columns.
What I have tried
I tried to add chunksize to read_sql. Didn't work, same error regardless of the chunksize. This approach would be complicated for me anyway, because the resulting dataframe is fed immediatly to another function, so I would need to introduce some kind of barrier to stop execution until every chunk has finished computing.
I tried messing around with the engine parameters, like passing fast_executemany=True to the create_engine function. Didn't work, exact same error.
I tried using bind params:
where(table.c.column.not_in(db.bindparam("p1", expanding=True, literal_execute=True)))
with engine.connect() as connection:
result = connection.execute(subquery)
ids = [r[0] for r in result]
data = pd.read_sql(con=engine.connect(),
sql=query,
params={"p1": ids})
It didn't throw a sintax error, I think I got that right at least. But it didn't solve my problem, exact same error.
So, I have tried everything I could find and still no luck. At this point I'm thinking of rewritting the query as an Outer Join, even though I have read that is less efficient.
Anyone has any ideas? There has to be multiple people that stumbled upon the same limitation, and there has to be a real solution. I just cant find it.