0

I realize that similar questions have been asked before, but none of the solutions have worked for me.

To be succint: I have a SQLAlchemy Select object which is a query that includes a where(not_in(subquery)) clause:

query = Select(table.c.column)
            .where(...)
            .where(table.c.column.not_in(subquery))
            
subquery = Select(table_2.c.ids).distinct()

Very simple. The subquery returns around half a million rows, all of them to be excluded from the main query.

I need the result in a pandas' dataframe, so I execute it using read_sql:

data = pd.read_sql(con=engine.connect(),
                           sql=query)

When I try to run this, the application freezes. This is the error:

pyodbc.Error: ('07002', '[07002] [Microsoft][SQL Server Native Client 11.0]COUNT field incorrect or syntax error (0) (SQLExecDirectW)')

Investigating the issue, I eventually found out that there is a hard limit to the amount of parameters that a query can have (2100) because every single id in the subquery gets translated into a column, and the driver cannot translate that into a table with half a million columns.

What I have tried

I tried to add chunksize to read_sql. Didn't work, same error regardless of the chunksize. This approach would be complicated for me anyway, because the resulting dataframe is fed immediatly to another function, so I would need to introduce some kind of barrier to stop execution until every chunk has finished computing.

I tried messing around with the engine parameters, like passing fast_executemany=True to the create_engine function. Didn't work, exact same error.

I tried using bind params:

where(table.c.column.not_in(db.bindparam("p1", expanding=True, literal_execute=True)))

with engine.connect() as connection:
    result = connection.execute(subquery)
    ids = [r[0] for r in result]

data = pd.read_sql(con=engine.connect(),
                           sql=query,
                           params={"p1": ids})

It didn't throw a sintax error, I think I got that right at least. But it didn't solve my problem, exact same error.

So, I have tried everything I could find and still no luck. At this point I'm thinking of rewritting the query as an Outer Join, even though I have read that is less efficient.

Anyone has any ideas? There has to be multiple people that stumbled upon the same limitation, and there has to be a real solution. I just cant find it.

Thom A
  • 88,727
  • 11
  • 45
  • 75
  • Can you just use raw queries? – siggemannen Apr 18 '23 at 09:50
  • You mean writting raw SQL? I have to dynamically generate a query based on user input. That's why I'm using SQLAlchemy. – DisplayName Apr 18 '23 at 09:54
  • i'm guessing your user doesn't input 1 millions of IDs? – siggemannen Apr 18 '23 at 12:11
  • He does not. I can kind of see what you are hinting at, but could you please be more explicit? The main query is a mix of the dynamically generated wheres, and the problematic subquery. Are you suggesting I divide the process in two parts, one with raw SQL and the other with SQLAlchemy? – DisplayName Apr 18 '23 at 13:39
  • I'm mostly a raw SQL guy, so i'm thinking you could create your query in form of raw sql: select * from sometable t where somecondition1 = 1 and not exists(select 1 from anothertable t2 where t2.id = t.id and t2.condition2 = 'something' and t2.condition3 = 'something else') – siggemannen Apr 18 '23 at 13:42

1 Answers1

0

I ended up avoiding the problem by achieving my objective with pandas, instead of SQLAlchemy.

For anyone having the same problem, this are the steps:

  • Query both tables into a pandas dataframe

  • Drop the rows from the main dataset that are present in the dataset of IDs to exclude, like so:

    main_df = main_df [~main_df .id.isin(exclusion_df.id)]

That's it.