Querying a SQL Table with over 10million rows

Question

I have a table with over 10 million rows in Dremio. I have connected to it from Python PYODBC. I want to run a simple query like shown below:

SELECT REPORTDATE, TRANSDATE 
FROM TABLE 
WHERE TRANSDATE = '2020-01-05'

The issue is that it takes forever to run this query via Python. What would be the solution for this?

Make sure there's an index on the `TRANSDATE` column. Otherwise it has to read every row. — Barmar, May 01 '23 at 20:08
@Barmar thanks a lot for your reply. Can you tell me how to do that? Does this mean that I have to add a new index column into the table? — ravindu93, May 01 '23 at 20:14
I don't know Dremio, but in normal SQL it's `ALTER TABLE tablename ADD INDEX (transdate)` — Barmar, May 01 '23 at 20:15
@Barmar But to do that the field "TRANSDATE" should be unique, right? — ravindu93, May 01 '23 at 20:17
No. If you want to declare it unique you use `ADD UNIQUE INDEX` — Barmar, May 01 '23 at 20:18
Perfect! If possible, can you explain to me what exactly happens when I add this new index? Thank you very much for your time. — ravindu93, May 01 '23 at 20:21
Any database textbook or tutorial should explain what indexes do. This is a fundamental DBMS concept. — Barmar, May 01 '23 at 20:24

score 0 · Answer 1 · answered May 01 '23 at 22:59

I would recommend using sqlalchemy or pandas to make the call.

Personally, I use pandas (below example uses cx_Oracle since we use Oracle servers):

import pandas as pd

qry = '''
SELECT REPORTDATE, TRANSDATE 
FROM TABLE 
WHERE TRANSDATE = '2020-01-05'
'''

dsn_tns = cx_Oracle.makedsn('host-name','port',service_name='database')
conn = cx_Oracle.connect(user=user_name, password=pwd, dsn=dsn_tns)
c = conn.cursor()
results = pd.read_sql(qry , conn)
c.close()
conn.close()

If it still runs too slow, you can pull it in chunks using the chunksize attribute:
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_sql.html

score 0 · Answer 2 · answered May 03 '23 at 01:47

Use Oracle's python-oracledb driver instead of pyodbc and increase the arraysize value, see https://python-oracledb.readthedocs.io/en/latest/user_guide/tuning.html#tuning-fetch-performance

cur = connection.cursor()

cur.arraysize = 5000

for row in cur.execute("SELECT * FROM very_big_table"):
    print(row)

Unless you actually want a Pandas data frame, using the driver native calls will be fastest.

The 'Thin' mode (which is the default) of python-oracledb is likely to be faster than the 'Thick' mode.

Querying a SQL Table with over 10million rows

2 Answers2