0

I have a table with over 10 million rows in Dremio. I have connected to it from Python PYODBC. I want to run a simple query like shown below:

SELECT REPORTDATE, TRANSDATE 
FROM TABLE 
WHERE TRANSDATE = '2020-01-05'

The issue is that it takes forever to run this query via Python. What would be the solution for this?

M Z
  • 4,571
  • 2
  • 13
  • 27
ravindu93
  • 1
  • 3
  • 1
    Make sure there's an index on the `TRANSDATE` column. Otherwise it has to read every row. – Barmar May 01 '23 at 20:08
  • 1
    @Barmar thanks a lot for your reply. Can you tell me how to do that? Does this mean that I have to add a new index column into the table? – ravindu93 May 01 '23 at 20:14
  • 1
    I don't know Dremio, but in normal SQL it's `ALTER TABLE tablename ADD INDEX (transdate)` – Barmar May 01 '23 at 20:15
  • @Barmar But to do that the field "TRANSDATE" should be unique, right? – ravindu93 May 01 '23 at 20:17
  • 2
    No. If you want to declare it unique you use `ADD UNIQUE INDEX` – Barmar May 01 '23 at 20:18
  • Perfect! If possible, can you explain to me what exactly happens when I add this new index? Thank you very much for your time. – ravindu93 May 01 '23 at 20:21
  • 2
    Any database textbook or tutorial should explain what indexes do. This is a fundamental DBMS concept. – Barmar May 01 '23 at 20:24

2 Answers2

0

I would recommend using sqlalchemy or pandas to make the call.

Personally, I use pandas (below example uses cx_Oracle since we use Oracle servers):

import pandas as pd

qry = '''
SELECT REPORTDATE, TRANSDATE 
FROM TABLE 
WHERE TRANSDATE = '2020-01-05'
'''

dsn_tns = cx_Oracle.makedsn('host-name','port',service_name='database')
conn = cx_Oracle.connect(user=user_name, password=pwd, dsn=dsn_tns)
c = conn.cursor()
results = pd.read_sql(qry , conn)
c.close()
conn.close()

If it still runs too slow, you can pull it in chunks using the chunksize attribute:
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_sql.html

Abe
  • 11
  • 6
0

Use Oracle's python-oracledb driver instead of pyodbc and increase the arraysize value, see https://python-oracledb.readthedocs.io/en/latest/user_guide/tuning.html#tuning-fetch-performance

cur = connection.cursor()

cur.arraysize = 5000

for row in cur.execute("SELECT * FROM very_big_table"):
    print(row)

Unless you actually want a Pandas data frame, using the driver native calls will be fastest.

The 'Thin' mode (which is the default) of python-oracledb is likely to be faster than the 'Thick' mode.

Christopher Jones
  • 9,449
  • 3
  • 24
  • 48