Optimizing pandas.read_sql for teradata

Question

How can one read a teradata sql into a tempfile? The goal is to improve performance when ingesting data from an sql query into a pandas df.

On https://towardsdatascience.com/optimizing-pandas-read-sql-for-postgres-f31cd7f707ab Tristan Crockett shows how this is done for postgres.

def read_sql_tmpfile(query, db_engine):
    with tempfile.TemporaryFile() as tmpfile:
        copy_sql = "COPY ({query}) TO STDOUT WITH CSV {head}".format(
           query=query, head="HEADER"
        )
        conn = db_engine.raw_connection()
        cur = conn.cursor()
        cur.copy_expert(copy_sql, tmpfile)
        tmpfile.seek(0)
        df = pandas.read_csv(tmpfile)
        return df

I couldn't figure out how to rewrite this code to make it work with a teradata server.

Both the SQL `COPY` syntax and `cursor.copy_expert` method are postgres-specific. Are you using the `teradatasql` driver for Teradata? — Fred, Aug 17 '21 at 17:27
Hi Fred, thank you for your reply. So far I've tried using teradata, teradatasql and pyodbc to connect to the server. I'm looking for a similar solution to COPY and copy_expert on Teradata. — Felix313, Aug 18 '21 at 07:32
There is no direct equivalent. There is `TD_SYSFNLIB.CSV` table function that can be used to have the database convert to CSV, but no cursor method to directly write that result set to a file. You would need to use one of the fetch methods to bring data into memory and then write it to CSV. I have no idea if that would be faster or slower overall. — Fred, Aug 18 '21 at 16:28
Ok, thank you so much for the info Fred! May I ask if you have another suggestion to impove the fetching process of large queries? SQL Server is teradata vantage — Felix313, Aug 20 '21 at 11:26
If the slowness is in fetching millions of rows from the database, then current `teradatasql` driver allows you to add `{fn teradata_require_fastexport}` escape in front of the SELECT statement. If the slowness is in type conversions for pandas, though (as in the article you linked) maybe the intermediate CSV file approach is worth trying. — Fred, Aug 20 '21 at 14:56
Thank you again for the help @Fred. I tested `read_sql()` against `fetchall()` to check where the slowness appears. Selecting top 100k rows from a df with 57 cols took 29.6s vs 26.9s respectively, Just fetching using `{fn teradata_require_fastexport}` took 38.1s. Testing with 1mio rows resulted in 5min 28s vs 4min 47s respectively, while just fetching using `{fn teradata_require_fastexport}` took 4min 8s. — Felix313, Aug 23 '21 at 18:33

Optimizing pandas.read_sql for teradata

0 Answers0