Optimize reading from MS SQL server and writing to CSV file

Question

I am trying to optimize read/write task for a client using a Python Script. They have to have the data in a csv file even though i suggested parquet files. My code reads from a sql database into a dataframe and from a dataframe i write it to a csv in a specified path. I've tried all types of packages like vaex, dask, and pandas hoping the data would write to csv faster using columnar storage but i havent noticed significant performance improvement. For context on my dataset, its about 200mb and contains around 4 million rows and 10 columns. Reading takes about 1:30 minutes and writing takes around 30mins-1hr. Is there any way i can optimize writing to csv?

import pandas as pd
import pyodbc
import vaex

conn_str = (
    r"DRIVER={SQL Server};"
    r"SERVER=1SQL01\D1PRD;"
    r"DATABASE=DB1;"
    r"Trusted_Connection=yes;"
)

# create pyodbc connection
conn = pyodbc.connect(conn_str)

query = '''
select ID,RiskResultAssetID,PaymentDate,Balance,Interest,Principal,sum(interest+Principal) as Cashflow,isnull(WriteDown,0) as Writedown,
    isnull(Liquidation,0) as Liquidation,isnull(IntShortFall,0) as IntShortFall from nier..RiskCashflows_BaseCase
    where RiskResultAssetID in (select distinct id from nier..riskresultasset where riskresultsetid = 1167)
    Group by ID,RiskResultAssetID,PaymentDate,Balance,Interest,Principal,WriteDown,Liquidation,IntShortFall
'''

# Load the data into a pandas DataFrame
df_pandas = pd.read_sql(query, conn)

# Close the database connection
conn.close()

# Convert the pandas DataFrame to a vaex DataFrame
df_vaex = vaex.from_pandas(df_pandas)

# Export the DataFrame to a CSV file
df_vaex.export_csv('basecaseCF.csv')

Im very familiar with this process, dB, pandas, cvs. Never had the requirement to improve speed so I don't have a solid answer her but some suggestions instead. I know pyarrow is meant to increase read times in pandas so maybe write too? I'd also try reading your database with sqlalchemy and writing the csv with pure python (avoiding pandas entirely) — rayad, Mar 31 '23 at 23:37

Optimize reading from MS SQL server and writing to CSV file

0 Answers0