I am trying to optimize read/write task for a client using a Python Script. They have to have the data in a csv file even though i suggested parquet files. My code reads from a sql database into a dataframe and from a dataframe i write it to a csv in a specified path. I've tried all types of packages like vaex, dask, and pandas hoping the data would write to csv faster using columnar storage but i havent noticed significant performance improvement. For context on my dataset, its about 200mb and contains around 4 million rows and 10 columns. Reading takes about 1:30 minutes and writing takes around 30mins-1hr. Is there any way i can optimize writing to csv?
import pandas as pd
import pyodbc
import vaex
conn_str = (
r"DRIVER={SQL Server};"
r"SERVER=1SQL01\D1PRD;"
r"DATABASE=DB1;"
r"Trusted_Connection=yes;"
)
# create pyodbc connection
conn = pyodbc.connect(conn_str)
query = '''
select ID,RiskResultAssetID,PaymentDate,Balance,Interest,Principal,sum(interest+Principal) as Cashflow,isnull(WriteDown,0) as Writedown,
isnull(Liquidation,0) as Liquidation,isnull(IntShortFall,0) as IntShortFall from nier..RiskCashflows_BaseCase
where RiskResultAssetID in (select distinct id from nier..riskresultasset where riskresultsetid = 1167)
Group by ID,RiskResultAssetID,PaymentDate,Balance,Interest,Principal,WriteDown,Liquidation,IntShortFall
'''
# Load the data into a pandas DataFrame
df_pandas = pd.read_sql(query, conn)
# Close the database connection
conn.close()
# Convert the pandas DataFrame to a vaex DataFrame
df_vaex = vaex.from_pandas(df_pandas)
# Export the DataFrame to a CSV file
df_vaex.export_csv('basecaseCF.csv')