0

I have requirement to extract data from SQL Server and create a .csv file from numerous tables. So I created a python script to do this activity which uses pyodbc/turbodbc connection with SQL Server ODBC Drivers. It works fine sometimes however it disconnects when it finds large table (over 11M) and performance wise it is very slow. I tried freeTDS, but looks the same as pyodbc interns of performance.

This is my connection:

pyodbc.connect(Driver='/opt/microsoft/msodbcsql17/lib64/libmsodbcsql-17.5.so.2.1',server=systemname,UID=user_name,PWD=pwd)
def connect_to_SQL_Server(logins):
        '''Connects to SQL Server.

        Returns connection object or None
        '''
        con = None

        try:
            hostname = logins['hostname']
            username = logins['sql_username']
            password = logins['snow_password']

            #con = turbodbc.connect(Driver='/usr/lib64/libtdsodbc.so',server=hostname,UID=username,PWD=password,TDS_Version=8.0)
            #con = turbodbc.connect(Driver='/usr/lib64/libtdsodbc.so',server=hostname,UID=username,PWD=password,TDS_Version=8.0)
            #con = pyodbc.connect(Driver='/usr/lib64/libtdsodbc.so',server=hostname,UID=username,PWD=password,TDS_Version=8.0,Trace='Yes',ForceTrace='Yes',TraceFile='/maxbill_mvp_data/all_data/sql.log')
            con = pyodbc.connect(Driver='/opt/microsoft/msodbcsql17/lib64/libmsodbcsql-17.5.so.2.1',server=hostname,UID=username,PWD=password)
            #con = turbodbc.connect(Driver='/opt/microsoft/msodbcsql17/lib64/libmsodbcsql-17.5.so.2.1',server=hostname,UID=username,PWD=password)
            #con = pyodbc.connect(DSN='MSSQLDEV',server=hostname,UID=username,PWD=password)

            return con

        except (pyodbc.ProgrammingError, Exception) as error:
            logging.critical(error)
sqlCon = connect_to_SQL_Server(logins)
sql = 'select * from table'
i = 0
for  partial_df in(pd.read_sql(sql, sqlCon, chunksize=300000)):
               #chunk.to_csv(f+'_'+str(i)+'.csv',index = False,header = False,sep = ',',mode = 'a+')
     partial_df.to_csv(filenamewithpath + '_'+str(i)+'.csv.gz', compression='gzip', index=False, sep='\01', header= False, mode='a+')
     i+=1

Are there any parameters I can try with for performance improvement. Just to let you know these python scripts running from different server than SQL Server hosted server and which is Linux cloud instance

marc_s
  • 732,580
  • 175
  • 1,330
  • 1,459
  • Can you share some code that reproduces the failure and quantify the slow performance? – David Browne - Microsoft Jul 22 '20 at 22:41
  • 1
    If you want speed why aren't you using bcp.exe? – AlwaysLearning Jul 22 '20 at 22:51
  • Thanks for responses, i have updated the post with code snippets – Data Engineer Jul 22 '20 at 23:13
  • Can bcp.exe generate multiple files with certain no.of rows. better than pyodbc as i have seen few issues when did couples of years back.? – Data Engineer Jul 22 '20 at 23:24
  • @DataEngineer - why not use BCP to export the data, then use Python to chunk the file up? – gvee Jul 26 '20 at 10:59
  • Yes, I tried but could not be successful. Code: import bcp con = bcp.Connection(driver='ODBC Driver 18 for SQL Server', host='hostname', username='username', password='password') my_bcp = bcp.BCP(con) file = bcp.DataFile(file_path='/storage/folder', delimiter=',') my_bcp.dump(query='select * from tablename', output_file=file) – Data Engineer Jul 28 '20 at 11:56
  • Error: Traceback (most recent call last): File "bcp_test.py", line 4, in con = bcp.Connection(driver='ODBC Driver 18 for SQL Server', host='hostname', username='username', password='password') File "/home/ec2-user/.local/lib/python3.7/site-packages/bcp/connections.py", line 81, in __init__ self.driver = driver File "/home/ec2-user/.local/lib/python3.7/site-packages/bcp/connections.py", line 90, in driver raise DriverNotSupportedException bcp.exceptions.DriverNotSupportedException – Data Engineer Jul 28 '20 at 11:59

0 Answers0