1

I have a dask dataframe which has 220 partitions and 7 columns. I have imported this file from a bcp file as and completed some wrangling in dask. I then want to write this whole file to mssql using turboodbc. I connect to the DB as follows:

mydb = 'TEST'

from turbodbc import connect, make_options

connection = connect(driver="ODBC Driver 17 for SQL Server",
                     server="TEST SERVER",
                     port="1433",
                     database=mydb,
                     uid="sa",
                     pwd="5pITfir3")

I then use the function i found from a medium article to write to a test table in the DB:

def turbo_write(mydb, df, table): """Use turbodbc to insert data into sql.""" start = time.time() # preparing columns columns = '(' columns += ', '.join(df.columns) columns += ')'

# preparing value place holders
val_place_holder = ['?' for col in df.columns]
sql_val = '('
sql_val += ', '.join(val_place_holder)
sql_val += ')'

# writing sql query for turbodbc
sql = f"""
INSERT INTO {mydb}.dbo.{table} {columns}
VALUES {sql_val}
"""
print(sql)
print(sql_val)
# writing array of values for turbodbc
values_df = [df[col].values for col in df.columns]
print(values_df)

# cleans the previous head insert
with connection.cursor() as cursor:
    cursor.execute(f"delete from {mydb}.dbo.{table}")
    connection.commit()

# inserts data, for real
with connection.cursor() as cursor:
    #try:
    cursor.executemanycolumns(sql, values_df)
    connection.commit()
   # except Exception:
       # connection.rollback()
       # print('something went wrong')

stop = time.time() - start
return print(f'finished in {stop} seconds')

This works when I upload a small amount of rows as follows:

turbo_write(mydb, df_train.head(1000), table)

When i try to do a larger number of rows it fails:

turbo_write(mydb, df_train.head(10000), table)

I get the error:

RuntimeError: Unable to cast Python instance to C++ type (compile in debug mode for details)

How do i write the whole dask dataframe to mssql without any errors?

Sql_Pete_Belfast
  • 570
  • 4
  • 23

1 Answers1

0

I needed to convert to a maskedarray by changing:

 # writing array of values for turbodbc
    values_df = [df[col].values for col in df.columns]

to

values_df = [np.ma.MaskedArray(df[col].values, pd.isnull(df[col].values)) for col in df.columns]

I can then write all the data using:

for i in range(df_train.npartitions):
    partition = df_train.get_partition(i)
    turbo_write(mydb, partition, table)
    i += 1

This still takes a long time to write in comparison to saving the file and writing to the DB using BCP. If anyone has any more efficient suggestions I would love to see them.

Sql_Pete_Belfast
  • 570
  • 4
  • 23