1

Context: I am using MSSQL, pandas, and pyodbc.

Steps:

  • Obtain dataframe from query using pyodbc (no problemo)
  • Process columns to generate the context of a new (but already existing) column
  • Fill an auxilliary column with UPDATE statements (i.e. UPDATE t SET t.value = df.value FROM dbo.table t where t.ID = df.ID)

Now how do I execute the sql code in the auxilliary column, without looping through each row?

sample data

The first two columns are obtained by querying dbo.table, the third columns exists but is empty in the database. The fourth column only exists in the dataframe to prepare the SQL statement that would correspond to updating dbo.table

ID raw processed strSQL
1 lorum.ipsum@test.com lorum ipsum UPDATE t SET t.processed = 'lorum ipsum' FROM dbo.table t WHERE t.ID = 1
2 rumlo.sumip@test.com rumlo sumip UPDATE t SET t.processed = 'rumlo sumip' FROM dbo.table t WHERE t.ID = 2
3 ... ... ...

I would like to execute the SQL script in each row in an efficient manner.

Ayam
  • 169
  • 1
  • 12
  • 1
    can you porvide sample data. your question will be easy to understand – Vivs May 16 '21 at 14:29
  • 1
    Looping through the DataFrame rows and issuing individual UPDATE statements for each table row will be very inefficient. It would be much faster to dump the two columns to a list of tuples as `[(processed, id),(processed, id),…]` (note the order of the columns) and then use `crsr.executemany("UPDATE t SET t.processed=? WHERE t.ID=?", my_list_of_tuples)`. Be sure to call `crsr.fast_executemany = True` first for best performance. – Gord Thompson May 16 '21 at 14:58
  • For even better performance, use a Table valued Parameter, or create a temporary table and Bulk Copy into it. Then do a single joined update – Charlieface May 16 '21 at 15:20
  • @Charlieface I'm not quite sure what you mean. – Ayam May 16 '21 at 15:42
  • See https://stackoverflow.com/questions/51930062/python-call-sql-server-stored-procedure-with-table-valued-parameter for example. A TVP is basically a readonly table variable that you can use to do a joined update. There is no need to loop anything, just a single `update t ... from yourtable t join @yourTVP p on ...` – Charlieface May 16 '21 at 15:44
  • This feels like an [XY Problem](http://XYproblem.info). Why are you stopped ING SQL statements in a table to be run? Is the process your using to insert that data got a extremely good parses on it to stop the insertion of malicious code? – Thom A May 16 '21 at 15:44
  • @Larnu I am indeed asking about help with a possible solution to another problem. While your sentences confuse me, I think you are talking about inserting user input. However it's not the case. my problem consists out of processing a large table (>10m rows) in a way that is not possible with only SQL. That's why I'm using python, a language that i'm comfortable with. The auxiliary column is only a possible solution. Using execute many as Gord Thompson mentioned is also possibile. Do you have any other plausible solutions? – Ayam May 16 '21 at 16:04

1 Answers1

2

After I recommended .executemany() in a comment to the question, a subsequent comment from @Charlieface suggested that a table-valued parameter (TVP) would provide even better performance. I didn't think it would make that much difference, but I was wrong.

For an existing table named MillionRows

ID  TextField
--  ---------
 1  foo
 2  bar
 3  baz
…

and example data of the form

num_rows = 1_000_000
rows = [(f"text{x:06}", x + 1) for x in range(num_rows)]
print(rows)
# [('text000000', 1), ('text000001', 2), ('text000002', 3), …]

my test using a standard executemany() call with cnxn.autocommit = False and crsr.fast_executemany = True

crsr.executemany("UPDATE MillionRows SET TextField = ? WHERE ID = ?", rows)

took about 180 seconds (3 minutes).

However, by creating a user-defined table type

CREATE TYPE dbo.TextField_ID AS TABLE 
(
    TextField nvarchar(255) NULL, 
    ID int NOT NULL, 
    PRIMARY KEY (ID)
)

and a stored procedure

CREATE PROCEDURE [dbo].[mr_update]
@tbl dbo.TextField_ID READONLY
AS
BEGIN
    SET NOCOUNT ON;
    UPDATE MillionRows SET TextField = t.TextField
    FROM MillionRows mr INNER JOIN @tbl t ON mr.ID = t.ID
END

when I used

crsr.execute("{CALL mr_update (?)}", (rows,))

it did the same update in approximately 80 seconds (less than half the time).

Gord Thompson
  • 116,920
  • 32
  • 215
  • 418