Parallelizing pandas pyodbc SQL database calls

Question

I am currently querying data into dataframe via the pandas.io.sql.read_sql() command. I wanted to parallelize the calls similar to what this guys is advocating: (Embarrassingly parallel database calls with Python (PyData Paris 2015 ))

Something like (very general):

pools = [ThreadedConnectionPool(1,20,dsn=d) for d in dsns]
connections = [pool.getconn() for pool in pools]
parallel_connection = ParallelConnection(connections)
pandas_cursor = parallel_connection.cursor()
pandas_cursor.execute(my_query)

Is something like that possible?

what is your SQL database type and driver, and do they support multi-threaded calls? — denfromufa, Aug 29 '15 at 19:19
not sure about pyodbc, but since 2013 pymssql seems to be thread-safe for multi-threading: http://pymssql.org/en/latest/changelog.html?highlight=threading — denfromufa, Aug 30 '15 at 21:39

Tristan Reid · Answer 1 · 2015-09-10T16:47:57.437

2

Yes, this should work, although with the caveat that you'll need to change parallel_connection.py in that talk that you site. In that code there's a fetchall function which executes each of the cursors in parallel, then combines the results. This is the core of what you'll change:

Old Code:

def fetchall(self):
    results = [None] * len(self.cursors)
    def do_work(index, cursor):
        results[index] = cursor.fetchall()
    self._do_parallel(do_work)
    return list(chain(*[rs for rs in results]))

New Code:

def fetchall(self):
    results = [None] * len(self.sql_connections)
    def do_work(index, sql_connection):
        sql, conn = sql_connection  #  Store tuple of sql/conn instead of cursor
        results[index] = pd.read_sql(sql, conn)
    self._do_parallel(do_work)
    return pd.DataFrame().append([rs for rs in results])

Repo: https://github.com/godatadriven/ParallelConnection

edited Sep 10 '15 at 16:47

answered Sep 10 '15 at 16:41

Tristan Reid

5,844
2
26
31

is it possible to show a example on how you actually pass a query `sql, conn = sql_connection` basically we need to pass tuple of sql and connection?. – add-semi-colons Apr 22 '19 at 19:49
1

It's been a few years, so I don't fully remember the context - but it looks like from the linked code that you would pass in an array of `(sql, conn)` tuples to the constructor of ParallelConnection. Something like `ParallelConnection([(sql1, con1), (sql2, con2)])` – Tristan Reid Apr 23 '19 at 16:10
so no need to call `execute()` while passing a query string? just like in the question... – add-semi-colons Apr 25 '19 at 01:41
1

In the above example, I used `fetchall` instead of `execute`, but you can do the same thing with `execute`. After initializing the ParallelConnection with the array of tuples, call either `execute` or `fetchall` and the `_do_parallel` function handles passing out the work to the individual connections/queries. – Tristan Reid Apr 25 '19 at 18:07

Parallelizing pandas pyodbc SQL database calls

1 Answers1

Linked