2

How does one run SQL queries with different column dimensions in parallel using dask? Below was my attempt:

from dask.delayed import delayed
from dask.diagnostics import ProgressBar
import dask
ProgressBar().register()

con = cx_Oracle.connect(user="BLAH",password="BLAH",dsn = "BLAH")

@delayed
def loadsql(sql):
    return pd.read_sql_query(sql,con)

results = [loadsql(x) for x in sql_to_run] 

dask.compute(results)

df1=results[0]
df2=results[1]
df3=results[2]
df4=results[3]
df5=results[4]
df6=results[5]

However this results in the following error being thrown:

DatabaseError: Execution failed on sql: "SQL QUERY" ORA-01013: user requested cancel of current operation unable to rollback

and then shortly thereafter another error comes up:

MultipleInstanceError: Multiple incompatible subclass instances of TerminalInteractiveShell are being created.

sql_to_run is a list of different sql queries

Any suggestions or pointers?? Thanks!


Update 9.7.18

Think this is more a case of me not reading documentation close enough. Indeed the con being outside the loadsql function was causing the problem. The below is the code change that seems to be working as intended now.

def loadsql(sql):
    con = cx_Oracle.connect(user="BLAH",password="BLAH",dsn = "BLAH")
    result =  pd.read_sql_query(sql,con)
    con.close()
    return result

values = [delayed(loadsql)(x) for x in sql_to_run] 
#MultiProcessing version
import dask.multiprocessing
results = dask.compute(*values, scheduler='processes')
#My sample queries took 56.2 seconds
#MultiThreaded version
import dask.threaded
results = dask.compute(*values, scheduler='threads')
#My sample queries took 51.5 seconds
Nick
  • 101
  • 7

1 Answers1

1

My guess is, that the oracle client is not thread-safe. You could try running with processes instead (by using the multiprocessing scheduler, or the distributed one), if the conn object serialises - this may be unlikely. More likely to work, would be to create the connection within loadsql, so it gets remade for each call, and the different connections hopefully don't interfere with one-another.

mdurant
  • 27,272
  • 5
  • 45
  • 74
  • I tried a similar approach unloading data to S3 and I could see only a modest (~20%) speed-up even using 8 threads. – rpanai Sep 07 '18 at 13:16
  • 1
    The speed you get depends on many things, and you could try to find where the bottleneck is for you; you could start by trying the different schedulers, in case it is a GIL effect. The main use case is to take pieces that each fit safely in memory. – mdurant Sep 07 '18 at 13:25
  • (I notice that you immediately `compute` your queries, but you probably want to do more delayed operations and aggregations before pulling everything into the main thread) – mdurant Sep 07 '18 at 13:39
  • I guess you refer to the OP – rpanai Sep 07 '18 at 14:07
  • Yes, sorry @user32185 ; for your s3 code, of course I haven't seen it, so I couldn't guess. – mdurant Sep 07 '18 at 14:34
  • @mdurant; thank you for your feedback, indeed moving the connection inside the function resolved the larger issue. The other issue was just an incorrect use of the delayed function. A good night's rest and your feedback helped me to resolve. Thanks! – Nick Sep 07 '18 at 14:41