2

The setting in which I am working can be described as follows:

Database and what I want to extract from it

The data required to run the analysis is stored in a single de-normalized (more than 100 columns) Oracle table. Financial reporting data is published to the table every day and its range-partitioned on the reporting date (one partition per day). Here's the structure of the query I intend to run:

SELECT col1, 
       col2, 
       col3
FROM table
WHERE date BETWEEN start_date AND end_date

Strategy to load data with Dask

I am using sqlalchemy with the cx_Oracle driver to access the database. The strategy I am following to load data in parallel with Dask is:

from dask import bag as db

def read_rows(from_date, to_date, engine):
    engine.dispose()
    query = """
        -- Query Text --
    """.format(from_date, to_date)
    with engine.connect() as conn:
          ret = conn.execute(query).fetchall()
    return ret

engine = create_engine(...) # initialise sqlalchemy engine
add_engine_pidguard(engine) # adding pidguard to engine
date_ranges = [...] # list of (start_date, end_date)-tuples
data_db = db.from_sequence(date_ranges)
    .map(lambda x: read_rows(from_date=x[0], to_date=x[1], engine=engine)).concat()

# ---- further process data ----
...

add_engine_pidguard is taken from the sqlalchemy documentation:How do I use engines / connections / sessions with Python multiprocessing, or os.fork()?

Questions

  • Is the current way of running blocked queries fine - or is there a cleaner way of achieving this in sqlalchemy?
  • Since the queries operate in a multiprocessing environment, is the approach of managing the engines fine the way it is implemented?
  • Currently I am executing a "raw query", would it be beneficial from a performance point of view to define the table in a declarative_base (with respective column types) and use session.query on the required columns from within read_rows?
sim
  • 1,227
  • 14
  • 20

1 Answers1

4

I would be apt to try code along the lines of

import pandas as pd
import dask.dataframe as dd
from dask.delayed import delayed

...

con = engine.connect()
df = dd.from_delayed([
    delayed(pd.read_sql_query)(QUERY, con, params=params)
    for params in date_ranges
])

In this case, I just have one connection I make--cx_Oracle connections are, as I understand it, able to be used by multiple threads. The data is loaded using dask.dataframe, without doing anything yet to make it anything other than the threaded scheduler. Database IO and many pandas operations release the GIL, so the threaded scheduler is a good candidate here.

This will let us jump right to having a dataframe, which is nice for many operations on structured data.


Currently I am executing a "raw query", would it be beneficial from a performance point of view to define the table in a declarative_base (with respective column types) and use session.query on the required columns from within read_rows?

This is not especially likely to improve performance, as I understand things.

Mike Graham
  • 73,987
  • 14
  • 101
  • 130
  • 1
    Thank you very much, that does indeed look a bit cleaner. Performance is fine, in fact, with either approach the import executed (after triggering it with a `count()` on the `df`) in approx. 5 minutes, while an sql*plus trace (without blocking the query) took 11 minutes. For using session factories in threads, the `sqlalchemy` documentation recommends using `scoped_session()` - not sure whether something similar is required when sharing the connection. To switch to threaded computation, running it in a `with set_options(get=dask.threaded.get):` context should be fine, right? – sim Aug 24 '16 at 13:11
  • 1
    `threaded` is the default for `dask.dataframe`, but that code can ensure it as well. – Mike Graham Aug 24 '16 at 18:32