The setting in which I am working can be described as follows:
Database and what I want to extract from it
The data required to run the analysis is stored in a single de-normalized (more than 100 columns) Oracle table. Financial reporting data is published to the table every day and its range-partitioned on the reporting date (one partition per day). Here's the structure of the query I intend to run:
SELECT col1,
col2,
col3
FROM table
WHERE date BETWEEN start_date AND end_date
Strategy to load data with Dask
I am using sqlalchemy
with the cx_Oracle
driver to access the database. The strategy I am following to load data in parallel with Dask is:
from dask import bag as db
def read_rows(from_date, to_date, engine):
engine.dispose()
query = """
-- Query Text --
""".format(from_date, to_date)
with engine.connect() as conn:
ret = conn.execute(query).fetchall()
return ret
engine = create_engine(...) # initialise sqlalchemy engine
add_engine_pidguard(engine) # adding pidguard to engine
date_ranges = [...] # list of (start_date, end_date)-tuples
data_db = db.from_sequence(date_ranges)
.map(lambda x: read_rows(from_date=x[0], to_date=x[1], engine=engine)).concat()
# ---- further process data ----
...
add_engine_pidguard
is taken from the sqlalchemy documentation:How do I use engines / connections / sessions with Python multiprocessing, or os.fork()?
Questions
- Is the current way of running blocked queries fine - or is there a cleaner way of achieving this in
sqlalchemy
? - Since the queries operate in a
multiprocessing
environment, is the approach of managing the engines fine the way it is implemented? - Currently I am executing a "raw query", would it be beneficial from a performance point of view to define the table in a
declarative_base
(with respective column types) and usesession.query
on the required columns from withinread_rows
?