4

I have a 55-million-row table in MSSQL and I only need 5 million of those rows to pull into a dask dataframe. Currently, it doesn't support sql queries but it does support sqlalchemy statements, but there's some issue with that as described here: Dask read_sql_table errors out when using an SQLAlchemy expression

I attempted to implement the solution suggested and the process still takes around 1 minute.

I also attempted to load the data into a pandas dataframe all at once then to dask, and that takes around 2 minutes.

Loading in a ~4.5 million row table in the accepted dask way takes 15 seconds.

generator = pd.read_sql(sql=query, con=uri,chunksize=50000)
dds = []
for chunk in generator:
    dds.append(dask.delayed(dd.from_pandas)(chunk, npartitions=5))
ddf = dd.from_delayed(dds)
CPU times: user 50.1 s, sys: 2.13 s, total: 52.2 s
Wall time: 52.3 s


result = engine.execute(query)
df = pd.DataFrame(result.fetchall())
df.columns = result.keys()
ddf = dd.from_pandas(df,npartitions=10)
CPU times: user 54.3 s, sys: 3.14 s, total: 57.4 s
Wall time: 2min 41s

ddf = dd.read_sql_table(table="4.5mil_table",
uri=uri, index_col='ID')
CPU times: user 117 ms, sys: 4.12 ms, total: 122 ms
Wall time: 16 s

I know there has to be a more efficient way to do this that I am missing.

msolomon87
  • 56
  • 1
  • 6
  • Hi, how much memory does your system have to obtain the performance you are seeing? I'm able to get through maybe 250k rows in a minute on 16 GB RAM (is this expected?) - Adding that my table also has 200+ columns (most with empty cells) - I'm only selecting 3 columns however. – mathee Mar 27 '20 at 11:04

0 Answers0