Redshift to dask DataFrame

Question

Does anyone have a nice neat and stable way to achieve the equivalent of:

pandas.read_sql(sql, con, chunksize=None)

and/or

pandas.read_sql_table(table_name, con, schema=None, chunksize=None)

connected to redshift with SQLAlchemy & psycopg2, directly into a dask DataFrame ?

The solution should be able to handle large amounts of data

Are you sure that this is the way you want to follow? On AWS we ended up doing this: create a temp table from query -> unload table to S3 (GZIP compressed) -> read from the folder with dask -> delete folder on S3. It is very fast and the impact on RS is minimal. — rpanai, Jan 30 '19 at 12:17

score 1 · Accepted Answer · edited Jan 29 '19 at 21:55

1

You might consider the read_sql_table function in dask.dataframe.

http://dask.pydata.org/en/latest/dataframe-api.html#dask.dataframe.read_sql_table

>>> df = dd.read_sql_table('accounts', 'sqlite:///path/to/bank.db',
...                  npartitions=10, index_col='id')  # doctest: +SKIP

This relies on the pandas.read_sql_table function internally, so should be able to operate with the same restrictions, except that now you're asked to provide a number of partitions and an index column.

edited Jan 29 '19 at 21:55

muon

12,821
11
69
88

answered Sep 27 '17 at 11:06

MRocklin

55,641
23
163
235

I am trying to use "dd.read_sql" to connect to RedShift, but how would I pass the uri parameter? In Redshift I have JBDC or Endpoint link!! And to your experience is this capable of loading 300 million row data with 16 GB RAM? In principle dask should, right? Also have tried this suggestion: https://stackoverflow.com/questions/35004936/how-to-connect-to-a-cluster-in-amazon-redshift-using-sqlalchemy, but "Could not parse rfc1738 URL from string" error. – TwinPenguins Jul 06 '18 at 05:48
@MRocklin if you want to use something like that in production you should change these all the time you increase your RS nodes. – rpanai Jan 30 '19 at 12:21
@MajidMortazavi if by load you mean persist or compute I think you can't do it. – rpanai Jan 30 '19 at 12:21

Redshift to dask DataFrame

1 Answers1