How to fix a Dask memory error for database table with over a 100 million rows

Question

I'm connecting to a Oracle database with dask.dataframe.read_sql_table to try and bring across some larger tables, some with over 100 million rows and then write them to a s3 Bucket in parquet format. However, I keep running into memory errors even if I try and specify the number of partitions Dask recommends. I've read a bit about dask.distributed but not sure how to use it with dask.dataframe.read_sql_table. I also seem to be running into a KeyError a lot as well. Please follow the link for more information.

Only a column name can be used for the key in a dtype mappings argument

If anyone has any ideas of how to use dask.dataframe.read_sql_table for reading 100 million row tables it would be greatly appreciated.

Thanks

score 1 · Accepted Answer · answered Nov 19 '19 at 15:16

1

In principle using read_sql_table followed by a to_parquet call should be fine.

Without additional information, like a minimal reproducible example, it's not clear how else we can help. Good luck!

answered Nov 19 '19 at 15:16

MRocklin

55,641
23
163
235

How to fix a Dask memory error for database table with over a 100 million rows

1 Answers1