0

I'm connecting to a Oracle database with dask.dataframe.read_sql_table to try and bring across some larger tables, some with over 100 million rows and then write them to a s3 Bucket in parquet format. However, I keep running into memory errors even if I try and specify the number of partitions Dask recommends. I've read a bit about dask.distributed but not sure how to use it with dask.dataframe.read_sql_table. I also seem to be running into a KeyError a lot as well. Please follow the link for more information.

Only a column name can be used for the key in a dtype mappings argument

If anyone has any ideas of how to use dask.dataframe.read_sql_table for reading 100 million row tables it would be greatly appreciated.

Thanks

Pete
  • 107
  • 1
  • 1
  • 6

1 Answers1

1

In principle using read_sql_table followed by a to_parquet call should be fine.

Without additional information, like a minimal reproducible example, it's not clear how else we can help. Good luck!

MRocklin
  • 55,641
  • 23
  • 163
  • 235