0

I need to run some offline algorithms on a large dataset (to test its scalability). The dataset can be as large as 10 million * 10 thousand.

I don't think I can use small batches in this case, since my algorithm is offline, which means it needs all the data at once. I will get memory error when creating such a large dataset using numpy. I don't have access to the root either since I am running jobs on the cluster.

I wonder in this situation, is it still possible to generate such a larget dataset in python?

  • 2
    I would recommend you check out [dask](https://dask.org/) and [xarray](http://xarray.pydata.org/en/stable/) – Energya Oct 03 '19 at 21:25
  • I recommend Dask as well. Alternatively, if you want to do this in 'pure' Python, look at writing a generator and lazily evaluating your batches in chunks, see: https://stackoverflow.com/questions/519633/lazy-method-for-reading-big-file-in-python – Martin Dinov Oct 03 '19 at 23:15
  • Thank you, guys! Your suggestion are truly helpful! – FoolBoy Oct 06 '19 at 03:54

0 Answers0