I need to run some offline algorithms on a large dataset (to test its scalability). The dataset can be as large as 10 million * 10 thousand.
I don't think I can use small batches in this case, since my algorithm is offline, which means it needs all the data at once. I will get memory error when creating such a large dataset using numpy. I don't have access to the root either since I am running jobs on the cluster.
I wonder in this situation, is it still possible to generate such a larget dataset in python?