I am running a sagemaker instance which always gives me an exception the same place in the cycle, even if I allocate more storage. So it might not be a storage issue, but I am at a loss for why it fails.
I got the error at the same spot no matter if I allocate 1024gb or 100gb storage (estimator volume_size). The DiskUtilization sits at 23% when it crashes on 100gb allocated (however, that number does not update in real time, so it is probably higher)
2021-06-10T13:25:32.141+02:00 terminate called after throwing an instance of 'dmlc::Error'
2021-06-10T13:25:45.144+02:00 what(): [11:25:31] src/io/local_filesys.cc:38: Check failed: std::fwrite(ptr, 1, size, fp_) == size: FileStream.Write incomplete
2021-06-10T13:25:45.144+02:00 Stack trace: [bt] (0) /usr/local/lib/python3.6/dist-packages/mxnet/libmxnet.so(+0x3c58ea9) [0x7f0280805ea9]
I am loading parquet files and saving them as ndarrays of 10000 rows pr. file. And around 120000 rows, it crashes. I am doing this in order to give mxnet a dataset with random access, which I cannot do with just parquet files.
Any help is appreciated.