3

I am running a sagemaker instance which always gives me an exception the same place in the cycle, even if I allocate more storage. So it might not be a storage issue, but I am at a loss for why it fails.

I got the error at the same spot no matter if I allocate 1024gb or 100gb storage (estimator volume_size). The DiskUtilization sits at 23% when it crashes on 100gb allocated (however, that number does not update in real time, so it is probably higher)

2021-06-10T13:25:32.141+02:00   terminate called after throwing an instance of 'dmlc::Error'

2021-06-10T13:25:45.144+02:00   what(): [11:25:31] src/io/local_filesys.cc:38: Check failed: std::fwrite(ptr, 1, size, fp_) == size: FileStream.Write incomplete

2021-06-10T13:25:45.144+02:00   Stack trace: [bt] (0) /usr/local/lib/python3.6/dist-packages/mxnet/libmxnet.so(+0x3c58ea9) [0x7f0280805ea9]

I am loading parquet files and saving them as ndarrays of 10000 rows pr. file. And around 120000 rows, it crashes. I am doing this in order to give mxnet a dataset with random access, which I cannot do with just parquet files.

Any help is appreciated.

Mikkel F
  • 31
  • 2

2 Answers2

2

On a SageMaker notebook instance, to write to the user-sized EBS volume, you need to write your data within the directory /home/ec2-user/SageMaker. If you run a df -h, you will see that your user-sized EBS volume (the 1024 GB storage) is mounted on /home/ec2-user/SageMaker. If you don't write inside this directory, then your data won't be persisted when you shutdown your notebook instance. In your case, I am assuming you are writing to the 100 GB storage and hence running out of space

rmilletich
  • 481
  • 2
  • 10
  • Okay, that might be true, I'll look into that. However, I am not running a notebook, just a training job. Or is that a notebook in disguise? – Mikkel F Jun 14 '21 at 08:52
1

I have been struggling with the same issue and seems like there is very little explicit documentation online. Quite a few resources (Github/Stackoverflow etc) kept pointed me in the direction of using the dir /home/ec2-user/SageMaker. When trying, this did not help and so I don't think this applies to training jobs, only to SageMaker Notebooks / Studio.

I ran a SageMaker training job today with a VolumeSizeInGB of 200Gb. When adding df -h in the beginning of the train script I got the following output:

Filesystem Size Used Avail Use% Mounted on
...
tmpfs 64M 0 64M 0% /dev
tmpfs 30G 0 30G 0% /sys/fs/cgroup
/dev/xvdf 196G 65M 186G 1% /tmp
/dev/xvda1 40G 17G 24G 42% /etc/hosts
shm 31G 0 31G 0% /dev/shm
tmpfs 30G 12K 30G 1% /proc/driver/nvidia
tmpfs 30G 4.0K 30G 1% /etc/nvidia/nvidia-application-profiles-rc.d
devtmpfs 30G 120K 30G 1% /dev/nvidia0
tmpfs 30G 0 30G 0% /proc/acpi
tmpfs 30G 0 30G 0% /proc/scsi
tmpfs 30G 0 30G 0% /sys/firmware

Based off of this it seemed that the EBS volume was located at /dev/xvdf and mounted to /tmp.

So I modified my training script to use /tmp as base for downloading files and am no longer running into the No space left on device I was facing.

Nicholas Coles
  • 196
  • 2
  • 6