This is a sporadic issue that I could not figure out a condition to replicate.
The gist of the issue is that instance/controller node will randomly fail to find files that are already created on Amazon FSx. A sample script can be as simple as this:
import dask
fn = '/mnt/fsx/home/user/something.txt'
def run():
with open(fn) as f:
s1 = f.readlines()
with open(fn) as g: //<-- it is possible that this line can fail to read the file
s2 = f.readlines()
return len(s1) + len(s2)
with open(fn, 'w') as f:
f.write('balh blah blah')
ret = [dask.delayed(run)() for _ in range(2000)]
result = dask.compute(ret)
It is possible for the 2nd open(..) in run() to fail with the simple python FileNotFoundError.
I could not find any information on why this could happen and how I can mitigate this. I did consider having the file on S3 so that there is built-in retries around the file access, but that can incur different load and cost issues.