I have access to a couple dozens Dask servers without GPU but with complete control of the software (can wipe them and install something different) and want to accelerate pytorch-lightning model training. What could be a possible solution to integrate them with as little additional code possible?
I've researched this topic a bit, finding possible options, cannot determine which one to choose:
# | option | info | pro | con |
---|---|---|---|---|
1. | dask-pytorch-ddp | Package to be used to writing models with easier integration into Dask | will likely work | cannot use existing model out of the box, need rewriting the model itself |
2. | PL docs, on-prem cluster (intermediate) | multiple copies of pytorch lighning on the network | simples way according to lightning docs | fiddly to launch according to the docs |
3. | PL docs, SLURM cluster | wipe/redeploy cluster, setup SLURM | less fiddly to launch individual jobs | need to redeploy the cluster OS/software |
4. | Pytorch + dask | officially supported and documented use of Skorch | has a package handling this - skorch | will need to use pytorch, not lightning |
Are there any more options or tutorials to learn about this?