0

I am wondering how to do model parallelism using pytorch's distributed modules. Basically what I want to do is the following -


class LargeModel(nn.Module):
    def __init__(self, in_features, n_hid, out_features) -> None:
        super().__init__()
        self.to_train_locally = nn.Linear(in_features, n_hid)
        self.to_train_on_aws = nn.Linear(n_hid, out_features)

    def forward(self, input):
        intermediate = self.to_train_locally(input)
        res = send_to_aws_for_forward_pass(self.to_train_on_aws, intermediate)
        return res

basically, I want to train a "large" model, which I want to split into the "local" component, which is the self.to_train_locally component which can consist of an aribtrary number of layers, which I want to reside in my personal laptop. I then want a second component of the model to train on an AWS EC2 instance that I have, which is the self.to_train_on_aws, which I want to reside on AWS. This component will be much bigger that the self.to_train_locally component. In this toy example, of course the entire model can be stored locally, but that is not the point - I want a framework that will allow me to train a part of the model locally, while do the bulk of the training on AWS.

How would I set up a training routine for this? I have looked at the following tutorials from pytorch's official documentation, but none of them are of any help -

  1. https://pytorch.org/tutorials/intermediate/dist_tuto.html This talks about gloo etc. and co-ordination tools that I assume more sense for larger teams. If I want to do this for a personal project, where some layers exist and train on my local laptop and a personal EC2 instance, how would I do this?

  2. https://pytorch.org/tutorials/intermediate/rpc_tutorial.html this is not useful either, because it talks about model parallelism in a single machine, where multiple processes on a single machine are spawned/forked, but this is not what I want to do. For example, I can't just set

os.environ['MASTER_ADDR'] = 'my:aws:instance:public:ip:addr'
os.environ['MASTER_PORT'] = '29500' # any port

from the RNN example given in this tutorial.

Has anyone tried to do something like this before? How would distributed autograd and a distributed optimiser work in this case in a training loop? Any help will be appreciated, especially if you can point me to code where someone has tried to do something like this before.

I have looked at the following tutorials from pytorch's official documentation, but none of them are of any help -

  1. https://pytorch.org/tutorials/intermediate/dist_tuto.html This talks about gloo etc. and co-ordination tools that I assume more sense for larger teams. If I want to do this for a personal project, where some layers exist and train on my local laptop and a personal EC2 instance, how would I do this?

  2. https://pytorch.org/tutorials/intermediate/rpc_tutorial.html this is not useful either, because it talks about model parallelism in a single machine, where multiple processes on a single machine are spawned/forked, but this is not what I want to do. For example, I can't just set

os.environ['MASTER_ADDR'] = 'my:aws:instance:public:ip:addr'
os.environ['MASTER_PORT'] = '29500' # any port

from the RNN example given in this tutorial.

Has anyone tried to do something like this before? How would distributed autograd and a distributed optimiser work in this case in a training loop? Any help will be appreciated, especially if you can point me to code where someone has tried to do something like this before.

EDIT 1

What I am trying to do is basically explained in this paper https://arxiv.org/pdf/1903.11314.pdf, under the title "Model parallelism" -

"In model parallelism, the DL model is split, and each worker loads a different part of the DL model for training (see Figure 5). The worker(s) that hold the input layer of the DL model are fed with the training data. In the forward pass, they compute their output signal which is propagated to the workers that hold the next layer of the DL model. In the backpropagation pass, gradients are computed starting at the workers that hold the output layer of the DL model, propagating to the workers that hold the input layers of the DL model."

Does anyone have an example of how to do this on AWS and using PyTorch?

mndl
  • 17
  • 3

0 Answers0