8

I have read distributed tensorflow documentation and this answer.

According to this, in data parallelism approach:

  • The algorithm distributes the data between various cores.
  • Each core independently tries to estimate the same parameter(s)
  • Cores then exchange their estimate(s) with each other to come up with the right estimate for the step.

And in model parallelism approach:

  • The algorithm sends the same data to all the cores.
  • Each core is responsible for estimating different parameter(s)
  • Cores then exchange their estimate(s) with each other to come up with the right estimate for all the parameters.

How do In-graph replication and Between-graph replication relate to these approaches?

This article says:

For example, different layers in a network may be trained in parallel on different GPUs. This training procedure is commonly known as "model parallelism" (or "in-graph replication" in the TensorFlow documentation).

And:

In "data parallelism" (or “between-graph replication” in the TensorFlow documentation), you use the same model for every device, but train the model in each device using different training samples.

Is that accurate?

From the Tensorflow DevSummit video linked in tensorflow documentation page: enter image description here It looks like data is split and distributed to each worker. So isn't In-graph replication following data parallelism approach?

Amila
  • 5,195
  • 1
  • 27
  • 46
  • 1
    In my understanding the difference between in-graph and between-graph is where you run the code to build the dependency graph (aka your model) - on one server, or on all servers in the cluster. In large clusters between-graph doesn't bottleneck on a single server and is preferred. Both methods allow you, the user, to run any operations you define, be it a data parallel appraoch, a distributed model, or something inbetween. Since I haven't coded this I'll let someone else answer in case my understanding is in error. – David Parks Jun 20 '18 at 20:50

1 Answers1

4

In-graph replication and between-graph replication are not directly related with data parallelism and model parallelism. Data parallelism and model parallelism are terms dividing parallelization algorithms into two categories, like described in the quora answer that you link. But in-graph replication and between-graph replication are two ways to implement parallelism in tensorflow. Data parallelism for instance can be implemented with both in-graph replication and between-graph replication.

Like shown in the video, in-graph replication is achieved by assigning different parts of a single graph to different devices. In between-graph replication has multiple graphs running in parallel instead, which is achieved by using distributed tensorflow.

BlueSun
  • 3,541
  • 1
  • 18
  • 37