4

Hi I am old to programming but very new to Julia so the answer may be obvious.

Forward Mode AD is often compared with the forward pass of a neural net, Reverse Mode AD is compared with back propagation and clearly you cannot replace back propagation with a forward pass. Forward and Reverse Mode AD both compute the gradient. But are they the same function or, ignoring efficiency, is Reverse Mode doing something the Forward Mode is not? Alternatively are there applications using Reverse Mode where, ignoring efficiency, Forward Mode could not be used?

The reason for asking this is that the paper https://richarde.dev/papers/2022/ad/higher-order-ad.pdf Provably Correct, Asymptotically Efficient, Higher-Order Reverse-Mode Automatic Differentiation defines Reverse Mode AD as a correct and efficient way to compute the gradient when there is a large number of inputs. Yet their definition computes the same function as Froward Mode AD.

Any correction of my understanding much appreciated.

Clarification:

Let Forward Algorithm means an algorithm that takes dx and computes df(x) and Reverse Algorithm mean an algorithm the takes df(x) and computes possible dx.

The easiest Automatic Differentiation algorithm to understand is a Forward Algorithm called Forward Mode Automatic Differentiation but is not efficient when there is a large vector of inputs (x). Hence Reverse Algorithms were invented that were efficient with a large vector of inputs these were called Reverse Mode Automatic Differentiation. These Reverse Algorithms appear much harder both to understand and to implement.

In the paper cited a Haskel forward algorithm for computing Automatic Differentiation was given that is efficient with a large vector of inputs. This, because of the efficiency, was not unreasonably called Reverse Mode Automatic Differentiation. On the assumption both that they are correct and that their algorithm can be implemented for Julia programs does it mean that the Reverse Algorithms are no longer useful. Or is there some use cases (not Automatic Differentiation) for which Reverse Algorithms are still needed?

david streader
  • 589
  • 2
  • 7
  • It depends on what you mean by "computes the same function". The same numbers (up to numerics) will fall out at the end; the operations are different, though. In essence, it's a question of matrix multiplication: do you compute `Jv` for all basis vectors `v`, or `dJ'` for a scalar `d` to get the whole gradient at once? – phipsgabler Feb 24 '22 at 07:45
  • Are you saying that, ignoring efficiency, where ever Reverse Mode AD is used then Forward Mode AD could be used in its place? If not do you regard the algorithm in the cited paper as being a Forward Mode AD and not a Reverse Modes AD. Or are you saying that Reverse Mode AD takes as input what Forward Mode AD outputs and outputs what Forward Mode AD takes as input? – david streader Feb 24 '22 at 18:53

1 Answers1

3

It depends on what you mean by "computes the same function". Using it the right way, the same numbers (up to numerics) will fall out at the end -- the gradient. The operations involved are different, though.

AD always evaluates a certain linear operator at a given point. In the case of forward mode, the operator is the Jacobian J, and for backward mode, it's its adjoint J'. But as you have the operator only in algorithmic form, not in matrix form, you cannot just "transpose it" to get one from the other. Instead, you can evaluate both, but in different bases, to recover the same gradient.

In essence, it's a question of matrix multiplication: do you compute Jv for all one-hot vectors v, or dJ' for the scalar d = 1 to get the whole gradient at once? The inefficiency of forward mode comes from the fact that in R^n -> R functions, these bases (the number of one-hot vectors) scale with n.

So: they are not the same function, but adjoint, and you can recover either result from the other.

Sorry if that's a bit dense, but anything more would become a lecture introducing AD. SO is probably not the right place for that. You should take the time to find some introductory material, though.

phipsgabler
  • 20,535
  • 4
  • 40
  • 60
  • Many thanks for your help. I have been reading a lot about Julia, Automatic Differentiation both Forward Mode and Reverse Mode. But what I lack is a good enough overview to see the importance of the cited paper. By function I meant a mathematical function, a Domain, a Range and a subset of Domain* Range. To clarify what I am trying to ask I have reworded the question in a Clarification: Answer that avoids the misleading use of "function". – david streader Feb 28 '22 at 03:55
  • I don't exactly know what you mean, but let me try again: AD is an inherently structural operation. You treat a function as the composition of primitive functions. Forward and backward mode are different ways of building a total derivative function though compositionality of these primitive functions. Given them, you can construct two gradient functions of the same original function, which will be extensionally equivalent, but intensionally different. – phipsgabler Feb 28 '22 at 07:27