3

In my data flow, I'm querying a small subset of a database, using those results to construct about a dozen arrays, and then, given some parameter values, computing a likelihood value. Then repeating for a subset of the database. I want to compute the gradient of the likelihood function with respect to the parameters but not the data. But ReverseDiff computes the gradient with respect to all inputs. How can I get around this? Specifically, how can I construct a ReverseDiff.Tape object

TL;DR: How to marry stochastic gradient descent and ReverseDiff? (I'm not wedded to using ReverseDiff. It just seemed like the right tool for the job.)

It seems like this has to be a common coding pattern. It's used all the time in my field. But I'm missing something. Julia's scoping rules seem to undermine the scoped/anonymous function approach, and ReverseDiff is holding on to the original data values in the tape generated instead of using the mutated values.

some sample code of things that don't work

using ReverseDiff
using Base.Test


mutable struct data
    X::Array{Float64, 2}
end

const D = data(zeros(Float64, 2, 2))

# baseline known data to compare against
function f1(params)
    X = float.([1 2; 3 4])
    f2(params, X)
end

# X is data, want derivative wrt to params only
function f2(params, X)
    sum(params[1]' * X[:, 1] - (params[1] .* params[2])' * X[:, 2].^2)
end

# store data of interest in D.X so that we can call just f2(params) and get our
# gradient
f2(params) = f2(params, D.X)

# use an inner function and swap out Z's data
function scope_test()
    function f2_only_params(params)
        f2(params, Z)
    end
    Z = float.([6 7; 1 3])
    f2_tape = ReverseDiff.GradientTape(f2_only_params, [1, 2])

    Z[:] = float.([1 2; 3 4])
    grad = ReverseDiff.gradient!(f2_tape, [3,4])
    return grad
end

function struct_test()
    D.X[:] = float.([6 7; 1 3])
    f2_tape = ReverseDiff.GradientTape(f2, [1., 2.])
    D.X[:] = float.([1 2; 3 4])
    grad = ReverseDiff.gradient!(f2_tape, [3., 4.])
    return grad
end

function struct_test2()
    D.X[:] = float.([1 2; 3 4])
    f2_tape = ReverseDiff.GradientTape(f2, [3., 4.])
    D.X[:] = float.([1 2; 3 4])
    grad = ReverseDiff.gradient!(f2_tape, [3., 4.])
    return grad
end

D.X[:] = float.([1 2; 3 4])

@test f1([3., 4.]) == f2([3., 4.], D.X)
@test f1([3., 4.]) == f2([3., 4.])

f1_tape = ReverseDiff.GradientTape(f1, [3,4])
f1_grad = ReverseDiff.gradient!(f1_tape, [3,4])
# fails! uses line 33 values
@test scope_test() == f1_grad
# fails, uses line 42 values
@test struct_test() == f1_grad
# succeeds, so, not completely random
@test struct_test2() == f1_grad
James
  • 630
  • 1
  • 6
  • 15

2 Answers2

2

This is currently not possible (sadly). And there is a GitHub issue with the two work-arounds: https://github.com/JuliaDiff/ReverseDiff.jl/issues/36

  • either do not use a prerecorded tape
  • or differentiate relative to all arguments and ignore the gradient for some of the input parameters.

I had the same issue, and I used the grad function of Knet instead. I supports only the differentiation relative to one argument, but this argument can be quite flexible (e.g. an array of arrays or an dict or arrays).

Alex338207
  • 1,825
  • 12
  • 16
1

Thanks Alex, your answer was 90% of the way there. AutoGrad (what Knet is using at the time of writing) does provide a very nice interface that is natural I think for most users. However, it turns out that using anonymous functions with ReverseDiff is faster than the approach taken by AutoGrad, for reasons I don't quite understand.

If you follow the chain of issues referenced in what you linked, this seems to be what the ReverseDiff/ForwardDiff folks want people doing:

ReverseDiff.gradient(p -> f(p, non_differentiated_data), params)

Certainly disappointing that we can't get a precompiled tape with this incredibly common usage scenario, and maybe future work will change things. But this seems to be where things stand now.

Some references for those interested in further reading:

James
  • 630
  • 1
  • 6
  • 15