1

Say I have a bivariate function, for example: z = x^2 + y^2. I learned that on Keras I can compute nth-order derivatives using Lambda layers:

def bivariate_function(x, y):

    x2 = Lambda(lambda u: K.pow(u,2))(x)
    y3 = Lambda(lambda u: K.pow(u,2))(y)

    return Add()([x2,y3])

def derivative(y,x):
    return Lambda(lambda u: K.gradients(u[0],u[1]))([y,x])

f = bivariate_function(x,y)
df_dx = grad(f,x)      # 1st derivative wrt to x
df_dy = grad(f,y)      # 1st derivative wrt to y
df_dx2 = grad(df_dx,x) # 2nd derivative wrt to x
df_dy2 = grad(df_dy,y) # 2nd derivative wrt to y

However, how do I apply this approach to the derivatives of a NN output wrt to inputs in the loss function? I can't (?) just simply feed two inputs into a dense layer (as the ones created above).

For example, trying to use as loss the sum the first derivative wrt to the first variable and the second derivative wrt to the second variable (i.e. d/dx+d²/dy²), using Input(shape=(2,)), I managed to arrive here:

import tensorflow as tf
from keras.models import *
from keras.layers import *
from keras import backend as K

def grad(f, x):
    return Lambda(lambda u: K.gradients(u[0], u[1]), output_shape=[2])([f, x])

def custom_loss(input_tensor,output_tensor):
    def loss(y_true, y_pred):

        df1 = grad(output_tensor,input_tensor)
        df2 = grad(df1,input_tensor)
        df = tf.add(df1[0,0],df2[0,1])      

        return df
    return loss

input_tensor = Input(shape=(2,))
hidden_layer = Dense(100, activation='relu')(input_tensor)
output_tensor = Dense(1, activation='softplus')(hidden_layer)

model = Model(input_tensor, output_tensor)
model.compile(loss=custom_loss(input_tensor,output_tensor), optimizer='sgd')

xy = np.mgrid[-3.0:3.0:0.1, -3.0:3.0:0.1].reshape(2,-1).T
model.fit(x=xy,y=xy, batch_size=10, epochs=100, verbose=2)

But it just feels like I'm not doing it the proper way. Even worse, after the first epoch I'm getting just nan's.

Lucas Farias
  • 418
  • 1
  • 8
  • 22
  • 1
    Can't run your code. You renamed the `grad()` function to `derivative()` but the calls are still to `grad()`. Also, how do you get `x` from `input_tensor`. Can you please post a runnable, one-piece code I can just copy-paste to see what's going on? – Peter Szoldan Apr 26 '18 at 10:57
  • 1
    @PeterSzoldan just editted with a working snippet! Thanks for helping! – Lucas Farias Apr 26 '18 at 14:17

1 Answers1

1

The main issue here is theoretical.

You're trying to minimize doutput_tensor/dx + d2output_tensor/d2x. Your network just linearly combines the input x-s, however, with relu and softplus activations. Well, softplus adds a bit of twist to it, but that also has a monotonously increasing derivative. Therefore for the derivative to be as small as possible, the network will just scale the input up as much as possible with negative weights, to make the derivative as small as possible (that is, a really large negative number), at some point reaching NaN. I've reduced the first layer to 5 neurons and ran the model for 2 epochs, and the weights became:

('dense_1',
[array([[ 1.0536456 , -0.32706773, 0.0072904 , 0.01986691, 0.9854533 ],
[-0.3242108 , -0.56753945, 0.8098554 , -0.7545874 , 0.2716419 ]],
dtype=float32),
array([ 0.01207507, 0.09927677, -0.01768671, -0.12874101, 0.0210707 ], dtype=float32)])

('dense_2', [array([[-0.4332278 ], [ 0.6621602 ], [-0.07802075], [-0.5798264 ], [-0.40561703]],
dtype=float32),
array([0.11167384], dtype=float32)])

You can see that the second layer keeps a negative sign where the first has a positive, and vice versa. (Biases don't get any gradient because they don't contribute to the derivative. Well, not exactly true because of the softplus but more or less.)

So you have to come up with a loss function that is not divergent towards extreme parameter values because this will not be trainable, it will just increase the values of weights until they NaN.

This was the version I ran:

import tensorflow as tf
from keras.models import *
from keras.layers import *
from keras import backend as K

def grad(f, x):
    return Lambda(lambda u: K.gradients(u[0], u[1]), output_shape=[2])([f, x])

def ngrad(f, x, n):
    if 0 == n:
        return f
    else:
        return Lambda(lambda u: K.gradients(u[0], u[1]), output_shape=[2])([ngrad( f, x, n - 1 ), x])

def custom_loss(input_tensor,output_tensor):
    def loss(y_true, y_pred):

        _df1 = grad(output_tensor,input_tensor)
        df1 = tf.Print( _df1, [ _df1 ], message = "df1" )
        _df2 = grad(df1,input_tensor)
        df2 = tf.Print( _df2, [ _df2 ], message = "df2" )
        df = tf.add(df1,df2)      

        return df
    return loss

input_tensor = Input(shape=(2,))
hidden_layer = Dense(5, activation='softplus')(input_tensor)
output_tensor = Dense(1, activation='softplus')(hidden_layer)

model = Model(input_tensor, output_tensor)
model.compile(loss=custom_loss(input_tensor,output_tensor), optimizer='sgd')

xy = np.mgrid[-3.0:3.0:0.1, -3.0:3.0:0.1].reshape( 2, -1 ).T
#print( xy )
model.fit(x=xy,y=xy, batch_size=10, epochs=2, verbose=2)
for layer in model.layers: print(layer.get_config()['name'], layer.get_weights())
Peter Szoldan
  • 4,792
  • 1
  • 14
  • 24