Python Tensorflow - how does GradientTape operate and why can it give None as a result?

Question

Abstract
I am trying to create a neural network with custom training. In my attempts I ended up with a
ValueError: No gradients provided for any variable error. While trying to figure it out, I've found out that the problem appears because GradientTape sees no connection between the parameters given in GradientTape.gradient() which in turn usually happens because not every variable is watched by default and some of them weren't marked with a watch() function.

Problem
The code here raises the value error ValueError: No gradients provided for any variable: (['dense/kernel:0', 'dense/bias:0', 'dense_1/kernel:0', 'dense_1/bias:0', 'dense_2/kernel:0', 'dense_2/bias:0'],). Provided `grads_and_vars` is (None,... because GradientTape.gradient() returns a list of None instead of gradients for some reason.
The code (this is only a draft so it is kind of dirty):

def der2(f, x):
    res = [(f[0] - 2 * f[1] + f[2]) / ((x[1] - x[0]) * (x[2] - x[1]))] * 2
    res.extend(
        [(f[i - 1] - 2 * f[i] + f[i + 1]) / ((x[i] - x[i - 1]) * (x[i + 1] - x[i])) for i in range(2, len(f) - 2)])
    res.extend([(f[-3] - 2 * f[-2] + f[-1]) / ((x[-1] - x[-2]) * (x[-2] - x[-3]))] * 2)
    return res


dim1 = 50
dim2 = 50


def loss(model, x, y, training):
    y_ = model(x)

    s = 0
    for inputs, result in zip(x, y_):
        res_matrix = result.numpy()
        res_matrix = np.reshape(res_matrix, (dim1, dim2))
        transposed = res_matrix.transpose()
        si = list()
        sj = list()
        for yi in res_matrix:
            si.append(der2(yi, inputs[:dim1]))
        for yi in transposed:
            sj.append(der2(yi, inputs[dim1:]))
        si = np.array(si) + np.transpose(np.array(sj))
        si = tf.reduce_sum(si)
        for i in range(len(res_matrix)):
            for j in range(len(transposed)):
                si += 30 * np.exp(0.007 * ((inputs[i] - 5) ** 2) * ((inputs[dim1 + j] - 5) ** 2))
        s += abs(si)

    return s / 2500


def grad(model, inputs, targets):
    with tf.GradientTape() as tape:
        loss_value = loss(model, inputs, targets, training=True)
    return loss_value, tape.gradient(loss_value, model.trainable_variables)


num_epochs = 1501

x = np.linspace(0, 10, num=dim1)
x = np.append(x, np.linspace(0, 10, num=dim2))
y = 0
train_dataset = tf.data.Dataset.from_tensor_slices(([[x]], [[y]]))

model = tf.keras.Sequential([
    Dense(100, input_shape=(100,), activation=tf.nn.relu),  # input shape required
    Dense(100, activation=tf.nn.relu),
    Dense(2500, dtype='float64')
])

optimizer = tf.keras.optimizers.SGD(learning_rate=0.00001)

for epoch in range(num_epochs):
    epoch_loss_avg = tf.keras.metrics.Mean()
    # Training loop - using batches of 32
    for x, y in train_dataset:
        # Optimize the model
        loss_value, grads = grad(model, x, y)
        optimizer.apply_gradients(zip(grads, model.trainable_variables))
        epoch_loss_avg.update_state(loss_value)

    if epoch % 50 == 0:
        avg_loss = epoch_loss_avg.result()
        optimizer.learning_rate = avg_loss / 1000
        print("Epoch {:03d}: Loss: {:.3f}".format(epoch, avg_loss))

for x, y in train_dataset:
    y_ = model(x)
    plt.plot(x[0], y_[0])
plt.show()

My understanding of what happens

I have three code samples.

The one that works fine:

def der(f, x):
    res = [(f[1] - f[0]) / (x[1] - x[0])] * 2
    res.extend([(f[i] - f[i - 1]) / (x[i] - x[i - 1]) for i in range(2, len(f))])
    return res


def loss(model, x, y, tape, training):
    y_ = model(x)

    s = 0
    for xi, yi in zip(x, y_):
        lq = (1 + 3 * (xi ** 2)) / (1 + xi + xi ** 3)
        s += tf.reduce_sum(abs((der(yi, xi) + (xi + lq) * yi - xi ** 3 - 2 * xi - lq * xi * xi)))

    return s / tf.size(x).numpy()

Since it works fine I thought the problem is probuably because GradientTape only watches the output layer (yi in this sample) but not the result_matrix and transposed from the main problem code. So i've tried to simulate the same operations in sample #1 by adding yi = yi.numpy() and yi = tf.convert_to_tensor(yi) and making it a new variable and got code sample #2:

This code results in the same error as the main problem code:

def loss(model, x, y, tape, training)
    y_ = model(x)

    s = 0
    for xi, yi in zip(x, y_):
        yi = yi.numpy()
        yi = tf.convert_to_tensor(yi)
        lq = (1 + 3 * (xi ** 2)) / (1 + xi + xi ** 3)
        s += tf.reduce_sum(abs((der(yi, xi) + (xi + lq) * yi - xi ** 3 - 2 * xi - lq * xi * xi)))

    return s / tf.size(x).numpy()

So i thought that simply adding tape.watch(yi) would solve it, but it did not:

This code sample also results in the same error.

def loss(model, x, y, tape, training)
    y_ = model(x)

    s = 0
    for xi, yi in zip(x, y_):
        yi = yi.numpy()
        yi = tf.convert_to_tensor(yi)
        tape.watch(yi)
        lq = (1 + 3 * (xi ** 2)) / (1 + xi + xi ** 3)
        s += tf.reduce_sum(abs((der(yi, xi) + (xi + lq) * yi - xi ** 3 - 2 * xi - lq * xi * xi)))

    return s / tf.size(x).numpy()

Since that did not fix the problem, i clearly misunderstand something about how GradientTape operates.

Part of the issue could be that you use numpy operations in your loss function. Gradients will not flow through numpy arrays. Only use tensorflow operations. — jkr, Feb 20 '23 at 20:29
@jkr Well you were right, thank you. I've replaced every numpy operation with an equivalent tensorflow operation and now it does not raise the value. However, i still dont understand why would code sample #3 raise it if I made `yi` back to a tensor and explicitly used a watch() on it — Rabter, Feb 20 '23 at 20:39
regarding code sample 3, `yi.numpy()` will break the flow of gradients, even though it is converted back to a tensor in the next line. — jkr, Feb 20 '23 at 21:13
[Info], you can try tensorflow numpy api (https://www.tensorflow.org/guide/tf_numpy). — Innat, Feb 20 '23 at 23:05

Python Tensorflow - how does GradientTape operate and why can it give None as a result?

0 Answers0