I am trying to write a neural network in rust + arrayfire, and while gradient descent works, ADAM does not.
fn back_propagate(
&mut self,
signals: &Vec<Array<f32>>,
labels: &Array<u8>,
learning_rate_alpha: f64,
batch_size: i32,
) {
let mut output = signals.last().unwrap();
let mut error = output - labels;
for layer_index in (0..self.num_layers - 1).rev() {
let signal = Self::add_bias(&signals[layer_index]);
let deriv = self.layer_activations[layer_index].apply_deriv(output);
let delta = &(deriv * error).T();
let matmul = matmul(&delta, &signal, MatProp::NONE, MatProp::NONE);
let gradient_t = (matmul / batch_size).T();
match self.optimizer {
Optimizer::GradientDescent => {
let weight_update = learning_rate_alpha * gradient_t;
self.weights[layer_index] -= weight_update;
}
Optimizer::Adam => {
let exponents = constant(2f32, gradient_t.dims());
self.first_moment_vectors[layer_index] = (&self.beta1[layer_index]
* &self.first_moment_vectors[layer_index])
+ (&self.one_minus_beta1[layer_index] * &gradient_t);
self.second_moment_vectors[layer_index] = (&self.beta2[layer_index]
* &self.second_moment_vectors[layer_index])
+ (&self.one_minus_beta2[layer_index]
* arrayfire::pow(&gradient_t, &exponents, true));
let corrected_first_moment_vector = &self.first_moment_vectors[layer_index]
/ &self.one_minus_beta1[layer_index];
let corrected_second_moment_vector = &self.second_moment_vectors[layer_index]
/ &self.one_minus_beta2[layer_index];
let denominator = sqrt(&corrected_second_moment_vector) + 1e-8;
let weight_update =
learning_rate_alpha * (corrected_first_moment_vector / denominator);
self.weights[layer_index] -= weight_update;
}
}
output = &signals[layer_index];
let err = matmulTT(
&delta,
&self.weights[layer_index],
MatProp::NONE,
MatProp::NONE,
);
error = index(&err, &[seq!(), seq!(1, output.dims()[1] as i32, 1)]);
}
}
I've stored beta1, beta2, 1-beta1, 1-beta2 in constant arrays for every layer just to avoid having to recompute them. It appears to have made no difference.
GradientDescent converges with a learning rate alpha=2.0, however with Adam, if i use alpha>~0.02, the network appears to get locked in. Funnily enough, if I remove all the hidden layers, it does work. Which tells me something, but I'm not sure what it is.