I'm writing a pruning algorithm for tf.keras that simply removes the lowest x percentile of weights from a layer / filter. To do this, I've tried setting the value of the weights to prune to zero. Having read other sources, I'm under the impression that this has the same effect as "removing" a weight from a network, but even if I set all the weights in a network to be zero, no decrease in inference time is noted.
If I were to hypothetically set all the weights in a layer to zero, the code would be as follows:
flat_weights = np.array(self.model.layers[layer_index].get_weights()[0]).flatten()
weight_index = 0
for weight in flat_weights:
#if weight < self.delta_percentiles[index]:
flat_weights[weight_index] = 0
weight_index += 1
weights[0] = np.reshape(flat_weights, original_shape)
weights[1] = np.zeros(np.shape(weights[1]))
self.model.layers[index].set_weights(weights)
Theoretically, the inference time of a model pruned in such a way should decrease but no change is found. Am I pruning correctly?