You could try if this paper helps.
They say that if you don't use normalization, you need to train "more carefully", meaning using a lower learning rate.
Skimming the first pages, I could imagine it works like this:
For some nonlinearities, there's a 'good input value range', and batch norm brings values into that range. High input values are bad and lead to saturation (little slope in the function and "vanishing gradients").
So, if you don't normalize, you need to make smaller steps - a lower learning rate - to avoid 'jumping' into weights that lead to high values within the net. And also be more careful how you initialize weights. I guess if you use ReLus, that's not so much of a problem. But please correct me if someone else has had different experiences with ReLus.