First of all, I think you should forget the idea of "ON" or "OFF" because it is not really the way it often works : it is not compulsory that the result of a such function is something binary. There exist threshold activation functions, but they are not the only ones. The sigmoid function is a function that goes from the reals to the set ]0,1[. This function is applied and, unless you add a threshold, your neuron always outputs something, even if it is tiny or big, that is neither 0 nor 1.
Take the example of the linear activation function : you can even output among all reals. Then, the idea of on/off isn't relevant.
The goal of a such function is to add complexity to the model, and to make it non-linear. If you had a neural network without these functions, the output would just be a linear weighted sum of the inputs plus bias which is often not complex enough to solve problems (the example of simulating a XOR gate with a network is often used, you won't do it without activation functions). With activation functions, you can add whatever you want like tanh, sigmoid, ReLU...
That being said, the answer is 1 and 3.
If you take a random neuron n in a hidden layer, its input is a sum of values weighted by weights, and a bias (also weighted by a weight often called w0), sum on which it then applies the activation function. Imagine the weighted values of the previous neurons are 0.5 and 0.2, and you have a weighted bias of 0.1. You then apply a function, let's take the sigmoid, on 0.5+0.2+0.1=0.8. That makes something like 0.69.
The output of the neuron is the result of the function. Each neuron of the next layer will make a weighted sum of the output of the current layer, including the output of our neuron. Note that each neuron of the next layer has its own weights between the previous layer and itself. Then, neurons of the next layer will apply an activation function (not necesarly the same as current layer) to make their own outputs. So, informaly, it will do something like activ_func(..+..+0.69*weight_n+..).
That means, you can say each layer takes as value the result of the activation function applied on the weighted sum of the values of the neurons of the previous layer and a weighted bias. If you managed to read it without suffocating, you can recursively apply this definition for each layer (except input of course).