Anyone can tell me why we always use the gaussian distribution in Machine learning?

Question

For example, we always assumed that the data or signal error is a Gaussian distribution? why?

You are recommended to ask the question here http://stats.stackexchange.com — chyx, Sep 27 '12 at 07:48

score 22 · Accepted Answer · answered Sep 27 '12 at 10:11

The answer you'll get from mathematically minded people is "because of the central limit theorem". This expresses the idea that when you take a bunch of random numbers from almost any distribution* and add them together, you will get something approximately normally distributed. The more numbers you add together, the more normally distributed it gets.

I can demonstrate this in Matlab/Octave. If I generate 1000 random numbers between 1 and 10 and plot a histogram, I get something like this

enter image description here

If instead of generating a single random number, I generate 12 of them and add them together, and do this 1000 times and plot a histogram, I get something like this:

enter image description here

I've plotted a normal distribution with the same mean and variance over the top, so you can get an idea of how close the match is. You can see the code I used to generate these plots at this gist.

In a typical machine learning problem you will have errors from many different sources (e.g. measurement error, data entry error, classification error, data corruption...) and it's not completely unreasonable to think that the combined effect of all of these errors is approximately normal (although of course, you should always check!)

More pragmatic answers to the question include:

Because it makes the math simpler. The probability density function for the normal distribution is an exponential of a quadratic. Taking the logarithm (as you often do, because you want to maximize the log likelihood) gives you a quadratic. Differentiating this (to find the maximum) gives you a set of linear equations, which are easy to solve analytically.
It's simple - the entire distribution is described by two numbers, the mean and variance.
It's familiar to most people who will be reading your code/paper/report.

It's generally a good starting point. If you find that your distributional assumptions are giving you poor performance, then maybe you can try a different distribution. But you should probably look at other ways to improve the model's performance first.

*Technical point - it needs to have finite variance.

Hans Scundal · Answer 2 · 2012-09-27T12:58:25.140

Gaussian distributions are the most "natural" distributions. They show up everywhere. Here is a list of the properties that make me think that Gaussians are the most natural distributions:

The sum of several random variables (like dice) tends to be Gaussian as noted by nikie. (Central Limit Theorem).
There are two natural ideas that appear in machine learning, the standard deviation and the maximum entropy principle. If you ask the question, "Among all distributions with standard deviation 1 and mean 0, what is the distribution with maximum entropy?" The answer is the Gaussian.
Randomly select a point inside a high dimensional hypersphere. The distribution of any particular coordinate is approximately Gaussian. The same is true for a random point on the surface of the hypersphere.
Take several samples from a Gaussian Distribution. Compute the Discrete Fourier Transform of the samples. The results have a Gaussian Distribution. I am pretty sure that the Gaussian is the only distribution with this property.
The eigenfunctions of the Fourier Transforms are products of polynomials and Gaussians.
The solution to the differential equations y' = -x y is a Gaussian. This fact makes computations with Gaussians easier. (Higher derivatives involve Hermite polynomials.)
I think Gaussians are the only distributions closed under both multiplication, convolution, and linear transformations.
Maximum likelihood estimators to problems involving Gaussians tend to also be the least squares solutions.
I think all solutions to stochastic differential equations involve Gaussians. (This is mainly a consequence of the Central Limit Theorem.
"The normal distribution is the only absolutely continuous distribution all of whose cumulants beyond the first two (i.e. other than the mean and variance) are zero." - Wikipedia.
For even n, the nth moment of the Guassian is simply an integer multiplied by the standard deviation to the nth power.
Many of the other standard distributions are strongly related to the Gaussian (i.e. binomial, Poisson, chi-squared, Student t, Rayleigh, Logistic, Log-Normal, Hypergeometric ...)
"If X1 and X2 are independent and their sum X1 + X2 is distributed normally, then both X1 and X2 must also be normal" -- From the Wikipedia.
"The conjugate prior of the mean of a normal distribution is another normal distribution." -- From the Wikipedia.
When using Gaussians, the math is easier.
The Erdős–Kac theorem implies that the distribution of the prime factors of a "random" integer is Gaussian.
The velocities of random molecules in a gas are distributed as a Gaussian. (With standard deviation = z*sqrt( k T / m) where z is a constant and k is Boltzman's constant.)
"A Gaussian function is the wave function of the ground state of the quantum harmonic oscillator." -- From Wikipedia
Kalman Filters.
The Gauss–Markov theorem.

This post is cross posted at http://artent.net/blog/2012/09/27/why-are-gaussian-distributions-great/

*I think all solutions to stochastic differential equations involve Gaussians.* -- Isn't that because SDEs are most often defined using a Brownian motion for the stochastic part? Since Brownian motion has Gaussian increments, it's not surprising that the solution typically involves a Gaussian! — Chris Taylor, Sep 27 '12 at 11:11

score 4 · Answer 3 · answered Sep 27 '12 at 07:56

The signal error if often a sum of many independent errors. For example, in CCD camera you could have photon noise, transmission noise, digitization noise (and maybe more) that are mostly independent, so the error will often be normally distributed due to the central limit theorem.

Also, modeling the error as a normal distribution often makes calculations very simple.

score 2 · Answer 4 · answered Sep 18 '17 at 04:35

I had the same question "what the is advantage of doing a Gaussian transformation on predictors or target?" Infact, caret package has a pre-processing step that enables this transformation.

Here is my understanding -

1) Usually the data distribution in Nature follows a Normal distribution ( few examples like - age, income, height, weight etc., ) . So its the best approximation when we are not aware of the underlying distribution pattern.

2) Most often the goal in ML/ AI is to strive to make the data linearly separable even if it means projecting the data into higher dimensional space so as to find a fitting "hyperplane" (for example - SVM kernels, Neural net layers, Softmax etc.,). The reason for this being "Linear boundaries always help in reducing variance and is the most simplistic, natural and interpret-able" besides reducing mathematical / computational complexities. And, when we aim for linear separability, its always good to reduce the effect of outliers, influencing points and leverage points. Why? Because the hyperplane is very sensitive to the influencing points and leverage points (aka outliers) - To undertstand this - Lets shift to a 2D space where we have one predictor (X) and one target(y) and assume there exists a good positive correlation between X and y. Given this, if our X is normally distributed and y is also normally distributed, you are most likely to fit a straight line that has many points centered in the middle of the line rather than the end-points (aka outliers, leverage / influencing points). So the predicted regression line will most likely suffer little variance when predicting on unseen data.

Extrapolating the above understanding to a n-dimensional space and fitting a hyperplane to make things linearly separable does infact really makes sense because it helps in reducing the variance.

score 1 · Answer 5 · answered Sep 27 '12 at 09:19

The math often would not come out. :)
The normal distribution is very common. See nikie's answer.
Even non-normal distributions can often be looked as normal distribution with a large deviation. Yes, it's a dirty hack.

The first point might look funny but I did some research for problems where we had non-normal distributions and the maths get horribly complicated. In practice, often computer simluations are carried out to "prove the theorems".

mszlazak · Answer 6 · 2015-02-13T08:50:07.593

Why it is used a lot in machine learning is a great question since the usual justifications of it's use outside mathematics are often bogus.

You will see people giving the standard explanation of the normal distribution by way of the "central limit theorem".

However, there is the problem with that.

What you find with many things in the real world is the conditions of this theorem are often not met ... not even closely. Despite these things APPEARING to be normally distributed!

So i am not talking ONLY about things that do not appear normally distributed but also about those that do.

There is a long history about this in statistics and the empirical sciences.

Still, there is also a lot of intellectual inertia and misinformation that just has persisted for decades about the central limit theorem explanation. I guess that maybe a part of the answer.

Even though normal distributions may not be as normal as once thought, there must be some natural basis for times when things are distributed this way.

The best but not entirely adequate reasons are maximum entropy explanations. Problem here is there are different measures of entropy.

Anyway, machine learning may just have developed with a certain mind set along with confirmation bias by data that just fits Gaussians.

score 0 · Answer 7 · answered Nov 20 '17 at 21:57

I recently read an interesting perspective on this in David Mackay's book "Information Theory, Inference, and Learning Algorithms, " Chapter 28, which I'll briefly summarize here.

Say we want to approximate the posterior probability of a parameter given some data P(w|D). A reasonable approximation is the Taylor series expansion around some point of interest. A good candidate for this point is the Maximum Likelihood estimation, w*. Using the 2nd order Taylor series expansion of the log-probability of P at w*:

log(P(w|D)) = log(P(w*|D)) + ∇log(P(w*|D))(w-w*) - (1/2)(w-w*)^T(-∇∇log(P(w*|D)))(w-w*) +O(3)

Since the ML is a maxima, ∇log(P(w*|D))=0. Defining Γ=(-∇∇log(P(w*|D))), we have:

log(P(w|D)) ≈ log(P(w*|D)) - (1/2)(w-w*)^T Γ(w-w*).

Take the exponent of the additive terms:

P(w|D) ≈ cte exp(- (1/2)(w-w*)^T Γ(w-w*))

where cte=P(w*|D). So,

The Gaussian N(w*,Γ^(-1)) is the second order Taylor Series approximation of any given distribution at its Maximum Likelihood.

where w* is the Maximum Likelihood of the distribution and Γ is the Hessian of its log-probability at w*.

Anyone can tell me why we always use the gaussian distribution in Machine learning?

7 Answers7