manipulate data to better fit a Gaussian Distribution

Question

I have got a question concerning normal distribution (with mu = 0 and sigma = 1).

Let say that I firstly call randn or normrnd this way

x = normrnd(0,1,[4096,1]); % x = randn(4096,1)

Now, to assess how good x values fit the normal distribution, I call

[a,b] = normfit(x);

and to have a graphical support

histfit(x)

Now come to the core of the question: if I am not satisfied enough on how x fits the given normal distribution, how can I optimize x in order to better fit the expected normal distribution with 0 mean and 1 standard deviation?? Sometimes because of the few representation values (i.e. 4096 in this case), x fits really poorly the expected Gaussian, so that I wanna manipulate x (linearly or not, it does not really matter at this stage) in order to get a better fitness.

I'd like remarking that I have access to the statistical toolbox.

EDIT

I made the example with normrnd and randn cause my data are supposed and expected to have normal distribution. But, within the question, those functions are only helpful to better understand my concern.
Would it be possible to appy a least-squares fitting?
Generally the distribution I get is similar to the following:

My

Maybe you'll better luck with quasi-random numbers than with pseudo-random numbers if your data set is small. http://www.mathworks.com/help/stats/generating-quasi-random-numbers.html — Dan, Mar 19 '13 at 10:28
If you show us how your distribution looks, that would help. — Memming, Mar 19 '13 at 14:59
what you uploaded looks like a pretty good fit to me. You probably just need more samples. — Memming, Mar 19 '13 at 15:14
I know that it is already quite well representing a normal distribution: the point is that I can not increase the number of samples. The highest I can reach is generally around **2^13** — fpe, Mar 19 '13 at 15:16
@tashuhka: but isn't it possible just to somehow use `lsqcurvefit`? — fpe, Mar 19 '13 at 18:42
From my point of view, your data fits quite well the N(0,1). I am just suggesting that the problem could not be the data, but the representation. If you cannot increase the number of samples, you could use overlapping bins to smooth the histogram and virtually increase the number of samples by re-using them in several bins. — tashuhka, Mar 20 '13 at 11:00
In case you have other dataset with poorer fitting, try with kernel smoothing 'ksdensity'. — tashuhka, Mar 20 '13 at 11:07

tashuhka · Accepted Answer · 2013-03-20T10:12:48.770

3

Maybe, you can try to normalize your input data to have mean=0 and sigma=1. Like this:

y=(x-mean(x))/std(x);

edited Mar 20 '13 at 10:12

answered Mar 19 '13 at 12:51

tashuhka

5,028
4
45
64

I'm playing already around with this trivial tricks, which don't really solve the question. btw, thanks for the support – fpe Mar 19 '13 at 13:00
2

You should normalize by `std` not `var`. Also could just use `zscore`. – Memming Mar 19 '13 at 14:57

score 1 · Answer 2 · answered Mar 19 '13 at 15:12

If you are searching for a nonlinear transformation that would make your distribution look normal, you can first estimate the cumulative distribution, then take the function composition with the inverse of standard normal CDF. This way you can transform almost any distribution to a normal through invertible transformation. Take a look at the example code below.

x = randn(1000, 1) + 4 * (rand(1000, 1) < 0.5); % some funky bimodal distribution
xr = linspace(-5, 9, 2000);
cdf = cumsum(ksdensity(x, xr, 'width', 0.5)); cdf = cdf / cdf(end); % you many want to use a better smoother
c = interp1(xr, cdf, x); % function composition step 1
y = norminv(c); % function composition step 2
% take a look at the result
figure;
subplot(2,1,1); hist(x, 100);
subplot(2,1,2); hist(y, 100);

If you don't smooth the empirical CDF, it'll be exactly normal, but what would be the point of doing such manipulation? :) — Memming, Mar 19 '13 at 15:22

manipulate data to better fit a Gaussian Distribution

2 Answers2

Linked