How to find PDF from sample set in MATLAB

Question

I have some observations from an unknown source. This set of observations is x, for example :

x = [97 , 102.3, 95.05 , 89.1 , 117 , ...]; % this is just an example. data set could contain any thing.

provided x is large enough, I should be able to say something about the probability distribution function, right?

So how can I do this in MATLAB so I can get p(x = 101) or p(x = 5) ? the first one will probably be very high.

Any kind of assumption (normal distribution etc.) is ok, I just want a simple answer for probabilities. And maybe I don't have to explicitly know the PDF, I just need a way to implement p(x = x_star), where x_star is not necessarily a member of x. How can I do this?

Thanks for any help !

My Attempts

The simplest attempt is length(find(x==x_star))/length(x), however this returns zero if for example there is no 101 in the observations. However looking at the distribution it should be a high probability.

Edit :

My function according to Kamtal's answer :

function p = get_probability_from_sample_set(S, X)
% finds the probability that a sample from S is equal to X
[mu,sigma] = normfit(S);
 z = 1:200;
 xfit = normpdf(z,mu,sigma);
 p = xfit(find(z == X)); 
end

p returns []. Where am I doing wrong?

Check [`hist`](http://es.mathworks.com/help/matlab/ref/hist.html) — Luis Mendo, Nov 01 '14 at 19:37
@Kamtal no they are floats. Luis Mendo, ok, then how do I get the probability from the histogram? — jeff, Nov 01 '14 at 19:45
@halilpazarlama Check for example [here](http://stackoverflow.com/questions/5320677/how-to-normalize-a-histogram-in-matlab) — Luis Mendo, Nov 02 '14 at 04:18

Rashid · Answer 1 · 2014-11-01T20:41:10.443

0

 x = randi(200,[1000 1]);
 [mu,sigma] = normfit(x);
 z = 1:200;
 xfit = normpdf(z,mu,sigma);
 p = xfit(find(z == round(X)));

If your values are in [0 0.1],

 x = randi(1000,[1000 1])/10000;
 [mu,sigma] = normfit(x);
 z = 0:1e-5:0.1;
 xfit = normpdf(z,mu,sigma);
 nearestToz = z(abs(z - X) == min(abs(z - X)));
 p = xfit(find(z == nearestToz));

edited Nov 01 '14 at 20:41

answered Nov 01 '14 at 19:37

Rashid

4,326
2
29
54

Thanks ! This looks right, but does this assume integer values? Because it was just an example, my actual values are floats, and they change from order of 1e-3 to 1e3. Will this work for all types of x and x_star ? – jeff Nov 01 '14 at 19:46
@halilpazarlama, Since `z = 1:200;` it will give integers, if you change `z = 1:stepsize:200;` you could have access to floats, depending on your data. – Rashid Nov 01 '14 at 19:48
Oh so it should be `z = min(S):stepsize:max(S)` ? Please see the edit to the question. – jeff Nov 01 '14 at 19:50
@halilpazarlama, yes. you can plot `xfit` to see the pdf. – Rashid Nov 01 '14 at 19:51
Ok so does `p = xfit(..)` make sense? Or how do I get the probability? – jeff Nov 01 '14 at 19:53
@halilpazarlama, `p=xfit(find(z == x_star))`, and try `z=unique(x);`, I think that is more efficient. – Rashid Nov 01 '14 at 19:55
Ok but this still gives zero for queries that are not in the sample set, right? – jeff Nov 01 '14 at 19:57
@halilpazarlama, I forgot that you want probabilities for data that aren't in set. you have to use `z = min(S):stepsize:max(S)` with a small step to cover all the values you want their probability. – Rashid Nov 01 '14 at 19:59
Ok thanks. This works if query is in z. But still not for all values, except stepsize goes to zero (which fails the memory). I still think that there should be a way that works for **every** query. – jeff Nov 01 '14 at 20:02
@halilpazarlama, Do you have your queries in an array? so we could somehow insert them all in `z`, or they are random? – Rashid Nov 01 '14 at 20:03
No, they are not pre-determined. I want to make a system that gives a probability for any given query. So we can say that the query set is the set of real numbers. **edit** : Maybe we can use the "closest" member of z. – jeff Nov 01 '14 at 20:05
@halilpazarlama, you could also use `p = xfit(find(z == round(X));`. They will almost be the same anyway. – Rashid Nov 01 '14 at 20:06
no, neither the query set nor the data set is limited to integers. – jeff Nov 01 '14 at 20:08
@halilpazarlama, Ok that is the reason to use `round`. What would be the problem? – Rashid Nov 01 '14 at 20:09
Hmm. What if my data set is between 0 and 0.1? – jeff Nov 01 '14 at 20:17

How to find PDF from sample set in MATLAB

My Attempts

Edit :

1 Answers1