3

Consider a 5-variate cumulative distribution function (cdf) which I call F.

I want to sample random 5x1 vectors from this cdf in Matlab. F is not a cdf that has been already implemented in Matlab (such as normal, t-student, etc.). Specifically, it is defined as

enter image description here

I have read several question/answers in this and other forums on how to sample from customised probability distribution functions in Matlab. However,

1) Most of them are for univariate cdf, e.g. here. The idea is to apply inverse transform sampling. My problem is a bit more complicated because I would need to "invert"a 5-variate function.

2) Another option could be to use slicesample as suggested here but I don't know how to write the analytical expression of the probability density function in my case.

3) Here's another idea but specific for the bivariate case.

Could you kindly help me to understand how I can proceed?

TEX
  • 2,249
  • 20
  • 43
  • i am not sure if i understand your problem, you said that you have a CDF which is F(r, sigma_a, sigma_b), then you want to randomly sample from this function ! so your problem is how to make a random 5x1 vector of r and submitting it to F? or your problem is how to transform this CDF to a proper PDF for custom values of r? – Hadi Jun 21 '18 at 23:21
  • @user3285148: The code I posted was correct only for independent probabilities. It picked r1, r2 and r4 correctly, but r3 depends on r2 and r5 on r4, so these did not show the correct distributions. I have fixed the code, it's become a bit more complex now, but it seems correct. – Cris Luengo Jun 23 '18 at 06:57

1 Answers1

5

Your link under #3 gives a hint of the solution. It explains the bivariate case when you have a PDF. Here we'll extend this to any number of dimensions, for the case when you have a CDF.

So the process is:

  1. Compute the marginal CDF for r1.
  2. Take a random sample using this marginal CDF (another link you posted explains how to do this).
  3. Compute the marginal CDF for r2 given r1.
  4. Take a random sample using this marginal CDF
  5. Compute the marginal CDF for r3 given r1 and r2.
  6. etc. etc. You see where this is going.

Note that if you have a PDF, computing marginal distributions involves integrating over the remaining variables. So a marginal distribution for r1 requires integrating over r2..r5, a marginal distribution for r2 given r1 requires integrating over r3..r5, etc.

When you have a CDF, computing the marginal distributions is trivial, as it already integrates the PDF: the marginal distribution for r1 is F(x,∞,∞,∞,∞). However, obtaining the marginal distribution given one or more variables requires differentiating: a marginal distribution for r2 given r1 requires differentiating along r1, a marginal distribution for r3 given r1 and r2 requires differentiating along r1 and r2, etc.

It might be possible to obtain these derivatives analytically (this would be the more efficient solution). Here we use the finite difference derivative approximation instead (this makes it easier to plug in any CDF).

Let's see some MATLAB code:

sigma_a = 0.5;
sigma_b = 0.3;
F = @(r1,r2,r3,r4,r5)exp(-exp(-r1) - (exp(-r2/sigma_a)+exp(-r3/sigma_a)).^sigma_a ...
                                   - (exp(-r4/sigma_b)+exp(-r5/sigma_b)).^sigma_b);

lims = [-5,10]; % This is the area along all dimensions containing 99.99% of the PDF

N = 1000;
values = zeros(N,5);
for n=1:N
   values(n,:) = sample_random(F,5,lims);
end

Here I picked some random values for sigma_a and sigma_b, and used those to define a function F of 5 variables r1..r5. I figured that the domain of the PDF is the same along all dimensions, I found a region slightly larger than really necessary (lims). Next, I obtain 1000 random samples from the distribution F by calling sample_random:

function r = sample_random(F,N,lims)
delta = diff(lims)/10000;
x = linspace(lims(1),lims(2),300);
r = inf(1,N);
for ii = 1:N
   marginal = get_marginal(F,r,ii,x,delta);
   p = rand * marginal(end);
   [~,I] = unique(marginal); % interp1 cannot handle duplicated points, let's remove them
   r(ii) = interp1(marginal(I),x(I),p);
end

delta is the distance we'll use for the finite difference approximation to the derivative. x represents sample points along any one dimension of F.

We first define r as the vector [inf,inf,inf,inf,inf], we'll use this as the sample locations, and at the end of the function it will contain the random value drawn from our distribution.

Next we loop over the 5 dimensions, in each iteration we sample the marginal distribution for dimension ii, given the values for the previous dimensions (which have already been picked). The function get_marginal is below. We pick a random value in between 0 and the max of this marginal CDF (note that the max decreases as we pick values of r for each dimension, when ii==1 the max is 1), and we use this random value to interpolate into the inverse sampled marginal CDF (inverse simply means swapping x and y). I needed to remove repeated values from marginal because it becomes the x in interp1, and this function requires the x values to be unique.

Finally, the function get_marginal:

function marginal = get_marginal(F,r,ii,x,delta)
N = length(r);
marginal = zeros(size(x));
for jj=0:2^(ii-1)-1
   rr = flip(dec2bin(jj,N)-'0');
   sign = mod(sum(rr,2),2);
   if sign == 0
      sign = 1;
   else
      sign = -1;
   end
   args = num2cell(r - delta * rr);
   args{ii} = x;
   marginal = marginal + sign * F(args{:});
end

This contains quite a bit of complication. It samples the CDF along a given dimension ii, at the points x, given the fixed values r(1:ii-1).

The complication comes from computing partial derivatives. If we were to compute the marginal distribution for any one dimension without having picked any fixed values yet, we would simply do e.g.

marginal = F(inf,x,inf,inf,inf);

Having picked one value, we would do

marginal = F(r1,x,inf,inf,inf) - F(r1-delta,x,inf,inf,inf);

(this is the approximation to the partial derivative along the first dimension).

The code in get_marginal does this for ii-1 fixed values. This requires sampling F twice for each of these fixed values, as well as for each combination of delta shifts, a total of n^2 times (for n fixed values). The dec2bin bit is to get all these combinations. sign determines whether to add or subtract the given sample from the running total. args is a cell array with the 5 arguments to the function F, elements 1:ii-1 are the fixed values, element ii is set to x, and elements ii+1:N are inf.


Finally, I draw the marginal distributions of the data set values (which contains the 1000 elements randomly drawn from the CDF), and overlay with the marginal distributions of the CDF, to verify all is correct:

lims = [-2,5];
x = linspace(lims(1),lims(2),300);
figure
for ii=1:5
   subplot(5,1,ii)
   histogram(values(:,ii),'normalization','cdf','BinLimits',lims)
   hold on
   args = num2cell(inf(1,5));
   args{ii} = x;
   plot(x,F(args{:}))
   text(5.2,0.5,['r_',num2str(ii)])
end

marginal distributions, input and sampled

Cris Luengo
  • 55,762
  • 10
  • 62
  • 120
  • I have posted an issue with the answer (but it may be me implementing your procedure wrongly). Please advise if you have time. Thanks – TEX Nov 01 '18 at 17:41
  • @user3285148: It looks like your "comparison" graph is inverted from the other ones. It starts at -6 and ends at 2, the other ones start at -2 and end at 6. A minus sign issue? – Cris Luengo Nov 01 '18 at 18:11
  • @CrisLuengo To draw the comparison I used the pre-built function `evrnd`. – TEX Nov 01 '18 at 18:14
  • @user3285148: That implements a different equation: https://www.mathworks.com/help/stats/extreme-value-distribution.html. Why not compare to `exp(-exp(-r0))`, which is what you expect the output to be? – Cris Luengo Nov 01 '18 at 18:18
  • @CrisLuengo : `evrnd` implements draws from this distribution https://en.wikipedia.org/wiki/Gumbel_distribution (see PDF and CDF formulas on the right). When `sigma=1`, `F` = joint CDF of 3 i.i.d. Gumbel with scale `0` and location `1`. Hence, when I plot each marginal of `F` against the empirical CDF of `evrnd(0,1,N,1)` I should get the same thing. – TEX Nov 01 '18 at 18:23
  • @user3285148: I don't have the stats toolbox, so cannot compare, but the [documentation to `evrnd`](https://www.mathworks.com/help/stats/evrnd.html) links to the page I linked above, that shows a different equation: `exp(x)*exp(-exp(x))` vs your `exp(-exp(-x))` (ignoring all sigma and mu). Why don't you plot your `exp(-exp(-x))` against `evrnd(x)` and see the difference? – Cris Luengo Nov 01 '18 at 18:29
  • Interesting: notice that the first equation here https://en.wikipedia.org/wiki/Gumbel_distribution refers to the PDF and not to the CDF. In my question I reported the CDF. Though, there are flipped signs! In Wikipedia the Gumbel PDF with scale 0 and location 1 is `exp(-x)*exp(-exp(-x))`. In Matlab the Gumbel PDF with scale 0 and location 1 is `exp(x)*exp(-exp(x))`. – TEX Nov 01 '18 at 18:41
  • I need to think more, sorry. I didn't expect this. I will post a question on this flipped sign. – TEX Nov 01 '18 at 18:42
  • @user3285148: [Mathworld](http://mathworld.wolfram.com/GumbelDistribution.html) agrees with MATLAB, I would guess Wikipedia is incorrect with its flipped sign. Interesting discussion on the [Talk page](https://en.wikipedia.org/wiki/Talk:Gumbel_distribution). – Cris Luengo Nov 01 '18 at 18:49
  • Thanks a lot. I guess that is the issue. – TEX Nov 01 '18 at 19:04
  • Could you help me with this related question https://stackoverflow.com/questions/69290732/drawing-from-gev-cumulative-distribution-function-in-matlab – TEX Sep 22 '21 at 20:15