4

I would like to compute the inverse cumulative density function (inverse cdf) of a given pdf. The pdf is directly given as a histogram, ie., a vector of N equally spaced components.

My current approach is to do :

cdf = cumsum(pdf);
K = 3;   %// some upsampling factor
maxVal = 1;   %// just for my own usage - a scaling factor
M = length(cdf);
N = M*K;   %// increase resolution for higher accuracy
y = zeros(N, 1);
cursor = 2;
for i=1:N
   desiredF = (i-1)/(N-1)*maxVal;
   while (cursor<M && cdf(cursor)<desiredF)
    cursor = cursor+1;
   end;    

   if (cdf(cursor)==cdf(cursor-1))
       y(i) = cursor-1;
   else        
       alpha = min(1, max(0,(desiredF - cdf(cursor-1))/(cdf(cursor)-cdf(cursor-1))));
       y(i) = ((cursor-1)*(1-alpha) + alpha*cursor )/maxVal;
   end;

end;

y = resample(y, 1, K, 0);

which means that I upsample with linear interpolation, inverse and downsample my histogram. This is rather an ugly code, is not very robust (if I change the upsampling factor, I can get really different results), and is uselessly slow... could anyone suggest a better approach ?

Note : the generalized inverse I am trying to compute (in the case the cdf is not invertible) is :

F^{-1}(t) = \inf{x \in R ; F(x)>t }   

with F the cumulative density function

[EDIT : actually, K = 1 (ie., no upsampling) seems to give more accurate results...]

Thanks!

nbonneel
  • 3,286
  • 4
  • 29
  • 39

2 Answers2

4

If your input is specified in the form of a non-normalized histogram, then simply using the built-in quantile() function automatically computes the data point for a specified quantile, which is what the inverse-CDF does. If the histogram is normalized by the number of data points (making it a probability vector), then just multiply it by the number of data points first. See here for the quantile() details. Basically, you'll assume that given your histogram/data, the first parameter is fixed, which turns quantiles() into a function only of the specified probability values p. You could easily write a wrapper function to make it more convenient if necessary. This removes the need to explicitly compute the CDF with cumsum().

Added

If we assume the histogram, bins, and number of data points are h, b, and N, respectively, then:

 h1 = N*h; %// Only if histogram frequencies have been normalized.
 data = [];
 for kk = 1:length(h1)
     data = [data repmat(b(kk), 1, h1(kk))];
 end

 %// Set p to the probability you want the inv-cdf for...
 p = 0.5;
 inv_cdf = quantiles(data,p)

Added

For solutions that must leverage the existing PDF vector, we can do the following. Assume that x_old and pdf_old are the histogram bins and histogram frequencies, respectively.

 p = 0.5; %// the inv-cdf probability that I want
 num_points_i_want = 100; %// the number of points I want in my histogram vector

 x_new = linspace(min(x_old),max(x_old),num_points_i_want);
 pdf_new = interp1(x_old,pdf_old,x_new);
 cdf_new = cumsum(pdf_new);
 inv_cdf = min(x_new(cdf_new >= p));

Alternatively, we could create the cumsum() CDF first and use interp1() on that if it's not desirable to interpolate first.

ely
  • 74,674
  • 34
  • 147
  • 228
  • as I understood, the quantile function doesn't work on histograms but rather on samples drawn for a given histogram... isn't it ? – nbonneel Feb 08 '12 at 01:24
  • A histogram is a set of frequencies `f` and a set of bins `b`. If you repeat each bin value `b(i)` for `f(i)` number of repetitions, then you've got yourself a synthetic data set that has the exact same empirical distribution as your original, up to binning errors which would affect both the synthetic set and the histogram identically. – ely Feb 08 '12 at 01:34
  • Indeed, thanks. But since f(i) are real numbers, everything needs to be normalized to some big factor. At the end, this means generating N samples from the histogram (with N possibly very large), and finding the percentiles. Although definitely shorter to write, I am not sure this is much much better in terms of complexity – nbonneel Feb 08 '12 at 02:01
  • I am also not sure this is very precise since we cannot have the linear interpolation between the bins of the histogram. A better approach (in term of precision) would be to really generate N samples from the pdf... which would mean sampling the pdf using the inverse cdf formula which is a chicken and egg problem! ;) (or sampling it with a rejection method, which would handle the linear interpolation in the histogram, but which would add yet another level of complexity) – nbonneel Feb 08 '12 at 02:09
  • `f(i)` won't be real numbers if you just use the regular output of Matlab's `hist()` function -- they'll only become floating point representations if you normalize that vector to make it a probability vector. Additionally, the `quantiles()` and `repmat()` functions are pretty efficient. – ely Feb 08 '12 at 02:13
  • You can even improve my code because you'll know specifically how many data points will be appended to `data`, so you can pre-allocate. I agree that there are faster solutions that involve inverting the empirical CDF, but that if you ever hit a limiting size of your data where this matters, then it is a mistake to be using Matlab for such large data in the first place. – ely Feb 08 '12 at 02:13
  • Lastly, the precision is the same regardless. If you sample according to the histogram, you're still only ever generating values from the bins (i.e. all binned histograms offer a reduced resolution distribution over the original data unless you just have a uniform empirical pdf over every data point). The `quantiles()` function handles the linear interpolation when determining that, which is just as well. – ely Feb 08 '12 at 02:16
  • in practice my f(i) don't come from the hist() function and can be real. I agree quantiles() handle linear interpolation... but since the data have been generated exactly at the center of each bin, this linear interpolation is not useful (we have *exactly* 10 samples with value 1 and 20 samples with value 2 : the quantile function cannot guess that in fact there was a bunch a samples which would have fallen in-between). Also, using this kind of approach, I am not sure this would exactly meet the definition of the generalized inverse that I have given (? the doc about quantile is rather poor). – nbonneel Feb 08 '12 at 02:24
  • Right, I understand that `quantiles()` won't be able to capture how the data falls within a bin, but that's a problem with all possible inversions of a histogram, not any particular Matlab approach. As soon as you snapshot your data into histograms, you destroy all information about how it fell within a bin. Making the assumption of linear interpolations is not any more valid than simply replicating each bin's center value, unless you assume something about the underlying distribution. – ely Feb 08 '12 at 02:29
  • And if you're going to do that, you might as well fit some function to it straight out and then compute quantiles analytically or using a built-in inverse CDF function. But you can't re-capture the smoothness of your distribution with linear interpolation between bins... how do you know that the between-bin data fell linearly? – ely Feb 08 '12 at 02:30
  • Also, I updated a second approach that leverages the given histogram vector and just assumes linear interpolation. But using `interp1()` you can also do cubic, spline, etc. – ely Feb 08 '12 at 02:30
  • I really don't manage to get as smooth results from your (first) code as my code, even using many samples. I really consider the histogram as a discretized version of a continuous and smooth pdf. For your second code, I am now worried : to generate a set of samples which follows a given pdf, we have to use the inverse cdf formula and not directly generate the samples according to the pdf... but here, I don't understand why we can directly use the pdf as it. For your 2nd code, since I want to generate the whole inverse cdf and not just for one point, I would need to do the min() for each point – nbonneel Feb 08 '12 at 03:00
  • I still fail to see your point. That's something you cannot avoid doing! There's not an analytical form for an empirical inverse CDF. You have to find the minimal element of the sample space such that the corresponding entries of the CDF vector are larger than your given probability. You do indeed have to find the min repeatedly. – ely Feb 08 '12 at 03:03
  • Also, this is the precise heart of stochastic sampling problems. To generate a set of samples that follows a given PDF, you need to do Gibbs sampling or Metropolis-Hastings or rejection method... but you cannot just assume that a discretized inverse CDF gives you data according to your distribution. The `hist()` function can't be trusted as a reliably smooth discretization of your data unless you let the number of bins get extremely large.. but then you need even larger amounts of data to avoid small-sample discretization error. – ely Feb 08 '12 at 03:05
  • In short, all of the problems you're mentioning about the code are true, but they are inherent numerical analysis problems with this whole concept. If what you really want is the capability to sample data sets according to your empirical PDF, you can use the built-in sampling functions from discrete distributions, or you can attempt a "real" solution with Markov chain Monte Carlo stuff. [This book](http://www.amazon.com/Monte-Carlo-Strategies-Scientific-Computing/dp/0387952306) is a good place for you to start. – ely Feb 08 '12 at 03:08
  • indeed, I need to find the inverse cdf for each point, thus I need to compute a min() for each point. However, the min() doesn't take into account any consistency between each individual call, although they are obviously sorted (using quantile, we can directly pass a vector for p, so that's fine) With your first code, I get about 5 times larger error than with my code. I appreciate your help though and will continue investigate things in your direction as well. I know there is no close forms but there are sampling theory results (Shannon etc.) and I'd like to have the best approximation of it – nbonneel Feb 08 '12 at 03:09
  • hehe great, that's funny - you're actually just a floor above me (and I just went home, so I can't come and say hello now) :) I upvote your answer and wait a few days to see whether I will have other answers before validating anything. – nbonneel Feb 08 '12 at 03:47
  • I did a short comparison code at : http://pastebin.com/PTbu61L4 I realized that my upsampling factor in fact degrades the result so I removed it. With the above comparison, depending on the number of samples I draw: either I get an error three times larger with your code which is in this case 8000 times slower (without taking into account creating the data vector, otherwise it's 138000x slower), or I get an error 111x higher with your code for a slow down of "only" 12 times.... :s That should explain why I am not yet completely satistfied with this code ;) – nbonneel Feb 08 '12 at 17:43
  • I don't understand how it is so much slower. In the `while` loop where you increment your `cursor` variable, you're doing exactly what Matlab's logical indexing coupled with `min()` would do, except Matlab's built-ins are much faster than that `while` loop. Have you profiled it to see what is being slow? – ely Feb 08 '12 at 18:29
  • This was for the version using quantile, not min(). If I use the version with min(), and repeat it for each value ( http://pastebin.com/Juep0CGy ), it gives 10x larger errors for 780x slow down. Even without profiling, this kind of makes sense: the min() function will definitely not take into account the consistency between each individual call (it has to find the min element from scratch each time), and likely doesn't even check if the vector is sorted (so it can't even make a O(log(n)) dichotomy).So this is comparing something in O(k*n^2), with k the upsampling factor, with something in O(n) – nbonneel Feb 08 '12 at 20:22
0

Ok, I think I found a much shorter version, which works at least as fast and as accurately :

cdf = cumsum(pdf);
M = length(cdf);
xx = linspace(0,1,M);
invcdf = interp1(cdf,xx,xx)

[EDIT : No this is actually still two to three times slower than the initial code... don't ask me why! And it does not handle non strictly monotonous functions : this produces the error : "The values of X should be distinct"]

nbonneel
  • 3,286
  • 4
  • 29
  • 39
  • Also, is there a typo in the arguments to `interp1()`? It looks like you're using the exact same grid. – ely Feb 09 '12 at 00:39
  • no typo : the goal is to replace the function (x, f(x)), by (f(x), x) sampled at x... hence the redundancy. – nbonneel Feb 09 '12 at 04:11
  • I copied and pasted this code as it, and it works when the cdf is invertible, albeit more slowly. – nbonneel Feb 09 '12 at 04:17
  • I'm still not following. I didn't mean to imply it would fail to work; I just don't understand how invcdf can be different from xx in the code. You're basically asking it to interpolate a vector into itself. xx already represents the 'independent variable' values at each point in cdf, how does the interpolation change anything? – ely Feb 09 '12 at 04:22
  • The thing is to realize that in terms of graph, the inverse of the function whose graph is plot(x,y) is the function whose graph is plot(y,x). So this is just a matter of resampling the data "x" which were originally sampled at f^-1(y) to the new samples "x". One of the "x" represents the value of the function to be interpolated while the other "x" represents the positions where we want the interpolation... The syntax for interp1 is interp1(x,Y,xi) where xi are the new sample points. – nbonneel Feb 09 '12 at 14:57