Calculating a 2D joint probability distribution

Question

I have many points inside a square. I want to partition the square in many small rectangles and check how many points fall in each rectangle, i.e. I want to compute the joint probability distribution of the points. I am reporting a couple of common sense approaches, using loops and not very efficient:

% Data
N = 1e5;    % number of points
xy = rand(N, 2);    % coordinates of points
xy(randi(2*N, 100, 1)) = 0;    % add some points on one side
xy(randi(2*N, 100, 1)) = 1;    % add some points on the other side
xy(randi(N, 100, 1), :) = 0;    % add some points on one corner
xy(randi(N, 100, 1), :) = 1;    % add some points on one corner
inds= unique(randi(N, 100, 1)); xy(inds, :) = repmat([0 1], numel(inds), 1);    % add some points on one corner
inds= unique(randi(N, 100, 1)); xy(inds, :) = repmat([1 0], numel(inds), 1);    % add some points on one corner

% Intervals for rectangles
K1 = ceil(sqrt(N/5));    % number of intervals along x
K2 = K1;    % number of intervals along y
int_x = [0:(1 / K1):1, 1+eps];    % intervals along x
int_y = [0:(1 / K2):1, 1+eps];    % intervals along y

% First approach
tic
count_cells = zeros(K1 + 1, K2 + 1);
for k1 = 1:K1+1
  inds1 = (xy(:, 1) >= int_x(k1)) & (xy(:, 1) < int_x(k1 + 1));
  for k2 = 1:K2+1
    inds2 = (xy(:, 2) >= int_y(k2)) & (xy(:, 2) < int_y(k2 + 1));
    count_cells(k1, k2) = sum(inds1 .* inds2);
  end
end
toc
% Elapsed time is 46.090677 seconds.

% Second approach
tic
count_again = zeros(K1 + 2, K2 + 2);
for k1 = 1:K1+1
  inds1 = (xy(:, 1) >= int_x(k1));
  for k2 = 1:K2+1
    inds2 = (xy(:, 2) >= int_y(k2));
    count_again(k1, k2) = sum(inds1 .* inds2);
  end
end
count_again_fix = diff(diff(count_again')');
toc
% Elapsed time is 22.903767 seconds.

% Check: the two solutions are equivalent
all(count_cells(:) == count_again_fix(:))

How can I do it more efficiently in terms of time, memory, and possibly avoiding loops?

EDIT --> I have just found this as well, it's the best solution found so far:

tic
count_cells_hist = hist3(xy, 'Edges', {int_x int_y});
count_cells_hist(end, :) = []; count_cells_hist(:, end) = [];
toc
all(count_cells(:) == count_cells_hist(:))
% Elapsed time is 0.245298 seconds.

but it requires the Statistics Toolbox.

EDIT --> Testing solution suggested by chappjc

tic
xcomps = single(bsxfun(@ge,xy(:,1),int_x));
ycomps = single(bsxfun(@ge,xy(:,2),int_y));
count_again = xcomps.' * ycomps; %' 143x143 = 143x1e5 * 1e5x143
count_again_fix = diff(diff(count_again')');
toc
% Elapsed time is 0.737546 seconds.
all(count_cells(:) == count_again_fix(:))

Pssible duplicate of http://stackoverflow.com/questions/18639518/generate-and-plot-the-empirical-joint-pdf-and-cdf-in-matlab/18640944#18640944 — Luis Mendo, Nov 02 '13 at 19:59
I am also checking http://stackoverflow.com/questions/16313949/how-to-plot-joint-distribtuion-of-2-random-variable-having-1000-data - I'm not sure if hist3 can be used to obtain the same result. — , Nov 02 '13 at 20:06
@LuisMendo - That's a very thorough answer to the other question and it is rightly linked here. However, the other question was not specific and contained no code, and hence it was closed. So, I think francesco's question here warrants answers for making a good attempt at solving the problem. Definite +1 to your well conceived solution to the other question. Just my 2 cents. :) — chappjc, Nov 02 '13 at 20:37
@chappjc Yes, since the other question was closed, it makes sense to answer here. — Luis Mendo, Nov 02 '13 at 20:51
@francesco If you use `single` instead of `double` in my solution, it runs twice as fast and should not be a problem since the matrix elements are just 0 and 1. — chappjc, Nov 02 '13 at 21:23
@Luis, I have tested your solution at the link provided - it does not return the outcome requested, it is very slow and it also takes a lot of memory (!). Perhaps I got it wrong (?) — , Nov 02 '13 at 22:31
@francesco Did it not speed up much with single? Maybe because I was testing on an old version if MATLAB. BTW, beware of going over 10 edits when question ownership reverts to the community. — chappjc, Nov 02 '13 at 23:02
@chappjc: my previous comment was about the solution suggested by Luis at the link provided above. Use of single improves indeed your solution, especially for N>1e5. What does "beware of going over 10 edits when question ownership reverts to the community" mean? — , Nov 02 '13 at 23:11
Since this question is still getting answers, I thought I would post another one using `accumarray`. This function is designed for this kind of thing and it is _extemely_ fast; all you have to do is bin your data. — chappjc, Nov 03 '13 at 23:59

score 2 · Answer 1 · answered Nov 03 '13 at 12:19

I have written a simple mex function which works very well when N is large. Of course it's cheating but still ...

The function is

#include "mex.h"

void mexFunction(int nlhs, mxArray *plhs[], int nrhs, const mxArray *prhs[])
{
    unsigned long int hh, ctrl;       /*  counters                       */
    unsigned long int N, m, n;        /*  size of matrices               */
    unsigned long int *xy;            /*  data                           */
    unsigned long int *count_cells;   /*  joint frequencies              */
    /*  matrices needed */
    mxArray *count_cellsArray;

/*  Now we need to get the data */
    if (nrhs == 3) {
        xy = (unsigned long int*) mxGetData(prhs[0]);
        N = (unsigned long int) mxGetM(prhs[0]);
        m = (unsigned long int) mxGetScalar(prhs[1]);
        n = (unsigned long int) mxGetScalar(prhs[2]);
    }

/*  Then build the matrices for the output */
    count_cellsArray = mxCreateNumericMatrix(m + 1, n + 1, mxUINT32_CLASS, mxREAL);
    count_cells = mxGetData(count_cellsArray);
    plhs[0] = count_cellsArray;

    hh = 0; /* counter for elements of xy */
    /* for all points from 1 to N */
    for(hh=0; hh<N; hh++) {
        ctrl = (m + 1) * xy[N + hh] + xy[hh];
        count_cells[ctrl] = count_cells[ctrl] + 1;
    }
}

It can be saved in a file "joint_dist_points_2D.c", then compiled:

mex joint_dist_points_2D.c

And check it out:

% Data
N = 1e7;    % number of points
xy = rand(N, 2);    % coordinates of points
xy(randi(2*N, 1000, 1)) = 0;    % add some points on one side
xy(randi(2*N, 1000, 1)) = 1;    % add some points on the other side
xy(randi(N, 1000, 1), :) = 0;    % add some points on one corner
xy(randi(N, 1000, 1), :) = 1;    % add some points on one corner
inds= unique(randi(N, 1000, 1)); xy(inds, :) = repmat([0 1], numel(inds), 1);    % add some points on one corner
inds= unique(randi(N, 1000, 1)); xy(inds, :) = repmat([1 0], numel(inds), 1);    % add some points on one corner

% Intervals for rectangles
K1 = ceil(sqrt(N/5));    % number of intervals along x
K2 = ceil(sqrt(N/7));    % number of intervals along y
int_x = [0:(1 / K1):1, 1+eps];    % intervals along x
int_y = [0:(1 / K2):1, 1+eps];    % intervals along y

% Use Statistics Toolbox: hist3
tic
count_cells_hist = hist3(xy, 'Edges', {int_x int_y});
count_cells_hist(end, :) = []; count_cells_hist(:, end) = [];
toc
% Elapsed time is 4.414768 seconds.

% Use mex function
tic
xy2 = uint32(floor(xy ./ repmat([1 / K1, 1 / K2], N, 1)));
count_cells = joint_dist_points_2D(xy2, uint32(K1), uint32(K2));
toc
% Elapsed time is 0.586855 seconds.

% Check: the two solutions are equivalent
all(count_cells_hist(:) == count_cells(:))

Good contribution! But MEX is kind of a cheat, yeah. ;) However, I used a MEX file when making joint PDFs for my research, so at the end of the day I would agree that is the way to go. However, for this `N=1e7` test data, my updated `accumarray` approach takes 1.1 seconds on my PC, so that might be a good general alternative, no toolboxes required. — chappjc, Nov 04 '13 at 00:19
I Agree! I tested your solution with accumarray and it is fast even with N=3e7! Chapeau! — , Nov 04 '13 at 00:53

chappjc · Accepted Answer · 2013-11-04T00:10:06.427

Improving on code in question

Your loops (and the nested dot product) can be eliminated with bsxfun and matrix multiplication as follows:

xcomps = bsxfun(@ge,xy(:,1),int_x);
ycomps = bsxfun(@ge,xy(:,2),int_y);
count_again = double(xcomps).'*double(ycomps); %' 143x143 = 143x1e5 * 1e5x143
count_again_fix = diff(diff(count_again')');

The multiplication step accomplishes the AND and summation done in sum(inds1 .* inds2), but without looping over the density matrix. EDIT: If you use single instead of double, execution time is nearly halved, but be sure to convert your answer to double or whatever is required for the rest of the code. On my computer this takes around 0.5 sec.

Note: With rot90(count_again/size(xy,1),2) you have a CDF, and in rot90(count_again_fix/size(xy,1),2) you have a PDF.

Using accumarray

Another approach is to use accumarray to make the joint histogram after we bin the data.

Starting with int_x, int_y, K1, xy, etc.:

% take (0,1) data onto [1 K1], following A.Dondas approach for easy comparison
ii = floor(xy(:,1)*(K1-eps))+1; ii(ii<1) = 1; ii(ii>K1) = K1;
jj = floor(xy(:,2)*(K1-eps))+1; jj(jj<1) = 1; jj(jj>K1) = K1;

% create the histogram and normalize
H = accumarray([ii jj],ones(1,size(ii,1)));
PDF = H / size(xy,1); % for probabilities summing to 1

On my computer, this takes around 0.01 sec.

The output is the same as A.Donda's converted from sparse to full (full(H)). Although, as he A.Donda pointed out, it is correct to have the dimensions be K1xK1, rather than the size of count_again_fix in the OPs code that was K1+1xK1+1.

To get the CDF, I believe you can simply apply cumsum to each axis of the PDF.

Note to all: I'm not necessarily going for a solution to the general joint probability distribution question, but rather for a way to change francesco's code to "do it more efficiently in terms of time, memory, and possibly avoiding loops". There's a fine line here I think, and it comes down to the scope and quality of the two questions. I'm going outside now. :p — chappjc, Nov 02 '13 at 21:03
Using hist3 seems to be the best option if the Statistics Toolbox is available - otherwise the solution suggested by chappjc is the best alternative solution that I have tested so far. — , Nov 02 '13 at 23:00
Your solution using accumarray is very fast indeed - it is comparable with my mex function! Although I need also the extremes, 0s and 1s, so I think the size of the matrix should be (K1+1)x(K2+1) - consider that I am using the *edges*, not the *bins*. — , Nov 04 '13 at 00:44

A. Donda · Answer 3 · 2013-11-03T13:00:23.710

chappjc's answer and using hist3 are all good, but since I happened to want to have something like this some time ago and for some reason didn't find hist3 I wrote it myself, and I thought I'd post it here as a bonus. It uses sparse to do the actual counting and returns the result as a sparse matrix, so it may be useful for dealing with a multimodal distribution where different modes are far apart – or for someone who doesn't have the Statistics Toolbox.

Application to francesco's data:

K1 = ceil(sqrt(N/5));
[H, xs, ys] = hist2d(xy(:, 1), xy(:, 2), [K1 K1], [0, 1 + eps, 0, 1 + eps]);

Called with output parameters the function just returns the result, without it makes a color plot.

Here's the function:

function [H, xs, ys] = hist2d(x, y, n, ax)

% plot 2d-histogram as an image
%
% hist2d(x, y, n, ax)
% [H, xs, ys] = hist2d(x, y, n, ax)
%
% x:    data for horizontal axis
% y:    data for vertical axis
% n:    how many bins to use for each axis, default is [100 100]
% ax:   axis limits for the plot, default is [min(x), max(x), min(y), max(y)]
% H:    2d-histogram as a sparse matrix, indices 1 & 2 correspond to x & y
% xs:   corresponding vector of x-values
% ys:   corresponding vector of y-values
%
% x and y have to be column vectors of the same size. Data points
% outside of the axis limits are allocated to the first or last bin,
% respectively. If output arguments are given, no plot is generated;
% it can be reproduced by "imagesc(ys, xs, H'); axis xy".


% defaults
if nargin < 3
    n = [100 100];
end
if nargin < 4
    ax = [min(x), max(x), min(y), max(y)];
end

% parameters
nx = n(1);
ny = n(2);
xl = ax(1 : 2);
yl = ax(3 : 4);

% generate histogram
i = floor((x - xl(1)) / diff(xl) * nx) + 1;
i(i < 1) = 1;
i(i > nx) = nx;
j = floor((y - yl(1)) / diff(yl) * ny) + 1;
j(j < 1) = 1;
j(j > ny) = ny;
H = sparse(i, j, ones(size(i)), nx, ny);

% generate axes
xs = (0.5 : nx) / nx * diff(xl) + xl(1);
ys = (0.5 : ny) / ny * diff(yl) + yl(1);

% possibly plot
if nargout == 0
    imagesc(ys, xs, H')
    axis xy
    clear H xs ys
end

The function is brilliant but the outcome is not exactly the same - I guess the edges on the right are treated differently. I'm trying to understand if I can fix it accordingly. — , Nov 03 '13 at 12:31
Thx! Maybe its because "indices 1 & 2 correspond to y & x"? I did it that way because that's the way imagesc wants it input, but maybe that was a bad idea. In that case, transposition should fix it. — A. Donda, Nov 03 '13 at 12:33
Also, your hist3 solution produces a 143 x 143 matrix, while K1 = K2 = 142, and my function produces a 142 x 142 accordingly. — A. Donda, Nov 03 '13 at 12:36
@francesco, I changed my function to give the output with the "natural" order of coordinates. The remaining difference is due to the fact that `hist3` with specifying 'Edges' ignores data points lying outside, while my function counts them towards the margin bins. Its output is identical to that of `hist3` if called like this: `hist3(xy, 'Ctrs', {xs ys})` where `xs` and `ys` are the bin centers returned by my function. Thanks for pointing out these inconsistencies! — A. Donda, Nov 03 '13 at 13:04
@A.Donda That's a nice way. Instead of using `sparse` to do counting, MATLAB's `accumarray` is quite nice for accumulating binned data like this. I posted a second solution in my answer just for completeness. — chappjc, Nov 04 '13 at 00:02

Calculating a 2D joint probability distribution

3 Answers3

Improving on code in question

Using accumarray

Linked