Ideas for GPU implementation of Hoeffding's "D" (Dependence) coefficient?

Question

I am trying to come up with a very fast algorithm for calculating this very interesting statistic that takes full advantage of the capabilities of a powerful GPU. Ideally I will do this in Matlab using Jacket, but other ideas in CUDA code or OpenCL would also be much appreciated. Basically, I want to crowd-source lots of clever ideas that I can try to put together, and then I will attempt to open-source the result so others can use it.

Despite the power of this dependency coefficient (it is capable of detecting even "one-to-many" dependency relationships), there is almost nothing about it online, except for two sources: SAS statistical software, and the excellent R package Hmisc by Frank Harrell. You can read a description of the algorithm here:

http://support.sas.com/documentation/cdl/en/procstat/63104/HTML/default/viewer.htm#procstat_corr_sect016.htm

And here is Harrell's code in Fortran (which is surprisingly easy to follow if you understand the calculation already):

http://hmisc.sourcearchive.com/documentation/3.7-0/hoeffd_8f-source.html

(also, page 128 of the pdf documentation for Hmisc has some more details.)

This is a very computationally demanding algorithm-- if you wanted to apply it to a data set that is thousands of rows and a few thousand columns, even with the fast Fortran implementation, you would be waiting many days for the result-- even on a new machine. My hope is that using an Nvidia GTX 580 level card, or better yet, a Tesla, would bring that down to a couple hours. In which case, the combination would be an analytical force to be reckoned with, whether identifying genes or finding correlations in experimental data.

Anyway, I look forward to peoples' responses and hope we can make a fast, GPU based algorithm for Hoeffding's D a reality.

Thanks in advance for any ideas-- and please don't hesitate to give partial or half-baked ideas!

Update: So, Jascha has generously provided a working implementation of Hoeffding's D in Matlab, which fulfilled one of my objectives. The other was to radically enhance the speed by using a GPU, preferably using Jacket. Does anyone see a path or strategy to do this in a clever way on the GPU?

I don't think you are asking the right question. The statistic you are interested in is a scalar value, you would never compute it using a GPU. But is it composed of a number of sums, and those sums are computed over ranks of the input data. So what you should be asking about is how to compute those sums and ranks using data parallel primitives on a GPU. The rest of your question is either irrelevant or self evident once those underlying values are available. — talonmies, Feb 14 '12 at 06:17
Thanks for your response. I should clarify-- although the atomic calculation, if you will, is indeed a scalar value (a pair-wise correlation coefficient), the typical use case at least for me would be to take as input a set of equal-length column vectors (i.e., a matrix), and then look at the set of all Hoeffding D's for every column with every other column (really we only need half because it is symmetric). Does that not introduce more opportunities for parallelism in the computation? Does calculating the ranks for one pairwise combination help with others if you are clever? Thanks. — Doodles, Feb 14 '12 at 15:08
Can you post some Matlab with Jacket to get us started? Then we can help you optimize it in comments. — arrayfire, Feb 14 '12 at 15:42
Sure, but I don't have any ideas at the moment to parallelize it in any way (other than talonmies' suggestion, which I need to think about some more)-- so what I come up with will more or less be a direct port of the Fortran code, so not sure how useful that would be. Thanks for your interest. — Doodles, Feb 14 '12 at 16:10

score 2 · Answer 1 · edited Mar 09 '23 at 19:01

I intended this to be just a comment, but it is too long. No worries if you don't want to accept it as an answer or anything.

First, it is commendable that you are thinking about how to map this to the GPU. Many more scientists should take the time to consider ideas like this for their algorithms. However, after reading the description, I am not convinced that a GPU is the ideal way in which to parallelize this particular method. The reason I say this is that GP-GPU programming tends to be very effective at "batch processing", i.e. algorithms that naturally lend themselves to a element-wise basic manipulation (the thread level).

It isn't clear that there's a useful division of this algorithm along those lines except at the summation / rank-calculation level that others have already mentioned (and these sorts of sub-functions are already well-understood on the GPU). But if you zoom out and start trying to think about how you can use the GPU to speed things up at the level of column-to-column comparisons, there's not much you can do. Knowing the places where particular entries are less than a given point's value won't let you avoid doing that same computation when one or the other of the columns are changed.

In short, you're going to have to do N(N+1)/2 different iterations of the algorithm at the top level, and there's no way to avoid that. Within each of those algorithms, there's plenty of room to send your column arrays to the GPU and do comparisons at the thread level to calculate the various rank statistics quickly.

A better approach might be to use MPI in a multiprocessor setup, so that you farm out each high-level iteration to different processors. The master-slave parallel model (terribly named) seems appropriate if you don't have enough processors and/or if you want each processor to utilize the GPU within a single high-level iteration. For as long as there continues to be upper-triangular elements of the Hoeffding "covariance" matrix you're trying to compute, you keep pumping out tasks to the available processors.

Secondly, I think Matlab and Jacket are more of a time-problem here than you might believe. Yes, Matlab is optimized for some linear algebra operations, but almost always it is unequivocally slower than any 'real' programming language. The tradeoff is that you get a lot of convenience functions with commercial-grade documentation, and this is valuable sometimes.

An alternative that I suggest to you is to use PyCUDA together with mpi4py in the Python programming language. With NumPy and Cython, Python is strictly better and faster than Matlab, even in terms of convenience functions and ease of use. If you use the PyDev plug-in for the Eclipse IDE, even the user experience of the Matlab terminal is basically identical. Plus you don't have to pay for a license or any extra for Jacket.

You'd have to do a bit of work to get the PyCUDA and mpi4py to work together, but in the end I think the open-source-ness would make your effort far more valuable to a greater number of people, and the code would likely be much faster.

Also, if you do indeed stick with Matlab, have you timed the existing Fortran code in the single-pair-of-columns case, for thousands of entries but just in two columns? If that is reasonably fast, you should be able to write a wrapper for that Fortran with a mex file. And in that case, simple multi-processor sessions of Matlab could be all you need.

Thanks for your detailed response. I will hold off on accepting it since I'd like to allow others to weigh in-- perhaps there's another angle to the problem that no one has thought of yet. I do wonder if there is an "approximation" (whatever that might mean heere) to Hoeffding's D, where you would use a divide-and-conquer strategy to break the original data matrix up into blocks, do the calculations on the blocks, and then somehow re-assemble them without losing too much accuracy. As for PyCUDA, I appreciate the recommendation, but I am already so familiar with Matlab that I won't switch now. — Doodles, Feb 16 '12 at 22:05
Also, your point about just wrapping the existing Fortran code into a MEX file is a very good one. I am not familiar with how this is done, but will investigate it. Hopefully it isn't too difficult. I also wonder if using a fancy Fortran compiler like Intel's Visual Fortran Composer XE, which has some "auto vectorizing" functionality, might also speed things up. — Doodles, Feb 16 '12 at 22:09

score 2 · Accepted Answer · answered Feb 17 '12 at 03:54

Below is a code example with a simple implementation of the Hoeffding's D measure of dependency in MATLAB. This is NOT GPU-ized, but may prove useful to people who want to calculate this statistic and are not using Fortran, or as a starting point for putting it on the GPU. (The extended header illustrates Hoeffding's D values for several types of joint distributions.)

function [ D ] = hoeffdingsD( x, y )
%Compute's Hoeffding's D measure of dependence between x and y
%   inputs x and y are both N x 1 arrays
%   output D is a scalar
%     The formula for Hoeffding's D is taken from
%     http://support.sas.com/documentation/cdl/en/procstat/63104/HTML/default/viewer.htm#procstat_corr_sect016.htm
% Below is demonstration code for several types of dependencies.
%
% % this case should be 0 - there are no dependencies
% x = randn(1000,1);
% y = randn(1000,1);
% D = hoeffdingsD( x, y );
% desc = 'x = N( 0, 1 ), y and x independent';
% desc = sprintf( '%s, Hoeffding''s D = %f', desc, D );
% fprintf( '%s\n', desc );
% figure; plot(x,y,'.'); title( desc );
% 
% % the rest of these cases have dependencies of different types
% x = randn(1000,1);
% y = x;
% D = hoeffdingsD( x, y );
% desc = 'x = N( 0, 1 ), y = x';
% desc = sprintf( '%s, Hoeffding''s D = %f', desc, D );
% fprintf( '%s\n', desc );
% figure; plot(x,y,'.'); title( desc );
% 
% x = randn(1000,1);
% y = cos(10*x);
% D = hoeffdingsD( x, y );
% desc = 'x = N( 0, 1 ), y = cos(10x)';
% desc = sprintf( '%s, Hoeffding''s D = %f', desc, D );
% fprintf( '%s\n', desc );
% figure; plot(x,y,'.'); title( desc );
% 
% x = randn(1000,1);
% y = x.^2;
% D = hoeffdingsD( x, y );
% desc = 'x = N( 0, 1 ), y = x^2';
% desc = sprintf( '%s, Hoeffding''s D = %f', desc, D );
% fprintf( '%s\n', desc );
% figure; plot(x,y,'.'); title( desc );
% 
% x = randn(1000,1);
% y = x.^2 + randn(1000,1);
% D = hoeffdingsD( x, y );
% desc = 'x = N( 0, 1 ), y = x^2 + N( 0, 1 )';
% desc = sprintf( '%s, Hoeffding''s D = %f', desc, D );
% fprintf( '%s\n', desc );
% figure; plot(x,y,'.'); title( desc );
% 
% x = rand(1000,1);
% y = rand(1000,1);
% z = sign(randn(1000,1));
% x = z.*x; y = z.*y;
% D = hoeffdingsD( x, y );
% desc = 'x = z U( [0, 1) ), y = z U( [0, 1) ), z = U( {-1,1} )';
% desc = sprintf( '%s, Hoeffding''s D = %f', desc, D );
% fprintf( '%s\n', desc );
% figure; plot(x,y,'.'); title( desc );
% 
% x = rand(1000,1);
% y = rand(1000,1);
% z = sign(randn(1000,1));
% x = z.*x; y = -z.*y;
% D = hoeffdingsD( x, y );
% desc = 'x = z U( [0, 1) ), y = -z U( [0, 1) ), z = U( {-1,1} )';
% desc = sprintf( '%s, Hoeffding''s D = %f', desc, D );
% fprintf( '%s\n', desc );
% figure; plot(x,y,'.'); title( desc );

N = size(x,1);

R = tiedrank( x );
S = tiedrank( y );

Q = zeros(N,1);
parfor i = 1:N
    Q(i) = 1 + sum( R < R(i) & S < S(i) );
    % and deal with cases where one or both values are ties, which contribute less
    Q(i) = Q(i) + 1/4 * (sum( R == R(i) & S == S(i) ) - 1); % both indices tie.  -1 because we know point i matches
    Q(i) = Q(i) + 1/2 * sum( R == R(i) & S < S(i) ); % one index ties.
    Q(i) = Q(i) + 1/2 * sum( R < R(i) & S == S(i) ); % one index ties.
end

D1 = sum( (Q-1).*(Q-2) );
D2 = sum( (R-1).*(R-2).*(S-1).*(S-2) );
D3 = sum( (R-2).*(S-2).*(Q-1) );

D = 30*((N-2)*(N-3)*D1 + D2 - 2*(N-2)*D3) / (N*(N-1)*(N-2)*(N-3)*(N-4));

end

Thanks Jascha, this looks really great! And it is already parallelized to a degree using the parfor. OK, so now there is some code to look at. Does anyone have any ingenious ideas to speed this up using the GPU? — Doodles, Feb 17 '12 at 05:02

Ideas for GPU implementation of Hoeffding's "D" (Dependence) coefficient?

2 Answers2