3

I have two uint16 3D (GPU) arrays A and B in MATLAB, which have the same 2nd and 3rd dimension. For instance, size(A,1) = 300 000, size(B,1) = 2000, size(A,2) = size(B,2) = 20, and size(A,3) = size(B,3) = 100, to give an idea about the orders of magnitude. Actually, size(A,3) = size(B,3) is very big, say ~ 1 000 000, but the arrays are stored externally in small pieces cut along the 3rd dimension. The point is that there is a very long loop along the 3rd dimension (cfg. MWE below), so the code inside of it needs to be optimized further (if possible). Furthermore, the values of A and B can be assumed to be bounded way below 65535, but there are still hundreds of different values.

For each i,j, and d, the rows A(i,:,d) and B(j,:,d) represent multisets of the same size, and I need to find the size of the largest common submultiset (multisubset?) of the two, i.e. the size of their intersection as multisets. Moreover, the rows of B can be assumed sorted.

For example, if [2 3 2 1 4 5 5 5 6 7] and [1 2 2 3 5 5 7 8 9 11] are two such multisets, respectively, then their multiset intersection is [1 2 2 3 5 5 7], which has the size 7 (7 elements as a multiset).

I am currently using the following routine to do this:

s = 300000; % 1st dim. of A
n = 2000; % 1st dim. of B
c = 10; % 2nd dim. of A and B
depth = 10; % 3rd dim. of A and B (corresponds to a batch of size 10 of A and B along the 3rd dim.)
N = 100; % upper bound on the possible values of A and B

A = randi(N,s,c,depth,'uint16','gpuArray');
B = randi(N,n,c,depth,'uint16','gpuArray');

Sizes_of_multiset_intersections = zeros(s,n,depth,'uint8'); % too big to fit in GPU memory together with A and B
for d=1:depth
    A_slice = A(:,:,d);
    B_slice = B(:,:,d);
    unique_B_values = permute(unique(B_slice),[3 2 1]); % B is smaller than A

    % compute counts of the unique B-values for each multiset:
    A_values_counts = permute(sum(uint8(A_slice==unique_B_values),2,'native'),[1 3 2]);
    B_values_counts = permute(sum(uint8(B_slice==unique_B_values),2,'native'),[1 3 2]);

    % compute the count of each unique B-value in the intersection:
    Sizes_of_multiset_intersections_tmp = gpuArray.zeros(s,n,'uint8');
    for i=1:n
        Sizes_of_multiset_intersections_tmp(:,i) = sum(min(A_values_counts,B_values_counts(i,:)),2,'native');
    end

    Sizes_of_multiset_intersections(:,:,d) = gather(Sizes_of_multiset_intersections_tmp);
end

One can also easily adapt above code to compute the result in batches along dimension 3 rather than d=1:depth (=batch of size 1), though at the expense of even bigger unique_B_values vector.

Since the depth dimension is large (even when working in batches along it), I am interested in faster alternatives to the code inside the outer loop. So my question is this: is there a faster (e.g. better vectorized) way to compute sizes of intersections of multisets of equal size?

M.G.
  • 293
  • 1
  • 13
  • How do you decide which ones get included if you have, for example, `[1 1 1 1]` and `[1 1]`? Or are they indistinguishable? – Neil Aug 08 '22 at 03:58
  • 1
    have you checked the function `intersect`? – Ander Biguri Aug 08 '22 at 10:39
  • 1
    @Neil: yes, they are indistinguishable, so in your example the intersection would be just `[1 1]`. In other words, the intersection of finite multisets is just their intersection as sets, but with multiplicities, given by the minimum of the multiplicities from each multiset. – M.G. Aug 08 '22 at 12:06
  • 1
    @AnderBiguri: yes, I am aware of `intersect`, but it does not seem suitable for several reasons. Firstly, it works for sets, but not for multisets, it returns only the set-theoretical intersection, so I still have to find some other way to count multiplicities. Secondly and more importantly, it will have to compare each row from `A` to each row from `B`. My current code is faster than that as it compares a whole `A`-slice (matrix) to `B`-row via vectorization (while taking care of multiplicities at the same time too), i.e. one less loop. I don't see how to avoid the extra loop with `intersect` – M.G. Aug 08 '22 at 12:15
  • Are you sure about the orders of magnitude of the sizes of your matrices? Not sure I can handle a 12000GB array. You should add a [Minimal, Reproducible Example](https://stackoverflow.com/help/minimal-reproducible-example) (e.g. with random values and the `default` rng seed) – BillBokeey Aug 08 '22 at 12:38
  • @BillBokeey: Ah, I see your point. Indeed the size of the 3rd dimension `depth` is correct, but the arrays are stored externally in small pieces along the 3rd dimension. My point was that there are a lot of iterations for the outer loop, so the code inside needs to be faster. I will edit my question accordingly to reflect this and make the code a MRE. – M.G. Aug 08 '22 at 13:11
  • The thing is, you are always going to have to do `size(A,1)*size(B,1)` comparisons per slice, which is going to be a lot. Are you sure you need to do all these? What do you want to do with the `Sizes_of_multiset_intersections` array? As is, I don't think there is a faster way than the one you have – BillBokeey Aug 08 '22 at 13:21
  • 1
    [This](https://stackoverflow.com/questions/51829635/finding-multiset-difference-between-two-arrays) seems like it would be helpful, particularly obchardon’s `histc`-based approach. Note that `histc` takes an optional dimension, while the newer `histcounts` does not. – beaker Aug 08 '22 at 13:27
  • Also, are you sure your code actually does what you want? The call to `unique(B_slice)` is going to lose you the information of which `n` index your values correspond to – BillBokeey Aug 08 '22 at 13:27
  • @BillBokeey: yes, I am certain. `unique(B_slice)` is only needed to determine the distinct elements whose multiplicities are to be counted. The next two lines count the multiplicities of these values in `A` and `B` and the final (inner) loop takes `min` of these multiplicities to determine the intersection. – M.G. Aug 08 '22 at 13:34
  • @BillBokeey: in terms of performance, I had a routine that was doing this in an even slower way. It boils down to optimal use of vectorization in MATLAB and that some vectorized MATLAB functions are faster than others. What happens to `Sizes_of_multiset_intersections` next is not relevant. I will edit my post to make the code into MRE as soon as I get back home. – M.G. Aug 08 '22 at 13:37
  • @beaker: Thanks, nice find! I will check it out. Looks very much like what I need. Maybe it can be adapted to my scenario and then it only remains to compare performance. – M.G. Aug 08 '22 at 13:39
  • I have now modified the included code example to a proper MWE, tested on a GTX 1060 with 6GB of VRAM. – M.G. Aug 08 '22 at 15:32

1 Answers1

3

Disclaimer : This is not a GPU based solution (Don't have a good GPU). I find the results interesting and want to share, but I can delete this answer if you think it should be.


Below is a vectorized version of your code, that makes it possible to get rid of the inner loop, at the cost of having to deal with a bigger array, that might be too big to fit in the memory.

The idea is to have the matrices A_values_counts and B_values_counts be 3D matrices shaped in such a way that calling min(A_values_counts,B_values_counts) will calculate everything in one go due to implicit expansion. In the background it will create a big array of size s x n x length(unique_B_values) (Probably most of the time too big)

In order to go around the constraint on the size, the results are calculated in batches along the n dimension, i.e. the first dimension of B:

tic

nBatches_B = 2000;
sBatches_B = n/nBatches_B;
Sizes_of_multiset_intersections_new = zeros(s,n,depth,'uint8');

for d=1:depth
    A_slice = A(:,:,d);
    B_slice = B(:,:,d);

    % compute counts of the unique B-values for each multiset:    
    unique_B_values = reshape(unique(B_slice),1,1,[]);

    A_values_counts = sum(uint8(A_slice==unique_B_values),2,'native'); % s x 1 x length(uniqueB) array
    B_values_counts = reshape(sum(uint8(B_slice==unique_B_values),2,'native'),1,n,[]); % 1 x n x length(uniqueB) array

    % Not possible to do it all in one go, must split in batches along B

    for ii = 1:nBatches_B

        Sizes_of_multiset_intersections_new(:,((ii-1)*sBatches_B+1):ii*sBatches_B,d) = sum(min(A_values_counts,B_values_counts(:,((ii-1)*sBatches_B+1):ii*sBatches_B,:)),3,'native'); % Vectorized
    
    end

end

toc

Here is a little benchmark with different values of the number of batches. You can see that a minimum is found around a number of 400 (batch size 50), with a decrease of around 10% in processing time (each point is an average over 3 runs). (EDIT : x axis is amount of batches, not batches size)

enter image description here

I'd be interested in knowing how it behaves for GPU arrays as well!

BillBokeey
  • 3,168
  • 14
  • 28
  • Thanks, this is a nice idea that halved the execution time of the MWE on the GPU. On GTX 1060, my code takes ~17.7s to complete, whereas yours takes only ~8.6s, a substantial improvement. Discussing the problem here actually helped me come up with three other routines that run at least 5 times faster, and I will try to combine them with your idea! But the most promising one is kinda difficult and I need more time to finish it. It relies on computing the Incidence matrix of the multisets, which can be done very fast, but it's a bit tricky to deduce the multiplicities from the incidence matrix. – M.G. Aug 09 '22 at 21:52
  • The other two are basically about substantially reducing the size of `unique_B_values`, since my original MWE makes these comparisons rather ineffeciently. Maybe all of them can be combined with your idea. Then I will need to benchmark them and post the results. But first I need to finish the incidence-matrix-based method... – M.G. Aug 09 '22 at 21:55
  • Forgot to mention that the fastest option for me was `nBatches_B = 2000`. – M.G. Aug 10 '22 at 04:24
  • I've accepted your answer as an answer. For some reason, the functions I came up with work much faster with my datasets, but slower with the random datasets. Currently I have no idea as to why this is the case since all both sets of datasets have the same type and sizes, are non-sparse or anything like it, so vectorization should not behave fundamentally differently. If I ever figure out why, I will post an answer too. – M.G. Aug 13 '22 at 16:25