0

I am using openmp to parallelize my code. I have an original array:

A=[3,5,2,5,7,9,-4,6,7,-3,1,7,6,8,-1,2]

and a marks array:

M=[1,0,1,0,0,0,1,0,0,1,1,0,0,0,1,1]

using array M i can compact my original array in this packed array:

A=[3,2,-4,-3,1,-1,2]

I'd like to solve this problem using a multi-threads approach. Library 'Thrust' for C++ solves this problem but i am not able to find a similar tools for Fortran. Is there a library, like 'thrust' for C++, that i can use to execute a stream compaction? Alternatively, is there an algorithm that i can write myself using fortran and openmp, to solve this?

Pierpym
  • 13
  • 4
  • 1
    I think you'll struggle to write an OpenMP program to outperform `A = pack(A,M==1)`. I think the overhead of having multiple threads write to `A` will kill any speedup from distributing the work of `pack`ing. But I look forward to being proved wrong. How does Thrust solve the problem ? – High Performance Mark Nov 05 '14 at 13:16
  • I could, and perhaps should, have added to my previous comment that I know of no library to implement a parallelised version of the `pack` intrinsic in Fortran. I suppose you might find it easy enough to call the C++ routines from Thrust. – High Performance Mark Nov 05 '14 at 13:24
  • If your vector is very very long, you can try and split it in a few chunks in a `OMP do` loop and use `pack` on each subset. You'd need to store the resulting subsets independently and merge them at the end. – damienfrancois Nov 05 '14 at 13:25
  • First, thanks for your reply. I don't know how Thrust solve the problem, but i have read that in this library there are a lot of API that perform these kind of operations(reductions, prefix-sums, reordering, etc.) in multithread for GPU applications. Here (http://http.developer.nvidia.com/GPUGems3/gpugems3_ch39.html) seems that it is necessary a parallel prefix scan and then a scatter but i don't understand how can write it in fortran using openmp. Pack intrinsic function is serial, so i don't know if i can achieve a better performance with large arrays and a lot of threads (MIC or GPU). – Pierpym Nov 05 '14 at 14:26
  • 2
    It's my belief, unjustified by evidence, that the time it takes your computer to move very large arrays between host RAM and co-processor RAM, will outweigh any benefit you might achieve from parallelising this `pack`ing operation. I'll be interested to see your results should you go down this route. I think most of the material in the URL you point us to shows the advantages of using parallel algorithms once you've got the data in the GPU's RAM, but is rather coy about the time taken to shift it there and back. – High Performance Mark Nov 05 '14 at 14:53

1 Answers1

1

Is there a library, like 'thrust' for C++, that i can use to execute a stream compaction?

It shouldn't be that difficult to call a thrust routine from Fortran (if you're willing to write a little bit of C++ code). Furthermore, thrust can target an OMP backend instead of a GPU backend.

Alternatively, is there an algorithm that i can write myself using fortran and openmp, to solve this?

The basic parallel stream compaction algorithm is as follows. We will assume that there is one thread assigned per element in your data array, initially.

  1. Perform a parallel prefix sum (inclusive scan) on the M array:

     M=[1,0,1,0,0,0,1,0,0,1,1,0,0,0,1,1]
    sM=[1,1,2,2,2,2,3,3,3,4,5,5,5,5,6,7]
    
  2. Each thread will then inspect its element in the M array, and if that element is non-zero, it will copy its corresponding element in the A array to the output array (let's call it O):

     M=[1,0,1,0,0,0, 1,0,0, 1,1,0,0,0, 1,1]
    sM=[1,1,2,2,2,2, 3,3,3, 4,5,5,5,5, 6,7]
     A=[3,5,2,5,7,9,-4,6,7,-3,1,7,6,8,-1,2]
     O=[3,  2,      -4,    -3,1,      -1,2]
    

If you were doing this in OMP, you will need an OMP barrier between steps 1 and 2. The work in step 2 is relatively simple and completely independent, so you could use a OMP parallel do loop, and break the work up in any fashion you wish. Step 1 will be complicated, and I suggest following the outline provided in the chapter you and I linked. The OMP code there will require various barriers along the way, but is parallelizable.

As mentioned already in the comments, if this is the only piece of work you want to parallelize, I wouldn't recommend a GPU, because the cost of transferring the data to/from the GPU would probably outweigh any parallel execution time benefits you might accrue. But as I mentioned already, thrust can target an OMP realization rather than a GPU realization. It might be worth a try.

Regarding thrust from fortran, most of what you need is here. That is admittedly CUDA fortran, but the only differences should be not using the device attribute, and using thrust::host_vector instead of thrust::device_vector (at least, to get started).

Robert Crovella
  • 143,785
  • 11
  • 213
  • 257