pyopencl copy_if(): is it possible to minimize the return buffer size?

Question

Here's a simple pyopencl copy_if() example.

First, let's create a large set (2^25) of random ints, and query those below the 500,000 threshold:

import pyopencl as cl
import numpy as np
import my_pyopencl_algorithm
import time

ctx = cl.create_some_context()
queue = cl.CommandQueue(ctx)

from pyopencl.clrandom import rand as clrand
random_gpu = clrand(queue, (2^25,), dtype=np.int32, a=0, b=10**6)

start = time.time()
final_gpu, count_gpu, evt = my_pyopencl_algorithm.copy_if(random_gpu, "ary[i] < 500000", queue = queue)
final = final_gpu.get()
count = int(count_gpu.get())
print '\ncopy_if():\nresults=',final[:count], '\nfound=', count, '\ntime=', (time.time()-start), '\n========\n'

You may have noticed that I'm not calling pyopencl's copy_if, but a fork of it (my_pyopencl_algorithm.copy_if). The fork of pyopencl.algorithm.py can be found here.

The beauty of copy_if is that you have a ready-made count of the desired output, and on the order from gid=0 to gid=count. What doesn't seem optimal is that it allocates and returns (from the gpu) the entire buffer, with only the first entries having meaning. So in my fork of pyopencl.algorithm.py I'm trying to optimize the return buffer size, and I've got this:

def sparse_copy_if(ary, predicate, extra_args=[], preamble="", queue=None, wait_for=None):
    """Copy the elements of *ary* satisfying *predicate* to an output array.

:arg predicate: a C expression evaluating to a `bool`, represented as a string.
    The value to test is available as `ary[i]`, and if the expression evaluates
    to `true`, then this value ends up in the output.
:arg extra_args: |scan_extra_args|
:arg preamble: |preamble|
:arg wait_for: |explain-waitfor|
:returns: a tuple *(out, count, event)* where *out* is the output array, *count*
    is an on-device scalar (fetch to host with `count.get()`) indicating
    how many elements satisfied *predicate*, and *event* is a
    :class:`pyopencl.Event` for dependency management. *out* is allocated
    to the same length as *ary*, but only the first *count* entries carry
    meaning.

.. versionadded:: 2013.1
"""
if len(ary) > np.iinfo(np.int32).max:
    scan_dtype = np.int64
else:
    scan_dtype = np.int32

extra_args_types, extra_args_values = extract_extra_args_types_values(extra_args)


knl = _copy_if_template.build(ary.context,
        type_aliases=(("scan_t", scan_dtype), ("item_t", ary.dtype)),
        var_values=(("predicate", predicate),),
        more_preamble=preamble, more_arguments=extra_args_types)
out = cl.array.empty_like(ary)
count = ary._new_with_changes(data=None, offset=0,
        shape=(), strides=(), dtype=scan_dtype)

# **dict is a Py2.5 workaround
evt = knl(ary, out, count, *extra_args_values,
        **dict(queue=queue, wait_for=wait_for))

'''
Now I need to copy the first num_results values from out to final_gpu (in which buffer size is minimized)
'''

prg = cl.Program(ary.context, """ 
__kernel void copy_final_results(__global int *final_gpu, __global int *out_gpu) 
{ 
__private uint gid; 
gid = get_global_id(0); 
final_gpu [gid] = out_gpu [gid]; 
} 
""").build() 

num_results= int(count.get())

final_gpu = pyopencl.array.zeros(queue, (num_results,), dtype=scan_dtype)

prg.copy_final_results (queue, (num_results,), None, final_gpu.data, out.data).wait()  

return final_gpu, evt 
#return out, count, evt

That is, I'm creating a final_gpu buffer exactly the size of the output, then copying the meaningful entries to it, and returning it.

If I now run:

start = time.time()
final_gpu, evt = my_pyopencl_algorithm.sparse_copy_if(random_gpu, "ary[i] < 500000", queue = queue)
final = final_gpu.get()
print '\ncopy_if_2():\nresults=',final, '\nfound=', count, '\ntime=', (time.time()-start) here

... this seems to yield orders of magnitude improvements in speed. The more sparse results are, the faster it becomes, as the buffer size to be transferred (with high latency) is minimized.

My question is: is there a reason we are returning a full-sized buffer? In other words, am I introducing any bugs, or should I just submit a patch?

_"am I introducing any bugs, or should I just submit a patch"_ Peoples at pyopencl are probably much more qualified to answer that question. I don't know if some of them are hanging around here. You should directly discuss that on the [dev mailing list](http://lists.tiker.net/listinfo/pyopencl) instead. — Sylvain Leroux, Jan 31 '15 at 10:17

pyopencl copy_if(): is it possible to minimize the return buffer size?

0 Answers0