1) In kernel,
get_global_size(0) gives number of items in x dimension
get_global_size(1) gives number of item arrays in y dimension
get_global_size(2) gives number of item matrices in z dimension
total number is multiplication of them but if kernel is launched only 1-dim then only first function is enough.
get_local_size(0 or 1 or 2);
gives same thing for items in groups, not total items.
get_num_groups (0 or 1 or 2)
is similar but gives number of groups in total groups.
Number of dimensions are taken from
int dims=get_work_dim ()
2) Event based performance queries from host code:
http://www.jocl.org/cloth/docs/doc-utils/org/jocl/utils/Events.html
computeExecutionTimeMs(org.jocl.cl_event event)
Compute the execution time for the given event, in milliseconds.
1), 2) and 3) a profiler
can show all except "each core"(but gives info of "Lanes" which may not map to same core at all times but you can see what a single thread was doing) part. https://developer.nvidia.com/nvidia-nsight-visual-studio-edition visuals and tables give enough information about bottlenecks and kernel hotspots