If you're talking about replicating the kernel more than once, you can increase the number of compute units. There is a attribute that you can add before the kernel.
__attribute__((num_compute_units(N)))
__kernel void test(...){
...
}
By doing this you essentially replicate the kernel N times. However, the Programming guide states that you probably first look into using the simd attribute where it performs the same operation but over multiple data. This way, the access to global memory becomes more efficient. By increasing the number of compute units, if your kernels have global memory access, there could be contention as multiple compute units are competing for access to global memory.
You can also replicate operations at a fine-grained level by using loop unrolling. For example,
#pragma unroll N
for(short i = 0; i < N; i++)
sum[i] = a[i] + b[i]
This will essentially perform the summing of a vector by element N times in one go by creating hardware to do the addition N times. If the data is dependent on the previous iteration, then it unrolls the pipeline.
On the other hand, if your goal is to launch different kernels with different operations, you can do that by creating your kernels in an OpenCL file. When you compile the kernels, it will map and par the kernels in the file into the FPGA together. Afterwards, you just need to envoke the kernel in your host by calling clEnqueueNDRangeKernel or clEnqueueTask. The kernels will run side by side in parallel after you enqueue the commands.