How to do correct timing of Android RenderScript code on Nvidia Shield

Question

I have implemented a small CNN in RenderScript and want to profile the performance on different hardware. On my Nexus 7 the times make sense, but on the NVIDIA Shield they do not.

The CNN (LeNet) is implemented in 9 layers residing in a queue, computation is performed in sequence. Each layer is timed individually.

Here is an example:

       conv1  pool1 conv2  pool2 resh1 ip1    relu1  ip2    softmax
nexus7 11.177 7.813 13.357 8.367 8.097 2.1    0.326  1.557  2.667
shield 13.219 1.024 1.567  1.081 0.988 14.588 13.323 14.318 40.347

The distribution of the times are about right for the nexus, with conv1 and conv2 (convolution layers) taking most of the time. But on the shield, the times drop way beyond what's reasonable for layers 2-4 and seem to gather up towards the end. The softmax layer is a relatively small job, so 40ms is way too large. My timing method must be faulty, or something else is going on.

The code running the layers looks something like this:

double[] times = new double[layers.size()];
int layerindex = 0;
for (Layer a : layers) {

    double t = SystemClock.elapsedRealtime(); 
    //long t = System.currentTimeMillis(); // makes no difference

    blob = a.forward(blob); // here we call renderscript forEach_(), invoke_() etc

    //mRS.finish(); // makes no difference

    t = SystemClock.elapsedRealtime() - t; 
    //t = System.currentTimeMillis() - t; // makes no difference

    times[layerindex] += t; // later we take average etc

    layerindex++;
}

It is my understanding that once forEach_() returns, the job is supposed to be finished. In any case, mRS.finish() should provide a final barrier. But looking at the times, the only reasonable explanation is that jobs are still processed in the background.

The app is very simple, I just run the test from MainActivity and print to logcat. Android Studio builds the app as a release and runs it on the device which is connected by USB.

(1) What is the correct way to time RenderScript processes? (2) Is it true that when forEach_() returns, the threads spawned by the script are guaranteed to be done? (3) In my test app, I simply run directly from the MainActivity. Is this a problem (other than blocking the UI thread and making the app unresponsive)? If this influences the timing or causes the weirdness, what is a proper way to set up a test app like this?

Convolutional Neural Network. – frankhond Jul 20 '16 at 15:45 — frankhond, Jul 20 '16 at 15:45

score 3 · Accepted Answer · answered May 06 '16 at 20:52

3

I've implemented CNNs in RenderScript myself, and as you explain, it does require chaining multiple processes and calling forEach_*() various times for each layer if you implement them each as a different kernel. As such, I can assure you that the forEach call returning does not really guarantee that the process has completed. In theory, this will only schedule the kernel and all queued up requests will actually run whenever the system determines it's best to, especially if they get processed in the tablet's GPU.

Usually, the only way to make absolutely sure you have some kind of control over a kernel truly running is by explicitly reading the output of the RS kernel in between layers, such as by using .copyTo() on the output allocation object of that kernel. This "forces" any queued up RS jobs that have not run yet (on which that layer's output allocation is dependent), to execute at that time. Granted, that may introduce data transfer overheads and your timing will not be fully accurate -- in fact, the execution time of the full network will quite surely be lower than the sum of the individual layers if timed in this manner. But as far as I know, it's the only reliable way to time individual kernels in a chain and it will give you some feedback to find out where bottlenecks are, and to better guide your optimization, if that's what you're after.

answered May 06 '16 at 20:52

monoeci

547
2
10

Thanks! I have read your other questions which contributed to my understanding of RenderScript. May I pick your brain some more? When testing something like this on a tablet with a GPU like the Shield, what would be a good way to find out if the code actually runs on the GPU? – frankhond May 07 '16 at 10:18
Hmm, as usual, it's a bit difficult to tell where an RS kernel will execute, but if you have your timing setup fixed, you can just do test runs with your kernels running normally, and running them while forcing RS to use the CPU only with `adb shell setprop debug.rs.default-CPU-driver 1` (and 0 to turn it back off, make sure to properly terminate the app in between changing this for it to take effect). If your kernel is well built and it really does use the GPU, you should see a noticeable difference, especially in convolutions you should get at least a 4-6x speedup compared to CPU-only mode. – monoeci May 09 '16 at 23:58
Nice, I'll try that! Was hoping there was a profiler available somewhere, but this would certainly give an indication. Thanks again for your help! – frankhond May 10 '16 at 19:53
This is a great answer. And on certain Nvdia driver, mRS.finish() is actually treated a NOOP. – Miao Wang May 13 '16 at 21:15
1

I have now tried "adb shell setprop debug.rs.default-CPU-driver 1" and the result is puzzling. The command gives a speedup of about 10x on both Nvidia Shield and Nexus 7. But this is supposed to force RS to use default reference CPU implementation. What is the explanation for this? The reference app goes from about 50 to 130 fps... – frankhond May 14 '16 at 14:51
1

I added a new question about this here: http://stackoverflow.com/questions/37228427/renderscript-speedup-10x-when-forcing-default-cpu-implementation – frankhond May 14 '16 at 15:26

score 3 · Answer 2 · answered May 13 '16 at 22:29

Maybe a little bit off topic: but for CNN, if you can structure your algorithm using matrix-matrix multiplication as basic computing blocks you can actually use RenderScript IntrinsicBLAS, especially BNNM and SGEMM.

Pros:

High performance implementation of 8bit Matrix Multiplication (BNNM), available in N Preview.
Back support back to Android 2.3 through RenderScript Support lib, when using Build-Tools 24.0.0 rc3 and above.
High performance GPU acceleration of SGEMM on Nexus5X and 6P with N Preview build NPC91K.
If you only use RenderScript Intrinsics, you can code everything in java.

Cons:

Your algorithm may need to be refactored, and need to be based on 2d matrix multiplication.
Though available in Android 6.0, but BNNM performance in 6.0 is not satisfactory. So it is better to use support lib for BNNM and set targetSdkVersion to be 24.
SGEMM GPU acceleration currently only available in Nexus5X and Nexus6P. And it currently requires the width and height of the Matrices to be multiples of 8.

It's worth trying if BLAS fits into your algorithm. And it is easy to use:

    import android.support.v8.renderscript.*;
    // if you are not using support lib:
    // import android.renderscript.*;

    private void runBNNM(int m, int n, int k, byte[] a_byte, byte[] b_byte, int c_offset, RenderScript mRS) {
        Allocation A, B, C;
        Type.Builder builder = new Type.Builder(mRS, Element.U8(mRS));
        Type a_type = builder.setX(k).setY(m).create();
        Type b_type = builder.setX(k).setY(n).create();
        Type c_type = builder.setX(n).setY(m).create();

        // If you are reusing the input Allocations, just create and cache them somewhere else.
        A = Allocation.createTyped(mRS, a_type);
        B = Allocation.createTyped(mRS, b_type);
        C = Allocation.createTyped(mRS, c_type);
        A.copyFrom(a_byte);
        B.copyFrom(b_byte);

        ScriptIntrinsicBLAS blas = ScriptIntrinsicBLAS.create(mRS);
        // Computes: C = A * B.Transpose
        int a_offset = 0;
        int b_offset = 0;
        int c_offset = 0;
        int c_multiplier = 1;
        blas.BNNM(A, a_offset, B, b_offset, C, c_offset, c_multiplier);
    }

SGEMM is similar:

        ScriptIntrinsicBLAS blas = ScriptIntrinsicBLAS.create(mRS);
        // Construct the Allocations: A, B, C somewhere and make sure the dimensions match.
        // Computes: C = 1.0f * A * B + 0.0f * C
        float alpha = 1.0f;
        float beta = 0.0f;
        blas.SGEMM(ScriptIntrinsicBLAS.NO_TRANSPOSE, ScriptIntrinsicBLAS.NO_TRANSPOSE,
                   alpha, A, B, beta, C);

Thanks for the examples. I have already implemented convolution using im2col and matrix multiplication with SGEMM, as well as a moving kernel version. The reason for my question is that I'm trying to profile these two algorithms agains each other. Depending on hardware, I'm getting wildly different results, and try to figure out why. Your BNNM example is very useful as I was going to move to that next. — frankhond, May 14 '16 at 06:36
Also thanks for the GPU info about, very useful. Can this be found online somewhere? — frankhond, May 14 '16 at 06:43
Not yet. But there will be more tutorial & documentation about it. I will update it here once that is available, stay tuned. Also, if you have any suggestions or feature requests for RenderScript, please let me know. I can forward it to the RenderScript team. Thanks! — Miao Wang, May 14 '16 at 23:22
@MiaoWang Can you please have a look into this: http://stackoverflow.com/questions/40452679/renderscript-c-style-pointer-usage-performance-issue and https://code.google.com/p/android/issues/detail?id=227607 — Vardan95, Nov 17 '16 at 13:39

How to do correct timing of Android RenderScript code on Nvidia Shield

2 Answers2

Linked