Thursday, January 30, 2014

RenderScript in Android - anatomy of the benchmark program

 In the previous post I have presented our RenderScript benchmark and demonstrated that RenderScript implementation of the same algorithm can be 2-3 times faster than Java. How can a "script" be so fast? In order to understand this speed difference, let's see how the RenderScript fragment is executed.

The example program is attached to the end of this post. You have to be logged to the Sfonge site to access it.

First, let's see how the script looks like. The source can be found in, in the same directory where other Java sources (just one file in this case) are.

It looks like an innocent C function but there are some specialties. All the global variables like these ones:

int32_t s2len = 0;
int32_t *d0;

can be used to pass data to the script.  The toolchain generates a Java wrapper for each .rs file, ours is called In order to set s2len, for example, one calls the set_s2len function in the ScriptC_dtw class. The d0 global variable is a pointer type, setting this variable requires an Allocation Java object. Open and look up the findReferenceSignalC99() method. There you will find:

        Allocation signal1Allocation = Allocation.createSized(

Here we created an allocation that holds 16-bit integers, copied the input signal into it and bound the allocation so that the allocation's data is available to the script. When the script is invoked, the data area of this allocation is simply available to the script as:

int16_t *signal1;

This sort of parameter passing is one-way for simple values but two-way for allocations. So whatever you write in your script into e.g. s2len, you won't be able to read it in the Java layer after the script finishes executing. In contrast, Allocations provide two-way data transfer, that's why the result value is passed back to Java in an Allocation.


        Allocation rAllocation = Allocation.createSized(


        *r = d1[s1len-1];

And again in after the script finished executing:

        int result[] = new int[1];
        int maxc = result[0];

The execution of the script seems simple enough but there's more than meets the eye.

Execution context and the script instance are created.

        RenderScript rsC = RenderScript.create( this);
        ScriptC_dtw script = new ScriptC_dtw(

ScriptC_dtw is the wrapper which was generated by RenderScript toolchain. But what is R.raw.dtw? Let's see how our "script" was turned into executable code. If you unzip the APK file, you find some interesting artifacts. Under the res/raw directory, you find dtw.bc. This is the LLVM bytecode that was compiled to. In addition, under the lib directory, you will find .so files for the ARM, MIPS and Intel platform. If you disassemble, you will find highly optimized binary compiled from our script which is really a piece of valid C code.

The RenderScript name generates a confusion. This name evokes a proprietary scripting language when in fact it is Clang's C99 front-end compiler with a set of libraries that are ported to a large number of processors. Optimized C code is fast, what is so surprising in it? When our "script" is executed on an ARM processor, the RenderScript runtime just has to load the precompiled ARM code and execute it. If it turns out that there is no precompiled native code for the target processor (e.g. it is a GPU) then LLVM backend compiler swings into action and generates code for that processor at installation time. Both compilation steps (from C to LLVM bytecode and from LLVM bytecode to native) are subject to optimization so the resulting native code is very fast. No wonder therefore that RenderScript beats Dalvik VM so easily and with such a large margin.

After all the global variables have been initialized, the script can be invoked.


Note the finish() invocation here. The invoke_dtw() method is asynchronous meaning that when it returns, the execution of the "script" has not finished, in fact, it was not even started. The finish() method on the RenderScript instance blocks until the script invocations on that context all finish. Script invocations in the same context are executed sequentially.

But what happens when more than one context is created? Allocations and script execution in those contexts are independent. If you have enough cores/processors, script invocations in those contexts will execute in parallel. Be aware, however, that if you create more contexts than the number of processing units you have then those contexts will compete for the same processing units by means of context switching and these context switches will eventually decrease your performance. If your algorithm requires an element scan which is more complicated than the sequence that foreach() supports, you can always create a dummy allocation with as many elements as the processing elements your algorithm supports and release foreach() on that dummy allocation. Then your kernel will access elements of the data set in any order it wishes.

How does RenderScript compare to established technologies like Android SDK or NDK? For Google, the equation is simple: RenderScript is mainly for GPUs, hence its name. I tried to present the case here that for an average Android programmer, RenderScript provides a much more productive way to offload computation-intensive code fragments to highly optimized native code than NDK. RenderScript is integrated with the Android SDK, compilation is super-fast, wrappers are generated automatically, JNI issues are non-existing, coding parallel execution is simpler than either with the SDK or with the NDK. Faster execution also means lower battery consumption as this presentation demonstrated in a different context. And who knows, one day a device with a multicore CPU, GPU or DSP comes along that speeds up your application even further, at no cost. As RenderScript has LLVM at its heart, the possibility is there.

No comments: