In the previous post I have presented our RenderScript benchmark and
demonstrated that RenderScript implementation of the same algorithm
can be 2-3 times faster than Java. How can a "script" be so fast? In
order to understand this speed difference, let's see how the
RenderScript fragment is executed.
The example program is available here.
First, let's see how the script looks like. The source can be found in dtw.rs, in the same directory where other Java sources (just one file in this case) are.
It looks like an innocent C function but there are some
specialties. All the global variables like these ones:
int32_t s2len = 0;
int32_t *d0;
can be used to pass data to the script. The toolchain generates a Java wrapper for each .rs file, ours is called ScriptC_dtw.java. In order to set s2len, for example, one calls the set_s2len function in the ScriptC_dtw class. The d0 global variable is a pointer type, setting this variable requires an Allocation Java object. Open MainActivity.java and look up the findReferenceSignalC99() method. There you will find:
Allocation signal1Allocation = Allocation.createSized(
rsC,
Element.I16(rsC),
refSignal.length);
signal1Allocation.copyFrom(refSignal);
script.bind_signal1(signal1Allocation);
Here we created an allocation that holds 16-bit integers, copied the input signal into it and bound the allocation so that the allocation's data is available to the script. When the script is invoked, the data area of this allocation is simply available to the script as:
int16_t *signal1;
This sort of parameter passing is one-way for simple values but two-way for allocations. So whatever you write in your script into e.g. s2len, you won't be able to read it in the Java layer after the script finishes executing. In contrast, Allocations provide two-way data transfer, that's why the result value is passed back to Java in an Allocation.
In MainActivity.java:
Allocation rAllocation = Allocation.createSized(
rsC,
Element.I32(rsC),
1);
script.bind_r(rAllocation);
In dtw.rs:
*r = d1[s1len-1];
And again in MainActivity.java after the script finished executing:
int result[] = new int[1];
rAllocation.copyTo(result);
...
int maxc = result[0];
The execution of the script seems simple enough but there's more than meets the eye.
Execution context and the script instance are created.
RenderScript rsC = RenderScript.create( this);
ScriptC_dtw script = new ScriptC_dtw(
rsC,
getResources(),
R.raw.dtw);
ScriptC_dtw is the wrapper which was generated by RenderScript toolchain. But what is R.raw.dtw? Let's see how our "script" was turned into executable code. If you unzip the APK file, you find some interesting artifacts. Under the res/raw directory, you find dtw.bc. This is the LLVM bytecode that dtw.rs was compiled to. In addition, under the lib directory, you will find .so files for the ARM, MIPS and Intel platform. If you disassemble librs.dtw.so, you will find highly optimized binary compiled from our script which is really a piece of valid C code.
The RenderScript name generates a confusion. This name evokes a proprietary scripting language when in fact it is Clang's C99 front-end compiler with a set of libraries that are ported to a large number of processors. Optimized C code is fast, what is so surprising in it? When our "script" is executed on an ARM processor, the RenderScript runtime just has to load the precompiled ARM code and execute it. If it turns out that there is no precompiled native code for the target processor (e.g. it is a GPU) then LLVM backend compiler swings into action and generates code for that processor at installation time. Both compilation steps (from C to LLVM bytecode and from LLVM bytecode to native) are subject to optimization so the resulting native code is very fast. No wonder therefore that RenderScript beats Dalvik VM so easily and with such a large margin.
After all the global variables have been initialized, the script can be invoked.
script.invoke_dtw();
rsC.finish();
Note the finish() invocation here. The invoke_dtw() method is asynchronous meaning that when it returns, the execution of the "script" has not finished, in fact, it was not even started. The finish() method on the RenderScript instance blocks until the script invocations on that context all finish. Script invocations in the same context are executed sequentially.
But what happens when more than one context is created? Allocations and script execution in those contexts are independent. If you have enough cores/processors, script invocations in those contexts will execute in parallel. Be aware, however, that if you create more contexts than the number of processing units you have then those contexts will compete for the same processing units by means of context switching and these context switches will eventually decrease your performance. If your algorithm requires an element scan which is more complicated than the sequence that foreach() supports, you can always create a dummy allocation with as many elements as the processing elements your algorithm supports and release foreach() on that dummy allocation. Then your kernel will access elements of the data set in any order it wishes.
How does RenderScript compare to established technologies like Android SDK or NDK? For Google, the equation is simple: RenderScript is mainly for GPUs, hence its name. I tried to present the case here that for an average Android programmer, RenderScript provides a much more productive way to offload computation-intensive code fragments to highly optimized native code than NDK. RenderScript is integrated with the Android SDK, compilation is super-fast, wrappers are generated automatically, JNI issues are non-existing, coding parallel execution is simpler than either with the SDK or with the NDK. Faster execution also means lower battery consumption as this presentation demonstrated in a different context. And who knows, one day a device with a multicore CPU, GPU or DSP comes along that speeds up your application even further, at no cost. As RenderScript has LLVM at its heart, the possibility is there.
The example program is available here.
First, let's see how the script looks like. The source can be found in dtw.rs, in the same directory where other Java sources (just one file in this case) are.
int32_t s2len = 0;
int32_t *d0;
can be used to pass data to the script. The toolchain generates a Java wrapper for each .rs file, ours is called ScriptC_dtw.java. In order to set s2len, for example, one calls the set_s2len function in the ScriptC_dtw class. The d0 global variable is a pointer type, setting this variable requires an Allocation Java object. Open MainActivity.java and look up the findReferenceSignalC99() method. There you will find:
Allocation signal1Allocation = Allocation.createSized(
rsC,
Element.I16(rsC),
refSignal.length);
signal1Allocation.copyFrom(refSignal);
script.bind_signal1(signal1Allocation);
Here we created an allocation that holds 16-bit integers, copied the input signal into it and bound the allocation so that the allocation's data is available to the script. When the script is invoked, the data area of this allocation is simply available to the script as:
int16_t *signal1;
This sort of parameter passing is one-way for simple values but two-way for allocations. So whatever you write in your script into e.g. s2len, you won't be able to read it in the Java layer after the script finishes executing. In contrast, Allocations provide two-way data transfer, that's why the result value is passed back to Java in an Allocation.
In MainActivity.java:
Allocation rAllocation = Allocation.createSized(
rsC,
Element.I32(rsC),
1);
script.bind_r(rAllocation);
In dtw.rs:
*r = d1[s1len-1];
And again in MainActivity.java after the script finished executing:
int result[] = new int[1];
rAllocation.copyTo(result);
...
int maxc = result[0];
The execution of the script seems simple enough but there's more than meets the eye.
Execution context and the script instance are created.
RenderScript rsC = RenderScript.create( this);
ScriptC_dtw script = new ScriptC_dtw(
rsC,
getResources(),
R.raw.dtw);
ScriptC_dtw is the wrapper which was generated by RenderScript toolchain. But what is R.raw.dtw? Let's see how our "script" was turned into executable code. If you unzip the APK file, you find some interesting artifacts. Under the res/raw directory, you find dtw.bc. This is the LLVM bytecode that dtw.rs was compiled to. In addition, under the lib directory, you will find .so files for the ARM, MIPS and Intel platform. If you disassemble librs.dtw.so, you will find highly optimized binary compiled from our script which is really a piece of valid C code.
The RenderScript name generates a confusion. This name evokes a proprietary scripting language when in fact it is Clang's C99 front-end compiler with a set of libraries that are ported to a large number of processors. Optimized C code is fast, what is so surprising in it? When our "script" is executed on an ARM processor, the RenderScript runtime just has to load the precompiled ARM code and execute it. If it turns out that there is no precompiled native code for the target processor (e.g. it is a GPU) then LLVM backend compiler swings into action and generates code for that processor at installation time. Both compilation steps (from C to LLVM bytecode and from LLVM bytecode to native) are subject to optimization so the resulting native code is very fast. No wonder therefore that RenderScript beats Dalvik VM so easily and with such a large margin.
After all the global variables have been initialized, the script can be invoked.
script.invoke_dtw();
rsC.finish();
Note the finish() invocation here. The invoke_dtw() method is asynchronous meaning that when it returns, the execution of the "script" has not finished, in fact, it was not even started. The finish() method on the RenderScript instance blocks until the script invocations on that context all finish. Script invocations in the same context are executed sequentially.
But what happens when more than one context is created? Allocations and script execution in those contexts are independent. If you have enough cores/processors, script invocations in those contexts will execute in parallel. Be aware, however, that if you create more contexts than the number of processing units you have then those contexts will compete for the same processing units by means of context switching and these context switches will eventually decrease your performance. If your algorithm requires an element scan which is more complicated than the sequence that foreach() supports, you can always create a dummy allocation with as many elements as the processing elements your algorithm supports and release foreach() on that dummy allocation. Then your kernel will access elements of the data set in any order it wishes.
How does RenderScript compare to established technologies like Android SDK or NDK? For Google, the equation is simple: RenderScript is mainly for GPUs, hence its name. I tried to present the case here that for an average Android programmer, RenderScript provides a much more productive way to offload computation-intensive code fragments to highly optimized native code than NDK. RenderScript is integrated with the Android SDK, compilation is super-fast, wrappers are generated automatically, JNI issues are non-existing, coding parallel execution is simpler than either with the SDK or with the NDK. Faster execution also means lower battery consumption as this presentation demonstrated in a different context. And who knows, one day a device with a multicore CPU, GPU or DSP comes along that speeds up your application even further, at no cost. As RenderScript has LLVM at its heart, the possibility is there.