Thursday, January 30, 2014

RenderScript in Android - anatomy of the benchmark program

 In the previous post I have presented our RenderScript benchmark and demonstrated that RenderScript implementation of the same algorithm can be 2-3 times faster than Java. How can a "script" be so fast? In order to understand this speed difference, let's see how the RenderScript fragment is executed.

The example program is attached to the end of this post. You have to be logged to the Sfonge site to access it.

First, let's see how the script looks like. The source can be found in dtw.rs, in the same directory where other Java sources (just one file in this case) are.




It looks like an innocent C function but there are some specialties. All the global variables like these ones:

int32_t s2len = 0;
int32_t *d0;


can be used to pass data to the script.  The toolchain generates a Java wrapper for each .rs file, ours is called ScriptC_dtw.java. In order to set s2len, for example, one calls the set_s2len function in the ScriptC_dtw class. The d0 global variable is a pointer type, setting this variable requires an Allocation Java object. Open MainActivity.java and look up the findReferenceSignalC99() method. There you will find:

        Allocation signal1Allocation = Allocation.createSized(
                rsC,
                Element.I16(rsC),
                refSignal.length);
        signal1Allocation.copyFrom(refSignal);
        script.bind_signal1(signal1Allocation);


Here we created an allocation that holds 16-bit integers, copied the input signal into it and bound the allocation so that the allocation's data is available to the script. When the script is invoked, the data area of this allocation is simply available to the script as:

int16_t *signal1;

This sort of parameter passing is one-way for simple values but two-way for allocations. So whatever you write in your script into e.g. s2len, you won't be able to read it in the Java layer after the script finishes executing. In contrast, Allocations provide two-way data transfer, that's why the result value is passed back to Java in an Allocation.

In MainActivity.java:

        Allocation rAllocation = Allocation.createSized(
                rsC,
                Element.I32(rsC),
                1);
        script.bind_r(rAllocation);


In dtw.rs:

        *r = d1[s1len-1];

And again in MainActivity.java after the script finished executing:

        int result[] = new int[1];
        rAllocation.copyTo(result);
...
        int maxc = result[0];


The execution of the script seems simple enough but there's more than meets the eye.

Execution context and the script instance are created.

        RenderScript rsC = RenderScript.create( this);
        ScriptC_dtw script = new ScriptC_dtw(
                rsC,
                getResources(),
                R.raw.dtw);


ScriptC_dtw is the wrapper which was generated by RenderScript toolchain. But what is R.raw.dtw? Let's see how our "script" was turned into executable code. If you unzip the APK file, you find some interesting artifacts. Under the res/raw directory, you find dtw.bc. This is the LLVM bytecode that dtw.rs was compiled to. In addition, under the lib directory, you will find .so files for the ARM, MIPS and Intel platform. If you disassemble librs.dtw.so, you will find highly optimized binary compiled from our script which is really a piece of valid C code.



The RenderScript name generates a confusion. This name evokes a proprietary scripting language when in fact it is Clang's C99 front-end compiler with a set of libraries that are ported to a large number of processors. Optimized C code is fast, what is so surprising in it? When our "script" is executed on an ARM processor, the RenderScript runtime just has to load the precompiled ARM code and execute it. If it turns out that there is no precompiled native code for the target processor (e.g. it is a GPU) then LLVM backend compiler swings into action and generates code for that processor at installation time. Both compilation steps (from C to LLVM bytecode and from LLVM bytecode to native) are subject to optimization so the resulting native code is very fast. No wonder therefore that RenderScript beats Dalvik VM so easily and with such a large margin.

After all the global variables have been initialized, the script can be invoked.

        script.invoke_dtw();
        rsC.finish();


Note the finish() invocation here. The invoke_dtw() method is asynchronous meaning that when it returns, the execution of the "script" has not finished, in fact, it was not even started. The finish() method on the RenderScript instance blocks until the script invocations on that context all finish. Script invocations in the same context are executed sequentially.

But what happens when more than one context is created? Allocations and script execution in those contexts are independent. If you have enough cores/processors, script invocations in those contexts will execute in parallel. Be aware, however, that if you create more contexts than the number of processing units you have then those contexts will compete for the same processing units by means of context switching and these context switches will eventually decrease your performance. If your algorithm requires an element scan which is more complicated than the sequence that foreach() supports, you can always create a dummy allocation with as many elements as the processing elements your algorithm supports and release foreach() on that dummy allocation. Then your kernel will access elements of the data set in any order it wishes.

How does RenderScript compare to established technologies like Android SDK or NDK? For Google, the equation is simple: RenderScript is mainly for GPUs, hence its name. I tried to present the case here that for an average Android programmer, RenderScript provides a much more productive way to offload computation-intensive code fragments to highly optimized native code than NDK. RenderScript is integrated with the Android SDK, compilation is super-fast, wrappers are generated automatically, JNI issues are non-existing, coding parallel execution is simpler than either with the SDK or with the NDK. Faster execution also means lower battery consumption as this presentation demonstrated in a different context. And who knows, one day a device with a multicore CPU, GPU or DSP comes along that speeds up your application even further, at no cost. As RenderScript has LLVM at its heart, the possibility is there.

Monday, January 27, 2014

RenderScript in Android - Java optimization in the benchmark program

I got into discussion with Michael Leahy with regards to the benchmark program posted at the end of the previous post. Michael claimed that the Java implementation of the benchmark algorithm in my test program is suboptimal and he contributed an optimized version that he claims could deliver 70% better performance. 
Therefore I decided to re-evaluate the benchmark results - but with a twist. Not only I added Michael's optimized Java implementation but I optimized the RenderScript implementation too. The results are the following.
  • Michael's implementation did improve the execution time of the Java implementation by about 2.6 times.
  • RenderScript implementation is still about 2.3 times faster than Michael's optimized Java implementation.
The new version is attached to the end of this post. You need to be logged to Sfonge site to access it. I will continue with explaining this example program in the next post.

Saturday, January 25, 2014

RenderScript in Android - the benchmark program

The previous post of this series tried to present argument, why it is a good idea to consider implementing part of your code in RenderScript, instead of Java or native code. This part presents a benchmark that we will use to evaluate the technology. I intentionally chose the benchmark not to be an image manipulation algorithm or a convolution filter. I wanted a use case that processes large amount of data. Sound processing yields itself as a quite evident choice.

The algorithm I am going to present is quite naive and would require a lot of refinement for commercial deployment. But my goal was RenderScript benchmarking so the algorithm's tolerance to all kinds of environmental disturbances was not relevant. We will do voice fingerprinting that takes a voice sample array, compares this sample to a reference sample and returns true or false depending on whether the voice sample is sufficiently similar to the reference sample. The algorithm should return true if the voice sample contains the same word from the same speaker, false otherwise.

I implemented Dynamic Time-Warping (DTW) for the purpose. DTW compares two data series assuming that the timing between the two series may be different. For example the same speaker may say the same word with slightly higher pitch. Therefore DTW calculates a so called warping path, a pair of corresponding timestamps from series1 and series2. E.g. a part of a warping path may be (1,1), (1,2),(2,3),(3,3) meaning that a[1] (data at index #1 from series1) corresponds to b[1], a[1] corresponds to b[2], a[2] corresponds to b[3] and a[3] corresponds to b[3]. Knowing the warping path and a cost function, the distance between the two series can be calculated. DTW finds the warping path with the lowest cumulative cost (i.e. the sum of cost between the corresponding data items along the warping path). A detailed introduction to the algorithm can be found here.

This example program implements classic DTW calculating every possible pairing between the data items of the two series. This requires an NxM matrix where N is the size of series1 and M is the size of series2. DTW(i,n) (element at position (i,n) in this matrix) is the accumulated cost value (sum of the cost function values) of the optimal warping path that leads through the position at i,n. (i.e. with assuming that index i in series1 and index n in series2 correspond to each other). The function calculating DTW(i,n) is thus:

DTW(i,n)=min(DTW(i-1,n),DTW(i-1,n-1),DTW(i,n-1))+c(i,n)

where c(i,n) is the cost between a[i] and b[n]. The example program uses

c(i,n)=abs(a[i]-b[n]))

Cells at the edge of the matrix are initialized like

DTW(i,0)=sum(c(i,0))

and

DTW(0,i)=sum(c(0,i))

Classic DTW can require a large amount of space. Take for example a short voice sample. With a sampling rate of 44100Hz/5=8820Hz, even a short word can generate easily a vector of 5000-8000 elements. If the reference vector is of a similar size, the DTW matrix will consist of 25 million to 64 million elements, all of them integer (meaning 4 bytes each). A data item of that size may easily exhaust the fixed maximum process space allocated to Android applications. Therefore we do a simple optimization. The output of our algorithm is the optimal distance which will be found at DTW(M,N), the right-bottom element in the matrix. In order to calculate elements in row i, we need only elements in row i-1 plus the already calculated elements in row i. We can drop row i-2 and calculate DTW(i,0) on the fly, when we get there. This will reduce the memory requirement to 2*M data items instead of N*M.

DTW, however naive its application in the example program, satisfies our RenderScript benchmark needs because

  • The kernel function is non-linear, it is not so evident to break it down to sub-ranges (although FastDTW does precisely that).
  • Calculation of matrix cells depend on an intermediate result of the calculation and not only on the input values.

The example program is attached to the end of this post. You need a free registration to the Sfonge site to access it. The application exercises the DTW algorithm on speech samples. Launch the application and record a reference sample. E.g. record a word like "four". Then record a same or different word from the same or different speaker and observe the distance that DTW calculates. If the environment is not too noisy, you will find that same word from the same speaker yields a (normalized) distance below 1000-1200 while other words from the same speaker or the same word from a different speaker yields higher distance. The calculation takes a bit of time because the DTW algorithm is executed twice, once in Java and once in RenderScript. The application displays the execution time for both implementations.



For now, you can observe that RenderScript execution times are 4-5 times smaller than Java execution times even (Update: visit this post for an updated number. After optimization, RenderScript is still 2.3 faster). though the implementation does not exploit any parallelization features. I will go deeper into this observation in the next post.

Wednesday, January 22, 2014

RenderScript in Android - why should you care?

Before 3.0, Android application development meant either using the SDK - which is a Java toolchain - or the NDK - which is a C/C++ toolchain based on the gcc compiler. The 3.0 version added a third option, RenderScript. RenderScript was positioned as an auxiliary toolchain for special tasks like image manipulation operations. In this post I will investigate whether it makes sense to use RenderScript for more mainstream coding problems.

If you search the web for RenderScript, you won't get too many hits. This is due to the fact that RenderScript is really just the Google name for the technology which is based on two powerful software packages: the clang compiler and the LLVM "virtual machine". Clang is a C/C++ front-end compiler and is a strong challenger to gcc. RenderScript code is processed by clang and is turned into LLVM bytecode. LLVM is not a virtual machine in a strong sense as it does not "execute" this bytecode. Rather, it provides a second compilation step when it turns the LLVM bytecode into machine code of the target processor.

Much like Android's new runtime, the ART, LLVM is an Ahead-of-Time compiler. The second compilation step either happens on the developer's machine or directly on the device. The first option means that the Clang-LLVM toolchain generates code for 3 processor families: ARM, MIPS and Intel so your APK file already contains native code generated from RenderScript code for these processors and no further compilation is needed on the device. RenderScript code may run on graphical co-processors and other DSPs, however. In this case the on-device LLVM compiler accesses the LLVM bytecode which is also packaged into the APK file and may generate code for these specialized processors at install time.

 So if you write RenderScript instead of native implementation with the traditional NDK toolchain, you get the performance of the native code, support of many processor architectures including more exotic GPUs and DSPs without any hassles, get a simpler integration with the Java code and a much smoother integration with the Eclipse developer environment.

RenderScript appeared in Android 3.0 and supporting Java packages reside under the android.renderscript hierarchy. This is called native RenderScript. Google, however, backported the feature to Android 2.2 (and higher) and this backported version is now the recommended way. Instead of being part of the platform, RenderScript support now resides in a support library and is under the android.support.v8.renderscript package hierarchy. Functionality of android.renderscript and android.support.v8.renderscript are largely similar.

Even though the RenderScript runtime is general-purpose, Google positions the technology as a solution for specific tasks. The RenderScript tutorial, for example, concentrates mainly on the filtering runtime that executes a RenderScript function - so called "kernel" - on a large number of data items. This assumes a certain class of algorithm, i.e. when the output of the algorithm depends solely on the input and not on some intermediate result of the computation. Simple convolution filters (e.g. the 2D filters used widely for image filtering) are such algorithms and therefore they are very suitable for parallel execution on multiple CPU cores. We will see, however, that RenderScript can be exploited to speed up algorithms that do not fit into this computation model.

In the next post I will present our benchmark program.