My life with Android :-)

Tuesday, March 11, 2014

Beyond RenderScript - parallelism with NEON


My last post about the parallel implementation of Distributed Time Warping (DTW) algorithm was a disappointment. The RenderScript runtime executed the parallel implementation significantly slower than the single-core implementation (also implemented with RenderScript). It turned out that parallelizing the processing of 10000-50000 element vectors on multiple cores were not worth the cost of the multi-thread processing and all the overhead that comes with it (threads, semaphores, etc.). One core must be allocated a significantly larger workload but our DTW algorithm is not able to generate such a large, independent workload because rows of the DTW matrix depend on each other. So in order to exploit RenderScript multi-core support, it is best to have an algorithm where the output depends on only the input and not on some intermediate result because this type of algorithm can be sliced up easily to multiple cores.

It would have been such a waste to discard our quite complicated parallel processing DTW algorithm so I turned to other means of parallel execution. Multi-core is one option but the ARM processors in popular Android devices have another parallel execution engine, internal to the core, the NEON execution engine. One NEON instruction is able to process 4 32-bit integers in parallel (see picture below). Can we speed up DTW fourfold with this option?



NEON is actually quite an old technology, even Nexus One was equipped with it. It is much more widely deployed therefore than multi-core CPUs. While ordinary applications can take advantage of multi-core CPUs (e.g. two processes can execute in parallel on two cores), NEON programs are difficult to write. Although some compilers claim the ability to generate NEON code and template libraries are available, the experience is that the potential performance benefits cannot be exploited without hand-coding in assembly and that's not for the faint hearted.

The example program is attached at the end of this post. You have to be logged to the Sfonge site to access it.

The relevant functions are in jni/cpucore.c. There are 3 implementations, processNativeSlow, processNative and processNativeNEON, each is progressively more optimized than the previous one. The processNativeSlow and processNative functions are in C, in processNativeNEON the most time-critical loop ("tight loop") is entirely implemented in mixed ARM/NEON assembly. This tight loop produces 4 result elements in parallel so we expect huge performance gain over the single-core RenderScript implementation (dtw.rs).

The experience is completely different. While the NEON implementation is significantly faster on small datasets, one second of voice is 8000 samples so data sizes grow quickly. On 10 second data sets (80000 samples, 6.4 billion element DTW matrix) the simple nested loop C99 implementation and the complex, hard to understand NEON implementation produces about the same execution time.

How is this possible? Let's take an example of 10 second reference and evaluation samples. This means 80000 elements, 80000*80000=6.4 billion values to calculate. Calculating each value takes 20 bytes to access (2 input samples (2 bytes each), 3 neighbor cells (4 bytes each) and storing the result (4 bytes)). A1 SD Bench measures 800 Mbyte/sec copying performance on my Galaxy Nexus (and similar values on the two cheap Android tablets that the family has), that obviously means 2 accesses (one read and one write). For simplicity, let's assume that reads and writes take about the same time. This means that according to this very rough calculation, the memory accesses themselves take about 80 sec. The real execution time is about 120 sec, the difference can be explained by the simplifications. Cache does not really help because of the large data size. The performance is determined by the RAM speed and the simplest single-core implementation already reaches the bottleneck. All the wizardry with parallelism is futile.

Obviously the case was not helped by the selection of the DTW algorithm as benchmark which intentionally does not fit into the class of algorithms normally used to demonstrate the benefits of parallel processing. Grayscale conversion would be better (one read, one write and 3 multiplications per pixel). But this means that you actually have to be really lucky with your algorithms for these parallel options to speed up your code significantly. Even then, it is worth looking at the parallel options inside the core before going multi-core. And you definitely should not forget the auxiliary costs of parallel computation, e.g. distributing/gathering the data to/from the parallel processing units or whether other hardware (e.g. memory) is able to keep the pace with the CPU.

One wild idea at the end. Could RenderScript computation model be used to generate NEON code? With some limitations, the answer is probably yes.

Friday, February 7, 2014

RenderScript in Android - the parallel version

In the previous post I promised to revisit the parallel case. The big promise of RenderScript is to exploit parallelism among different CPUs, GPUs and DSPs in the device at no additional cost. Once the algorithm is properly transformed into parallel version, the RenderScript runtime grabs whatever computing devices are available and schedules the subtask automatically.

The problem with DTW is that it is not so trivial to parallelize. Each cell in the matrix depends on cells at (x-1,y), (x-1,y-1) and (x,y-1) (provided that the cell to calculate is at (x,y)). By traversing the matrix horizontally or vertically, only two rows (one horizontal and one vertical) can be evaluated in parallel.

Michael Leahy recommended a paper that solves this problem. This algorithm traverses the matrix diagonally. Each diagonal row depends on the two previous diagonal rows but cells in one diagonal row don't depend on each other. One diagonal row can be then fed to RenderScript to iterate over it. The picture below illustrates the concept.



The example program can be downloaded at the end of this post. You have to be logged to Sfonge site to access it.

You will notice that there are two parallel implementations. The findReferenceSignalC99Parallel() is the "proper" implementation that follows closely the RenderScript tutorial. Here the diagonal rows are iterated in Java and only the parallel kernel is implemented in RenderScript. This version - even though it is functional - is not invoked by default because it delivers completely inacceptable performance on my 2-core Galaxy Nexus. By looking closely at the execution times, I concluded that even though RenderScript runtime invocations ( copying into Allocations and invoking forEach) are normally fast, sometimes very innocent-looking invocations (like copying 5 integers into an Allocation) can take about a second. This completely ruined this implementation's performance.

The other parallel implementation which is actually invoked and whose performance is compared to the 1-core RenderScript implementation (the fastest one) is findReferenceSignalC99ParallelRSOnly(). This version is implemented entirely in RenderScript. Unfortunately its performance is 2-2.5 times slower than the 1-core implementation. How can it be?

First, if you compare dtw.rs and dtwparallel2.rs, you will notice that the parallel implementation is considerably more complex. Indexing out those varying-length diagonal rows takes a bit of fiddling while the 1-core implementation can take the advantage of fast pointer arithmetic to move from cell to cell sequentially. So the parallel implementation starts with a handicap. This handicap is not compensated by the 2 cores of the Galaxy Nexus.

OK, Galaxy Nexus is the stone age but what happens on a 4-core processor like on a Nexus 4? The runtime does launch with 4 cores but then the Adreno driver kicks in and the result is that the parallel implementation is about 3 times slower than the serial one. What happens in the driver, I don't know, as far as I can see, the source code is not available.

Jasons Sams recommended to disable the GPU driver with
adb shell setprop debug.rs.default-CPU-driver 1
but I decided to stop my adventures here. The conclusion I drew for myself is that RenderScript in its present form is not ready for parallel programming. Clang-LLVM is a very promising compilation technology but the parallel runtime suffers from a number of problems. IMHO, there should be a way to programmatically control the way the workloads are allocated to CPUs/GPUs. Until then, if you want to harness the power of your multicore processor, code the parallel runtime yourself. Using RenderScript for the serial code if you wish.

Thursday, January 30, 2014

RenderScript in Android - anatomy of the benchmark program

 In the previous post I have presented our RenderScript benchmark and demonstrated that RenderScript implementation of the same algorithm can be 2-3 times faster than Java. How can a "script" be so fast? In order to understand this speed difference, let's see how the RenderScript fragment is executed.

The example program is attached to the end of this post. You have to be logged to the Sfonge site to access it.

First, let's see how the script looks like. The source can be found in dtw.rs, in the same directory where other Java sources (just one file in this case) are.




It looks like an innocent C function but there are some specialties. All the global variables like these ones:

int32_t s2len = 0;
int32_t *d0;


can be used to pass data to the script.  The toolchain generates a Java wrapper for each .rs file, ours is called ScriptC_dtw.java. In order to set s2len, for example, one calls the set_s2len function in the ScriptC_dtw class. The d0 global variable is a pointer type, setting this variable requires an Allocation Java object. Open MainActivity.java and look up the findReferenceSignalC99() method. There you will find:

        Allocation signal1Allocation = Allocation.createSized(
                rsC,
                Element.I16(rsC),
                refSignal.length);
        signal1Allocation.copyFrom(refSignal);
        script.bind_signal1(signal1Allocation);


Here we created an allocation that holds 16-bit integers, copied the input signal into it and bound the allocation so that the allocation's data is available to the script. When the script is invoked, the data area of this allocation is simply available to the script as:

int16_t *signal1;

This sort of parameter passing is one-way for simple values but two-way for allocations. So whatever you write in your script into e.g. s2len, you won't be able to read it in the Java layer after the script finishes executing. In contrast, Allocations provide two-way data transfer, that's why the result value is passed back to Java in an Allocation.

In MainActivity.java:

        Allocation rAllocation = Allocation.createSized(
                rsC,
                Element.I32(rsC),
                1);
        script.bind_r(rAllocation);


In dtw.rs:

        *r = d1[s1len-1];

And again in MainActivity.java after the script finished executing:

        int result[] = new int[1];
        rAllocation.copyTo(result);
...
        int maxc = result[0];


The execution of the script seems simple enough but there's more than meets the eye.

Execution context and the script instance are created.

        RenderScript rsC = RenderScript.create( this);
        ScriptC_dtw script = new ScriptC_dtw(
                rsC,
                getResources(),
                R.raw.dtw);


ScriptC_dtw is the wrapper which was generated by RenderScript toolchain. But what is R.raw.dtw? Let's see how our "script" was turned into executable code. If you unzip the APK file, you find some interesting artifacts. Under the res/raw directory, you find dtw.bc. This is the LLVM bytecode that dtw.rs was compiled to. In addition, under the lib directory, you will find .so files for the ARM, MIPS and Intel platform. If you disassemble librs.dtw.so, you will find highly optimized binary compiled from our script which is really a piece of valid C code.



The RenderScript name generates a confusion. This name evokes a proprietary scripting language when in fact it is Clang's C99 front-end compiler with a set of libraries that are ported to a large number of processors. Optimized C code is fast, what is so surprising in it? When our "script" is executed on an ARM processor, the RenderScript runtime just has to load the precompiled ARM code and execute it. If it turns out that there is no precompiled native code for the target processor (e.g. it is a GPU) then LLVM backend compiler swings into action and generates code for that processor at installation time. Both compilation steps (from C to LLVM bytecode and from LLVM bytecode to native) are subject to optimization so the resulting native code is very fast. No wonder therefore that RenderScript beats Dalvik VM so easily and with such a large margin.

After all the global variables have been initialized, the script can be invoked.

        script.invoke_dtw();
        rsC.finish();


Note the finish() invocation here. The invoke_dtw() method is asynchronous meaning that when it returns, the execution of the "script" has not finished, in fact, it was not even started. The finish() method on the RenderScript instance blocks until the script invocations on that context all finish. Script invocations in the same context are executed sequentially.

But what happens when more than one context is created? Allocations and script execution in those contexts are independent. If you have enough cores/processors, script invocations in those contexts will execute in parallel. Be aware, however, that if you create more contexts than the number of processing units you have then those contexts will compete for the same processing units by means of context switching and these context switches will eventually decrease your performance. If your algorithm requires an element scan which is more complicated than the sequence that foreach() supports, you can always create a dummy allocation with as many elements as the processing elements your algorithm supports and release foreach() on that dummy allocation. Then your kernel will access elements of the data set in any order it wishes.

How does RenderScript compare to established technologies like Android SDK or NDK? For Google, the equation is simple: RenderScript is mainly for GPUs, hence its name. I tried to present the case here that for an average Android programmer, RenderScript provides a much more productive way to offload computation-intensive code fragments to highly optimized native code than NDK. RenderScript is integrated with the Android SDK, compilation is super-fast, wrappers are generated automatically, JNI issues are non-existing, coding parallel execution is simpler than either with the SDK or with the NDK. Faster execution also means lower battery consumption as this presentation demonstrated in a different context. And who knows, one day a device with a multicore CPU, GPU or DSP comes along that speeds up your application even further, at no cost. As RenderScript has LLVM at its heart, the possibility is there.

Monday, January 27, 2014

RenderScript in Android - Java optimization in the benchmark program

I got into discussion with Michael Leahy with regards to the benchmark program posted at the end of the previous post. Michael claimed that the Java implementation of the benchmark algorithm in my test program is suboptimal and he contributed an optimized version that he claims could deliver 70% better performance. 
Therefore I decided to re-evaluate the benchmark results - but with a twist. Not only I added Michael's optimized Java implementation but I optimized the RenderScript implementation too. The results are the following.
  • Michael's implementation did improve the execution time of the Java implementation by about 2.6 times.
  • RenderScript implementation is still about 2.3 times faster than Michael's optimized Java implementation.
The new version is attached to the end of this post. You need to be logged to Sfonge site to access it. I will continue with explaining this example program in the next post.

Saturday, January 25, 2014

RenderScript in Android - the benchmark program

The previous post of this series tried to present argument, why it is a good idea to consider implementing part of your code in RenderScript, instead of Java or native code. This part presents a benchmark that we will use to evaluate the technology. I intentionally chose the benchmark not to be an image manipulation algorithm or a convolution filter. I wanted a use case that processes large amount of data. Sound processing yields itself as a quite evident choice.

The algorithm I am going to present is quite naive and would require a lot of refinement for commercial deployment. But my goal was RenderScript benchmarking so the algorithm's tolerance to all kinds of environmental disturbances was not relevant. We will do voice fingerprinting that takes a voice sample array, compares this sample to a reference sample and returns true or false depending on whether the voice sample is sufficiently similar to the reference sample. The algorithm should return true if the voice sample contains the same word from the same speaker, false otherwise.

I implemented Dynamic Time-Warping (DTW) for the purpose. DTW compares two data series assuming that the timing between the two series may be different. For example the same speaker may say the same word with slightly higher pitch. Therefore DTW calculates a so called warping path, a pair of corresponding timestamps from series1 and series2. E.g. a part of a warping path may be (1,1), (1,2),(2,3),(3,3) meaning that a[1] (data at index #1 from series1) corresponds to b[1], a[1] corresponds to b[2], a[2] corresponds to b[3] and a[3] corresponds to b[3]. Knowing the warping path and a cost function, the distance between the two series can be calculated. DTW finds the warping path with the lowest cumulative cost (i.e. the sum of cost between the corresponding data items along the warping path). A detailed introduction to the algorithm can be found here.

This example program implements classic DTW calculating every possible pairing between the data items of the two series. This requires an NxM matrix where N is the size of series1 and M is the size of series2. DTW(i,n) (element at position (i,n) in this matrix) is the accumulated cost value (sum of the cost function values) of the optimal warping path that leads through the position at i,n. (i.e. with assuming that index i in series1 and index n in series2 correspond to each other). The function calculating DTW(i,n) is thus:

DTW(i,n)=min(DTW(i-1,n),DTW(i-1,n-1),DTW(i,n-1))+c(i,n)

where c(i,n) is the cost between a[i] and b[n]. The example program uses

c(i,n)=abs(a[i]-b[n]))

Cells at the edge of the matrix are initialized like

DTW(i,0)=sum(c(i,0))

and

DTW(0,i)=sum(c(0,i))

Classic DTW can require a large amount of space. Take for example a short voice sample. With a sampling rate of 44100Hz/5=8820Hz, even a short word can generate easily a vector of 5000-8000 elements. If the reference vector is of a similar size, the DTW matrix will consist of 25 million to 64 million elements, all of them integer (meaning 4 bytes each). A data item of that size may easily exhaust the fixed maximum process space allocated to Android applications. Therefore we do a simple optimization. The output of our algorithm is the optimal distance which will be found at DTW(M,N), the right-bottom element in the matrix. In order to calculate elements in row i, we need only elements in row i-1 plus the already calculated elements in row i. We can drop row i-2 and calculate DTW(i,0) on the fly, when we get there. This will reduce the memory requirement to 2*M data items instead of N*M.

DTW, however naive its application in the example program, satisfies our RenderScript benchmark needs because

  • The kernel function is non-linear, it is not so evident to break it down to sub-ranges (although FastDTW does precisely that).
  • Calculation of matrix cells depend on an intermediate result of the calculation and not only on the input values.

The example program is attached to the end of this post. You need a free registration to the Sfonge site to access it. The application exercises the DTW algorithm on speech samples. Launch the application and record a reference sample. E.g. record a word like "four". Then record a same or different word from the same or different speaker and observe the distance that DTW calculates. If the environment is not too noisy, you will find that same word from the same speaker yields a (normalized) distance below 1000-1200 while other words from the same speaker or the same word from a different speaker yields higher distance. The calculation takes a bit of time because the DTW algorithm is executed twice, once in Java and once in RenderScript. The application displays the execution time for both implementations.



For now, you can observe that RenderScript execution times are 4-5 times smaller than Java execution times even (Update: visit this post for an updated number. After optimization, RenderScript is still 2.3 faster). though the implementation does not exploit any parallelization features. I will go deeper into this observation in the next post.

Wednesday, January 22, 2014

RenderScript in Android - why should you care?

Before 3.0, Android application development meant either using the SDK - which is a Java toolchain - or the NDK - which is a C/C++ toolchain based on the gcc compiler. The 3.0 version added a third option, RenderScript. RenderScript was positioned as an auxiliary toolchain for special tasks like image manipulation operations. In this post I will investigate whether it makes sense to use RenderScript for more mainstream coding problems.

If you search the web for RenderScript, you won't get too many hits. This is due to the fact that RenderScript is really just the Google name for the technology which is based on two powerful software packages: the clang compiler and the LLVM "virtual machine". Clang is a C/C++ front-end compiler and is a strong challenger to gcc. RenderScript code is processed by clang and is turned into LLVM bytecode. LLVM is not a virtual machine in a strong sense as it does not "execute" this bytecode. Rather, it provides a second compilation step when it turns the LLVM bytecode into machine code of the target processor.

Much like Android's new runtime, the ART, LLVM is an Ahead-of-Time compiler. The second compilation step either happens on the developer's machine or directly on the device. The first option means that the Clang-LLVM toolchain generates code for 3 processor families: ARM, MIPS and Intel so your APK file already contains native code generated from RenderScript code for these processors and no further compilation is needed on the device. RenderScript code may run on graphical co-processors and other DSPs, however. In this case the on-device LLVM compiler accesses the LLVM bytecode which is also packaged into the APK file and may generate code for these specialized processors at install time.

 So if you write RenderScript instead of native implementation with the traditional NDK toolchain, you get the performance of the native code, support of many processor architectures including more exotic GPUs and DSPs without any hassles, get a simpler integration with the Java code and a much smoother integration with the Eclipse developer environment.

RenderScript appeared in Android 3.0 and supporting Java packages reside under the android.renderscript hierarchy. This is called native RenderScript. Google, however, backported the feature to Android 2.2 (and higher) and this backported version is now the recommended way. Instead of being part of the platform, RenderScript support now resides in a support library and is under the android.support.v8.renderscript package hierarchy. Functionality of android.renderscript and android.support.v8.renderscript are largely similar.

Even though the RenderScript runtime is general-purpose, Google positions the technology as a solution for specific tasks. The RenderScript tutorial, for example, concentrates mainly on the filtering runtime that executes a RenderScript function - so called "kernel" - on a large number of data items. This assumes a certain class of algorithm, i.e. when the output of the algorithm depends solely on the input and not on some intermediate result of the computation. Simple convolution filters (e.g. the 2D filters used widely for image filtering) are such algorithms and therefore they are very suitable for parallel execution on multiple CPU cores. We will see, however, that RenderScript can be exploited to speed up algorithms that do not fit into this computation model.

In the next post I will present our benchmark program.