In the previous post I promised to revisit the parallel case. The big
promise of RenderScript is to exploit parallelism among different CPUs,
GPUs and DSPs in the device at no additional cost. Once the algorithm is
properly transformed into parallel version, the RenderScript runtime
grabs whatever computing devices are available and schedules the subtask
automatically.

The problem with DTW is that it is not so trivial to parallelize. Each cell in the matrix depends on cells at (x-1,y), (x-1,y-1) and (x,y-1) (provided that the cell to calculate is at (x,y)). By traversing the matrix horizontally or vertically, only two rows (one horizontal and one vertical) can be evaluated in parallel.

Michael Leahy recommended a paper that solves this problem. This algorithm traverses the matrix diagonally. Each diagonal row depends on the two previous diagonal rows but cells in one diagonal row don't depend on each other. One diagonal row can be then fed to RenderScript to iterate over it. The picture below illustrates the concept.

The example program can be downloaded at the end of this post. You have to be logged to Sfonge site to access it.

You will notice that there are two parallel implementations. The findReferenceSignalC99Parallel() is the "proper" implementation that follows closely the RenderScript tutorial. Here the diagonal rows are iterated in Java and only the parallel kernel is implemented in RenderScript. This version - even though it is functional - is not invoked by default because it delivers completely inacceptable performance on my 2-core Galaxy Nexus. By looking closely at the execution times, I concluded that even though RenderScript runtime invocations ( copying into Allocations and invoking forEach) are normally fast, sometimes very innocent-looking invocations (like copying 5 integers into an Allocation) can take about a second. This completely ruined this implementation's performance.

The other parallel implementation which is actually invoked and whose performance is compared to the 1-core RenderScript implementation (the fastest one) is findReferenceSignalC99ParallelRSOnly(). This version is implemented entirely in RenderScript. Unfortunately its performance is 2-2.5 times

First, if you compare dtw.rs and dtwparallel2.rs, you will notice that the parallel implementation is considerably more complex. Indexing out those varying-length diagonal rows takes a bit of fiddling while the 1-core implementation can take the advantage of fast pointer arithmetic to move from cell to cell sequentially. So the parallel implementation starts with a handicap. This handicap is not compensated by the 2 cores of the Galaxy Nexus.

OK, Galaxy Nexus is the stone age but what happens on a 4-core processor like on a Nexus 4? The runtime does launch with 4 cores but then the Adreno driver kicks in and the result is that the parallel implementation is about 3 times slower than the serial one. What happens in the driver, I don't know, as far as I can see, the source code is not available.

Jasons Sams recommended to disable the GPU driver with

adb shell setprop debug.rs.default-CPU-driver 1

but I decided to stop my adventures here. The conclusion I drew for myself is that RenderScript in its present form is not ready for parallel programming. Clang-LLVM is a very promising compilation technology but the parallel runtime suffers from a number of problems. IMHO, there should be a way to programmatically control the way the workloads are allocated to CPUs/GPUs. Until then, if you want to harness the power of your multicore processor, code the parallel runtime yourself. Using RenderScript for the serial code if you wish.

The problem with DTW is that it is not so trivial to parallelize. Each cell in the matrix depends on cells at (x-1,y), (x-1,y-1) and (x,y-1) (provided that the cell to calculate is at (x,y)). By traversing the matrix horizontally or vertically, only two rows (one horizontal and one vertical) can be evaluated in parallel.

Michael Leahy recommended a paper that solves this problem. This algorithm traverses the matrix diagonally. Each diagonal row depends on the two previous diagonal rows but cells in one diagonal row don't depend on each other. One diagonal row can be then fed to RenderScript to iterate over it. The picture below illustrates the concept.

The example program can be downloaded at the end of this post. You have to be logged to Sfonge site to access it.

You will notice that there are two parallel implementations. The findReferenceSignalC99Parallel() is the "proper" implementation that follows closely the RenderScript tutorial. Here the diagonal rows are iterated in Java and only the parallel kernel is implemented in RenderScript. This version - even though it is functional - is not invoked by default because it delivers completely inacceptable performance on my 2-core Galaxy Nexus. By looking closely at the execution times, I concluded that even though RenderScript runtime invocations ( copying into Allocations and invoking forEach) are normally fast, sometimes very innocent-looking invocations (like copying 5 integers into an Allocation) can take about a second. This completely ruined this implementation's performance.

The other parallel implementation which is actually invoked and whose performance is compared to the 1-core RenderScript implementation (the fastest one) is findReferenceSignalC99ParallelRSOnly(). This version is implemented entirely in RenderScript. Unfortunately its performance is 2-2.5 times

*than the 1-core implementation. How can it be?***slower**First, if you compare dtw.rs and dtwparallel2.rs, you will notice that the parallel implementation is considerably more complex. Indexing out those varying-length diagonal rows takes a bit of fiddling while the 1-core implementation can take the advantage of fast pointer arithmetic to move from cell to cell sequentially. So the parallel implementation starts with a handicap. This handicap is not compensated by the 2 cores of the Galaxy Nexus.

OK, Galaxy Nexus is the stone age but what happens on a 4-core processor like on a Nexus 4? The runtime does launch with 4 cores but then the Adreno driver kicks in and the result is that the parallel implementation is about 3 times slower than the serial one. What happens in the driver, I don't know, as far as I can see, the source code is not available.

Jasons Sams recommended to disable the GPU driver with

adb shell setprop debug.rs.default-CPU-driver 1

but I decided to stop my adventures here. The conclusion I drew for myself is that RenderScript in its present form is not ready for parallel programming. Clang-LLVM is a very promising compilation technology but the parallel runtime suffers from a number of problems. IMHO, there should be a way to programmatically control the way the workloads are allocated to CPUs/GPUs. Until then, if you want to harness the power of your multicore processor, code the parallel runtime yourself. Using RenderScript for the serial code if you wish.