It would have been such a waste to discard our quite complicated parallel processing DTW algorithm so I turned to other means of parallel execution. Multi-core is one option but the ARM processors in popular Android devices have another parallel execution engine, internal to the core, the NEON execution engine. One NEON instruction is able to process 4 32-bit integers in parallel (see picture below). Can we speed up DTW fourfold with this option?
NEON is actually quite an old technology, even Nexus One was equipped with it. It is much more widely deployed therefore than multi-core CPUs. While ordinary applications can take advantage of multi-core CPUs (e.g. two processes can execute in parallel on two cores), NEON programs are difficult to write. Although some compilers claim the ability to generate NEON code and template libraries are available, the experience is that the potential performance benefits cannot be exploited without hand-coding in assembly and that's not for the faint hearted.
The example program can be downloaded from here.
The relevant functions are in jni/cpucore.c. There are 3 implementations, processNativeSlow, processNative and processNativeNEON, each is progressively more optimized than the previous one. The processNativeSlow and processNative functions are in C, in processNativeNEON the most time-critical loop ("tight loop") is entirely implemented in mixed ARM/NEON assembly. This tight loop produces 4 result elements in parallel so we expect huge performance gain over the single-core RenderScript implementation (dtw.rs).
The experience is completely different. While the NEON implementation is significantly faster on small datasets, one second of voice is 8000 samples so data sizes grow quickly. On 10 second data sets (80000 samples, 6.4 billion element DTW matrix) the simple nested loop C99 implementation and the complex, hard to understand NEON implementation produces about the same execution time.
How is this possible? Let's take an example of 10 second reference and evaluation samples. This means 80000 elements, 80000*80000=6.4 billion values to calculate. Calculating each value takes 20 bytes to access (2 input samples (2 bytes each), 3 neighbor cells (4 bytes each) and storing the result (4 bytes)). A1 SD Bench measures 800 Mbyte/sec copying performance on my Galaxy Nexus (and similar values on the two cheap Android tablets that the family has), that obviously means 2 accesses (one read and one write). For simplicity, let's assume that reads and writes take about the same time. This means that according to this very rough calculation, the memory accesses themselves take about 80 sec. The real execution time is about 120 sec, the difference can be explained by the simplifications. Cache does not really help because of the large data size. The performance is determined by the RAM speed and the simplest single-core implementation already reaches the bottleneck. All the wizardry with parallelism is futile.
Obviously the case was not helped by the selection of the DTW algorithm as benchmark which intentionally does not fit into the class of algorithms normally used to demonstrate the benefits of parallel processing. Grayscale conversion would be better (one read, one write and 3 multiplications per pixel). But this means that you actually have to be really lucky with your algorithms for these parallel options to speed up your code significantly. Even then, it is worth looking at the parallel options inside the core before going multi-core. And you definitely should not forget the auxiliary costs of parallel computation, e.g. distributing/gathering the data to/from the parallel processing units or whether other hardware (e.g. memory) is able to keep the pace with the CPU.
One wild idea at the end. Could RenderScript computation model be used to generate NEON code? With some limitations, the answer is probably yes.