Saturday, January 25, 2014

RenderScript in Android - the benchmark program

The previous post of this series tried to present argument, why it is a good idea to consider implementing part of your code in RenderScript, instead of Java or native code. This part presents a benchmark that we will use to evaluate the technology. I intentionally chose the benchmark not to be an image manipulation algorithm or a convolution filter. I wanted a use case that processes large amount of data. Sound processing yields itself as a quite evident choice.

The algorithm I am going to present is quite naive and would require a lot of refinement for commercial deployment. But my goal was RenderScript benchmarking so the algorithm's tolerance to all kinds of environmental disturbances was not relevant. We will do voice fingerprinting that takes a voice sample array, compares this sample to a reference sample and returns true or false depending on whether the voice sample is sufficiently similar to the reference sample. The algorithm should return true if the voice sample contains the same word from the same speaker, false otherwise.

I implemented Dynamic Time-Warping (DTW) for the purpose. DTW compares two data series assuming that the timing between the two series may be different. For example the same speaker may say the same word with slightly higher pitch. Therefore DTW calculates a so called warping path, a pair of corresponding timestamps from series1 and series2. E.g. a part of a warping path may be (1,1), (1,2),(2,3),(3,3) meaning that a[1] (data at index #1 from series1) corresponds to b[1], a[1] corresponds to b[2], a[2] corresponds to b[3] and a[3] corresponds to b[3]. Knowing the warping path and a cost function, the distance between the two series can be calculated. DTW finds the warping path with the lowest cumulative cost (i.e. the sum of cost between the corresponding data items along the warping path). A detailed introduction to the algorithm can be found here.

This example program implements classic DTW calculating every possible pairing between the data items of the two series. This requires an NxM matrix where N is the size of series1 and M is the size of series2. DTW(i,n) (element at position (i,n) in this matrix) is the accumulated cost value (sum of the cost function values) of the optimal warping path that leads through the position at i,n. (i.e. with assuming that index i in series1 and index n in series2 correspond to each other). The function calculating DTW(i,n) is thus:

DTW(i,n)=min(DTW(i-1,n),DTW(i-1,n-1),DTW(i,n-1))+c(i,n)

where c(i,n) is the cost between a[i] and b[n]. The example program uses

c(i,n)=abs(a[i]-b[n]))

Cells at the edge of the matrix are initialized like

DTW(i,0)=sum(c(i,0))

and

DTW(0,i)=sum(c(0,i))

Classic DTW can require a large amount of space. Take for example a short voice sample. With a sampling rate of 44100Hz/5=8820Hz, even a short word can generate easily a vector of 5000-8000 elements. If the reference vector is of a similar size, the DTW matrix will consist of 25 million to 64 million elements, all of them integer (meaning 4 bytes each). A data item of that size may easily exhaust the fixed maximum process space allocated to Android applications. Therefore we do a simple optimization. The output of our algorithm is the optimal distance which will be found at DTW(M,N), the right-bottom element in the matrix. In order to calculate elements in row i, we need only elements in row i-1 plus the already calculated elements in row i. We can drop row i-2 and calculate DTW(i,0) on the fly, when we get there. This will reduce the memory requirement to 2*M data items instead of N*M.

DTW, however naive its application in the example program, satisfies our RenderScript benchmark needs because

  • The kernel function is non-linear, it is not so evident to break it down to sub-ranges (although FastDTW does precisely that).
  • Calculation of matrix cells depend on an intermediate result of the calculation and not only on the input values.

The example program is attached to the end of this post. You need a free registration to the Sfonge site to access it. The application exercises the DTW algorithm on speech samples. Launch the application and record a reference sample. E.g. record a word like "four". Then record a same or different word from the same or different speaker and observe the distance that DTW calculates. If the environment is not too noisy, you will find that same word from the same speaker yields a (normalized) distance below 1000-1200 while other words from the same speaker or the same word from a different speaker yields higher distance. The calculation takes a bit of time because the DTW algorithm is executed twice, once in Java and once in RenderScript. The application displays the execution time for both implementations.



For now, you can observe that RenderScript execution times are 4-5 times smaller than Java execution times even (Update: visit this post for an updated number. After optimization, RenderScript is still 2.3 faster). though the implementation does not exploit any parallelization features. I will go deeper into this observation in the next post.

3 comments:

MichaelEGR said...

Thanks for the follow up blog post.

Unfortunately, like I said extraordinary claims need extraordinary evidence and you didn't deliver with obviously unoptimized Java code. With a few minor tweaks without even getting really fancy the core Java method performing the calculation can be reduced to ~27.5% (3-4 times) the execution time of your original code.

I'm not sure if you intentionally tried to make the Java code slow, but you raised some serious doubts by commenting out the System.arraycopy and putting in a for loop to copy data between two arrays mind you once again inside of an existing loop! Also, the completely unnecessary "signalCost" method which just is a Math.abs() call is called 3 times (twice in a nested for loop). You should know that calling methods in tight loops is extremely bad form for performance.

Here is the modified method:

http://pastebin.com/mfmHKkCb

Since you mention the Renderscript code having a 4-5 times improvement over your poorly written Java example I posit that the ~8-10% improvement is simply not worth the loss of cross-platform compatibility and headache free maintenance. The code is shorter, clearer, doesn't require developers to bother with Renderscript and multiple files. If one wanted to improve upon it then evaluating the NDK / C++ for a native cross-platform solution is certainly worthy. There are ways of writing ones C++ code inline to Java code and this method would be a perfect example for such a solution.

Of course the parallel solution is the real avenue to explore. And multithreaded Java code is not going to be that far behind C++ and once again given the data set to be calculated one would implement a multithreaded parallel CPU version of the algorithm and where available OpenCL to greatly speed it up choosing which one depending on the data size to process. Do this once and you have an algorithm that runs cross-platform everywhere.

I am curious if you care to revisit your assumptions as otherwise it seems confirmation bias is in play at least for the moment.

Gabor Paller said...

Thanks for the contribution, Michael. I re-evaluated the benchmark results, here are the new results, along with the updated benchmark program. In short: RenderScript is still 2.3 times faster.

With regards to the NDK: there's a long discussion between the GCC and Clang community, who generates faster code. The reality is that the two compilers are head-to-head when it comes to the efficiency of the generated code. Clang however offers much faster compilation time. NDK implementation is about as fast as RenderScript but it is much faster to write a RenderScript function. The reasons: Clang/LLVM is better integrated into the SDK toolchain (one reason is that Clang compiles so quickly) and the SDK automatically generates all the glue code while in case of NDK you have to fiddle with all the complexities of the JNI.

Parallelization is another issue, I will return to that later.

MichaelEGR said...

Indeed, the optimizations provided for the Java side are basic. Marking the method static will help slightly, but it must also be noted that the current execution environment of running the method once without any warm up is a bad micro test essentially. Things approach or best 70% (66-72% improvement) with a warm up before grabbing a result of the Java test. I haven't tested if loop unrolling and removing the Math. methods will help, but it should be noted that doing so would likely slow down a one time / single execution of the calculation method.

As far as JNI is concerned one actually doesn't need to fiddle with the glue code. Using a build tool like jnigen which is a standalone effort part of libgdx one can easily build and integrate inline to the Java code native methods and automatically generate all platform resources and not just Android. This solution trumps any serial Renderscript option since it's cross-platform.

I posit that a parallel CPU Java version will likely be twice as fast as the current Renderscript version. A parallel OpenCL version upwards of 5-10 times faster than Renderscript.