Monday, January 27, 2014

RenderScript in Android - Java optimization in the benchmark program

I got into discussion with Michael Leahy with regards to the benchmark program posted at the end of the previous post. Michael claimed that the Java implementation of the benchmark algorithm in my test program is suboptimal and he contributed an optimized version that he claims could deliver 70% better performance. 
Therefore I decided to re-evaluate the benchmark results - but with a twist. Not only I added Michael's optimized Java implementation but I optimized the RenderScript implementation too. The results are the following.
  • Michael's implementation did improve the execution time of the Java implementation by about 2.6 times.
  • RenderScript implementation is still about 2.3 times faster than Michael's optimized Java implementation.
The new version is available here. I will continue with explaining this example program in the next post.

13 comments:

MichaelEGR said...

There is a small mistake in the code I submitted.

Between the first for loop and the 2nd it's necessary to separately track accumulation of the first index of d0:

s = 0;

for (int y = 1; y < s2len; y++)
{
signal2Y = signal2[y];
d1[0] = (s += Math.abs(signal1[0] - signal2Y));


------

In a few runs without this things may result in slightly incorrect results. This basically just adds one more assignment into the mix.

------

I commented more in your last post about the 70% speed improvement. You need to mark the method static and a bit more performance will result. Also it's a bad micro benchmark scenario in only executing the calculation one time. With a warm up period I've seen results from 66% to 72% speed up over the original Java code. What you posted is the lower bound due to no warm up and removing the static keyword.

A C++ version will run just as fast as the current Renderscript implementation. Using jnigen one can have the C++ code inline to the Java files and have automatic build scripts generated along with all of the platform specific resources.

I posit a parallel Java / CPU version will likely run twice as fast as the existing Renderscript solution and be quite portable without the need of any build tools or nuisances. Certainly it shouldn't run any slower than the current Renderscript solution.

As far as I'm currently aware you won't be able to improve or parallelize DWT with Renderscript. I'd be interested if you can greatly improve / parallelize it though and write about how you did it.. :) Unfortunately I don't have too much time in the coming month or two to work on curiosities, but it would be interesting to keep in touch on expanding the comparisons.

Gabor Paller said...

RenderScript implementation can of course be parallelized but not with the (quite naive) foreach semantic. The parallel implementation will keep its speed advantage over the parallel Java implementation. I will get to that later but I guess, you will understand why it is so after you read the next post (which will just explain the current benchmark and will not present the parallel version).

Comparison with the NDK is trickier. NDK and RenderScript performance will be similar if the RenderScript code runs on the main CPU core(s). In this case the ease of development will be the main factor that supports RenderScript. If you have just some small code fragments that you want to optimize, it is much faster to use the RenderScript runtime which is fully integrated with the Eclipse environment than to fiddle with the NDK and JNI. Particularly if you support only Android (cross-platform development is not an issue for me) If you already have large amount of code in native and you also think about portability (e.g. you cross-compile your native code on different mobile platforms to simplify application porting) then RenderScript is a bad choice.

My motivation has always been the efficient implementation of signal-processing algorithms which are small in terms of lines of code but can consume a lot of CPU cycles (and hence battery power). RenderScript is an attractive option for me therefore, particularly if my RenderScript code can really execute on some DSP or microcontroller on the device. This is just a promise now but it may happen one day. This is the story I try to present in these posts.

Unknown said...

You may also want to try using the built in min/max functions rather than rolling you own. On HW which has dedicated instructions for this it will be faster. If nothing else it reduces code.

RS vs NDK, I did some benchmarks on this. RS is typically fast due to the ability to compile on device and inline processor specific instructions. In theory if you compiled separately for every possible processor you would break even with the NDK, but no sane person would do that much work.

For this code I don't think it would be a big delta.

MichaelEGR said...

@Gabor
>RenderScript implementation can of course be parallelized but not with the (quite naive) foreach semantic.

Yes, that is my point. There is no work efficient way to parallelize this and many other algorithms with Renderscript.

There are several research papers that detail how DWT is parallelized with the Cuda / OpenCL model. There are next to no research papers at all that describe the simplest algorithm implementation via Renderscript compared to OpenCL / Cuda.

Here is the one I'm currently perusing:
http://membres-liglab.imag.fr/termier/ParallelDMWorkshop/WorkshopNotes_PDM11.pdf#page=17

>The parallel implementation will keep its speed advantage over the parallel Java implementation.

Only if there are semantics in the language that allow one to actually implement the algorithm in a work efficient manner otherwise the Java parallel version may be in the same ballpark still.

It also is the case that the parallel OpenCL or Cuda version will be up to ~10 times faster than a CPU version and likely easily beat any further Renderscript implementations. I'm arguing that the Java parallel version will be as fast as the serial CPU Renderscript (the current demo version) and / or conceivably twice as fast along with the benefits that it runs anywhere and without additional developers tools.

If the language features are not available in Renderscript to parallelize the algorithm in a work efficient manner which is to my knowledge the situation then there will be an upper bound of performance that will be far exceeded by an execution environment like OpenCL or Cuda that does have the constructs to implement this algorithm and most other intermediate to advanced algorithms in a work efficient manner.

>If you have just some small code fragments that you want to optimize, it is much faster to use the RenderScript runtime which is fully integrated with the Eclipse environment than to fiddle with the NDK and JNI.

Who uses Eclipse anymore? Well I was a long time Idea user, so the move of Android Studio to Idea CE was a nice situation. You ever check out all the tools / IDE bugs Renderscript has with Android Studio? IDE integration with Renderscript isn't as trouble free as you frame it.

Also, seriously take a look at jnigen:
https://github.com/libgdx/libgdx/wiki/jnigen

One doesn't have to fiddle with NDK & JNI.. Well, sure there is setup for the environment and ones system, but that is a one time setup cost. The actual programming experience is transparent, quick, and instant.

There are other projects too which provide an abstracted bridge to JNI / NDK without the need to write all the boiler plate code. I do have a preference from jnigen though.

MichaelEGR said...

@Gabor
>If you already have large amount of code in native and you also think about portability

There are several things in play here. First most organizations from small (startups definitely included!) to large don't just deal with one platform and must address multiple platforms when developing software; IE iOS, Android, and often the desktop.

Algorithm development is difficult; particularly parallel algorithm development thus there is a major cost reason to work with cross-platform technology. Only boutique operations not concerned about the future or ones receiving funds to only release on a single platform will accept tools lock in. That or the organization is flush with cash and able to implement algorithms in cross-platform and platform specific technologies. Not many if any of those around...

All of the benefits in your last paragraph of the previous comment are addressed more fully by open standards such as OpenCL in regard to target devices and runtime environments. Google has a poor track record when it comes to working on something that they themselves don't directly need. You say it's a promise.. I say it will likely never happen. So, waiting for them to get Renderscript code working on general purpose DSPs is practically pipe dream. As you say you are presenting a story.

I too am interested in fast efficient implementation of DSP algorithms running on Android and everywhere else. I still think it will be neat if in the coming months I can find some time to provide performance based Java, C++, and OpenCL comparisons to further accelerating the DWT algorithm and post about it. I may make this the topic for a presentation at an Android conference later in the year.. It would be handy to be able to reference Renderscript counterpart examples. :)

Gabor Paller said...

Jason, I found min/max only for float, did I overlook something?

Unknown said...

min and max are there. Might be a bug with the docs, I'll check it tomorrow.

Gabor Paller said...

Jason, one more question. Is it OK to create more than one RenderScript context to enforce parallel execution? (same script executed in two or more RenderScript contexts). It works for me nicely, is there any trap I am not aware of?

Unknown said...

You can create multiple contexts. Context switching overhead can add up fast.

I'm curious what the algorithm looks like that might benefit from this is. I'm done a lot of performance work and may be able to suggest alternative options which will be less work.

Gabor Paller said...

"I'm curious what the algorithm looks like that might benefit from this is."

Well, it is the same DTW algorithm we have been discussing with Michael for a while. Some description here along with reference to a more detailed paper. I chose this algorithm because its output needs intermediate results and not only the input values. Hence it cannot be parallelized so easily as a convolution filter. I would say that the foreach() semantic cannot be used for DTW. So I intend to implement the parallel execution scheme myself.

Unknown said...

For the cases where you want parallelism, but the level of parallel execution is not a direct property of the input or output you have a few options.

Clipping the launch. In the histogram example we did where you have a reduction I tried this approach. Clip the launch to either one row or column and then loop within the kernel. This is a very simple way to alter the thread to work ratio.

For more complex cases I often create a 3rd allocation to hold per-thread data (or it can just be a dummy). Then use get/set element at to walk your in/out data any way you see fit.

Without looking at FastDWT in detail, I'm not sure which technique will be best here. My work on codec's leads me to think you will probably want to try the 2nd option first. Its more flexible.

Gabor Paller said...

"For more complex cases I often create a 3rd allocation to hold per-thread data (or it can just be a dummy). Then use get/set element at to walk your in/out data any way you see fit. "

But then you can't control the maximum number of processing elements allocated to the task. E.g. if parallelization of the algorithm supports only a certain number of processing elements (e.g. 2), RenderScript runtime can still allocate more processing elements (e.g. 4 in case of a 4-core CPU) and you have to idle the remaining 2 cores.

DTW has one such parallelization, if one traverses the cost matrix horizontally and vertically at the same time. This suits 2 processing elements but not more.

Unknown said...

If you want "at most 2 threads" you an create a 1D allocation with 2 elements. Its completely valid to run a forEach on that.

The allocation size caps the number of threads that get used and factors in processor selection.