Tuesday, March 11, 2014

Beyond RenderScript - parallelism with NEON


My last post about the parallel implementation of Distributed Time Warping (DTW) algorithm was a disappointment. The RenderScript runtime executed the parallel implementation significantly slower than the single-core implementation (also implemented with RenderScript). It turned out that parallelizing the processing of 10000-50000 element vectors on multiple cores were not worth the cost of the multi-thread processing and all the overhead that comes with it (threads, semaphores, etc.). One core must be allocated a significantly larger workload but our DTW algorithm is not able to generate such a large, independent workload because rows of the DTW matrix depend on each other. So in order to exploit RenderScript multi-core support, it is best to have an algorithm where the output depends on only the input and not on some intermediate result because this type of algorithm can be sliced up easily to multiple cores.

It would have been such a waste to discard our quite complicated parallel processing DTW algorithm so I turned to other means of parallel execution. Multi-core is one option but the ARM processors in popular Android devices have another parallel execution engine, internal to the core, the NEON execution engine. One NEON instruction is able to process 4 32-bit integers in parallel (see picture below). Can we speed up DTW fourfold with this option?



NEON is actually quite an old technology, even Nexus One was equipped with it. It is much more widely deployed therefore than multi-core CPUs. While ordinary applications can take advantage of multi-core CPUs (e.g. two processes can execute in parallel on two cores), NEON programs are difficult to write. Although some compilers claim the ability to generate NEON code and template libraries are available, the experience is that the potential performance benefits cannot be exploited without hand-coding in assembly and that's not for the faint hearted.

The example program is attached at the end of this post. You have to be logged to the Sfonge site to access it.

The relevant functions are in jni/cpucore.c. There are 3 implementations, processNativeSlow, processNative and processNativeNEON, each is progressively more optimized than the previous one. The processNativeSlow and processNative functions are in C, in processNativeNEON the most time-critical loop ("tight loop") is entirely implemented in mixed ARM/NEON assembly. This tight loop produces 4 result elements in parallel so we expect huge performance gain over the single-core RenderScript implementation (dtw.rs).

The experience is completely different. While the NEON implementation is significantly faster on small datasets, one second of voice is 8000 samples so data sizes grow quickly. On 10 second data sets (80000 samples, 6.4 billion element DTW matrix) the simple nested loop C99 implementation and the complex, hard to understand NEON implementation produces about the same execution time.

How is this possible? Let's take an example of 10 second reference and evaluation samples. This means 80000 elements, 80000*80000=6.4 billion values to calculate. Calculating each value takes 20 bytes to access (2 input samples (2 bytes each), 3 neighbor cells (4 bytes each) and storing the result (4 bytes)). A1 SD Bench measures 800 Mbyte/sec copying performance on my Galaxy Nexus (and similar values on the two cheap Android tablets that the family has), that obviously means 2 accesses (one read and one write). For simplicity, let's assume that reads and writes take about the same time. This means that according to this very rough calculation, the memory accesses themselves take about 80 sec. The real execution time is about 120 sec, the difference can be explained by the simplifications. Cache does not really help because of the large data size. The performance is determined by the RAM speed and the simplest single-core implementation already reaches the bottleneck. All the wizardry with parallelism is futile.

Obviously the case was not helped by the selection of the DTW algorithm as benchmark which intentionally does not fit into the class of algorithms normally used to demonstrate the benefits of parallel processing. Grayscale conversion would be better (one read, one write and 3 multiplications per pixel). But this means that you actually have to be really lucky with your algorithms for these parallel options to speed up your code significantly. Even then, it is worth looking at the parallel options inside the core before going multi-core. And you definitely should not forget the auxiliary costs of parallel computation, e.g. distributing/gathering the data to/from the parallel processing units or whether other hardware (e.g. memory) is able to keep the pace with the CPU.

One wild idea at the end. Could RenderScript computation model be used to generate NEON code? With some limitations, the answer is probably yes.

20 comments:

Romain Guy said...

Renderscript will generate NEON code when using the CPU.

MichaelEGR said...

*"Obviously the case was not helped by the selection of the DTW algorithm as benchmark which intentionally does not fit into the class of algorithms normally used to demonstrate the benefits of parallel processing."

---

I'd argue that DTW is more than appropriate because it falls in the 95%+ of algorithms that can be accelerated by CUDA / OpenCL, but not Renderscript in a work efficient manner.

I think DTW is an excellent algorithm to select because it shows the deficiencies of Renderscript where only structured grid (IE image processing) algorithms are being cherry picked and shown off as use cases. There is a whole world of algorithms outside the structured grid case.

In the CUDA / OpenCL world DTW and many other classes of algorithms can be implemented in a work efficient manner as opposed to Renderscript. On the desktop OpenCL / CUDA versions of DTW are upwards of 40 times faster than a naive serial version. multicore (4) CPU versions of DTW are upwards of 16 times faster. It would be interesting to have similar comparisons on Android with a recent quad core SoC.

A large amount of common algorithms also rely on parallel scan techniques which simply isn't work efficient with Renderscript. Choose Radix Sort if you need a simple algorithm that will show the deficiencies of Renderscripts inability to handle parallel scan efficiently.

You mentioned choosing grayscale image processing as a test, but it's no stress test at all. If an image processing algorithm must be tested then something like the Kuwahara filter or other non linear / edge preserving algorithm should be tested. Also if image processing is going to be the use case then GLSL needs to be included as a prime implementation mechanism compared against Renderscript and or CUDA / OpenCL.

I still think the DTW saga test is not over on Android. There really should be multicore C and Java versions tested in addition to an OpenCL version tested. Too bad I don't have the free time to do so right at this moment as I'd compare your Renderscript version against an OpenCL version on the ODROID-XU.

Thanks for keeping the posts going on this topic!

Gabor Paller said...

Romain, could you be more specific on that NEON code generation statement?

I inspected the code that RenderScript (LLVM) generated from the dtw.rs file (check out the previous post and look for librs.dtw.so) and there were no NEON statements in it. No wonder those instructions were not present because the optimization I did here by hand is not trivial to perform automatically. A RenderScript kernel could look like:

int kernel(int input) {
return 2*input;
}

For this simple kernel, it would be easy to generate a code that actually takes 4 inputs and produces 4 outputs, map the multiplication statement to a NEON I32 statement and enjoy the benefits (if you don't run into some other bottleneck like memory).

But take this code:

int kernel(int input) {
if( input < 0 )
return 2*input;
else
return 4*input;
}

Here the execution takes different path depending on the input value. Even if you map it to NEON, the performance will not be higher than the naive, serial version.

Gabor Paller said...

Michael, RenderScript *is* the multicore test. The multicore RenderScript driver (so-called reference CPU driver) distributes the payload to all the cores and executes them in parallel. There are two problems with this approach as Jason Sams pointed out in the comment section of the previous post.

- Multicore execution in Android is implemented by means of threads. Thread management and synchronization comes as an additional cost that the relatively small payloads of DTW cannot counterbalance. If you have an image with 1 million pixels, you can send 500000-500000 pixels to the two cores and the thread join at the end may be paid off. But the data dependency of DTW needs resynchronization after each row. In this demo, the input is sampled at 8000 Hz so with 10 sec of data, the payload that can be executed paralelly by the cores is 80000 sample. That payload is turned to be too small.
- Android environment has an additional quirk that affects even the RenderScript rumtime. Unlike on the desktop, Android moves the threads around. So if you have two threads running on 2 cores, they are likely to be executed on both cores (1 thread per core). But if you have 3 threads, one core will be shared by 2 threads. In Java, that third thread may likely be the garbage collector. So at the end you will find that your threads expected to run on multiple cores will actually compete for a single core.

I actually implemented multicore version of the test application but the results were disappointing. Basically one could observe all the problems that multicore RenderScript has. No wonder here - the implementation approach was basically the same except that I managed the threads instead of the RenderScript runtime.

Gabor Paller said...

Michael, one more comment about the OpenCL/CUDA/RenderScript discussion. I don't doubt that high-performance NVIDIA systems installed in PCs are capable of miracles (even though I am not a graphics expert but I know the basics). We are talking about mobile devices, however, that are necessarily constrained. If you could move all the data needed for the computation into the GPU memory, share the workload there and obtain just the result at the end, the performance improvement would be dramatic. Let's use the 10 second DTW benchmark case: that would mean 80000+80000 samples (320 kbytes total) plus 3*80000=240000 data items (4 bytes each, total of 960 kbytes), all in the GPU memory. Are you aware of any Android device that has the hardware that is in theory capable of doing that? Because if we have that hardware, then we can start thinking about the best way of implementing the code. But as long as we are stuck with GPUs that only understand OpenGL (as far as I know, my Galaxy Nexus has such a GPU) and RAM with 800 MBytes/sec copy rate, this OpenCL/CUDA/RenderScript discussion is completely pointless.

MichaelEGR said...

@Gabor
Indeed the Galaxy Nexus is a rather old device. I use it as the baseline device for the work I do now and both CPU & GPU of the GN is quite underperforming compared against the current gen let alone the next gen SoCs on the near horizon. Since the GN is the same SoC as in the PandaBoard and I'm aware that there were OpenCL drivers for the PandaBoard, but apparently to get access to the SDK on the PandaBoard one needed to have an SLA with Imagination Tech and TI. IE be a larger company. Of course no chance to get access via the GN.

All mobile GPU manufacturers have OpenCL drivers available for current generation hardware such that an OEM can expose it on any device by simply making the driver accessible. It's not until the next generation which we are on the doorsteps right now (Adreno 4xx, Tegra K1, PowerVR 6 series) that OpenCL will really be powerful on mobile. This mostly has to do with the previous GPU generation being in development before mobile OpenCL really became a focus. This led to work group limitations with most previous gen mobile OpenCL implementations. The ODROID-XU while having OpenCL drivers accessible has work group limitations which could likely affect non-structured grid algorithms like DTW.

I'd say that OpenCL on a next gen device (Adreno 4xx, Tegra K1, PowerVR 6 series) will be more than capable at making a dramatic improvements beyond structured grid oriented algorithms. It would be neat to compare the performance on the next gen against limited OpenCL environments like the ODROID-XU.

I suppose regarding multicore Java / C the point isn't necessarily to try and best Renderscript, but come up with clear examples of cross-platform implementations. They likely will be similar in performance to Renderscript. I view it as no different than on the desktop when I compare a multicore Java implementation against an OpenCL software implementation.

Jason Sams said...

NEON: Yes we will generate neon if the script opts into the "rs_fp_relaxed" pragma indicating its allowed to reduce precision on some FP operation such as flushing denorms to zero. However, you will not see any neon instructing in the .bc code. Remember, the .bc must be processable by any CPU including non-neon ARM, x86, and MIPS. Its on device the vector ops get turned into NEON. You can pull the object file from the on device cache after running the app and disassemble it. IIRC its in /data/[app]/cache/...

If you would like to see the difference it make you can download one of our benchmarks:
https://android.googlesource.com/platform/frameworks/rs/+/master/java/tests/ImageProcessing_jb/

Just look at the performance difference between "relaxed" and "full".

Gabor Paller said...

Jason, we are not talking about the same thing. You talk about some optimizations in floating-point calculations. The idea presented in this post is completely different. Depending on your data length, NEON SIMD instructions deliver 4x (in case of 32 bit data)-16x (in case of 8 bit data) performance improvement because the NEON engine is capable of executing 4-16 operations in parallel. A similar mechanism exist in Intel architectures, there it is called "vectorization". Advantage of parallelism inside one core is that it is widely deployed, I would be in trouble finding an Android device (even obsolete one) that does not have ARMv7 core with NEON. Disadvantage of single-core parallelism is that it is very badly supported by tools. After some struggling, everybody finds that it is better to write NEON assembly code by hand.

Now the big challenge for RenderScript is that even for the algorithm classes RS does support, it needs either a RenderScript-compatible GPU or a multicore CPU to exploit any parallelism. On multicore CPUs, RenderScript is at the mercy of Android task scheduler. Meanwhile, NEON can deliver significant performance boost (if not bottlenecked by the memory speed as in my case) on the *existing* devices. The cost is a bit more coding but coders are cheap.

Therefore my question is whether is possible to exploit the RS execution model for supporting SIMD parallelism.

I did not inspect the .bc bytecode for NEON instructions, I inspected the pre-generated .so file under the libs directory as described by the post. As my code does not use floating point, floating point optimization would not be present anyway.

Gabor Paller said...

Michael, there may be a middle ground that takes advantage of both LLVM and OpenCL. What do you think about SPIR?

Jason Sams said...

Vectorization: Yes I know what this is...

There are many ways to generate neon. If you use the primitives such as float4, short4, ... you will get neon code today.

The second option as you point out is auto-vectorization. LLVM has been working on code to do this but its not quite ready yet.

The pre-generated .so is for "compatibility" and intended for the NDK. Because not all legacy devices support NEON the code will be different than what is generated on device from the same .rs file. PIC vs non-PIC and soft vs hard float are other differences between the .so and the on device version.

MichaelEGR said...

@Gabor
SPIR is very exciting in regard to protecting kernel source and also intermediate CL code generation from various higher level languages. As a middleware developer (TyphonRT) which is a Java based component architecture that has different usage patterns than traditional OOP there are possibilities to capture these patterns and generate SPIR code directly from the Java based code. On my side I'd probably accomplish this through annotation pre-processing or conceivably at runtime along with the inherent component structure of the TyphonRT API which splits apart data from logic.

It's really exciting because like the rest of OpenCL et al SPIR enables a generic cross-platform execution environment. We'll see soon if the next gen of mobile GPUs (Adreno 4xx, Tegra K1, PowerVR 6 series) which should all support OpenCL 1.2 will also support SPIR.

I wished I had a GDC conference ticket this year because there will be a lot of interesting Khronos and OpenCL developments announced next week at the 2 days of sessions surrounding Khronos APIs.

SPIR is another reason why OpenCL is very important to be widely available on Android because it provides an alternate high performance execution environment that allows parallel / performance techniques to be applied in lieu of direct improvements to the Java language advancements that are not going to come to Android, but are being developed in the J2SE world. We may not see some of the built in parallel improvements to J2SE until Java 9 is released, but a good deal of language evolution which will not hit Android at the Java level (heck annotations is still broken on Android! without fixes / workarounds!) will be possible to realize in a similar way with SPIR & OpenCL while maintaining cross-platform capabilities with J2SE and Android. IE TyphonRT runs on J2SE and Android hence my strong preference towards OpenCL vs Renderscript.

@Jason.. Thanks for contributing to the discussion without the rhetoric of Mr. Hines! Beyond the basic information about Renderscript on the API guide section of the android developers web site is there an in depth programmers guide for RS available? I must say when I got the ODROID-XU and joined the Imagination Tech OpenCL SDK early access program they sent over an extremely detailed ~40 page programmers guide. All GPU vendors have similar documentation and it would help if something similar is available for RS along with detailed information on specifically which manufacturers have RS enabled drivers.

Jason Sams said...

We don't have a programmers guide yet. I try to answer questions about using RS whenever I get the chance. There are a lot of misunderstandings about what it can and cannot do and I try to help out when I have time.

Gabor Paller said...

Jason, auto-vectorization is hard in case of a general coding problem. But in case of the RS programming model, it is actually very tempting and quite easy to do. If the kernel code does not have conditional branches that depend on the input value then auto-vectorization just means that input variables are allocated in NEON registers and filled up with 4 (in case of 32-bit input values) at a time. It would work just fine for that grayscale conversion example code.

MichaelEGR said...

@Jason - Having a detailed programmers guide will go a long way supporting adoption for those that can play just within the Android sphere of things. I know Renderscript is a moving target, but it is indeed 3 years after release at this point.

@Gabor et al - The OpenGL ES 3.1 spec was just announced today and it includes compute shaders which is a very welcome addition which I didn't expect to see on mobile until full profile is supported. We'll see if Android team bumps the API version for GLES as quick as they did for 3.0 with Android 4.3. The right thing to do at this point would be to quickly bump the OpenGL ES API and officially support OpenCL. As we know though we just need accessible OpenCL drivers and can roll our own 3rd party APIs (on the Java side). Being able to get started with compute shaders through the GLES 3.1 API via official Android APIs which will be hard for Google to work around without deliberately hampering the GL APIs is a very strategic play on the part of Khronos. There are additional OpenCL announcements for Wednesday.

Jason Sams said...

I'll never call auto-vectorization easy as there are so many corner cases you need to handle. The RS api was designed to make this a tractable problem. Hopefully it makes it to the top of the task list sometime soon.

Docs: Agree, its something I'm trying to get fixed.

lovesports said...

Hi, I enjoy reading your site! Is it okay to contact your through your email? Please email me back.

Thanks!

Cailyn
cailynxxx gmail.com

Joel said...

Hey, I had a great time reading your website. Do you have an email address that I can contact you on? Thank you and hope to hear from you soon.

Regards,

Joel
JHouston791 gmail.com

Kevin Collins said...
This comment has been removed by the author.
Kevin Collins said...

Hi you have a broken link

kevincollins1011 @gmail.com

Angela Brooks said...

Hey, I had a great time reading your website. Can I contact you through email?. Please email me back.

Regards,

Angela
angelabrooks741 gmail.com