Friday, February 7, 2014

RenderScript in Android - the parallel version

In the previous post I promised to revisit the parallel case. The big promise of RenderScript is to exploit parallelism among different CPUs, GPUs and DSPs in the device at no additional cost. Once the algorithm is properly transformed into parallel version, the RenderScript runtime grabs whatever computing devices are available and schedules the subtask automatically.

The problem with DTW is that it is not so trivial to parallelize. Each cell in the matrix depends on cells at (x-1,y), (x-1,y-1) and (x,y-1) (provided that the cell to calculate is at (x,y)). By traversing the matrix horizontally or vertically, only two rows (one horizontal and one vertical) can be evaluated in parallel.

Michael Leahy recommended a paper that solves this problem. This algorithm traverses the matrix diagonally. Each diagonal row depends on the two previous diagonal rows but cells in one diagonal row don't depend on each other. One diagonal row can be then fed to RenderScript to iterate over it. The picture below illustrates the concept.



The example program can be downloaded from here.

You will notice that there are two parallel implementations. The findReferenceSignalC99Parallel() is the "proper" implementation that follows closely the RenderScript tutorial. Here the diagonal rows are iterated in Java and only the parallel kernel is implemented in RenderScript. This version - even though it is functional - is not invoked by default because it delivers completely inacceptable performance on my 2-core Galaxy Nexus. By looking closely at the execution times, I concluded that even though RenderScript runtime invocations ( copying into Allocations and invoking forEach) are normally fast, sometimes very innocent-looking invocations (like copying 5 integers into an Allocation) can take about a second. This completely ruined this implementation's performance.

The other parallel implementation which is actually invoked and whose performance is compared to the 1-core RenderScript implementation (the fastest one) is findReferenceSignalC99ParallelRSOnly(). This version is implemented entirely in RenderScript. Unfortunately its performance is 2-2.5 times slower than the 1-core implementation. How can it be?

First, if you compare dtw.rs and dtwparallel2.rs, you will notice that the parallel implementation is considerably more complex. Indexing out those varying-length diagonal rows takes a bit of fiddling while the 1-core implementation can take the advantage of fast pointer arithmetic to move from cell to cell sequentially. So the parallel implementation starts with a handicap. This handicap is not compensated by the 2 cores of the Galaxy Nexus.

OK, Galaxy Nexus is the stone age but what happens on a 4-core processor like on a Nexus 4? The runtime does launch with 4 cores but then the Adreno driver kicks in and the result is that the parallel implementation is about 3 times slower than the serial one. What happens in the driver, I don't know, as far as I can see, the source code is not available.

Jasons Sams recommended to disable the GPU driver with
adb shell setprop debug.rs.default-CPU-driver 1
but I decided to stop my adventures here. The conclusion I drew for myself is that RenderScript in its present form is not ready for parallel programming. Clang-LLVM is a very promising compilation technology but the parallel runtime suffers from a number of problems. IMHO, there should be a way to programmatically control the way the workloads are allocated to CPUs/GPUs. Until then, if you want to harness the power of your multicore processor, code the parallel runtime yourself. Using RenderScript for the serial code if you wish.

4 comments:

Unknown said...

A few comments. I finally had a chance to look at the code and I think I know where your performance issues are coming from.

You are doing a kernel launch per line. This will be slow in any language, its not RS specific. Its entirely due to power management, it takes time to bring additional cores out of sleep and bring them up to speed. There is also overhead due to mutexing and making sure the work completes before returning.

You would be much better off doing one large kernel launch and using atomics to subdivide the work. This will let all the cores get up to speed and greatly reduce the cost of coordinating your threads.

The CPU/GPU comment was a guess on my part having not seen source at the time I made it. Just trying to give you debugging options that might be helpful. The code as structured will not be GPU friendly and should not be run on one. Based on this source it never ran on GPU which is the correct call. That command would have made no difference.

I'd say that when you want to revisit this, try using atomics to coordinate the threads within one launch. It will almost certainly work a lot better.

Gabor Paller said...

Jason, I understand that this algorithm does not fit into the RenderScript parallel computation model - that's why I chose it. For example the rows depend on each other and the source/target matrix is not rectangular. On the other hand, we are talking about rows that are 5000-15000 elements in length. These workloads should be easy to partition among multiple CPUs as implementation in OpenCL demonstrates. Do you want me to write a parallel implementation in Java to demonstrate it? :-)

You say that the cores should be brought out of sleep. I guess, that should be done only once and not here we have typically 10000-18000 diagonals.

You also say that there should be only one kernel launch. Can you do that in RenderScript on a matrix which is not rectangular?

I don't understand your reference to atomics. Do you think about the methods in rs_atomic.rsh? How would that solve the problem that this runtime obviously does not exploit any parallelism? Because no matter how you set the PARALLEL_LIMIT in dtwparallel2.rs which controls whether the row will be processed by a simple for() loop or rsForEach, the results are almost the same.

Unknown said...

The issue has nothing to do with the RenderScript model. The code as written, wakes up a worker pool, waits for them to work, waits for them to join, then repeats many times. This is slow in any language.

What I am recommending you do, is do a ONE small 1D launch on an allocation of size 2-16. Then within that kernel, walk multiple rows. i.e. write it exactly how you would if you had a worker pool in C. Use atomic ops to coordinate the threads if you need to share info between them.

Power management in mobile is very aggressive. Its not like desktops. If you have an algorithm with 4 threads that only achieves 75% load, one of those cores will be put to sleep. Then you will get additional overhead with thread switching on the core. Whats worse, you then get into the problem where one thread completes slower than the others because it was suspended waiting for the other to complete.

Larger workloads greatly diminish this problem because the scheduler has time to react and you only go though one set of launches and joins. IIRC there is about 50us of overhead per kernel launch, so if you do 5000-15000 of those it adds up fast.

Anonymous said...

Hi Gabor,

I'm an editor at DZone, a content site for software developers. You've written a lot of posts that I think would be very popular with our audience - our Mobile Zone and Javalobby readers in particular - and I'd like to invite you to join our Most Valuable Blogger program. You can find details here:

http://www.dzone.com/aboutmvb

If you're interested, please email me at alecn@dzone.com and I can help you get started.

Thanks,
Alec