Comments on My life with Android :-): RenderScript in Android - the parallel version

Hi Gabor, I'm an editor at DZone, a content s...

2014-02-17T23:34:48.733+01:00

Hi Gabor,

I'm an editor at DZone, a content site for software developers. You've written a lot of posts that I think would be very popular with our audience - our Mobile Zone and Javalobby readers in particular - and I'd like to invite you to join our Most Valuable Blogger program. You can find details here:

http://www.dzone.com/aboutmvb

If you're interested, please email me at alecn@dzone.com and I can help you get started.

Thanks,
Alec

The issue has nothing to do with the RenderScript ...

2014-02-08T22:35:28.784+01:00

The issue has nothing to do with the RenderScript model. The code as written, wakes up a worker pool, waits for them to work, waits for them to join, then repeats many times. This is slow in any language.

What I am recommending you do, is do a ONE small 1D launch on an allocation of size 2-16. Then within that kernel, walk multiple rows. i.e. write it exactly how you would if you had a worker pool in C. Use atomic ops to coordinate the threads if you need to share info between them.

Power management in mobile is very aggressive. Its not like desktops. If you have an algorithm with 4 threads that only achieves 75% load, one of those cores will be put to sleep. Then you will get additional overhead with thread switching on the core. Whats worse, you then get into the problem where one thread completes slower than the others because it was suspended waiting for the other to complete.

Larger workloads greatly diminish this problem because the scheduler has time to react and you only go though one set of launches and joins. IIRC there is about 50us of overhead per kernel launch, so if you do 5000-15000 of those it adds up fast.

Jason, I understand that this algorithm does not f...

2014-02-08T19:31:30.856+01:00

Jason, I understand that this algorithm does not fit into the RenderScript parallel computation model - that's why I chose it. For example the rows depend on each other and the source/target matrix is not rectangular. On the other hand, we are talking about rows that are 5000-15000 elements in length. These workloads should be easy to partition among multiple CPUs as implementation in OpenCL demonstrates. Do you want me to write a parallel implementation in Java to demonstrate it? :-)

You say that the cores should be brought out of sleep. I guess, that should be done only once and not here we have typically 10000-18000 diagonals.

You also say that there should be only one kernel launch. Can you do that in RenderScript on a matrix which is not rectangular?

I don't understand your reference to atomics. Do you think about the methods in rs_atomic.rsh? How would that solve the problem that this runtime obviously does not exploit any parallelism? Because no matter how you set the PARALLEL_LIMIT in dtwparallel2.rs which controls whether the row will be processed by a simple for() loop or rsForEach, the results are almost the same.

A few comments. I finally had a chance to look at...

2014-02-08T00:21:15.201+01:00

A few comments. I finally had a chance to look at the code and I think I know where your performance issues are coming from.

You are doing a kernel launch per line. This will be slow in any language, its not RS specific. Its entirely due to power management, it takes time to bring additional cores out of sleep and bring them up to speed. There is also overhead due to mutexing and making sure the work completes before returning.

You would be much better off doing one large kernel launch and using atomics to subdivide the work. This will let all the cores get up to speed and greatly reduce the cost of coordinating your threads.

The CPU/GPU comment was a guess on my part having not seen source at the time I made it. Just trying to give you debugging options that might be helpful. The code as structured will not be GPU friendly and should not be run on one. Based on this source it never ran on GPU which is the correct call. That command would have made no difference.

I'd say that when you want to revisit this, try using atomics to coordinate the threads within one launch. It will almost certainly work a lot better.