Migrate scripts to OpenGL ES 3.1

For workloads where GPU compute is ideal, migrating RenderScript scripts to OpenGL ES (GLES) allows applications written in Kotlin, Java, or using the NDK to take advantage of GPU hardware. A high-level overview follows to help you use OpenGL ES 3.1 compute shaders to replace RenderScript scripts.

GLES Initialization

Instead of creating a RenderScript context object, perform the following steps to create a GLES offscreen context using EGL:

Get the default display
Initialize EGL using the default display, specifying the GLES version.
Choose an EGL config with a surface type of EGL_PBUFFER_BIT.
Use the display and config to create an EGL context.
Create the offscreen surface with eglCreatePBufferSurface. If the context is going to only be used for compute, this can be a trivially small (1x1) surface.
Create the render thread and call eglMakeCurrent in the render thread with the display, surface, and EGL context to bind the GL context to the thread.

The sample app demonstrates how to initialize the GLES context in GLSLImageProcessor.kt. To learn more, see EGLSurfaces and OpenGL ES.

GLES debug output

Getting useful errors from OpenGL uses an extension to enable debug logging that sets a debug output callback. The method to do this from the SDK, glDebugMessageCallbackKHR, has never been implemented, and throws an exception. The sample app includes a wrapper for the callback from NDK code.

GLES Allocations

A RenderScript Allocation can be migrated to an Immutable storage texture or a Shader Storage Buffer Object. For read-only images, you can use a Sampler Object, which allows for filtering.

GLES resources are allocated within GLES. To avoid memory copying overhead when interacting with other Android components, there is an extension for KHR Images that allows the sharing of 2D arrays of image data. This extension has been required for Android devices beginning with Android 8.0. The graphics-core Android Jetpack library includes support for creating these images within managed code and mapping them to an allocated HardwareBuffer:

val outputBuffers = Array(numberOfOutputImages) {
  HardwareBuffer.create(
    width, height, HardwareBuffer.RGBA_8888, 1,
    HardwareBuffer.USAGE_GPU_SAMPLED_IMAGE
  )
}
val outputEGLImages = Array(numberOfOutputImages) { i ->
    androidx.opengl.EGLExt.eglCreateImageFromHardwareBuffer(
        display,
        outputBuffers[i]
    )!!
}

Unfortunately, this doesn't create the immutable storage texture required for a compute shader to write directly to the buffer. The sample uses glCopyTexSubImage2D to copy the storage texture used by the compute shader into the KHR Image. If the OpenGL driver supports the EGL Image Storage extension, then that extension can be used to create a shared immutable storage texture to avoid the copy.

Conversion to GLSL compute shaders

Your RenderScript scripts are converted into GLSL compute shaders.

Write a GLSL compute shader

In OpenGL ES,compute shaders are written in the OpenGL Shading Language (GLSL).

Adaptation of script globals

Based on the characteristics of the script globals, you can either use uniforms or uniform buffer objects for globals that are not modified within the shader:

Uniform buffer: Recommended for frequently-changed script globals of sizes larger than the push constant limit.

For globals that are changed within the shader, you can use use an Immutable storage texture or a Shader Storage Buffer Object.

Execute Computations

Compute shaders aren't part of the graphics pipeline; they are general purpose and designed to compute highly-parallelizable jobs. This lets you have more control over how they execute, but it also means that you have to understand a bit more about how your job is parallelized.

Create and initialize the compute program

Creating and initializing the compute program has lots in common with working with any other GLES shader.

Create the program and the compute shader associated with it.
Attach the shader source, compile the shader (and check the results of the compilation).
Attach the shader, link the program, and use the program.
Create, initialize, and bind any uniforms.

Start a computation

Compute shaders operate within an abstract 1D, 2D, or 3D space on a series of workgroups, which are defined within the shader source code, and represent the minimum invocation size as well as the geometry of the shader. The following shader works on a 2D image and defines the work groups in two dimensions:

private const val WORKGROUP_SIZE_X = 8
private const val WORKGROUP_SIZE_Y = 8
private const val ROTATION_MATRIX_SHADER =
    """#version 310 es
    layout (local_size_x = $WORKGROUP_SIZE_X, local_size_y = $WORKGROUP_SIZE_Y, local_size_z = 1) in;

Workgroups can share memory, defined by GL_MAX_COMPUTE_SHARED_MEMORY_SIZE, which is at least 32 KB and can make use of memoryBarrierShared() to provide coherent memory access.

Define workgroup size

Even if your problem space works well with workgroup sizes of 1, setting an appropriate workgroup size is important for parallelizing the compute shader. If the size is too small, the GPU driver may not parallelize your computation enough, for example. Ideally, these sizes should be tuned per-GPU, although reasonable defaults work well enough on current devices, such as the workgroup size of 8x8 in the shader snippet.

There is a GL_MAX_COMPUTE_WORK_GROUP_COUNT, but it is substantial; it must be at least 65535 in all three axes according to the specification.

Dispatch the shader

The final step in executing computations is to dispatch the shader using one of the dispatch functions such as glDispatchCompute. The dispatch function is responsible for setting the number of workgroups for each axis:

GLES31.glDispatchCompute(
  roundUp(inputImage.width, WORKGROUP_SIZE_X),
  roundUp(inputImage.height, WORKGROUP_SIZE_Y),
  1 // Z workgroup size. 1 == only one z level, which indicates a 2D kernel
)

To return the value, first wait for the compute operation to finish using a memorybarrier:

GLES31.glMemoryBarrier(GLES31.GL_SHADER_IMAGE_ACCESS_BARRIER_BIT)

To chain multiple kernels together, (for example, to migrate code using ScriptGroup), create and dispatch multiple programs and synchronize their access to the output with memory barriers.

The sample app demonstrates two compute tasks:

HUE rotation: A compute task with a single compute shader. See GLSLImageProcessor::rotateHue for the code sample.
Blur: A more complex compute task that sequentially executes two compute shaders. See GLSLImageProcessor::blur for the code sample.

To learn more about memory barriers, refer to Ensuring visibility as well as Shared variables .