GPULib: 2013

Wednesday, October 2, 2013

GPULib 1.6.2 released

GPULib 1.6.2 has been released today! Updates include:

support for CUDA 5.5
updated to MAGMA 1.4
improved syntax in GPUEXECUTEFUNCTION
self-extracting executable installer on Linux and package installer on OS X
small bug fixes

We intend to keep updating GPULib to use the newest versions of IDL and CUDA as those packages are released. If you need a different combination of IDL and CUDA, please let us know.

The update to MAGMA 1.4 should fix some of the issues that some were experiencing with GPUINVERT at particular matrix sizes.

For the most part, we updated the libraries that GPULib is built on in this release and not added new features, but there is one new convenience in using GPUEXECUTEFUNCTION in GPULb 1.6.2. After reading PTX code and loading a module and function, the CUDA kernel can be launched with the nicer syntax:

  gpuExecuteFunction, kernel, $
                      dx, $
                      dy, $
                      dz, $
                      n, $
                      GRID=[nBlocksPerGrid], $
                      BLOCKS=[nThreadsPerBlock], $
                      ERROR=err

instead of the old, cumbersome syntax of calling the _getHandle method for each GPU variable:

  gpuExecuteFunction, kernel, $
                      dx->_getHandle(), $
                      dy->_getHandle(), $
                      dz->_getHandle(), $
                      n, $
                      GRID=[nBlocksPerGrid], $
                      BLOCKS=[nThreadsPerBlock], $
                      ERROR=err

It should be possible to get the full functionality of GPULib without calling any of the private methods of the GPUVariable class now.

Monday, June 17, 2013

GPULib VISualize talk

My colleague Jon Rood presented on the recently released GPULib 1.6 at VISualize last week. Here are the slides for the talk.

Tuesday, May 7, 2013

Custom kernels in GPULib 1.6

One of the new features of GPULib 1.6 is the ability to load and execute custom CUDA kernels at runtime. This means you can run CUDA code does not know about GPULib (or even IDL). Let's run through what is required to do this for a simple example.

Our example will simply add two n element arrays and store the result in a third array. The kernel for this will be written in a file `vectorAdd_kernel.cu` and should be straight-forward for anyone writing CUDA:

  extern "C" __global__ void VecAdd_kernel(const float* A,
                                           const float* B,
                                           float* C,
                                           int N)
  {
      int i = blockDim.x * blockIdx.x + threadIdx.x;
      if (i < N)
          C[i] = A[i] + B[i];
  }

This code is then compiled to PTX code using the --ptx flag of nvcc:

  $ nvcc --ptx vectorAdd_kernel.cu

This produces the file `vectorAdd_Kernel.ptx` which can be loaded and executed with a couple of new GPULib routines. First, load the source code of the kernel:

  ptx_source = gpu_read_ptxfile('vectorAdd_kernel.ptx')

This simply loads the source code from the PTX file into an IDL string. The source code could contain multiple kernel definitions; the module represents the entirety of the code, while the function represents one particular kernel. We call GPULib routines to load the module and function:

  module = gpuLoadModule(ptx_source, error=err)
  kernel = gpuLoadFunction(module, 'VecAdd_kernel', error=err)

Next, we create two arrays to add and a third array to contain the result:

  n = 20L
  dx = gpuFindgen(n)
  dy = gpuFindgen(n)
  dz = gpuFltarr(n)

Lastly, we are ready to call our kernel. The grid and block structure is set through the GRID and BLOCKS keywords, while otherwise we just pass the kernel and its arguments to GPUEXECUTEFUNCTION:

  nThreadsPerBlock = 256L
  nBlocksPerGrid = (n + nThreadsPerBlock - 1L) / nThreadsPerBlock

  gpuExecuteFunction, kernel, $
                      dx->_getHandle(), $
                      dy->_getHandle(), $
                      dz->_getHandle(), $
                      n, $
                      GRID=[nBlocksPerGrid], $
                      BLOCKS=[nThreadsPerBlock], $
                      ERROR=err

There is a SHARED_MEMORY keyword to pass the amount of shared memory that your kernel requires, but that is not needed for our simple example.

Friday, May 3, 2013

Linear algebra in GPULib 1.6

GPULib 1.6 adds 100+ linear algebra routines available through MAGMA, a GPU accelerated LAPACK implementation.

The IDL bindings for the low-level MAGMA routines are automatically generated wrappers for the standard LAPACK interface, but GPULib also offers a higher-level routine mimicking the LA_ routines present in the IDL library. GPUINVERT is comparable to LA_INVERT, inverting a non-singular matrix in a single call:

  dinverse = gpuInvert(da, lhs=dinverse)

While these routines document their technique for performing their operations, they do not require much user knowledge of the underlying techniques used. More high-level routines similar to the LA_ are planned for future GPULib versions.

Calling the low-level routines directly in MAGMA requires more knowledge of the mathematical algorithms and a bit more cumbersome notation. For example, the following example shows how to call the MAGMA version of SGELS:

  status = gpusgels((byte('N'))[0], $
                    m, n, nrhs, $
                    da->_getHandle(), lda, $
                    db->_getHandle(), ldb, $
                    work, lwork, $
                    info)

SGELS solves overdetermined or underdetermined real linear systems using a QR or LQ factorization of a matrix of full rank. This low-level operation has options for performing multiple solves in a call and operating on the transpose.

The calling syntax for the MAGMA routines is in the API documentation for GPULib as well as in lib/magma_routines.h in the GPULib distribution.

Wednesday, May 1, 2013

GPULib 1.6 released

Now available for download from the Tech-X website, GPULib 1.6 offers some exciting new features and a lot more stability. Here’s the brief rundown of what's new:

All platforms, Windows, Linux, and OS X, are now distributed as binaries. No building from source required!
Added wrappers for MAGMA, a GPU accelerated LAPACK library. This makes over 100 low-level linear algebra routines available. In addition to these low-level routines, we have one high-level routine, GPUINVERT, that is similar to the IDL routine LA_INVERT. We intend to write analogs to the rest of the LA_ IDL routines as well. This feature requires the Basic membership plan.
GPULib can now load and execute custom CUDA kernels without having to link to GPULib; you just compile your kernel to a .ptx file. GPULib provides routines to load and execute that kernel at runtime. This feature requires the Basic membership plan.
Support for CUDA 5.0.
Added support for up to 8-dimensional arrays. This is basic support for higher-dimensional arrays in GPULib, not every routine understands them yet.
Added optimized scalar/array operations. No more tricks just to add a scalar to an array!
Fixed bugs in GPUHISTOGRAM and GPUHIST_2D.
Fixed bugs in GPUATAN2 for complex and double complex.
Fixed bugs in GPUREAL.

I have planned a few more posts with details and examples about some of the big features above, like the MAGMA wrappers and the custom kernels.

A lot of work was done on infrastructure to make releasing an easier process, hopefully resulting in more frequent updates in the future. We have plans for some very exciting features in the coming year!