GPULib

Monday, February 2, 2015

GPULib 1.8 released

GPULib 1.8 has been released with updates to the underlying libraries as well as many other features in many areas of the library. It has been updated to use the most recent versions of IDL and CUDA, IDL 8.4 and CUDA 6.5. The new features are:

Support for integer data types. I have been wanting to support integer types in GPULib for awhile and now GPULib supports all the numeric types provided by IDL! We can finally do:
```
dx = gpuIndgen(10)
```
Added GPUREPMAT routine. This is a handy routine to create a new array by repeating a 2-dimensional array in a grid.
Added GPUCREATEKERNEL routine to create the source code of a simple kernel. This is a code generation routine that can be loaded with GPULOADMODULE/GPULOADFUNCTION and executed with GPUEXECUTEFUNCTION.
Added GPUFINITE routine similar to IDL's library routine.
Added linear algebra routines GPULUDC, GPULUSOL, and GPULEAST_SQUARES. This fills out more of the GPU equivalent of the convenience routines provided by IDL so that the LAPACK interface of MAGMA is not required to perform linear algebra computations.
Added support for RHO and THETA keywords in GPURADON.
Added GPUMEAN routine. This routine has DIMENSION and NAN keywords with the same functionality as IDL's library routine.

Wednesday, October 2, 2013

GPULib 1.6.2 released

GPULib 1.6.2 has been released today! Updates include:

support for CUDA 5.5
updated to MAGMA 1.4
improved syntax in GPUEXECUTEFUNCTION
self-extracting executable installer on Linux and package installer on OS X
small bug fixes

We intend to keep updating GPULib to use the newest versions of IDL and CUDA as those packages are released. If you need a different combination of IDL and CUDA, please let us know.

The update to MAGMA 1.4 should fix some of the issues that some were experiencing with GPUINVERT at particular matrix sizes.

For the most part, we updated the libraries that GPULib is built on in this release and not added new features, but there is one new convenience in using GPUEXECUTEFUNCTION in GPULb 1.6.2. After reading PTX code and loading a module and function, the CUDA kernel can be launched with the nicer syntax:

  gpuExecuteFunction, kernel, $
                      dx, $
                      dy, $
                      dz, $
                      n, $
                      GRID=[nBlocksPerGrid], $
                      BLOCKS=[nThreadsPerBlock], $
                      ERROR=err

instead of the old, cumbersome syntax of calling the _getHandle method for each GPU variable:

  gpuExecuteFunction, kernel, $
                      dx->_getHandle(), $
                      dy->_getHandle(), $
                      dz->_getHandle(), $
                      n, $
                      GRID=[nBlocksPerGrid], $
                      BLOCKS=[nThreadsPerBlock], $
                      ERROR=err

It should be possible to get the full functionality of GPULib without calling any of the private methods of the GPUVariable class now.

Monday, June 17, 2013

GPULib VISualize talk

My colleague Jon Rood presented on the recently released GPULib 1.6 at VISualize last week. Here are the slides for the talk.

Tuesday, May 7, 2013

Custom kernels in GPULib 1.6

One of the new features of GPULib 1.6 is the ability to load and execute custom CUDA kernels at runtime. This means you can run CUDA code does not know about GPULib (or even IDL). Let's run through what is required to do this for a simple example.

Our example will simply add two n element arrays and store the result in a third array. The kernel for this will be written in a file `vectorAdd_kernel.cu` and should be straight-forward for anyone writing CUDA:

  extern "C" __global__ void VecAdd_kernel(const float* A,
                                           const float* B,
                                           float* C,
                                           int N)
  {
      int i = blockDim.x * blockIdx.x + threadIdx.x;
      if (i < N)
          C[i] = A[i] + B[i];
  }

This code is then compiled to PTX code using the --ptx flag of nvcc:

  $ nvcc --ptx vectorAdd_kernel.cu

This produces the file `vectorAdd_Kernel.ptx` which can be loaded and executed with a couple of new GPULib routines. First, load the source code of the kernel:

  ptx_source = gpu_read_ptxfile('vectorAdd_kernel.ptx')

This simply loads the source code from the PTX file into an IDL string. The source code could contain multiple kernel definitions; the module represents the entirety of the code, while the function represents one particular kernel. We call GPULib routines to load the module and function:

  module = gpuLoadModule(ptx_source, error=err)
  kernel = gpuLoadFunction(module, 'VecAdd_kernel', error=err)

Next, we create two arrays to add and a third array to contain the result:

  n = 20L
  dx = gpuFindgen(n)
  dy = gpuFindgen(n)
  dz = gpuFltarr(n)

Lastly, we are ready to call our kernel. The grid and block structure is set through the GRID and BLOCKS keywords, while otherwise we just pass the kernel and its arguments to GPUEXECUTEFUNCTION:

  nThreadsPerBlock = 256L
  nBlocksPerGrid = (n + nThreadsPerBlock - 1L) / nThreadsPerBlock

  gpuExecuteFunction, kernel, $
                      dx->_getHandle(), $
                      dy->_getHandle(), $
                      dz->_getHandle(), $
                      n, $
                      GRID=[nBlocksPerGrid], $
                      BLOCKS=[nThreadsPerBlock], $
                      ERROR=err

There is a SHARED_MEMORY keyword to pass the amount of shared memory that your kernel requires, but that is not needed for our simple example.

Friday, May 3, 2013

Linear algebra in GPULib 1.6

GPULib 1.6 adds 100+ linear algebra routines available through MAGMA, a GPU accelerated LAPACK implementation.

The IDL bindings for the low-level MAGMA routines are automatically generated wrappers for the standard LAPACK interface, but GPULib also offers a higher-level routine mimicking the LA_ routines present in the IDL library. GPUINVERT is comparable to LA_INVERT, inverting a non-singular matrix in a single call:

  dinverse = gpuInvert(da, lhs=dinverse)

While these routines document their technique for performing their operations, they do not require much user knowledge of the underlying techniques used. More high-level routines similar to the LA_ are planned for future GPULib versions.

Calling the low-level routines directly in MAGMA requires more knowledge of the mathematical algorithms and a bit more cumbersome notation. For example, the following example shows how to call the MAGMA version of SGELS:

  status = gpusgels((byte('N'))[0], $
                    m, n, nrhs, $
                    da->_getHandle(), lda, $
                    db->_getHandle(), ldb, $
                    work, lwork, $
                    info)

SGELS solves overdetermined or underdetermined real linear systems using a QR or LQ factorization of a matrix of full rank. This low-level operation has options for performing multiple solves in a call and operating on the transpose.

The calling syntax for the MAGMA routines is in the API documentation for GPULib as well as in lib/magma_routines.h in the GPULib distribution.

Wednesday, May 1, 2013

GPULib 1.6 released

Now available for download from the Tech-X website, GPULib 1.6 offers some exciting new features and a lot more stability. Here’s the brief rundown of what's new:

All platforms, Windows, Linux, and OS X, are now distributed as binaries. No building from source required!
Added wrappers for MAGMA, a GPU accelerated LAPACK library. This makes over 100 low-level linear algebra routines available. In addition to these low-level routines, we have one high-level routine, GPUINVERT, that is similar to the IDL routine LA_INVERT. We intend to write analogs to the rest of the LA_ IDL routines as well. This feature requires the Basic membership plan.
GPULib can now load and execute custom CUDA kernels without having to link to GPULib; you just compile your kernel to a .ptx file. GPULib provides routines to load and execute that kernel at runtime. This feature requires the Basic membership plan.
Support for CUDA 5.0.
Added support for up to 8-dimensional arrays. This is basic support for higher-dimensional arrays in GPULib, not every routine understands them yet.
Added optimized scalar/array operations. No more tricks just to add a scalar to an array!
Fixed bugs in GPUHISTOGRAM and GPUHIST_2D.
Fixed bugs in GPUATAN2 for complex and double complex.
Fixed bugs in GPUREAL.

I have planned a few more posts with details and examples about some of the big features above, like the MAGMA wrappers and the custom kernels.

A lot of work was done on infrastructure to make releasing an easier process, hopefully resulting in more frequent updates in the future. We have plans for some very exciting features in the coming year!

Monday, December 10, 2012

AGU 2012 poster

Here is the abstract for my poster at AGU this year, “GPU accelerated curve fitting with IDL”:

Curve fitting is a common mathematical calculation done in all scientific areas. The Interactive Data Language (IDL) is also widely used in this community for data analysis and visualization. We are creating a general-purpose, GPU accelerated curve fitting library for use from within IDL.

We have developed GPULib, a library of routines in IDL for accelerating common scientific operations including arithmetic, FFTs, interpolation, and others. These routines are accelerated using modern GPUs using NVIDIA’s CUDA architecture. We will add curve fitting routines to the GPULib library suite, making curve fitting much faster.

In addition, library routines required for efficient curve fitting will also be generally useful to other users of GPULib. In particular, a GPU accelerated LAPACK implementation such as MAGMA is required for the Levenberg- Marquardt curve fitting and is commonly used in many other scientific computations. Furthermore, the ability to evaluate custom expressions at runtime necessary for specifying a function model will be useful for users in all areas.

The poster is available to download on the AGU website.