Friday, May 3, 2013

Linear algebra in GPULib 1.6

GPULib 1.6 adds 100+ linear algebra routines available through MAGMA, a GPU accelerated LAPACK implementation.

The IDL bindings for the low-level MAGMA routines are automatically generated wrappers for the standard LAPACK interface, but GPULib also offers a higher-level routine mimicking the LA_ routines present in the IDL library. GPUINVERT is comparable to LA_INVERT, inverting a non-singular matrix in a single call:

  dinverse = gpuInvert(da, lhs=dinverse)
While these routines document their technique for performing their operations, they do not require much user knowledge of the underlying techniques used. More high-level routines similar to the LA_ are planned for future GPULib versions.

Calling the low-level routines directly in MAGMA requires more knowledge of the mathematical algorithms and a bit more cumbersome notation. For example, the following example shows how to call the MAGMA version of SGELS:

  status = gpusgels((byte('N'))[0], $
                    m, n, nrhs, $
                    da->_getHandle(), lda, $
                    db->_getHandle(), ldb, $
                    work, lwork, $
                    info)
SGELS solves overdetermined or underdetermined real linear systems using a QR or LQ factorization of a matrix of full rank. This low-level operation has options for performing multiple solves in a call and operating on the transpose.

The calling syntax for the MAGMA routines is in the API documentation for GPULib as well as in lib/magma_routines.h in the GPULib distribution.

Wednesday, May 1, 2013

GPULib 1.6 released

Now available for download from the Tech-X website, GPULib 1.6 offers some exciting new features and a lot more stability. Here’s the brief rundown of what's new:

  • All platforms, Windows, Linux, and OS X, are now distributed as binaries. No building from source required!
  • Added wrappers for MAGMA, a GPU accelerated LAPACK library. This makes over 100 low-level linear algebra routines available. In addition to these low-level routines, we have one high-level routine, GPUINVERT, that is similar to the IDL routine LA_INVERT. We intend to write analogs to the rest of the LA_ IDL routines as well. This feature requires the Basic membership plan.
  • GPULib can now load and execute custom CUDA kernels without having to link to GPULib; you just compile your kernel to a .ptx file. GPULib provides routines to load and execute that kernel at runtime. This feature requires the Basic membership plan.
  • Support for CUDA 5.0.
  • Added support for up to 8-dimensional arrays. This is basic support for higher-dimensional arrays in GPULib, not every routine understands them yet.
  • Added optimized scalar/array operations. No more tricks just to add a scalar to an array!
  • Fixed bugs in GPUHISTOGRAM and GPUHIST_2D.
  • Fixed bugs in GPUATAN2 for complex and double complex.
  • Fixed bugs in GPUREAL.

I have planned a few more posts with details and examples about some of the big features above, like the MAGMA wrappers and the custom kernels.

A lot of work was done on infrastructure to make releasing an easier process, hopefully resulting in more frequent updates in the future. We have plans for some very exciting features in the coming year!

Monday, December 10, 2012

AGU 2012 poster

Here is the abstract for my poster at AGU this year, “GPU accelerated curve fitting with IDL”:

Curve fitting is a common mathematical calculation done in all scientific areas. The Interactive Data Language (IDL) is also widely used in this community for data analysis and visualization. We are creating a general-purpose, GPU accelerated curve fitting library for use from within IDL.

We have developed GPULib, a library of routines in IDL for accelerating common scientific operations including arithmetic, FFTs, interpolation, and others. These routines are accelerated using modern GPUs using NVIDIA’s CUDA architecture. We will add curve fitting routines to the GPULib library suite, making curve fitting much faster.

In addition, library routines required for efficient curve fitting will also be generally useful to other users of GPULib. In particular, a GPU accelerated LAPACK implementation such as MAGMA is required for the Levenberg- Marquardt curve fitting and is commonly used in many other scientific computations. Furthermore, the ability to evaluate custom expressions at runtime necessary for specifying a function model will be useful for users in all areas.

The poster is available to download on the AGU website.

Monday, August 27, 2012

Experiments with OpenCL

I've been looking at what it would take to convert GPULib from CUDA to OpenCL. In the effort, I have created a simple DLM that provides some basic arithmetic operations and some more advanced features. For example, we can create variables on the GPU in standard ways:

IDL> dx = cl_findgen(10)
IDL> dresult = cl_fltarr(10)
We can also transfer data from the CPU to the GPU:
IDL> hy = 2.0 * findgen(10)
IDL> dy = cl_putvar(hy)
Performing a simple operation is easy (though I will eventually change the form of this call to return the value of the result):
IDL> status = cl_add(dx, dy, dresult)
Retrieve the result:
IDL> result = cl_getvar(dresult)
I have implemented memory operations for allocating, transferring data, and freeing memory. I also have routines for obtaining metadata on variables, as well as the available OpenCL devices and platforms.

OpenCL kernels are always constructed from strings, so it is easy to make custom kernels on-the-fly to evaluate user entered expressions. In the next example we will evaluate the expression:

dz = 2. * dx + sin(dy)
First, allocate and initialize the inputs/outputs:
dx = cl_findgen(10)
dy = cl_putvar(2. * findgen(n))
dz = cl_fltarr(10, /nozero)
Next, create the kernel using a string representing the expression, a string array of the variable names in the expression, and an array of type codes of the variables in the expression:
kernel = cl_compile('z[i] = 2. * x[i] + sin(y[i])', $
                    ['x', 'y', 'z'], $
                    lonarr(3) + 4L, $
                    /simple)
The SIMPLE keyword indicates that we are just filling in an expression in a simple element-wise kernel. To run the kernel on our data, call CL_EXECUTE with the kernel identifier and a structure containing the values for the variables in the expression:
status = cl_execute(kernel, { x: dx, y: dy, z: dz })
Finally, retrieve the results as before:
z = cl_getvar(dz)
Without the SIMPLE keyword set, the user can specify the entire full code for the kernel.

There are some great advantages to using OpenCL:

  1. OpenCL is an open standard.
  2. OpenCL is not limited to NVIDIA GPUs (or even GPUs at all, OpenCL can run on various platforms including CPUs).
  3. It is easy to create on-the-fly kernels.
  4. OpenCL uses the same concepts as CUDA, so many kernels can be translated fairly easily.
Unfortunately, there is one big disadvantage right now:
  1. OpenCL libraries are not as far along as CUDA libraries, both those provided by NVIDIA and third-parties.
This would mean that anything beyond basic arithmetic in the current GPULib would be eliminated right now. I don't think it will be long before OpenCL catches up in this regard, but this is a serious issue currently.

Thursday, February 23, 2012

Multiplying a scalar by an array

Currently, you have to be a bit tricky to multiply a scalar by an array. (In the next version of GPULib, there will be a efficient, direct way to do scalar/array operations.)

Most GPULib operations have a short form and a long form. The short form is a simple version of the operation that takes two array arguments. For addition, the short form is equivalent to:

C = A + B
where A, B, and C are all arrays of the same size. In code, this is done with:
c = gpuadd(a, b)
The long form takes five arguments, three scalars and two arrays, in the form of:
C = s1*A + s2*B + s3
where A, B, and C are arrays of the same size and s1, s2, and s3 are scalars. In code, this is:
c = gpuadd(s1, a, s2, b, s3)
If you just want to do a scalar times an array, use the long form with A for both A and B, while also setting b and c to 0.0. For example, to do 2. * findgen(10), do:
IDL> gpuinit
Welcome to GPULib 1.5.0 (Revision: 2199M)
Graphics card: Tesla C2070, compute capability: 2.0, memory: 751 MB available, 1279 MB total
Checking GPU memory allocation...no errors
IDL> a = findgen(10)
IDL> da = gpuputarr(a)
IDL> dc = gpufltarr(10)
IDL> scalar = 2.0
IDL> ; dc = scalar * da + 0. * da + 0.
IDL> dc = gpuadd(scalar, da, 0., da, 0., lhs=dc) 
IDL> c = gpugetarr(dc)
IDL> print, c
      0.00000      2.00000      4.00000      6.00000      8.00000
      10.0000      12.0000      14.0000      16.0000      18.0000
Note, that most operations have a long form version that performs an affine transform along with the operation.

Monday, January 31, 2011

GPULib 1.4.4 released!

We are pleased to announce that GPULib 1.4.4 is available from the Tech-X website.


The following changes have been made for version 1.4.4:
  • Support for CUDA 3.2 (rather than CUDA 3.2 rc2)
  • Added more extensive checks for issues in GPUINIT.
  • Added ability to transpose 1-dimensional arrays like TRANSPOSE does.
  • Now handles ranges, index arrays, and * in any indexing position of an array in the same manner as IDL.
  • Now can use single-dimensional indexing for multi-dimensional arrays.
  • Trying to create a view from discontinuous memory is now disallowed.
  • Fixed garbage collection issue with repeated assignment to the same variable name.
  • Fixed an issue whereby multiple calls to GPUINIT were causing cudaErrorSetOnActiveProcess errors.
  • Fixed memory allocation bug in GPUTRANSPOSE.
  • Fixed typo in GPUMATRIX_MULTIPLY.
  • Added unit tests for GPUTRANSPOSE, GPUMAXOP.
  • Now prints an error message when trying to create a view from discontinuous memory.
  • Fixed build issue on Mac OS X Leopard.
  • Fixed several install issues for Windows when building from source.
  • Improvements to the QuickStartGuide for Windows.
  • Fixed getDimensions bug for certain views.
  • Added error checking for changing dimensions.
  • Fixed memory leak in GPUFFT.
  • Fixed errors in GPUMAXOP.
  • Fixed failing unit tests of GPUMAXOP.
  • Added missing demo file to install list.
  • Fixed digital signing of installer file. 
  • Fixed disappearing graphic in transform3d demo.
  • Corrected IDL_LIBDIR path.
  • Fixed path of file gpulib_version.txt in CMakeLists.txt.
  • Fixed full reduction in GPUTOTAL.
  • Fixed -fPIC warnings when building from source on Windows.
  • Fixed color table issue in GPU_SWIRL_DEMO.
  • Added "press q to stop" option to GPU_FDTD_DEMO.
  • Fixed flipped image in GPU_DECONHUBBLE_DEMO.


Monday, November 15, 2010

GPULib 1.4.2 released!

We are pleased to announce that GPULib 1.4.2 is available from the Tech-X website.

For Windows users, the installer contains the following pre-built versions of GPULib 1.4.2:
  1. 32- or 64-bit single precision (compute capability 1.0) built with CUDA 3.2
  2. 32- or 64-bit double precision (compute capability 1.3) built with CUDA 3.2
  3. 32- or 64-bit Fermi (compute capability 2.0) built with CUDA 3.2
(As before, Windows users can opt to build GPULib from source if they so desire).

The following changes have been made for version 1.4.2:
  • Fixed make install so that appropriate files are now installed.
  • All demos cleaned up and numerous memory leaks fixed.
  • Created a new comprehensive PDF for Windows users, covering general use as well as how to build from source.
  • Windows installation now includes all source files in a zip archive, so that users who are not building from source do not have their install folder cluttered with source files.
  • gpuinit() now displays amount of free memory as well as total memory available on the device.
  • Fixed bug in GPUMATRIX_MULTIPLY. 
  • Fixed several small bugs and memory leaks.

We are grateful to Mort Canty and David Grier for submitting bug reports which led to fixes for this release!