Device Interface Tutorial - Intermediate

Using the Device Interface

CULA also offers a device interface, which is for more advanced users who have the desire to run CULA on data they have already placed on the GPU, perhaps by other means. This could be from custom PyCUDA kernels, CUDA libraries like cuBLAS or cuFFT etc.

To implement this in PyCULA, we extend the numpy-like cuda-driver based GPUArray objects from PyCUDA. Unfortunately, we cannot use straight PyCUDA GPUArrays because PyCULA needs to know a little more about the arrays (like special pitching...).

What we have done is provide various cula_gpuarray functions which are simply extended PyCUDA arrays. At their heart cula_gpuarrays *are PyCUDA arrays, and therefor with proper initialization, you may use them anywhere you would use PyCUDA.gpuarrays and with CULA! This is one of those features that embodies what PyCULA is all about... bring useful features together while keepin’ it simple.

So lets get to business. The immediate differences from the simple host_interface routines are:

  • Use mixed_init() instead of culaInitialize() to create a special mixed context enviornment.
  • Use cula_gpuarray instead of numpy.arrays

Lets revisit our simple host_interface tutorial but using a different device_interface routine:

# import PyCULA module
from PyCULA.cula import *

#initialize special mixed environment;

#make a numpy array; you may use float32 or float64 dtypes
cat = numpy.array([[1,2],[3,4]], dtype=numpy.float32)
print cat

#make a cula_gpuarray on gpu device like cat
dog_ = cula_gpuarray_like(cat)
print dog_

#run PyCULA gpu_dev* routine; print results
lamb = gpu_devsvd(dog_)
print lamb

Some notes:

  • Make sure you use the mixed_init() when you plan to use device_interface or mix with other CUDA libs. If you don’t, you will probably crash and burn.
  • gpu_dev* routines take in cula_gpuarrays on device and return numpy arrays on host.
  • If you want PyCULA to keep its answers on the device as cula_gpuarrays, just use gpu_devdev*. These gpu_devdev* routines take cula_gpuarrays and return cula_gpuarrays.
  • Pitching will make CULA run faster
  • CULA is expecting FORTRAN (column-major storage).

Pitching and Column Major Storage

CULA is expecting FORTRAN style (column-major) storage, and in addition CULA claims that using proper pitching will allow their routines to run up to twice as fast. If these terms are foreign to you, read the relevant CULA documentation and then come back here.

For expert users coming from CUDA-C, you may call our function culaGetOptimalPitch(rows,cols,elesize) which queries CULA and will return the proper pitch; then make your array shape=(pitch,cols), transfer your data as column-major and you will be ‘cooking with gas’ as my favorite advisor(s) would say.

For other users, PyCULA provides some functions to help get your data on the card in the optimal fashion. Specifically, cula_Fpitched_gpuarray_like() takes a nummpy array and returns a pitched and column-major ordered cula_pitched_gpuarray instance with your data. Time for an example:

# import PyCULA module
from PyCULA.cula import *

#initialize special mixed environment;

#make a numpy array; you may use float32 or float64 dtypes
a = numpy.array([[1,1],[0,1]], dtype=numpy.float32)
print a

#make a cula_Fpitched_gpuarray on gpu device like a
a_ = cula_Fpitched_gpuarray_like(a)

#this is just for show:
print 'Pitch:',a_.pitch,'Rows:',a_.rows,'Cols:',a_.cols,'Shape:',a_.shape,'dtype:',a_.dtype
print 'Device Pointer:',a_._as_parameter_
print a_

#note that a_ is transposed now, and has a pitch of 48.

#run PyCULA gpu_devdev* routine, this keeps results on device; print results
#PyCULA checks for pitch and performs all the necessary work behind the scenes.
golden_ = gpu_devdevsvd(a_)
print golden_

#lets do something PyCUDA to this answer; multiply it by 2 elementwise:
#there is some overhead using PyCUDA, so this seems slow now, but it becomes quite fast with say >1000 elements
platinum_ = 2*golden_

#lets take the data off the device now; print results
platinum = platinum_.get()
print platinum

Now you have seen a simple device_interface gpu_dev* example, and a more advanced gpu_devdev* example which used optimal pitching followed by applying a PyCUDA kernel. Your GPGPU mastery is fast approaching wizardry status. To complete your training, lets start mixing with cuBLAS and custom PyCUDA kernels in the next sections.

Table Of Contents

Previous topic

Simple PyCULA Tutorial

Next topic

Mixing PyCULA - Expert

This Page