Within the mixed context environment created by mixed_init(), PyCULA allows you freely mix CULA LAPACK calls with your own calls to cuBLAS, cuFFT, (cuSP is coming soon!), and any other custom PyCUDA kernels you can cook up. The magic all occurs behind the scenes, but we only claim it will work with CULA if you use mixed_init() and cula_gpuarrays. Below are two examples using just this technique.

*As a note:* Right now we don’t support multiple devices or other black magic, but neither does anyone else, so you aren’t missing much. As other packages move forward, we will add such features.

In the dawn of CUDA time, we began to wrap various libraries, including skeletons of cuBLAS and cudart. It turns out that a fellow python~cuda developer, Lev Givon, has provided a python package called scikits.cuda that includes wrappers for cuBLAS, and cuFFT. There are even some wrappers for CULA, but they are rather bare, and we recommend you use PyCULA for this aspect. The scikits.cuda cuBLAS wrappers are very nice, and here is a very simple example of how to use them:

```
# import PyCULA module
from PyCULA.cula import *
# import scki
import scikits.cuda.cublas as cublas
#initialize special mixed environment;
mixed_init()
#make a numpy array; you may use float32 or float64 dtypes
a = numpy.array([[1,1],[0,1]], dtype=numpy.float32)
print a
#make a cula_Fpitched_gpuarray on gpu device like a
a_ = cula_Fpitched_gpuarray_like(a)
#note that a_ is transposed now, and has a pitch of 48.
print a_
#run PyCULA gpu_devdev* routine, this keeps results on device; print results
#PyCULA checks for pitch and performs all the necessary work behind the scenes.
golden_ = gpu_devdevsvd(a_)
print golden_
#Note that golden_ is a cula_gpuarray instance, but can be used exactly in place of PyCUDA.gpuarray
## run scikits.cuda.cublas's cublasSnrm2 routine to find norm:
norm = cublas.cublasSnrm2(golden_.size, golden_.gpudata, 1)
print norm
```

Similarly, you can mix with other CUDA runtime libraries using scikits.cuda wrappers to your hearts content. Should you wish to use a different runtime libs wrapper, or your own wrappers... if you can make the code work with PyCUDA.GPUArrays then it should work just as well with cula_gpuarrays allowing you to mix with calls to CULA LAPACK by the scheme above.

Thats all fine and dandy, but you have your own kernels that you wish to run? Well, we can use PyCUDA to run your custom kernels. This is exactly how the PyCULA project started... we had some graph generators and wished to find the spectrums of such graphs quickly. Noticing that we were spending a lot of time making (and moving) large matrixes we decided it would be faster to ship a much smaller edge-list and use the GPU to expand it into a matrix on device in parallel.

First we need to write our custom kernel. More information about this can be found in the PyCUDA documentation, but it basically works like this:

- Write some C code for your kernel
- Use PyCUDA.compiler.SourceModule to create a SourceModule
- Use ‘get_function’ to create a python callable function
- Call your CUDA-C from within python

This example is slightly complicated, but it is really useful! It shows how to use PyCULA to improve a real world computing problem. Within the code you can also see places where parameters can be tuned for performance. This example (and all the others) is included in the PyCULA/examples folder.

```
#imports
import numpy
from collections import defaultdict
from pycuda.compiler import SourceModule
from PyCULA.cula import *
# verbose printing for debugging etc
verbose = None
tuning = None
def vprint(s):
global verbose
if verbose:
print s
def tprint(s):
global tuning
if tuning:
print s
#-----------------------------------------------------------------------
# cpu_expand_kernel
#-----------------------------------------------------------------------
# Input edgelist as numpy.array, datatype
# For testing/comparing to host cpu
# Output discrete Laplacian as numpy.array in host memory
def cpu_expand_kernel(eList_nx2):
#find size of nxn matrix
n = max(eList_nx2.flatten())
M=numpy.zeros((n,n),eList_nx2.dtype)
vprint('.... expanding edgelist on cpu....')
for row in eList_nx2:
[i,j] = row
#fix indexing to start at 1
[i,j] = [i-1,j-1]
M[i,j] = 1
M[j,i] = 1
if i==j:
M[i,i] = M[i,i] - 1
else:
M[i,i] = M[i,i] - 1
M[j,j] = M[j,j] - 1
vprint(M)
return M
#---end-cpu_expand_kernel
#-----------------------------------------------------------------------
# gpu_expand_kernel
#-----------------------------------------------------------------------
# Input edgelist as numpy.array, datatype
# For fast GPU expansion
# Output discrete Laplacian as gpuarray in device memory
def gpu_expand_kernel(edges):
dt=edges.dtype
def degrees(edges):
h = defaultdict(int)
for e in edges:
i, j = e
h[i] += 1
h[j] += 1
return [d for (v,d) in sorted(h.items())]
D = degrees(edges)
n = len(D)
m = len(edges)
nsquare = int(n*n)
vprint('.... expanding edgelist on gpu....')
# host to device
# Allocate n^2 zero array on GPU device
M_gpu = cula_zeros(nsquare,dt)
edges_gpu=cula_gpuarray_like(numpy.array(edges,dt))
D_gpu=cula_gpuarray_like(-1*numpy.array(D,dt))
mod_expand = SourceModule("""
__global__ void offDiagonal( float *M_gpu, int n, float *edges, int q, int blocksize1)
{
int edge = threadIdx.x + (blocksize1*blockIdx.x) + q;
int i = (int) edges[2*edge] - 1;
int j = (int) edges[2*edge + 1] - 1;
M_gpu[i * n + j] = 1.0;
M_gpu[j * n + i] = 1.0;
}
__global__ void onDiagonal(float *M_gpu, int n, float *degrees, int p,int blocksize2)
{
int vertex = threadIdx.x + (blocksize2*blockIdx.x) + p;
M_gpu[vertex*n + vertex] = degrees[vertex];
}
""")
mod_expand_64 = SourceModule("""
__global__ void offDiagonal( double *M_gpu, int n, double *edges, int q, int blocksize1)
{
int edge = threadIdx.x + (blocksize1*blockIdx.x) + q;
int i = (int) edges[2*edge] - 1;
int j = (int) edges[2*edge + 1] - 1;
M_gpu[i * n + j] = 1.0;
M_gpu[j * n + i] = 1.0;
}
__global__ void onDiagonal(double *M_gpu, int n, double *degrees, int p,int blocksize2)
{
int vertex = threadIdx.x + (blocksize2*blockIdx.x) + p;
M_gpu[vertex*n + vertex] = degrees[vertex];
}
""")
#Define GPU C Kernels in pyCUDA
if dt==numpy.float32:
offDiagonal = mod_expand.get_function("offDiagonal")
onDiagonal = mod_expand.get_function("onDiagonal")
elif dt==numpy.float64:
offDiagonal = mod_expand_64.get_function("offDiagonal")
onDiagonal = mod_expand_64.get_function("onDiagonal")
#Manual Block index tuning for Kernel1
blocksize1=512
if m <= blocksize1:
k=1
b=0
q=0
d=0
#offdiagonal kernel call small
offDiagonal(M_gpu,numpy.int32(n),edges_gpu,numpy.int32(q),numpy.int32(blocksize1),block=(m,1,1),grid=(1,1))
else:
q=0
k=int(m/blocksize1)
#offdiagonal kernel call full blocks
offDiagonal(M_gpu,numpy.int32(n),edges_gpu,numpy.int32(q),numpy.int32(blocksize1),block=(blocksize1,1,1),grid=(k,1))
d=int(m%blocksize1)
if d !=0:
q=k*blocksize1
#offdiagonal kernel call partial blocks
offDiagonal(M_gpu,numpy.int32(n),edges_gpu,numpy.int32(q),numpy.int32(blocksize1),block=(d,1,1))
del edges_gpu
#Manual Block Index tuning for Kernel2
blocksize2=512
if n <= blocksize2:
l=1
b=0
p=0
# ondiagonal kernel call small
onDiagonal(M_gpu,numpy.int32(n),D_gpu,numpy.int32(p),numpy.int32(blocksize2),block=(n,1,1))
else:
p=0
l=int(n/blocksize2)
# ondiagonal kernel call, full blocks
onDiagonal(M_gpu,numpy.int32(n),D_gpu,numpy.int32(p),numpy.int32(blocksize2),block=(blocksize2,1,1),grid=(l,1))
b=int(n%blocksize2)
if b != 0:
p=l*blocksize2
# ondiagonal kernel call, partial blocks
onDiagonal(M_gpu,numpy.int32(n),D_gpu,numpy.int32(p),numpy.int32(blocksize2),block=(b,1,1))
del D_gpu
# tprint for tuning
tprint('length of edgelist')
tprint(m)
tprint('number of blocks kernel1')
tprint(k)
tprint('partial block length kernel1')
tprint(d)
tprint('length of "degree" list')
tprint(n)
tprint('number of whole blocks kernel2')
tprint(l)
tprint('partial block length kernel2')
tprint(b)
#reshape M_gpu to nxn
M_gpu.shape = n,n
vprint(M_gpu)
return M_gpu
#-----end-gpu-kernel-stuff-------------------------------
```

pheww, that was long. Your functions need not be this complicated, but I wanted to provide an example that *really improved performance* over the best known alternatives(read below). The next step is to call this module then follow with some CULA linear algebra, and fortunately this is quite easy from here on out!

So here the steps are simpler:

- We need to import our tools, including our kernel_modules created above
- Run mixed_init() to create a proper mixed enviornment
- Make an input data set (provided)
- Call our Custom Kernel
- Call CULA syev to find the eigenvalues of the resulting matrix

As code it looks like:

```
from PyCULA.cula import *
import numpy as np
import kernel_modules #import custom kernel module code
mixed_init()
# This is an edgelist, it represents a reduced intermediary form of a larger sparse matrix.
E=np.array([[1 , 2 ],
[ 1 , 16 ],
[ 2 , 3],
[3 , 4 ],
[ 4 , 5 ],
[ 5 , 6 ],
[ 6 , 7 ],
[ 7 , 8 ],
[ 8 , 9 ],
[ 8 , 5 ],
[ 9 , 10 ],
[ 9 , 4 ],
[ 10 , 11 ],
[ 11 , 12 ],
[ 12 , 13 ],
[ 13 , 14 ],
[ 13 , 10 ],
[ 14 , 15 ],
[ 14 , 9 ],
[ 15 , 16 ],
[ 15 , 4 ],
[ 16 , 17 ],
[ 16 , 3 ],
[ 17 , 18 ],
[ 18 , 19 ],
[ 19 , 20 ],
[ 19 , 16 ],
[ 20 , 1 ]], dtype=np.float64)
print E
print
# expand reduced form into full matrix using GPU kernel
# holds array M on GPU
M = kernel_modules.gpu_expand_kernel(E)
print M
print
eigvals = gpu_devsyev(M) # run CULA syev routine on GPU array M
print eigvals # print solutions returned by CULA as numpy array on host
```

When I claimed this is an improvement, I really meant it. To chug along a similar problem on a 10kx10k or larger graph in MATLAB may take anywhere from 10-40minutes using a fast machine. Just using the GPU based CULA for matrix computations yielded times in the 5-10 minute range on large problems with good hardware. Using this reduced intermediary form and GPU expansion technique above, the some computations can be had in 1 or 2 minutes. That is a 5-40x speedup depending on precision and problem size.

Best of all, is that since you are developing in Python, its easy to input data, store your results, and share your science with the world.

Thanks for checking out our tutorial documentation. Should you have more questions, the docstrings from the source code follow in the next section. Here you will find more detailed information about what particular wrappers are expecting as input and will be returning as output. There are MANY functions, so use search to speed your query, and don’t forget that docstring magic allows you to get help interactively in the interpreter:

```
>>>from PyCULA.cula import *
>>>help(gpu_svd)
Help on function gpu_svd in module PyCULA.cula:
gpu_svd(A, left_vectors=False, right_vectors=False)
Takes numpy.array A and computes the SVD, returns Singular Values (optionally left/right vectors) as np.arrays.
Keyword arguments:
A -- input matrix
left_vectors -- default == False; computes left singular vectors
right_vectors -- default == False; computes right singular vectors
Refer to CULA docs regarding gesvd for specific application.
(END)
```

~Cheers from the PyCULA team.