Mixing PyCULA - Expert
=======================

Within the mixed context environment created by mixed_init(), PyCULA allows you freely mix CULA LAPACK calls with your own calls to cuBLAS, cuFFT, (cuSP is coming soon!), and any other custom PyCUDA kernels you can cook up.  The magic all occurs behind the scenes, but we only claim it will work with CULA if you use mixed_init() and cula_gpuarrays.  Below are two examples using just this technique.

*As a note:* Right now we don't support multiple devices or other black magic, but neither does anyone else, so you aren't missing much.  As other packages move forward, we will add such features.


CUDA Libraries: cuBLAS, cuFFT,cuSP
-----------------------------------

In the dawn of CUDA time, we began to wrap various libraries, including skeletons of cuBLAS and cudart.  It turns out that a fellow python~cuda developer, Lev Givon, has provided a python package called scikits.cuda that includes wrappers for cuBLAS, and cuFFT.  There are even some wrappers for CULA, but they are rather bare, and we recommend you use PyCULA for this aspect.  The scikits.cuda cuBLAS wrappers are very nice, and here is a very simple example of how to use them::

    # import PyCULA module
    from PyCULA.cula import *

    # import scki
    import scikits.cuda.cublas as cublas

    #initialize special mixed environment;
    mixed_init()

    #make a numpy array; you may use float32 or float64 dtypes
    a = numpy.array([[1,1],[0,1]], dtype=numpy.float32)
    print a

    #make a cula_Fpitched_gpuarray on gpu device like a
    a_ = cula_Fpitched_gpuarray_like(a)

    #note that a_ is transposed now, and has a pitch of 48.
    print a_

    #run PyCULA gpu_devdev* routine, this keeps results on device; print results
    #PyCULA checks for pitch and performs all the necessary work behind the scenes.
    golden_ = gpu_devdevsvd(a_)
    print golden_
    
    #Note that golden_ is a cula_gpuarray instance, but can be used exactly in place of PyCUDA.gpuarray

    ## run scikits.cuda.cublas's  cublasSnrm2 routine to find norm:
    norm = cublas.cublasSnrm2(golden_.size, golden_.gpudata, 1)
    print norm

Similarly, you can mix with other CUDA runtime libraries using scikits.cuda wrappers to your hearts content.  Should you wish to use a different runtime libs wrapper, or your own wrappers... if you can make the code work with PyCUDA.GPUArrays then it should work just as well with cula_gpuarrays allowing you to mix with calls to CULA LAPACK by the scheme above.


PyCUDA Custom Kernels
----------------------

Thats all fine and dandy, but you have your own kernels that you wish to run?  Well, we can use PyCUDA to run your custom kernels.  This is exactly how the PyCULA project started... we had some graph generators and wished to find the spectrums of such graphs quickly.  Noticing that we were spending a lot of time making (and moving) large matrixes we decided it would be faster to ship a much smaller edge-list and use the GPU to expand it into a matrix on device in parallel.

Writing a Kernel Source Module
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

First we need to write our custom kernel.  More information about this can be found in the PyCUDA documentation, but it basically works like this:

#. Write some C code for your kernel
#. Use PyCUDA.compiler.SourceModule to create a SourceModule
#. Use 'get_function' to create a python callable function
#. Call your CUDA-C from within python

This example is slightly complicated, but it is really useful!  It shows how to use PyCULA to improve a real world computing problem.  Within the code you can also see places where parameters can be tuned for performance.  This example (and all the others) is included in the PyCULA/examples folder.

::

    #imports  
    import numpy
    from collections import defaultdict

    from pycuda.compiler import SourceModule
    from PyCULA.cula import *

    # verbose printing for debugging etc
    verbose = None
    tuning = None
    def vprint(s):
        global verbose
        if verbose:
            print s
    def tprint(s):
        global tuning
        if tuning:
    	print s

    #-----------------------------------------------------------------------
    # cpu_expand_kernel
    #-----------------------------------------------------------------------
    # Input edgelist as numpy.array, datatype
    # For testing/comparing to host cpu
    # Output discrete Laplacian as numpy.array in host memory

    def cpu_expand_kernel(eList_nx2):
        #find size of nxn matrix
        n =  max(eList_nx2.flatten())
        M=numpy.zeros((n,n),eList_nx2.dtype)
        vprint('.... expanding edgelist on cpu....')
        for row in eList_nx2:
            [i,j] = row
            #fix indexing to start at 1
            [i,j] = [i-1,j-1]
            M[i,j] = 1
            M[j,i] = 1
            if i==j:
                M[i,i] = M[i,i] - 1
            else:
                M[i,i] = M[i,i] - 1
                M[j,j] = M[j,j] - 1
        vprint(M)
        return M
    #---end-cpu_expand_kernel


    #-----------------------------------------------------------------------
    # gpu_expand_kernel
    #-----------------------------------------------------------------------
    # Input edgelist as numpy.array, datatype
    # For fast GPU expansion
    # Output discrete Laplacian as gpuarray in device memory
    
    def gpu_expand_kernel(edges):
        dt=edges.dtype
        def degrees(edges):
            h = defaultdict(int)
            for e in edges:
                i, j = e
                h[i] += 1
                h[j] += 1
            return [d for (v,d) in sorted(h.items())]
        D = degrees(edges)
        n =  len(D)
        m = len(edges)
        nsquare = int(n*n)
        vprint('.... expanding edgelist on gpu....')
        # host to device
        # Allocate n^2 zero array on GPU device
    	
        M_gpu = cula_zeros(nsquare,dt)
        edges_gpu=cula_gpuarray_like(numpy.array(edges,dt))
        D_gpu=cula_gpuarray_like(-1*numpy.array(D,dt))


        mod_expand = SourceModule("""
            __global__ void offDiagonal( float *M_gpu, int n, float *edges, int q, int blocksize1)
            {
                int edge = threadIdx.x + (blocksize1*blockIdx.x) + q;
                int i = (int) edges[2*edge] - 1;
                int j = (int) edges[2*edge + 1] - 1;
                M_gpu[i * n + j] = 1.0;
                M_gpu[j * n + i] = 1.0;
             }
            
            __global__ void onDiagonal(float *M_gpu, int n, float *degrees, int p,int blocksize2)
            {
                int vertex = threadIdx.x + (blocksize2*blockIdx.x) + p;
                M_gpu[vertex*n + vertex] = degrees[vertex];
            }
        """)


        mod_expand_64 = SourceModule("""
            __global__ void offDiagonal( double *M_gpu, int n, double *edges, int q, int blocksize1)
            {
                int edge = threadIdx.x + (blocksize1*blockIdx.x) + q;
                int i = (int) edges[2*edge] - 1;
                int j = (int) edges[2*edge + 1] - 1;
                M_gpu[i * n + j] = 1.0;
                M_gpu[j * n + i] = 1.0;
             }
            
            __global__ void onDiagonal(double *M_gpu, int n, double *degrees, int p,int blocksize2)
            {
                int vertex = threadIdx.x + (blocksize2*blockIdx.x) + p;
                M_gpu[vertex*n + vertex] = degrees[vertex];
            }
        """)

        #Define GPU C Kernels in pyCUDA    
        if dt==numpy.float32:
            offDiagonal = mod_expand.get_function("offDiagonal")
            onDiagonal = mod_expand.get_function("onDiagonal")
        elif dt==numpy.float64:
            offDiagonal = mod_expand_64.get_function("offDiagonal")
            onDiagonal = mod_expand_64.get_function("onDiagonal")

        #Manual Block index tuning for Kernel1
        blocksize1=512
        if m <= blocksize1:
            k=1
            b=0
            q=0
    	d=0
            #offdiagonal kernel call small
            offDiagonal(M_gpu,numpy.int32(n),edges_gpu,numpy.int32(q),numpy.int32(blocksize1),block=(m,1,1),grid=(1,1))
        else:
            q=0
            k=int(m/blocksize1)
            #offdiagonal kernel call full blocks
            offDiagonal(M_gpu,numpy.int32(n),edges_gpu,numpy.int32(q),numpy.int32(blocksize1),block=(blocksize1,1,1),grid=(k,1))
            d=int(m%blocksize1)
    	if d !=0:
    	    q=k*blocksize1
    	    #offdiagonal kernel call partial blocks
                offDiagonal(M_gpu,numpy.int32(n),edges_gpu,numpy.int32(q),numpy.int32(blocksize1),block=(d,1,1))

        del edges_gpu

        #Manual Block Index tuning for Kernel2
        blocksize2=512
        if n <= blocksize2:
            l=1
            b=0
            p=0
            # ondiagonal kernel call small
            onDiagonal(M_gpu,numpy.int32(n),D_gpu,numpy.int32(p),numpy.int32(blocksize2),block=(n,1,1))
        else:
            p=0
            l=int(n/blocksize2)
            # ondiagonal kernel call, full blocks  
            onDiagonal(M_gpu,numpy.int32(n),D_gpu,numpy.int32(p),numpy.int32(blocksize2),block=(blocksize2,1,1),grid=(l,1))
            b=int(n%blocksize2)
            if b != 0:
                p=l*blocksize2
    	    # ondiagonal kernel call, partial blocks
    	    onDiagonal(M_gpu,numpy.int32(n),D_gpu,numpy.int32(p),numpy.int32(blocksize2),block=(b,1,1))         
    
        del D_gpu


        # tprint for tuning
        tprint('length of edgelist')
        tprint(m)
        tprint('number of blocks kernel1')	
        tprint(k)	
        tprint('partial block length kernel1')
        tprint(d)
        tprint('length of "degree" list')
        tprint(n)
        tprint('number of whole blocks kernel2')	
        tprint(l)
        tprint('partial block length kernel2')
        tprint(b)


        #reshape M_gpu to nxn
        M_gpu.shape = n,n
        
        vprint(M_gpu) 
        return M_gpu
    #-----end-gpu-kernel-stuff-------------------------------


pheww, that was long.  Your functions need not be this complicated, but I wanted to provide an example that *really improved performance* over the best known alternatives(read below).  The next step is to call this module then follow with some CULA linear algebra, and fortunately this is quite easy from here on out!

Calling Kernel and Mixing With CULA
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

So here the steps are simpler:

#. We need to import our tools, including our kernel_modules created above
#. Run mixed_init() to create a proper mixed enviornment
#. Make an input data set (provided)
#. Call our Custom Kernel
#. Call CULA syev to find the eigenvalues of the resulting matrix

As code it looks like::

    from PyCULA.cula import *
    import numpy as np
    import kernel_modules  			#import custom kernel module code

    mixed_init()

    # This is an edgelist, it represents a reduced intermediary form of a larger sparse matrix.
    E=np.array([[1 , 2 ],
    [ 1 , 16 ],
    [ 2 , 3], 
    [3 , 4 ],
    [ 4 , 5 ],
    [ 5 , 6 ],
    [ 6 , 7 ],
    [ 7 , 8 ],
    [ 8 , 9 ],
    [ 8 , 5 ],
    [ 9 , 10 ],
    [ 9 , 4 ],
    [ 10 , 11 ],
    [ 11 , 12 ],
    [ 12 , 13 ],
    [ 13 , 14 ],
    [ 13 , 10 ],
    [ 14 , 15 ],
    [ 14 , 9 ],
    [ 15 , 16 ],
    [ 15 , 4 ],
    [ 16 , 17 ],
    [ 16 , 3 ],
    [ 17 , 18 ],
    [ 18 , 19 ],
    [ 19 , 20 ],
    [ 19 , 16 ],
    [ 20 , 1 ]], dtype=np.float64)

    print E
    print

    # expand reduced form into full matrix using GPU kernel
    # holds array M on GPU
    M = kernel_modules.gpu_expand_kernel(E)

    print M
    print

    eigvals = gpu_devsyev(M) 		# run CULA syev routine on GPU array M
    print eigvals				# print solutions returned by CULA as numpy array on host


When I claimed this is an improvement, I really meant it.  To chug along a similar problem on a 10kx10k or larger graph in MATLAB may take anywhere from 10-40minutes using a fast machine.  Just using the GPU based CULA for matrix computations yielded times in the 5-10 minute range on large problems with good hardware. Using this reduced intermediary form and GPU expansion technique above, the some computations can be had in 1 or 2 minutes.  That is a 5-40x speedup depending on precision and problem size.  

Best of all, is that since you are developing in Python, its easy to input data, store your results, and share your science with the world.


Thanks Again!
=============

Thanks for checking out our tutorial documentation.  Should you have more questions, the docstrings from the source code follow in the next section.  Here you will find more detailed information about what particular wrappers are expecting as input and will be returning as output.  There are MANY functions, so use search to speed your query, and don't forget that docstring magic allows you to get help interactively in the interpreter::

    >>>from PyCULA.cula import *
    >>>help(gpu_svd)

        Help on function gpu_svd in module PyCULA.cula:

        gpu_svd(A, left_vectors=False, right_vectors=False)
            Takes numpy.array A and computes the SVD, returns Singular Values (optionally left/right vectors) as np.arrays.
    
            Keyword arguments:
            A -- input matrix
            left_vectors -- default == False; computes left singular vectors
            right_vectors -- default == False; computes right singular vectors
    
            Refer to CULA docs regarding gesvd for specific application.
        (END) 

~Cheers from the PyCULA team.