This module contains implementation of batched FFT, ported from Apple’s OpenCL implementation. OpenCL’s ideology of constructing kernel code on the fly maps perfectly on PyCuda/PyOpenCL, and variety of Python’s templating engines makes code generation simpler. I used mako templating engine, simply because of the personal preference. The code can be easily changed to use any other engine.
Note
“Cuda” part of pyfft requires PyCuda 0.94 or newer; “CL” part requires PyOpenCL 0.92 or newer.
This overview contains basic usage examples for both backends, Cuda and OpenCL. Cuda part goes first and contains a bit more detailed comments, but they can be easily projected on OpenCL part, since the code is very similar.
First, import numpy and plan creation interface from pyfft.
>>> from pyfft.cuda import Plan
>>> import numpy
Import Cuda driver API root and context creation function. In addition, we will need gpuarray module to pass data to and from GPU.
>>> import pycuda.driver as cuda
>>> from pycuda.tools import make_default_context
>>> import pycuda.gpuarray as gpuarray
Since we are using Cuda, it must be initialized before any Cuda functions are called (by default, the plan will use existing context, but there are other possibilities; see reference entry for Plan for further information). Stream creation is optional; if no stream is provided, Plan will create its own one.
>>> cuda.init()
>>> context = make_default_context()
>>> stream = cuda.Stream()
Then the plan must be created. The creation is not very fast, mainly because of the compilation speed. But, fortunately, PyCuda and PyOpenCL cache compiled sources, so if you use the same plan for each run of your program, it will be compiled only the first time.
>>> plan = Plan((16, 16), stream=stream)
Now, let’s prepare simple test array:
>>> data = numpy.ones((16, 16), dtype=numpy.complex64)
>>> gpu_data = gpuarray.to_gpu(data)
>>> print gpu_data
[[ 1.+0.j 1.+0.j 1.+0.j 1.+0.j 1.+0.j 1.+0.j 1.+0.j 1.+0.j 1.+0.j
1.+0.j 1.+0.j 1.+0.j 1.+0.j 1.+0.j 1.+0.j 1.+0.j]
...
[ 1.+0.j 1.+0.j 1.+0.j 1.+0.j 1.+0.j 1.+0.j 1.+0.j 1.+0.j 1.+0.j
1.+0.j 1.+0.j 1.+0.j 1.+0.j 1.+0.j 1.+0.j 1.+0.j]]
... and execute our plan:
>>> plan.execute(gpu_data)
<pycuda._driver.Stream object at ...>
>>> result = gpu_data.get()
>>> print result
[[ 256.+0.j 0.+0.j 0.+0.j 0.+0.j 0.+0.j 0.+0.j 0.+0.j
0.+0.j 0.+0.j 0.+0.j 0.+0.j 0.+0.j 0.+0.j 0.+0.j
0.+0.j 0.+0.j]
...
[ 0.+0.j 0.+0.j 0.+0.j 0.+0.j 0.+0.j 0.+0.j 0.+0.j
0.+0.j 0.+0.j 0.+0.j 0.+0.j 0.+0.j 0.+0.j 0.+0.j
0.+0.j 0.+0.j]]
As expected, we got array with the first non-zero element, equal to array size. Let’s now perform the inverse transform:
>>> plan.execute(gpu_data, inverse=True)
<pycuda._driver.Stream object at ...>
>>> result = gpu_data.get()
Since data is non-integer, we cannot simply compare it. We will just calculate error instead.
>>> error = numpy.abs(numpy.sum(numpy.abs(data) - numpy.abs(result)) / data.size)
>>> error < 1e-6
True
That’s good enough for single precision numbers.
Last step is releasing Cuda context:
>>> context.pop()
OpenCL example consists of the same part as the Cuda one.
Import plan class:
>>> from pyfft.cl import Plan
>>> import numpy
Import OpenCL API root and array class:
>>> import pyopencl as cl
>>> import pyopencl.array as cl_array
Initialize context and queue (unlike Cuda one, OpenCL plan class requires either context or queue to be specified; there is no “current context” in OpenCL):
>>> ctx = cl.create_some_context(interactive=False)
>>> queue = cl.CommandQueue(ctx)
Create plan (remark about caching in PyCuda applies here too):
>>> plan = Plan((16, 16), queue=queue)
Prepare data:
>>> data = numpy.ones((16, 16), dtype=numpy.complex64)
>>> gpu_data = cl_array.to_device(ctx, queue, data)
>>> print gpu_data
[[ 1.+0.j 1.+0.j 1.+0.j 1.+0.j 1.+0.j 1.+0.j 1.+0.j 1.+0.j 1.+0.j
1.+0.j 1.+0.j 1.+0.j 1.+0.j 1.+0.j 1.+0.j 1.+0.j]
...
[ 1.+0.j 1.+0.j 1.+0.j 1.+0.j 1.+0.j 1.+0.j 1.+0.j 1.+0.j 1.+0.j
1.+0.j 1.+0.j 1.+0.j 1.+0.j 1.+0.j 1.+0.j 1.+0.j]]
Forward transform:
>>> plan.execute(gpu_data.data)
<pyopencl._cl.CommandQueue object at ...>
>>> result = gpu_data.get()
>>> print result
[[ 256.+0.j 0.+0.j 0.+0.j 0.+0.j 0.+0.j 0.+0.j 0.+0.j
0.+0.j 0.+0.j 0.+0.j 0.+0.j 0.+0.j 0.+0.j 0.+0.j
0.+0.j 0.+0.j]
...
[ 0.+0.j 0.+0.j 0.+0.j 0.+0.j 0.+0.j 0.+0.j 0.+0.j
0.+0.j 0.+0.j 0.+0.j 0.+0.j 0.+0.j 0.+0.j 0.+0.j
0.+0.j 0.+0.j]]
Inverse transform:
>>> plan.execute(gpu_data.data, inverse=True)
<pyopencl._cl.CommandQueue object at ...>
>>> result = gpu_data.get()
Check that the result is equal to the initial data array:
>>> error = numpy.abs(numpy.sum(numpy.abs(data) - numpy.abs(result)) / data.size)
>>> print error < 1e-6
True
PyOpenCL does not require explicit context destruction, Python will do it for us.
Tuple with integers, containing the module version, for example (0, 3, 4).
Creates class, containing precalculated FFT plan.
Parameters: |
|
---|
Warning
2D and 3D plans with y == 1 or z == 1 are not supported at the moment.
Execute plan for given data. Function signature depends on the data type, chosen during plan creation.
Parameters: |
|
---|---|
Returns: | None if waiting for scheduled kernels; Stream or CommandQueue object otherwise. User is expected to handle this object with care, since it can be reused during the next call to execute(). |
Plan behavior can differ depending on values of context, stream/queue and wait_for_finish parameters. These differences should, in theory, make the module more convenient to use.
wait_for_finish parameter can be set on three levels. First, there is a default value which depends on context and stream/queue parameters (see details below). It can be overridden by explicitly passing it as an argument to constructor. This setting, in turn, can be overridden by passing wait_for_finish keyword to execute().
- Current (at the moment of plan creation) context and device will be used to create kernels.
- Stream will be created internally and used for each execute() call.
- Default value of wait_for_finish is True.
- context is ignored.
- stream is remembered and used.
- execute() will assume that context, corresponding to given stream is active at the time of the call.
- Default value of wait_for_finish is False.
Either context or queue must be set.
- queue is remembered and used.
- Target context and device are obtained from queue.
- Default value of wait_for_finish is False.
- context is remembered.
- CommandQueue will be created internally and used for each execute() call.
- Default value of wait_for_finish is True.
Here is the comparison to pure Cuda program using CUFFT. For Cuda test program see cuda folder in the distribution. Pyfft tests were executed with fast_math=True (default option for performance test script).
In the following tables “sp” stands for “single precision”, “dp” for “double precision”.
Mac OS 10.6.6, Python 2.6, Cuda 3.2, PyCuda 2011.1, nVidia GeForce 9600M, 32 Mb buffer:
Problem size / GFLOPS | CUFFT, sp | pyfft, sp |
---|---|---|
[16, 1, 1], batch 131072 | 12.3 | 7.7 |
[1024, 1, 1], batch 2048 | 19.9 | 15.7 |
[8192, 1, 1], batch 256 | 13.0 | 12.3 |
[16, 16, 1], batch 8192 | 10.6 | 10.0 |
[128, 128, 1], batch 128 | 14.4 | 15.1 |
[1024, 1024, 1], batch 2 | 15.6 | 13.4 |
[16, 16, 16], batch 512 | 10.2 | 11.4 |
[32, 32, 128], batch 16 | 15.2 | 15.7 |
[128, 128, 128], batch 1 | 12.7 | 14.5 |
Ubuntu 10.04.2, Python 2.6.5, Cuda 3.2, PyCuda 2011.1, nVidia Tesla C2050, 32 Mb buffer:
Problem size / GFLOPS | CUFFT, sp | pyfft, sp | CUFFT, dp | pyfft, dp |
---|---|---|---|---|
[16, 1, 1], batch 131072 | 127.1 | 91.4 | 28.3 | 37.8 |
[1024, 1, 1], batch 2048 | 316.2 | 254.0 | 75.9 | 28.4 |
[8192, 1, 1], batch 256 | 250.6 | 117.3 | 94.7 | 29.3 |
[16, 16, 1], batch 8192 | 119.7 | 106.7 | 39.4 | 43.4 |
[128, 128, 1], batch 128 | 198.7 | 187.2 | 53.4 | 47.4 |
[1024, 1024, 1], batch 2 | 184.4 | 168.5 | 73.7 | 27.7 |
[16, 16, 16], batch 512 | 117.6 | 117.6 | 45.0 | 47.0 |
[32, 32, 128], batch 16 | 161.6 | 163.9 | 63.2 | 58.8 |
[128, 128, 128], batch 1 | 191.2 | 184.7 | 52.2 | 44.6 |