1 2.. _simulator: 3 4================================================= 5Debugging CUDA Python with the the CUDA Simulator 6================================================= 7 8Numba includes a CUDA Simulator that implements most of the semantics in CUDA 9Python using the Python interpreter and some additional Python code. This can 10be used to debug CUDA Python code, either by adding print statements to your 11code, or by using the debugger to step through the execution of an individual 12thread. 13 14Execution of kernels is performed by the simulator one block at a time. One 15thread is spawned for each thread in the block, and scheduling of the execution 16of these threads is left up to the operating system. 17 18Using the simulator 19=================== 20 21The simulator is enabled by setting the environment variable 22:envvar:`NUMBA_ENABLE_CUDASIM` to 1. CUDA Python code may then be executed as 23normal. The easiest way to use the debugger inside a kernel is to only stop a 24single thread, otherwise the interaction with the debugger is difficult to 25handle. For example, the kernel below will stop in the thread ``<<<(3,0,0), (1, 260, 0)>>>``:: 27 28 @cuda.jit 29 def vec_add(A, B, out): 30 x = cuda.threadIdx.x 31 bx = cuda.blockIdx.x 32 bdx = cuda.blockDim.x 33 if x == 1 and bx == 3: 34 from pdb import set_trace; set_trace() 35 i = bx * bdx + x 36 out[i] = A[i] + B[i] 37 38when invoked with a one-dimensional grid and one-dimensional blocks. 39 40Supported features 41================== 42 43The simulator aims to provide as complete a simulation of execution on a real 44GPU as possible - in particular, the following are supported: 45 46* Atomic operations 47* Constant memory 48* Local memory 49* Shared memory: declarations of shared memory arrays must be on separate source 50 lines, since the simulator uses source line information to keep track of 51 allocations of shared memory across threads. 52* :func:`.syncthreads` is supported - however, in the case where divergent 53 threads enter different :func:`.syncthreads` calls, the launch will not fail, 54 but unexpected behaviour will occur. A future version of the simulator may 55 detect this condition. 56* The stream API is supported, but all operations occur sequentially and 57 synchronously, unlike on a real device. Synchronising on a stream is therefore 58 a no-op. 59* The event API is also supported, but provides no meaningful timing 60 information. 61* Data transfer to and from the GPU - in particular, creating array objects with 62 :func:`.device_array` and :func:`.device_array_like`. The APIs for pinned memory 63 :func:`.pinned` and :func:`.pinned_array` are also supported, but no pinning 64 takes place. 65* The driver API implementation of the list of GPU contexts (``cuda.gpus`` and 66 ``cuda.cudadrv.devices.gpus``) is supported, and reports a single GPU context. 67 This context can be closed and reset as the real one would. 68* The :func:`.detect` function is supported, and reports one device called 69 `SIMULATOR`. 70 71Some limitations of the simulator include: 72 73* It does not perform type checking/type inference. If any argument types to a 74 jitted function are incorrect, or if the specification of the type of any 75 local variables are incorrect, this will not be detected by the simulator. 76* Only one GPU is simulated. 77* Multithreaded accesses to a single GPU are not supported, and will result in 78 unexpected behaviour. 79* Most of the driver API is unimplemented. 80* It is not possible to link PTX code with CUDA Python functions. 81* Warps and warp-level operations are not yet implemented. 82 83Obviously, the speed of the simulator is also much lower than that of a real 84device. It may be necessary to reduce the size of input data and the size of the 85CUDA grid in order to make debugging with the simulator tractable. 86