1=============================
2User Guide for NVPTX Back-end
3=============================
4
5.. contents::
6   :local:
7   :depth: 3
8
9
10Introduction
11============
12
13To support GPU programming, the NVPTX back-end supports a subset of LLVM IR
14along with a defined set of conventions used to represent GPU programming
15concepts. This document provides an overview of the general usage of the back-
16end, including a description of the conventions used and the set of accepted
17LLVM IR.
18
19.. note::
20
21   This document assumes a basic familiarity with CUDA and the PTX
22   assembly language. Information about the CUDA Driver API and the PTX assembly
23   language can be found in the `CUDA documentation
24   <http://docs.nvidia.com/cuda/index.html>`_.
25
26
27
28Conventions
29===========
30
31Marking Functions as Kernels
32----------------------------
33
34In PTX, there are two types of functions: *device functions*, which are only
35callable by device code, and *kernel functions*, which are callable by host
36code. By default, the back-end will emit device functions. Metadata is used to
37declare a function as a kernel function. This metadata is attached to the
38``nvvm.annotations`` named metadata object, and has the following format:
39
40.. code-block:: text
41
42   !0 = !{<function-ref>, metadata !"kernel", i32 1}
43
44The first parameter is a reference to the kernel function. The following
45example shows a kernel function calling a device function in LLVM IR. The
46function ``@my_kernel`` is callable from host code, but ``@my_fmad`` is not.
47
48.. code-block:: llvm
49
50    define float @my_fmad(float %x, float %y, float %z) {
51      %mul = fmul float %x, %y
52      %add = fadd float %mul, %z
53      ret float %add
54    }
55
56    define void @my_kernel(float* %ptr) {
57      %val = load float, float* %ptr
58      %ret = call float @my_fmad(float %val, float %val, float %val)
59      store float %ret, float* %ptr
60      ret void
61    }
62
63    !nvvm.annotations = !{!1}
64    !1 = !{void (float*)* @my_kernel, !"kernel", i32 1}
65
66When compiled, the PTX kernel functions are callable by host-side code.
67
68
69.. _address_spaces:
70
71Address Spaces
72--------------
73
74The NVPTX back-end uses the following address space mapping:
75
76   ============= ======================
77   Address Space Memory Space
78   ============= ======================
79   0             Generic
80   1             Global
81   2             Internal Use
82   3             Shared
83   4             Constant
84   5             Local
85   ============= ======================
86
87Every global variable and pointer type is assigned to one of these address
88spaces, with 0 being the default address space. Intrinsics are provided which
89can be used to convert pointers between the generic and non-generic address
90spaces.
91
92As an example, the following IR will define an array ``@g`` that resides in
93global device memory.
94
95.. code-block:: llvm
96
97    @g = internal addrspace(1) global [4 x i32] [ i32 0, i32 1, i32 2, i32 3 ]
98
99LLVM IR functions can read and write to this array, and host-side code can
100copy data to it by name with the CUDA Driver API.
101
102Note that since address space 0 is the generic space, it is illegal to have
103global variables in address space 0.  Address space 0 is the default address
104space in LLVM, so the ``addrspace(N)`` annotation is *required* for global
105variables.
106
107
108Triples
109-------
110
111The NVPTX target uses the module triple to select between 32/64-bit code
112generation and the driver-compiler interface to use. The triple architecture
113can be one of ``nvptx`` (32-bit PTX) or ``nvptx64`` (64-bit PTX). The
114operating system should be one of ``cuda`` or ``nvcl``, which determines the
115interface used by the generated code to communicate with the driver.  Most
116users will want to use ``cuda`` as the operating system, which makes the
117generated PTX compatible with the CUDA Driver API.
118
119Example: 32-bit PTX for CUDA Driver API: ``nvptx-nvidia-cuda``
120
121Example: 64-bit PTX for CUDA Driver API: ``nvptx64-nvidia-cuda``
122
123
124
125.. _nvptx_intrinsics:
126
127NVPTX Intrinsics
128================
129
130Address Space Conversion
131------------------------
132
133'``llvm.nvvm.ptr.*.to.gen``' Intrinsics
134^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
135
136Syntax:
137"""""""
138
139These are overloaded intrinsics.  You can use these on any pointer types.
140
141.. code-block:: llvm
142
143    declare i8* @llvm.nvvm.ptr.global.to.gen.p0i8.p1i8(i8 addrspace(1)*)
144    declare i8* @llvm.nvvm.ptr.shared.to.gen.p0i8.p3i8(i8 addrspace(3)*)
145    declare i8* @llvm.nvvm.ptr.constant.to.gen.p0i8.p4i8(i8 addrspace(4)*)
146    declare i8* @llvm.nvvm.ptr.local.to.gen.p0i8.p5i8(i8 addrspace(5)*)
147
148Overview:
149"""""""""
150
151The '``llvm.nvvm.ptr.*.to.gen``' intrinsics convert a pointer in a non-generic
152address space to a generic address space pointer.
153
154Semantics:
155""""""""""
156
157These intrinsics modify the pointer value to be a valid generic address space
158pointer.
159
160
161'``llvm.nvvm.ptr.gen.to.*``' Intrinsics
162^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
163
164Syntax:
165"""""""
166
167These are overloaded intrinsics.  You can use these on any pointer types.
168
169.. code-block:: llvm
170
171    declare i8 addrspace(1)* @llvm.nvvm.ptr.gen.to.global.p1i8.p0i8(i8*)
172    declare i8 addrspace(3)* @llvm.nvvm.ptr.gen.to.shared.p3i8.p0i8(i8*)
173    declare i8 addrspace(4)* @llvm.nvvm.ptr.gen.to.constant.p4i8.p0i8(i8*)
174    declare i8 addrspace(5)* @llvm.nvvm.ptr.gen.to.local.p5i8.p0i8(i8*)
175
176Overview:
177"""""""""
178
179The '``llvm.nvvm.ptr.gen.to.*``' intrinsics convert a pointer in the generic
180address space to a pointer in the target address space.  Note that these
181intrinsics are only useful if the address space of the target address space of
182the pointer is known.  It is not legal to use address space conversion
183intrinsics to convert a pointer from one non-generic address space to another
184non-generic address space.
185
186Semantics:
187""""""""""
188
189These intrinsics modify the pointer value to be a valid pointer in the target
190non-generic address space.
191
192
193Reading PTX Special Registers
194-----------------------------
195
196'``llvm.nvvm.read.ptx.sreg.*``'
197^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
198
199Syntax:
200"""""""
201
202.. code-block:: llvm
203
204    declare i32 @llvm.nvvm.read.ptx.sreg.tid.x()
205    declare i32 @llvm.nvvm.read.ptx.sreg.tid.y()
206    declare i32 @llvm.nvvm.read.ptx.sreg.tid.z()
207    declare i32 @llvm.nvvm.read.ptx.sreg.ntid.x()
208    declare i32 @llvm.nvvm.read.ptx.sreg.ntid.y()
209    declare i32 @llvm.nvvm.read.ptx.sreg.ntid.z()
210    declare i32 @llvm.nvvm.read.ptx.sreg.ctaid.x()
211    declare i32 @llvm.nvvm.read.ptx.sreg.ctaid.y()
212    declare i32 @llvm.nvvm.read.ptx.sreg.ctaid.z()
213    declare i32 @llvm.nvvm.read.ptx.sreg.nctaid.x()
214    declare i32 @llvm.nvvm.read.ptx.sreg.nctaid.y()
215    declare i32 @llvm.nvvm.read.ptx.sreg.nctaid.z()
216    declare i32 @llvm.nvvm.read.ptx.sreg.warpsize()
217
218Overview:
219"""""""""
220
221The '``@llvm.nvvm.read.ptx.sreg.*``' intrinsics provide access to the PTX
222special registers, in particular the kernel launch bounds.  These registers
223map in the following way to CUDA builtins:
224
225   ============ =====================================
226   CUDA Builtin PTX Special Register Intrinsic
227   ============ =====================================
228   ``threadId`` ``@llvm.nvvm.read.ptx.sreg.tid.*``
229   ``blockIdx`` ``@llvm.nvvm.read.ptx.sreg.ctaid.*``
230   ``blockDim`` ``@llvm.nvvm.read.ptx.sreg.ntid.*``
231   ``gridDim``  ``@llvm.nvvm.read.ptx.sreg.nctaid.*``
232   ============ =====================================
233
234
235Barriers
236--------
237
238'``llvm.nvvm.barrier0``'
239^^^^^^^^^^^^^^^^^^^^^^^^^^^
240
241Syntax:
242"""""""
243
244.. code-block:: llvm
245
246  declare void @llvm.nvvm.barrier0()
247
248Overview:
249"""""""""
250
251The '``@llvm.nvvm.barrier0()``' intrinsic emits a PTX ``bar.sync 0``
252instruction, equivalent to the ``__syncthreads()`` call in CUDA.
253
254
255Other Intrinsics
256----------------
257
258For the full set of NVPTX intrinsics, please see the
259``include/llvm/IR/IntrinsicsNVVM.td`` file in the LLVM source tree.
260
261
262.. _libdevice:
263
264Linking with Libdevice
265======================
266
267The CUDA Toolkit comes with an LLVM bitcode library called ``libdevice`` that
268implements many common mathematical functions. This library can be used as a
269high-performance math library for any compilers using the LLVM NVPTX target.
270The library can be found under ``nvvm/libdevice/`` in the CUDA Toolkit and
271there is a separate version for each compute architecture.
272
273For a list of all math functions implemented in libdevice, see
274`libdevice Users Guide <http://docs.nvidia.com/cuda/libdevice-users-guide/index.html>`_.
275
276To accommodate various math-related compiler flags that can affect code
277generation of libdevice code, the library code depends on a special LLVM IR
278pass (``NVVMReflect``) to handle conditional compilation within LLVM IR. This
279pass looks for calls to the ``@__nvvm_reflect`` function and replaces them
280with constants based on the defined reflection parameters. Such conditional
281code often follows a pattern:
282
283.. code-block:: c++
284
285  float my_function(float a) {
286    if (__nvvm_reflect("FASTMATH"))
287      return my_function_fast(a);
288    else
289      return my_function_precise(a);
290  }
291
292The default value for all unspecified reflection parameters is zero.
293
294The ``NVVMReflect`` pass should be executed early in the optimization
295pipeline, immediately after the link stage. The ``internalize`` pass is also
296recommended to remove unused math functions from the resulting PTX. For an
297input IR module ``module.bc``, the following compilation flow is recommended:
298
2991. Save list of external functions in ``module.bc``
3002. Link ``module.bc`` with ``libdevice.compute_XX.YY.bc``
3013. Internalize all functions not in list from (1)
3024. Eliminate all unused internal functions
3035. Run ``NVVMReflect`` pass
3046. Run standard optimization pipeline
305
306.. note::
307
308  ``linkonce`` and ``linkonce_odr`` linkage types are not suitable for the
309  libdevice functions. It is possible to link two IR modules that have been
310  linked against libdevice using different reflection variables.
311
312Since the ``NVVMReflect`` pass replaces conditionals with constants, it will
313often leave behind dead code of the form:
314
315.. code-block:: llvm
316
317  entry:
318    ..
319    br i1 true, label %foo, label %bar
320  foo:
321    ..
322  bar:
323    ; Dead code
324    ..
325
326Therefore, it is recommended that ``NVVMReflect`` is executed early in the
327optimization pipeline before dead-code elimination.
328
329The NVPTX TargetMachine knows how to schedule ``NVVMReflect`` at the beginning
330of your pass manager; just use the following code when setting up your pass
331manager:
332
333.. code-block:: c++
334
335    std::unique_ptr<TargetMachine> TM = ...;
336    PassManagerBuilder PMBuilder(...);
337    if (TM)
338      TM->adjustPassManager(PMBuilder);
339
340Reflection Parameters
341---------------------
342
343The libdevice library currently uses the following reflection parameters to
344control code generation:
345
346==================== ======================================================
347Flag                 Description
348==================== ======================================================
349``__CUDA_FTZ=[0,1]`` Use optimized code paths that flush subnormals to zero
350==================== ======================================================
351
352The value of this flag is determined by the "nvvm-reflect-ftz" module flag.
353The following sets the ftz flag to 1.
354
355.. code-block:: llvm
356
357    !llvm.module.flag = !{!0}
358    !0 = !{i32 4, !"nvvm-reflect-ftz", i32 1}
359
360(``i32 4`` indicates that the value set here overrides the value in another
361module we link with.  See the `LangRef <LangRef.html#module-flags-metadata>`
362for details.)
363
364Executing PTX
365=============
366
367The most common way to execute PTX assembly on a GPU device is to use the CUDA
368Driver API. This API is a low-level interface to the GPU driver and allows for
369JIT compilation of PTX code to native GPU machine code.
370
371Initializing the Driver API:
372
373.. code-block:: c++
374
375    CUdevice device;
376    CUcontext context;
377
378    // Initialize the driver API
379    cuInit(0);
380    // Get a handle to the first compute device
381    cuDeviceGet(&device, 0);
382    // Create a compute device context
383    cuCtxCreate(&context, 0, device);
384
385JIT compiling a PTX string to a device binary:
386
387.. code-block:: c++
388
389    CUmodule module;
390    CUfunction function;
391
392    // JIT compile a null-terminated PTX string
393    cuModuleLoadData(&module, (void*)PTXString);
394
395    // Get a handle to the "myfunction" kernel function
396    cuModuleGetFunction(&function, module, "myfunction");
397
398For full examples of executing PTX assembly, please see the `CUDA Samples
399<https://developer.nvidia.com/cuda-downloads>`_ distribution.
400
401
402Common Issues
403=============
404
405ptxas complains of undefined function: __nvvm_reflect
406-----------------------------------------------------
407
408When linking with libdevice, the ``NVVMReflect`` pass must be used. See
409:ref:`libdevice` for more information.
410
411
412Tutorial: A Simple Compute Kernel
413=================================
414
415To start, let us take a look at a simple compute kernel written directly in
416LLVM IR. The kernel implements vector addition, where each thread computes one
417element of the output vector C from the input vectors A and B.  To make this
418easier, we also assume that only a single CTA (thread block) will be launched,
419and that it will be one dimensional.
420
421
422The Kernel
423----------
424
425.. code-block:: llvm
426
427  target datalayout = "e-p:64:64:64-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-f32:32:32-f64:64:64-v16:16:16-v32:32:32-v64:64:64-v128:128:128-n16:32:64"
428  target triple = "nvptx64-nvidia-cuda"
429
430  ; Intrinsic to read X component of thread ID
431  declare i32 @llvm.nvvm.read.ptx.sreg.tid.x() readnone nounwind
432
433  define void @kernel(float addrspace(1)* %A,
434                      float addrspace(1)* %B,
435                      float addrspace(1)* %C) {
436  entry:
437    ; What is my ID?
438    %id = tail call i32 @llvm.nvvm.read.ptx.sreg.tid.x() readnone nounwind
439
440    ; Compute pointers into A, B, and C
441    %ptrA = getelementptr float, float addrspace(1)* %A, i32 %id
442    %ptrB = getelementptr float, float addrspace(1)* %B, i32 %id
443    %ptrC = getelementptr float, float addrspace(1)* %C, i32 %id
444
445    ; Read A, B
446    %valA = load float, float addrspace(1)* %ptrA, align 4
447    %valB = load float, float addrspace(1)* %ptrB, align 4
448
449    ; Compute C = A + B
450    %valC = fadd float %valA, %valB
451
452    ; Store back to C
453    store float %valC, float addrspace(1)* %ptrC, align 4
454
455    ret void
456  }
457
458  !nvvm.annotations = !{!0}
459  !0 = !{void (float addrspace(1)*,
460               float addrspace(1)*,
461               float addrspace(1)*)* @kernel, !"kernel", i32 1}
462
463
464We can use the LLVM ``llc`` tool to directly run the NVPTX code generator:
465
466.. code-block:: text
467
468  # llc -mcpu=sm_20 kernel.ll -o kernel.ptx
469
470
471.. note::
472
473  If you want to generate 32-bit code, change ``p:64:64:64`` to ``p:32:32:32``
474  in the module data layout string and use ``nvptx-nvidia-cuda`` as the
475  target triple.
476
477
478The output we get from ``llc`` (as of LLVM 3.4):
479
480.. code-block:: text
481
482  //
483  // Generated by LLVM NVPTX Back-End
484  //
485
486  .version 3.1
487  .target sm_20
488  .address_size 64
489
490    // .globl kernel
491                                          // @kernel
492  .visible .entry kernel(
493    .param .u64 kernel_param_0,
494    .param .u64 kernel_param_1,
495    .param .u64 kernel_param_2
496  )
497  {
498    .reg .f32   %f<4>;
499    .reg .s32   %r<2>;
500    .reg .s64   %rl<8>;
501
502  // %bb.0:                                // %entry
503    ld.param.u64    %rl1, [kernel_param_0];
504    mov.u32         %r1, %tid.x;
505    mul.wide.s32    %rl2, %r1, 4;
506    add.s64         %rl3, %rl1, %rl2;
507    ld.param.u64    %rl4, [kernel_param_1];
508    add.s64         %rl5, %rl4, %rl2;
509    ld.param.u64    %rl6, [kernel_param_2];
510    add.s64         %rl7, %rl6, %rl2;
511    ld.global.f32   %f1, [%rl3];
512    ld.global.f32   %f2, [%rl5];
513    add.f32         %f3, %f1, %f2;
514    st.global.f32   [%rl7], %f3;
515    ret;
516  }
517
518
519Dissecting the Kernel
520---------------------
521
522Now let us dissect the LLVM IR that makes up this kernel.
523
524Data Layout
525^^^^^^^^^^^
526
527The data layout string determines the size in bits of common data types, their
528ABI alignment, and their storage size.  For NVPTX, you should use one of the
529following:
530
53132-bit PTX:
532
533.. code-block:: llvm
534
535  target datalayout = "e-p:32:32:32-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-f32:32:32-f64:64:64-v16:16:16-v32:32:32-v64:64:64-v128:128:128-n16:32:64"
536
53764-bit PTX:
538
539.. code-block:: llvm
540
541  target datalayout = "e-p:64:64:64-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-f32:32:32-f64:64:64-v16:16:16-v32:32:32-v64:64:64-v128:128:128-n16:32:64"
542
543
544Target Intrinsics
545^^^^^^^^^^^^^^^^^
546
547In this example, we use the ``@llvm.nvvm.read.ptx.sreg.tid.x`` intrinsic to
548read the X component of the current thread's ID, which corresponds to a read
549of register ``%tid.x`` in PTX. The NVPTX back-end supports a large set of
550intrinsics.  A short list is shown below; please see
551``include/llvm/IR/IntrinsicsNVVM.td`` for the full list.
552
553
554================================================ ====================
555Intrinsic                                        CUDA Equivalent
556================================================ ====================
557``i32 @llvm.nvvm.read.ptx.sreg.tid.{x,y,z}``     threadIdx.{x,y,z}
558``i32 @llvm.nvvm.read.ptx.sreg.ctaid.{x,y,z}``   blockIdx.{x,y,z}
559``i32 @llvm.nvvm.read.ptx.sreg.ntid.{x,y,z}``    blockDim.{x,y,z}
560``i32 @llvm.nvvm.read.ptx.sreg.nctaid.{x,y,z}``  gridDim.{x,y,z}
561``void @llvm.nvvm.barrier0()``                   __syncthreads()
562================================================ ====================
563
564
565Address Spaces
566^^^^^^^^^^^^^^
567
568You may have noticed that all of the pointer types in the LLVM IR example had
569an explicit address space specifier. What is address space 1? NVIDIA GPU
570devices (generally) have four types of memory:
571
572- Global: Large, off-chip memory
573- Shared: Small, on-chip memory shared among all threads in a CTA
574- Local: Per-thread, private memory
575- Constant: Read-only memory shared across all threads
576
577These different types of memory are represented in LLVM IR as address spaces.
578There is also a fifth address space used by the NVPTX code generator that
579corresponds to the "generic" address space.  This address space can represent
580addresses in any other address space (with a few exceptions).  This allows
581users to write IR functions that can load/store memory using the same
582instructions. Intrinsics are provided to convert pointers between the generic
583and non-generic address spaces.
584
585See :ref:`address_spaces` and :ref:`nvptx_intrinsics` for more information.
586
587
588Kernel Metadata
589^^^^^^^^^^^^^^^
590
591In PTX, a function can be either a `kernel` function (callable from the host
592program), or a `device` function (callable only from GPU code). You can think
593of `kernel` functions as entry-points in the GPU program. To mark an LLVM IR
594function as a `kernel` function, we make use of special LLVM metadata. The
595NVPTX back-end will look for a named metadata node called
596``nvvm.annotations``. This named metadata must contain a list of metadata that
597describe the IR. For our purposes, we need to declare a metadata node that
598assigns the "kernel" attribute to the LLVM IR function that should be emitted
599as a PTX `kernel` function. These metadata nodes take the form:
600
601.. code-block:: text
602
603  !{<function ref>, metadata !"kernel", i32 1}
604
605For the previous example, we have:
606
607.. code-block:: llvm
608
609  !nvvm.annotations = !{!0}
610  !0 = !{void (float addrspace(1)*,
611               float addrspace(1)*,
612               float addrspace(1)*)* @kernel, !"kernel", i32 1}
613
614Here, we have a single metadata declaration in ``nvvm.annotations``. This
615metadata annotates our ``@kernel`` function with the ``kernel`` attribute.
616
617
618Running the Kernel
619------------------
620
621Generating PTX from LLVM IR is all well and good, but how do we execute it on
622a real GPU device? The CUDA Driver API provides a convenient mechanism for
623loading and JIT compiling PTX to a native GPU device, and launching a kernel.
624The API is similar to OpenCL.  A simple example showing how to load and
625execute our vector addition code is shown below. Note that for brevity this
626code does not perform much error checking!
627
628.. note::
629
630  You can also use the ``ptxas`` tool provided by the CUDA Toolkit to offline
631  compile PTX to machine code (SASS) for a specific GPU architecture. Such
632  binaries can be loaded by the CUDA Driver API in the same way as PTX. This
633  can be useful for reducing startup time by precompiling the PTX kernels.
634
635
636.. code-block:: c++
637
638  #include <iostream>
639  #include <fstream>
640  #include <cassert>
641  #include "cuda.h"
642
643
644  void checkCudaErrors(CUresult err) {
645    assert(err == CUDA_SUCCESS);
646  }
647
648  /// main - Program entry point
649  int main(int argc, char **argv) {
650    CUdevice    device;
651    CUmodule    cudaModule;
652    CUcontext   context;
653    CUfunction  function;
654    CUlinkState linker;
655    int         devCount;
656
657    // CUDA initialization
658    checkCudaErrors(cuInit(0));
659    checkCudaErrors(cuDeviceGetCount(&devCount));
660    checkCudaErrors(cuDeviceGet(&device, 0));
661
662    char name[128];
663    checkCudaErrors(cuDeviceGetName(name, 128, device));
664    std::cout << "Using CUDA Device [0]: " << name << "\n";
665
666    int devMajor, devMinor;
667    checkCudaErrors(cuDeviceComputeCapability(&devMajor, &devMinor, device));
668    std::cout << "Device Compute Capability: "
669              << devMajor << "." << devMinor << "\n";
670    if (devMajor < 2) {
671      std::cerr << "ERROR: Device 0 is not SM 2.0 or greater\n";
672      return 1;
673    }
674
675    std::ifstream t("kernel.ptx");
676    if (!t.is_open()) {
677      std::cerr << "kernel.ptx not found\n";
678      return 1;
679    }
680    std::string str((std::istreambuf_iterator<char>(t)),
681                      std::istreambuf_iterator<char>());
682
683    // Create driver context
684    checkCudaErrors(cuCtxCreate(&context, 0, device));
685
686    // Create module for object
687    checkCudaErrors(cuModuleLoadDataEx(&cudaModule, str.c_str(), 0, 0, 0));
688
689    // Get kernel function
690    checkCudaErrors(cuModuleGetFunction(&function, cudaModule, "kernel"));
691
692    // Device data
693    CUdeviceptr devBufferA;
694    CUdeviceptr devBufferB;
695    CUdeviceptr devBufferC;
696
697    checkCudaErrors(cuMemAlloc(&devBufferA, sizeof(float)*16));
698    checkCudaErrors(cuMemAlloc(&devBufferB, sizeof(float)*16));
699    checkCudaErrors(cuMemAlloc(&devBufferC, sizeof(float)*16));
700
701    float* hostA = new float[16];
702    float* hostB = new float[16];
703    float* hostC = new float[16];
704
705    // Populate input
706    for (unsigned i = 0; i != 16; ++i) {
707      hostA[i] = (float)i;
708      hostB[i] = (float)(2*i);
709      hostC[i] = 0.0f;
710    }
711
712    checkCudaErrors(cuMemcpyHtoD(devBufferA, &hostA[0], sizeof(float)*16));
713    checkCudaErrors(cuMemcpyHtoD(devBufferB, &hostB[0], sizeof(float)*16));
714
715
716    unsigned blockSizeX = 16;
717    unsigned blockSizeY = 1;
718    unsigned blockSizeZ = 1;
719    unsigned gridSizeX  = 1;
720    unsigned gridSizeY  = 1;
721    unsigned gridSizeZ  = 1;
722
723    // Kernel parameters
724    void *KernelParams[] = { &devBufferA, &devBufferB, &devBufferC };
725
726    std::cout << "Launching kernel\n";
727
728    // Kernel launch
729    checkCudaErrors(cuLaunchKernel(function, gridSizeX, gridSizeY, gridSizeZ,
730                                   blockSizeX, blockSizeY, blockSizeZ,
731                                   0, NULL, KernelParams, NULL));
732
733    // Retrieve device data
734    checkCudaErrors(cuMemcpyDtoH(&hostC[0], devBufferC, sizeof(float)*16));
735
736
737    std::cout << "Results:\n";
738    for (unsigned i = 0; i != 16; ++i) {
739      std::cout << hostA[i] << " + " << hostB[i] << " = " << hostC[i] << "\n";
740    }
741
742
743    // Clean up after ourselves
744    delete [] hostA;
745    delete [] hostB;
746    delete [] hostC;
747
748    // Clean-up
749    checkCudaErrors(cuMemFree(devBufferA));
750    checkCudaErrors(cuMemFree(devBufferB));
751    checkCudaErrors(cuMemFree(devBufferC));
752    checkCudaErrors(cuModuleUnload(cudaModule));
753    checkCudaErrors(cuCtxDestroy(context));
754
755    return 0;
756  }
757
758
759You will need to link with the CUDA driver and specify the path to cuda.h.
760
761.. code-block:: text
762
763  # clang++ sample.cpp -o sample -O2 -g -I/usr/local/cuda-5.5/include -lcuda
764
765We don't need to specify a path to ``libcuda.so`` since this is installed in a
766system location by the driver, not the CUDA toolkit.
767
768If everything goes as planned, you should see the following output when
769running the compiled program:
770
771.. code-block:: text
772
773  Using CUDA Device [0]: GeForce GTX 680
774  Device Compute Capability: 3.0
775  Launching kernel
776  Results:
777  0 + 0 = 0
778  1 + 2 = 3
779  2 + 4 = 6
780  3 + 6 = 9
781  4 + 8 = 12
782  5 + 10 = 15
783  6 + 12 = 18
784  7 + 14 = 21
785  8 + 16 = 24
786  9 + 18 = 27
787  10 + 20 = 30
788  11 + 22 = 33
789  12 + 24 = 36
790  13 + 26 = 39
791  14 + 28 = 42
792  15 + 30 = 45
793
794.. note::
795
796  You will likely see a different device identifier based on your hardware
797
798
799Tutorial: Linking with Libdevice
800================================
801
802In this tutorial, we show a simple example of linking LLVM IR with the
803libdevice library. We will use the same kernel as the previous tutorial,
804except that we will compute ``C = pow(A, B)`` instead of ``C = A + B``.
805Libdevice provides an ``__nv_powf`` function that we will use.
806
807.. code-block:: llvm
808
809  target datalayout = "e-p:64:64:64-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-f32:32:32-f64:64:64-v16:16:16-v32:32:32-v64:64:64-v128:128:128-n16:32:64"
810  target triple = "nvptx64-nvidia-cuda"
811
812  ; Intrinsic to read X component of thread ID
813  declare i32 @llvm.nvvm.read.ptx.sreg.tid.x() readnone nounwind
814  ; libdevice function
815  declare float @__nv_powf(float, float)
816
817  define void @kernel(float addrspace(1)* %A,
818                      float addrspace(1)* %B,
819                      float addrspace(1)* %C) {
820  entry:
821    ; What is my ID?
822    %id = tail call i32 @llvm.nvvm.read.ptx.sreg.tid.x() readnone nounwind
823
824    ; Compute pointers into A, B, and C
825    %ptrA = getelementptr float, float addrspace(1)* %A, i32 %id
826    %ptrB = getelementptr float, float addrspace(1)* %B, i32 %id
827    %ptrC = getelementptr float, float addrspace(1)* %C, i32 %id
828
829    ; Read A, B
830    %valA = load float, float addrspace(1)* %ptrA, align 4
831    %valB = load float, float addrspace(1)* %ptrB, align 4
832
833    ; Compute C = pow(A, B)
834    %valC = call float @__nv_powf(float %valA, float %valB)
835
836    ; Store back to C
837    store float %valC, float addrspace(1)* %ptrC, align 4
838
839    ret void
840  }
841
842  !nvvm.annotations = !{!0}
843  !0 = !{void (float addrspace(1)*,
844               float addrspace(1)*,
845               float addrspace(1)*)* @kernel, !"kernel", i32 1}
846
847
848To compile this kernel, we perform the following steps:
849
8501. Link with libdevice
8512. Internalize all but the public kernel function
8523. Run ``NVVMReflect`` and set ``__CUDA_FTZ`` to 0
8534. Optimize the linked module
8545. Codegen the module
855
856
857These steps can be performed by the LLVM ``llvm-link``, ``opt``, and ``llc``
858tools. In a complete compiler, these steps can also be performed entirely
859programmatically by setting up an appropriate pass configuration (see
860:ref:`libdevice`).
861
862.. code-block:: text
863
864  # llvm-link t2.bc libdevice.compute_20.10.bc -o t2.linked.bc
865  # opt -internalize -internalize-public-api-list=kernel -nvvm-reflect-list=__CUDA_FTZ=0 -nvvm-reflect -O3 t2.linked.bc -o t2.opt.bc
866  # llc -mcpu=sm_20 t2.opt.bc -o t2.ptx
867
868.. note::
869
870  The ``-nvvm-reflect-list=_CUDA_FTZ=0`` is not strictly required, as any
871  undefined variables will default to zero. It is shown here for evaluation
872  purposes.
873
874
875This gives us the following PTX (excerpt):
876
877.. code-block:: text
878
879  //
880  // Generated by LLVM NVPTX Back-End
881  //
882
883  .version 3.1
884  .target sm_20
885  .address_size 64
886
887    // .globl kernel
888                                          // @kernel
889  .visible .entry kernel(
890    .param .u64 kernel_param_0,
891    .param .u64 kernel_param_1,
892    .param .u64 kernel_param_2
893  )
894  {
895    .reg .pred  %p<30>;
896    .reg .f32   %f<111>;
897    .reg .s32   %r<21>;
898    .reg .s64   %rl<8>;
899
900  // %bb.0:                                // %entry
901    ld.param.u64  %rl2, [kernel_param_0];
902    mov.u32   %r3, %tid.x;
903    ld.param.u64  %rl3, [kernel_param_1];
904    mul.wide.s32  %rl4, %r3, 4;
905    add.s64   %rl5, %rl2, %rl4;
906    ld.param.u64  %rl6, [kernel_param_2];
907    add.s64   %rl7, %rl3, %rl4;
908    add.s64   %rl1, %rl6, %rl4;
909    ld.global.f32   %f1, [%rl5];
910    ld.global.f32   %f2, [%rl7];
911    setp.eq.f32 %p1, %f1, 0f3F800000;
912    setp.eq.f32 %p2, %f2, 0f00000000;
913    or.pred   %p3, %p1, %p2;
914    @%p3 bra  BB0_1;
915    bra.uni   BB0_2;
916  BB0_1:
917    mov.f32   %f110, 0f3F800000;
918    st.global.f32   [%rl1], %f110;
919    ret;
920  BB0_2:                                  // %__nv_isnanf.exit.i
921    abs.f32   %f4, %f1;
922    setp.gtu.f32  %p4, %f4, 0f7F800000;
923    @%p4 bra  BB0_4;
924  // %bb.3:                                // %__nv_isnanf.exit5.i
925    abs.f32   %f5, %f2;
926    setp.le.f32 %p5, %f5, 0f7F800000;
927    @%p5 bra  BB0_5;
928  BB0_4:                                  // %.critedge1.i
929    add.f32   %f110, %f1, %f2;
930    st.global.f32   [%rl1], %f110;
931    ret;
932  BB0_5:                                  // %__nv_isinff.exit.i
933
934    ...
935
936  BB0_26:                                 // %__nv_truncf.exit.i.i.i.i.i
937    mul.f32   %f90, %f107, 0f3FB8AA3B;
938    cvt.rzi.f32.f32 %f91, %f90;
939    mov.f32   %f92, 0fBF317200;
940    fma.rn.f32  %f93, %f91, %f92, %f107;
941    mov.f32   %f94, 0fB5BFBE8E;
942    fma.rn.f32  %f95, %f91, %f94, %f93;
943    mul.f32   %f89, %f95, 0f3FB8AA3B;
944    // inline asm
945    ex2.approx.ftz.f32 %f88,%f89;
946    // inline asm
947    add.f32   %f96, %f91, 0f00000000;
948    ex2.approx.f32  %f97, %f96;
949    mul.f32   %f98, %f88, %f97;
950    setp.lt.f32 %p15, %f107, 0fC2D20000;
951    selp.f32  %f99, 0f00000000, %f98, %p15;
952    setp.gt.f32 %p16, %f107, 0f42D20000;
953    selp.f32  %f110, 0f7F800000, %f99, %p16;
954    setp.eq.f32 %p17, %f110, 0f7F800000;
955    @%p17 bra   BB0_28;
956  // %bb.27:
957    fma.rn.f32  %f110, %f110, %f108, %f110;
958  BB0_28:                                 // %__internal_accurate_powf.exit.i
959    setp.lt.f32 %p18, %f1, 0f00000000;
960    setp.eq.f32 %p19, %f3, 0f3F800000;
961    and.pred    %p20, %p18, %p19;
962    @!%p20 bra  BB0_30;
963    bra.uni   BB0_29;
964  BB0_29:
965    mov.b32    %r9, %f110;
966    xor.b32   %r10, %r9, -2147483648;
967    mov.b32    %f110, %r10;
968  BB0_30:                                 // %__nv_powf.exit
969    st.global.f32   [%rl1], %f110;
970    ret;
971  }
972
973