1=========================
2Compiling CUDA with clang
3=========================
4
5.. contents::
6   :local:
7
8Introduction
9============
10
11This document describes how to compile CUDA code with clang, and gives some
12details about LLVM and clang's CUDA implementations.
13
14This document assumes a basic familiarity with CUDA. Information about CUDA
15programming can be found in the
16`CUDA programming guide
17<http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html>`_.
18
19Compiling CUDA Code
20===================
21
22Prerequisites
23-------------
24
25CUDA is supported since llvm 3.9. Current release of clang (7.0.0) supports CUDA
267.0 through 9.2. If you need support for CUDA 10, you will need to use clang
27built from r342924 or newer.
28
29Before you build CUDA code, you'll need to have installed the appropriate driver
30for your nvidia GPU and the CUDA SDK.  See `NVIDIA's CUDA installation guide
31<https://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html>`_ for
32details.  Note that clang `does not support
33<https://llvm.org/bugs/show_bug.cgi?id=26966>`_ the CUDA toolkit as installed by
34many Linux package managers; you probably need to install CUDA in a single
35directory from NVIDIA's package.
36
37CUDA compilation is supported on Linux. Compilation on MacOS and Windows may or
38may not work and currently have no maintainers. Compilation with CUDA-9.x is
39`currently broken on Windows <https://bugs.llvm.org/show_bug.cgi?id=38811>`_.
40
41Invoking clang
42--------------
43
44Invoking clang for CUDA compilation works similarly to compiling regular C++.
45You just need to be aware of a few additional flags.
46
47You can use `this <https://gist.github.com/855e277884eb6b388cd2f00d956c2fd4>`_
48program as a toy example.  Save it as ``axpy.cu``.  (Clang detects that you're
49compiling CUDA code by noticing that your filename ends with ``.cu``.
50Alternatively, you can pass ``-x cuda``.)
51
52To build and run, run the following commands, filling in the parts in angle
53brackets as described below:
54
55.. code-block:: console
56
57  $ clang++ axpy.cu -o axpy --cuda-gpu-arch=<GPU arch> \
58      -L<CUDA install path>/<lib64 or lib>             \
59      -lcudart_static -ldl -lrt -pthread
60  $ ./axpy
61  y[0] = 2
62  y[1] = 4
63  y[2] = 6
64  y[3] = 8
65
66On MacOS, replace `-lcudart_static` with `-lcudart`; otherwise, you may get
67"CUDA driver version is insufficient for CUDA runtime version" errors when you
68run your program.
69
70* ``<CUDA install path>`` -- the directory where you installed CUDA SDK.
71  Typically, ``/usr/local/cuda``.
72
73  Pass e.g. ``-L/usr/local/cuda/lib64`` if compiling in 64-bit mode; otherwise,
74  pass e.g. ``-L/usr/local/cuda/lib``.  (In CUDA, the device code and host code
75  always have the same pointer widths, so if you're compiling 64-bit code for
76  the host, you're also compiling 64-bit code for the device.) Note that as of
77  v10.0 CUDA SDK `no longer supports compilation of 32-bit
78  applications <https://docs.nvidia.com/cuda/cuda-toolkit-release-notes/index.html#deprecated-features>`_.
79
80* ``<GPU arch>`` -- the `compute capability
81  <https://developer.nvidia.com/cuda-gpus>`_ of your GPU. For example, if you
82  want to run your program on a GPU with compute capability of 3.5, specify
83  ``--cuda-gpu-arch=sm_35``.
84
85  Note: You cannot pass ``compute_XX`` as an argument to ``--cuda-gpu-arch``;
86  only ``sm_XX`` is currently supported.  However, clang always includes PTX in
87  its binaries, so e.g. a binary compiled with ``--cuda-gpu-arch=sm_30`` would be
88  forwards-compatible with e.g. ``sm_35`` GPUs.
89
90  You can pass ``--cuda-gpu-arch`` multiple times to compile for multiple archs.
91
92The `-L` and `-l` flags only need to be passed when linking.  When compiling,
93you may also need to pass ``--cuda-path=/path/to/cuda`` if you didn't install
94the CUDA SDK into ``/usr/local/cuda`` or ``/usr/local/cuda-X.Y``.
95
96Flags that control numerical code
97---------------------------------
98
99If you're using GPUs, you probably care about making numerical code run fast.
100GPU hardware allows for more control over numerical operations than most CPUs,
101but this results in more compiler options for you to juggle.
102
103Flags you may wish to tweak include:
104
105* ``-ffp-contract={on,off,fast}`` (defaults to ``fast`` on host and device when
106  compiling CUDA) Controls whether the compiler emits fused multiply-add
107  operations.
108
109  * ``off``: never emit fma operations, and prevent ptxas from fusing multiply
110    and add instructions.
111  * ``on``: fuse multiplies and adds within a single statement, but never
112    across statements (C11 semantics).  Prevent ptxas from fusing other
113    multiplies and adds.
114  * ``fast``: fuse multiplies and adds wherever profitable, even across
115    statements.  Doesn't prevent ptxas from fusing additional multiplies and
116    adds.
117
118  Fused multiply-add instructions can be much faster than the unfused
119  equivalents, but because the intermediate result in an fma is not rounded,
120  this flag can affect numerical code.
121
122* ``-fcuda-flush-denormals-to-zero`` (default: off) When this is enabled,
123  floating point operations may flush `denormal
124  <https://en.wikipedia.org/wiki/Denormal_number>`_ inputs and/or outputs to 0.
125  Operations on denormal numbers are often much slower than the same operations
126  on normal numbers.
127
128* ``-fcuda-approx-transcendentals`` (default: off) When this is enabled, the
129  compiler may emit calls to faster, approximate versions of transcendental
130  functions, instead of using the slower, fully IEEE-compliant versions.  For
131  example, this flag allows clang to emit the ptx ``sin.approx.f32``
132  instruction.
133
134  This is implied by ``-ffast-math``.
135
136Standard library support
137========================
138
139In clang and nvcc, most of the C++ standard library is not supported on the
140device side.
141
142``<math.h>`` and ``<cmath>``
143----------------------------
144
145In clang, ``math.h`` and ``cmath`` are available and `pass
146<https://github.com/llvm/llvm-test-suite/blob/master/External/CUDA/math_h.cu>`_
147`tests
148<https://github.com/llvm/llvm-test-suite/blob/master/External/CUDA/cmath.cu>`_
149adapted from libc++'s test suite.
150
151In nvcc ``math.h`` and ``cmath`` are mostly available.  Versions of ``::foof``
152in namespace std (e.g. ``std::sinf``) are not available, and where the standard
153calls for overloads that take integral arguments, these are usually not
154available.
155
156.. code-block:: c++
157
158  #include <math.h>
159  #include <cmath.h>
160
161  // clang is OK with everything in this function.
162  __device__ void test() {
163    std::sin(0.); // nvcc - ok
164    std::sin(0);  // nvcc - error, because no std::sin(int) override is available.
165    sin(0);       // nvcc - same as above.
166
167    sinf(0.);       // nvcc - ok
168    std::sinf(0.);  // nvcc - no such function
169  }
170
171``<std::complex>``
172------------------
173
174nvcc does not officially support ``std::complex``.  It's an error to use
175``std::complex`` in ``__device__`` code, but it often works in ``__host__
176__device__`` code due to nvcc's interpretation of the "wrong-side rule" (see
177below).  However, we have heard from implementers that it's possible to get
178into situations where nvcc will omit a call to an ``std::complex`` function,
179especially when compiling without optimizations.
180
181As of 2016-11-16, clang supports ``std::complex`` without these caveats.  It is
182tested with libstdc++ 4.8.5 and newer, but is known to work only with libc++
183newer than 2016-11-16.
184
185``<algorithm>``
186---------------
187
188In C++14, many useful functions from ``<algorithm>`` (notably, ``std::min`` and
189``std::max``) become constexpr.  You can therefore use these in device code,
190when compiling with clang.
191
192Detecting clang vs NVCC from code
193=================================
194
195Although clang's CUDA implementation is largely compatible with NVCC's, you may
196still want to detect when you're compiling CUDA code specifically with clang.
197
198This is tricky, because NVCC may invoke clang as part of its own compilation
199process!  For example, NVCC uses the host compiler's preprocessor when
200compiling for device code, and that host compiler may in fact be clang.
201
202When clang is actually compiling CUDA code -- rather than being used as a
203subtool of NVCC's -- it defines the ``__CUDA__`` macro.  ``__CUDA_ARCH__`` is
204defined only in device mode (but will be defined if NVCC is using clang as a
205preprocessor).  So you can use the following incantations to detect clang CUDA
206compilation, in host and device modes:
207
208.. code-block:: c++
209
210  #if defined(__clang__) && defined(__CUDA__) && !defined(__CUDA_ARCH__)
211  // clang compiling CUDA code, host mode.
212  #endif
213
214  #if defined(__clang__) && defined(__CUDA__) && defined(__CUDA_ARCH__)
215  // clang compiling CUDA code, device mode.
216  #endif
217
218Both clang and nvcc define ``__CUDACC__`` during CUDA compilation.  You can
219detect NVCC specifically by looking for ``__NVCC__``.
220
221Dialect Differences Between clang and nvcc
222==========================================
223
224There is no formal CUDA spec, and clang and nvcc speak slightly different
225dialects of the language.  Below, we describe some of the differences.
226
227This section is painful; hopefully you can skip this section and live your life
228blissfully unaware.
229
230Compilation Models
231------------------
232
233Most of the differences between clang and nvcc stem from the different
234compilation models used by clang and nvcc.  nvcc uses *split compilation*,
235which works roughly as follows:
236
237 * Run a preprocessor over the input ``.cu`` file to split it into two source
238   files: ``H``, containing source code for the host, and ``D``, containing
239   source code for the device.
240
241 * For each GPU architecture ``arch`` that we're compiling for, do:
242
243   * Compile ``D`` using nvcc proper.  The result of this is a ``ptx`` file for
244     ``P_arch``.
245
246   * Optionally, invoke ``ptxas``, the PTX assembler, to generate a file,
247     ``S_arch``, containing GPU machine code (SASS) for ``arch``.
248
249 * Invoke ``fatbin`` to combine all ``P_arch`` and ``S_arch`` files into a
250   single "fat binary" file, ``F``.
251
252 * Compile ``H`` using an external host compiler (gcc, clang, or whatever you
253   like).  ``F`` is packaged up into a header file which is force-included into
254   ``H``; nvcc generates code that calls into this header to e.g. launch
255   kernels.
256
257clang uses *merged parsing*.  This is similar to split compilation, except all
258of the host and device code is present and must be semantically-correct in both
259compilation steps.
260
261  * For each GPU architecture ``arch`` that we're compiling for, do:
262
263    * Compile the input ``.cu`` file for device, using clang.  ``__host__`` code
264      is parsed and must be semantically correct, even though we're not
265      generating code for the host at this time.
266
267      The output of this step is a ``ptx`` file ``P_arch``.
268
269    * Invoke ``ptxas`` to generate a SASS file, ``S_arch``.  Note that, unlike
270      nvcc, clang always generates SASS code.
271
272  * Invoke ``fatbin`` to combine all ``P_arch`` and ``S_arch`` files into a
273    single fat binary file, ``F``.
274
275  * Compile ``H`` using clang.  ``__device__`` code is parsed and must be
276    semantically correct, even though we're not generating code for the device
277    at this time.
278
279    ``F`` is passed to this compilation, and clang includes it in a special ELF
280    section, where it can be found by tools like ``cuobjdump``.
281
282(You may ask at this point, why does clang need to parse the input file
283multiple times?  Why not parse it just once, and then use the AST to generate
284code for the host and each device architecture?
285
286Unfortunately this can't work because we have to define different macros during
287host compilation and during device compilation for each GPU architecture.)
288
289clang's approach allows it to be highly robust to C++ edge cases, as it doesn't
290need to decide at an early stage which declarations to keep and which to throw
291away.  But it has some consequences you should be aware of.
292
293Overloading Based on ``__host__`` and ``__device__`` Attributes
294---------------------------------------------------------------
295
296Let "H", "D", and "HD" stand for "``__host__`` functions", "``__device__``
297functions", and "``__host__ __device__`` functions", respectively.  Functions
298with no attributes behave the same as H.
299
300nvcc does not allow you to create H and D functions with the same signature:
301
302.. code-block:: c++
303
304  // nvcc: error - function "foo" has already been defined
305  __host__ void foo() {}
306  __device__ void foo() {}
307
308However, nvcc allows you to "overload" H and D functions with different
309signatures:
310
311.. code-block:: c++
312
313  // nvcc: no error
314  __host__ void foo(int) {}
315  __device__ void foo() {}
316
317In clang, the ``__host__`` and ``__device__`` attributes are part of a
318function's signature, and so it's legal to have H and D functions with
319(otherwise) the same signature:
320
321.. code-block:: c++
322
323  // clang: no error
324  __host__ void foo() {}
325  __device__ void foo() {}
326
327HD functions cannot be overloaded by H or D functions with the same signature:
328
329.. code-block:: c++
330
331  // nvcc: error - function "foo" has already been defined
332  // clang: error - redefinition of 'foo'
333  __host__ __device__ void foo() {}
334  __device__ void foo() {}
335
336  // nvcc: no error
337  // clang: no error
338  __host__ __device__ void bar(int) {}
339  __device__ void bar() {}
340
341When resolving an overloaded function, clang considers the host/device
342attributes of the caller and callee.  These are used as a tiebreaker during
343overload resolution.  See `IdentifyCUDAPreference
344<http://clang.llvm.org/doxygen/SemaCUDA_8cpp.html>`_ for the full set of rules,
345but at a high level they are:
346
347 * D functions prefer to call other Ds.  HDs are given lower priority.
348
349 * Similarly, H functions prefer to call other Hs, or ``__global__`` functions
350   (with equal priority).  HDs are given lower priority.
351
352 * HD functions prefer to call other HDs.
353
354   When compiling for device, HDs will call Ds with lower priority than HD, and
355   will call Hs with still lower priority.  If it's forced to call an H, the
356   program is malformed if we emit code for this HD function.  We call this the
357   "wrong-side rule", see example below.
358
359   The rules are symmetrical when compiling for host.
360
361Some examples:
362
363.. code-block:: c++
364
365   __host__ void foo();
366   __device__ void foo();
367
368   __host__ void bar();
369   __host__ __device__ void bar();
370
371   __host__ void test_host() {
372     foo();  // calls H overload
373     bar();  // calls H overload
374   }
375
376   __device__ void test_device() {
377     foo();  // calls D overload
378     bar();  // calls HD overload
379   }
380
381   __host__ __device__ void test_hd() {
382     foo();  // calls H overload when compiling for host, otherwise D overload
383     bar();  // always calls HD overload
384   }
385
386Wrong-side rule example:
387
388.. code-block:: c++
389
390  __host__ void host_only();
391
392  // We don't codegen inline functions unless they're referenced by a
393  // non-inline function.  inline_hd1() is called only from the host side, so
394  // does not generate an error.  inline_hd2() is called from the device side,
395  // so it generates an error.
396  inline __host__ __device__ void inline_hd1() { host_only(); }  // no error
397  inline __host__ __device__ void inline_hd2() { host_only(); }  // error
398
399  __host__ void host_fn() { inline_hd1(); }
400  __device__ void device_fn() { inline_hd2(); }
401
402  // This function is not inline, so it's always codegen'ed on both the host
403  // and the device.  Therefore, it generates an error.
404  __host__ __device__ void not_inline_hd() { host_only(); }
405
406For the purposes of the wrong-side rule, templated functions also behave like
407``inline`` functions: They aren't codegen'ed unless they're instantiated
408(usually as part of the process of invoking them).
409
410clang's behavior with respect to the wrong-side rule matches nvcc's, except
411nvcc only emits a warning for ``not_inline_hd``; device code is allowed to call
412``not_inline_hd``.  In its generated code, nvcc may omit ``not_inline_hd``'s
413call to ``host_only`` entirely, or it may try to generate code for
414``host_only`` on the device.  What you get seems to depend on whether or not
415the compiler chooses to inline ``host_only``.
416
417Member functions, including constructors, may be overloaded using H and D
418attributes.  However, destructors cannot be overloaded.
419
420Using a Different Class on Host/Device
421--------------------------------------
422
423Occasionally you may want to have a class with different host/device versions.
424
425If all of the class's members are the same on the host and device, you can just
426provide overloads for the class's member functions.
427
428However, if you want your class to have different members on host/device, you
429won't be able to provide working H and D overloads in both classes. In this
430case, clang is likely to be unhappy with you.
431
432.. code-block:: c++
433
434  #ifdef __CUDA_ARCH__
435  struct S {
436    __device__ void foo() { /* use device_only */ }
437    int device_only;
438  };
439  #else
440  struct S {
441    __host__ void foo() { /* use host_only */ }
442    double host_only;
443  };
444
445  __device__ void test() {
446    S s;
447    // clang generates an error here, because during host compilation, we
448    // have ifdef'ed away the __device__ overload of S::foo().  The __device__
449    // overload must be present *even during host compilation*.
450    S.foo();
451  }
452  #endif
453
454We posit that you don't really want to have classes with different members on H
455and D.  For example, if you were to pass one of these as a parameter to a
456kernel, it would have a different layout on H and D, so would not work
457properly.
458
459To make code like this compatible with clang, we recommend you separate it out
460into two classes.  If you need to write code that works on both host and
461device, consider writing an overloaded wrapper function that returns different
462types on host and device.
463
464.. code-block:: c++
465
466  struct HostS { ... };
467  struct DeviceS { ... };
468
469  __host__ HostS MakeStruct() { return HostS(); }
470  __device__ DeviceS MakeStruct() { return DeviceS(); }
471
472  // Now host and device code can call MakeStruct().
473
474Unfortunately, this idiom isn't compatible with nvcc, because it doesn't allow
475you to overload based on the H/D attributes.  Here's an idiom that works with
476both clang and nvcc:
477
478.. code-block:: c++
479
480  struct HostS { ... };
481  struct DeviceS { ... };
482
483  #ifdef __NVCC__
484    #ifndef __CUDA_ARCH__
485      __host__ HostS MakeStruct() { return HostS(); }
486    #else
487      __device__ DeviceS MakeStruct() { return DeviceS(); }
488    #endif
489  #else
490    __host__ HostS MakeStruct() { return HostS(); }
491    __device__ DeviceS MakeStruct() { return DeviceS(); }
492  #endif
493
494  // Now host and device code can call MakeStruct().
495
496Hopefully you don't have to do this sort of thing often.
497
498Optimizations
499=============
500
501Modern CPUs and GPUs are architecturally quite different, so code that's fast
502on a CPU isn't necessarily fast on a GPU.  We've made a number of changes to
503LLVM to make it generate good GPU code.  Among these changes are:
504
505* `Straight-line scalar optimizations <https://goo.gl/4Rb9As>`_ -- These
506  reduce redundancy within straight-line code.
507
508* `Aggressive speculative execution
509  <http://llvm.org/docs/doxygen/html/SpeculativeExecution_8cpp_source.html>`_
510  -- This is mainly for promoting straight-line scalar optimizations, which are
511  most effective on code along dominator paths.
512
513* `Memory space inference
514  <http://llvm.org/doxygen/NVPTXInferAddressSpaces_8cpp_source.html>`_ --
515  In PTX, we can operate on pointers that are in a paricular "address space"
516  (global, shared, constant, or local), or we can operate on pointers in the
517  "generic" address space, which can point to anything.  Operations in a
518  non-generic address space are faster, but pointers in CUDA are not explicitly
519  annotated with their address space, so it's up to LLVM to infer it where
520  possible.
521
522* `Bypassing 64-bit divides
523  <http://llvm.org/docs/doxygen/html/BypassSlowDivision_8cpp_source.html>`_ --
524  This was an existing optimization that we enabled for the PTX backend.
525
526  64-bit integer divides are much slower than 32-bit ones on NVIDIA GPUs.
527  Many of the 64-bit divides in our benchmarks have a divisor and dividend
528  which fit in 32-bits at runtime. This optimization provides a fast path for
529  this common case.
530
531* Aggressive loop unrooling and function inlining -- Loop unrolling and
532  function inlining need to be more aggressive for GPUs than for CPUs because
533  control flow transfer in GPU is more expensive. More aggressive unrolling and
534  inlining also promote other optimizations, such as constant propagation and
535  SROA, which sometimes speed up code by over 10x.
536
537  (Programmers can force unrolling and inline using clang's `loop unrolling pragmas
538  <http://clang.llvm.org/docs/AttributeReference.html#pragma-unroll-pragma-nounroll>`_
539  and ``__attribute__((always_inline))``.)
540
541Publication
542===========
543
544The team at Google published a paper in CGO 2016 detailing the optimizations
545they'd made to clang/LLVM.  Note that "gpucc" is no longer a meaningful name:
546The relevant tools are now just vanilla clang/LLVM.
547
548| `gpucc: An Open-Source GPGPU Compiler <http://dl.acm.org/citation.cfm?id=2854041>`_
549| Jingyue Wu, Artem Belevich, Eli Bendersky, Mark Heffernan, Chris Leary, Jacques Pienaar, Bjarke Roune, Rob Springer, Xuetian Weng, Robert Hundt
550| *Proceedings of the 2016 International Symposium on Code Generation and Optimization (CGO 2016)*
551|
552| `Slides from the CGO talk <http://wujingyue.github.io/docs/gpucc-talk.pdf>`_
553|
554| `Tutorial given at CGO <http://wujingyue.github.io/docs/gpucc-tutorial.pdf>`_
555
556Obtaining Help
557==============
558
559To obtain help on LLVM in general and its CUDA support, see `the LLVM
560community <http://llvm.org/docs/#mailing-lists>`_.
561