1=============================
2User Guide for AMDGPU Backend
3=============================
4
5.. contents::
6   :local:
7
8.. toctree::
9   :hidden:
10
11   AMDGPU/AMDGPUAsmGFX7
12   AMDGPU/AMDGPUAsmGFX8
13   AMDGPU/AMDGPUAsmGFX9
14   AMDGPU/AMDGPUAsmGFX900
15   AMDGPU/AMDGPUAsmGFX904
16   AMDGPU/AMDGPUAsmGFX906
17   AMDGPU/AMDGPUAsmGFX908
18   AMDGPU/AMDGPUAsmGFX10
19   AMDGPU/AMDGPUAsmGFX1011
20   AMDGPUModifierSyntax
21   AMDGPUOperandSyntax
22   AMDGPUInstructionSyntax
23   AMDGPUInstructionNotation
24   AMDGPUDwarfProposalForHeterogeneousDebugging
25
26Introduction
27============
28
29The AMDGPU backend provides ISA code generation for AMD GPUs, starting with the
30R600 family up until the current GCN families. It lives in the
31``llvm/lib/Target/AMDGPU`` directory.
32
33LLVM
34====
35
36.. _amdgpu-target-triples:
37
38Target Triples
39--------------
40
41Use the ``clang -target <Architecture>-<Vendor>-<OS>-<Environment>`` option to
42specify the target triple:
43
44  .. table:: AMDGPU Architectures
45     :name: amdgpu-architecture-table
46
47     ============ ==============================================================
48     Architecture Description
49     ============ ==============================================================
50     ``r600``     AMD GPUs HD2XXX-HD6XXX for graphics and compute shaders.
51     ``amdgcn``   AMD GPUs GCN GFX6 onwards for graphics and compute shaders.
52     ============ ==============================================================
53
54  .. table:: AMDGPU Vendors
55     :name: amdgpu-vendor-table
56
57     ============ ==============================================================
58     Vendor       Description
59     ============ ==============================================================
60     ``amd``      Can be used for all AMD GPU usage.
61     ``mesa3d``   Can be used if the OS is ``mesa3d``.
62     ============ ==============================================================
63
64  .. table:: AMDGPU Operating Systems
65     :name: amdgpu-os-table
66
67     ============== ============================================================
68     OS             Description
69     ============== ============================================================
70     *<empty>*      Defaults to the *unknown* OS.
71     ``amdhsa``     Compute kernels executed on HSA [HSA]_ compatible runtimes
72                    such as AMD's ROCm [AMD-ROCm]_.
73     ``amdpal``     Graphic shaders and compute kernels executed on AMD PAL
74                    runtime.
75     ``mesa3d``     Graphic shaders and compute kernels executed on Mesa 3D
76                    runtime.
77     ============== ============================================================
78
79  .. table:: AMDGPU Environments
80     :name: amdgpu-environment-table
81
82     ============ ==============================================================
83     Environment  Description
84     ============ ==============================================================
85     *<empty>*    Default.
86     ============ ==============================================================
87
88.. _amdgpu-processors:
89
90Processors
91----------
92
93Use the ``clang -mcpu <Processor>`` option to specify the AMDGPU processor. The
94names from both the *Processor* and *Alternative Processor* can be used.
95
96  .. table:: AMDGPU Processors
97     :name: amdgpu-processor-table
98
99     =========== =============== ============ ===== ================= ======= ======================
100     Processor   Alternative     Target       dGPU/ Target            ROCm    Example
101                 Processor       Triple       APU   Features          Support Products
102                                 Architecture       Supported
103                                                    [Default]
104     =========== =============== ============ ===== ================= ======= ======================
105     **Radeon HD 2000/3000 Series (R600)** [AMD-RADEON-HD-2000-3000]_
106     -----------------------------------------------------------------------------------------------
107     ``r600``                    ``r600``     dGPU
108     ``r630``                    ``r600``     dGPU
109     ``rs880``                   ``r600``     dGPU
110     ``rv670``                   ``r600``     dGPU
111     **Radeon HD 4000 Series (R700)** [AMD-RADEON-HD-4000]_
112     -----------------------------------------------------------------------------------------------
113     ``rv710``                   ``r600``     dGPU
114     ``rv730``                   ``r600``     dGPU
115     ``rv770``                   ``r600``     dGPU
116     **Radeon HD 5000 Series (Evergreen)** [AMD-RADEON-HD-5000]_
117     -----------------------------------------------------------------------------------------------
118     ``cedar``                   ``r600``     dGPU
119     ``cypress``                 ``r600``     dGPU
120     ``juniper``                 ``r600``     dGPU
121     ``redwood``                 ``r600``     dGPU
122     ``sumo``                    ``r600``     dGPU
123     **Radeon HD 6000 Series (Northern Islands)** [AMD-RADEON-HD-6000]_
124     -----------------------------------------------------------------------------------------------
125     ``barts``                   ``r600``     dGPU
126     ``caicos``                  ``r600``     dGPU
127     ``cayman``                  ``r600``     dGPU
128     ``turks``                   ``r600``     dGPU
129     **GCN GFX6 (Southern Islands (SI))** [AMD-GCN-GFX6]_
130     -----------------------------------------------------------------------------------------------
131     ``gfx600``  - ``tahiti``    ``amdgcn``   dGPU
132     ``gfx601``  - ``hainan``    ``amdgcn``   dGPU
133                 - ``oland``
134                 - ``pitcairn``
135                 - ``verde``
136     **GCN GFX7 (Sea Islands (CI))** [AMD-GCN-GFX7]_
137     -----------------------------------------------------------------------------------------------
138     ``gfx700``  - ``kaveri``    ``amdgcn``   APU                             - A6-7000
139                                                                              - A6 Pro-7050B
140                                                                              - A8-7100
141                                                                              - A8 Pro-7150B
142                                                                              - A10-7300
143                                                                              - A10 Pro-7350B
144                                                                              - FX-7500
145                                                                              - A8-7200P
146                                                                              - A10-7400P
147                                                                              - FX-7600P
148     ``gfx701``  - ``hawaii``    ``amdgcn``   dGPU                    ROCm    - FirePro W8100
149                                                                              - FirePro W9100
150                                                                              - FirePro S9150
151                                                                              - FirePro S9170
152     ``gfx702``                  ``amdgcn``   dGPU                    ROCm    - Radeon R9 290
153                                                                              - Radeon R9 290x
154                                                                              - Radeon R390
155                                                                              - Radeon R390x
156     ``gfx703``  - ``kabini``    ``amdgcn``   APU                             - E1-2100
157                 - ``mullins``                                                - E1-2200
158                                                                              - E1-2500
159                                                                              - E2-3000
160                                                                              - E2-3800
161                                                                              - A4-5000
162                                                                              - A4-5100
163                                                                              - A6-5200
164                                                                              - A4 Pro-3340B
165     ``gfx704``  - ``bonaire``   ``amdgcn``   dGPU                            - Radeon HD 7790
166                                                                              - Radeon HD 8770
167                                                                              - R7 260
168                                                                              - R7 260X
169     **GCN GFX8 (Volcanic Islands (VI))** [AMD-GCN-GFX8]_
170     -----------------------------------------------------------------------------------------------
171     ``gfx801``  - ``carrizo``   ``amdgcn``   APU   - xnack                   - A6-8500P
172                                                      [on]                    - Pro A6-8500B
173                                                                              - A8-8600P
174                                                                              - Pro A8-8600B
175                                                                              - FX-8800P
176                                                                              - Pro A12-8800B
177     \                           ``amdgcn``   APU   - xnack           ROCm    - A10-8700P
178                                                      [on]                    - Pro A10-8700B
179                                                                              - A10-8780P
180     \                           ``amdgcn``   APU   - xnack                   - A10-9600P
181                                                      [on]                    - A10-9630P
182                                                                              - A12-9700P
183                                                                              - A12-9730P
184                                                                              - FX-9800P
185                                                                              - FX-9830P
186     \                           ``amdgcn``   APU   - xnack                   - E2-9010
187                                                      [on]                    - A6-9210
188                                                                              - A9-9410
189     ``gfx802``  - ``iceland``   ``amdgcn``   dGPU  - xnack           ROCm    - FirePro S7150
190                 - ``tonga``                          [off]                   - FirePro S7100
191                                                                              - FirePro W7100
192                                                                              - Radeon R285
193                                                                              - Radeon R9 380
194                                                                              - Radeon R9 385
195                                                                              - Mobile FirePro
196                                                                                M7170
197     ``gfx803``  - ``fiji``      ``amdgcn``   dGPU  - xnack           ROCm    - Radeon R9 Nano
198                                                      [off]                   - Radeon R9 Fury
199                                                                              - Radeon R9 FuryX
200                                                                              - Radeon Pro Duo
201                                                                              - FirePro S9300x2
202                                                                              - Radeon Instinct MI8
203     \           - ``polaris10`` ``amdgcn``   dGPU  - xnack           ROCm    - Radeon RX 470
204                                                      [off]                   - Radeon RX 480
205                                                                              - Radeon Instinct MI6
206     \           - ``polaris11`` ``amdgcn``   dGPU  - xnack           ROCm    - Radeon RX 460
207                                                      [off]
208     ``gfx810``  - ``stoney``    ``amdgcn``   APU   - xnack
209                                                      [on]
210     **GCN GFX9** [AMD-GCN-GFX9]_
211     -----------------------------------------------------------------------------------------------
212     ``gfx900``                  ``amdgcn``   dGPU  - xnack           ROCm    - Radeon Vega
213                                                      [off]                     Frontier Edition
214                                                                              - Radeon RX Vega 56
215                                                                              - Radeon RX Vega 64
216                                                                              - Radeon RX Vega 64
217                                                                                Liquid
218                                                                              - Radeon Instinct MI25
219     ``gfx902``                  ``amdgcn``   APU   - xnack                   - Ryzen 3 2200G
220                                                      [on]                    - Ryzen 5 2400G
221     ``gfx904``                  ``amdgcn``   dGPU  - xnack                   *TBA*
222                                                      [off]
223                                                                              .. TODO::
224                                                                                 Add product
225                                                                                 names.
226     ``gfx906``                  ``amdgcn``   dGPU  - xnack                   - Radeon Instinct MI50
227                                                      [off]                   - Radeon Instinct MI60
228                                                                              - Radeon VII
229                                                                              - Radeon Pro VII
230     ``gfx908``                  ``amdgcn``   dGPU  - xnack                   *TBA*
231                                                      [off]
232                                                      sram-ecc
233                                                      [on]
234                                                                              .. TODO::
235                                                                                 Add product
236                                                                                 names.
237     ``gfx909``                  ``amdgcn``   APU   - xnack                   *TBA*
238                                                      [on]
239                                                                              .. TODO::
240                                                                                 Add product
241                                                                                 names.
242     **GCN GFX10** [AMD-GCN-GFX10]_
243     -----------------------------------------------------------------------------------------------
244     ``gfx1010``                 ``amdgcn``   dGPU  - xnack                   - Radeon RX 5700
245                                                      [off]                   - Radeon RX 5700 XT
246                                                    - wavefrontsize64         - Radeon Pro 5600 XT
247                                                      [off]
248                                                    - cumode
249                                                      [off]
250     ``gfx1011``                 ``amdgcn``   dGPU  - xnack                   - Radeon Pro 5600M
251                                                      [off]
252                                                    - wavefrontsize64
253                                                      [off]
254                                                    - cumode
255                                                      [off]
256     ``gfx1012``                 ``amdgcn``   dGPU  - xnack                   - Radeon RX 5500
257                                                      [off]                   - Radeon RX 5500 XT
258                                                    - wavefrontsize64
259                                                      [off]
260                                                    - cumode
261                                                      [off]
262     ``gfx1030``                 ``amdgcn``   dGPU  - wavefrontsize64         *TBA*
263                                                      [off]
264                                                    - cumode
265                                                      [off]
266                                                                              .. TODO
267                                                                                 Add product
268                                                                                 names.
269     =========== =============== ============ ===== ================= ======= ======================
270
271.. _amdgpu-target-features:
272
273Target Features
274---------------
275
276Target features control how code is generated to support certain
277processor specific features. Not all target features are supported by
278all processors. The runtime must ensure that the features supported by
279the device used to execute the code match the features enabled when
280generating the code. A mismatch of features may result in incorrect
281execution, or a reduction in performance.
282
283The target features supported by each processor, and the default value
284used if not specified explicitly, is listed in
285:ref:`amdgpu-processor-table`.
286
287Use the ``clang -m[no-]<TargetFeature>`` option to specify the AMDGPU
288target features.
289
290For example:
291
292``-mxnack``
293  Enable the ``xnack`` feature.
294``-mno-xnack``
295  Disable the ``xnack`` feature.
296
297  .. table:: AMDGPU Target Features
298     :name: amdgpu-target-feature-table
299
300     ====================== ==================================================
301     Target Feature         Description
302     ====================== ==================================================
303     -m[no-]xnack           Enable/disable generating code that has
304                            memory clauses that are compatible with
305                            having XNACK replay enabled.
306
307                            This is used for demand paging and page
308                            migration. If XNACK replay is enabled in
309                            the device, then if a page fault occurs
310                            the code may execute incorrectly if the
311                            ``xnack`` feature is not enabled. Executing
312                            code that has the feature enabled on a
313                            device that does not have XNACK replay
314                            enabled will execute correctly but may
315                            be less performant than code with the
316                            feature disabled.
317
318     -m[no-]sram-ecc        Enable/disable generating code that assumes SRAM
319                            ECC is enabled/disabled.
320
321     -m[no-]wavefrontsize64 Control the default wavefront size used when
322                            generating code for kernels. When disabled
323                            native wavefront size 32 is used, when enabled
324                            wavefront size 64 is used.
325
326     -m[no-]cumode          Control the default wavefront execution mode used
327                            when generating code for kernels. When disabled
328                            native WGP wavefront execution mode is used,
329                            when enabled CU wavefront execution mode is used
330                            (see :ref:`amdgpu-amdhsa-memory-model`).
331     ====================== ==================================================
332
333.. _amdgpu-address-spaces:
334
335Address Spaces
336--------------
337
338The AMDGPU architecture supports a number of memory address spaces. The address
339space names use the OpenCL standard names, with some additions.
340
341The AMDGPU address spaces correspond to target architecture specific LLVM
342address space numbers used in LLVM IR.
343
344The AMDGPU address spaces are described in
345:ref:`amdgpu-address-spaces-table`. Only 64-bit process address spaces are
346supported for the ``amdgcn`` target.
347
348  .. table:: AMDGPU Address Spaces
349     :name: amdgpu-address-spaces-table
350
351     ================================= =============== =========== ================ ======= ============================
352     ..                                                                                     64-Bit Process Address Space
353     --------------------------------- --------------- ----------- ---------------- ------------------------------------
354     Address Space Name                LLVM IR Address HSA Segment Hardware         Address NULL Value
355                                       Space Number    Name        Name             Size
356     ================================= =============== =========== ================ ======= ============================
357     Generic                           0               flat        flat             64      0x0000000000000000
358     Global                            1               global      global           64      0x0000000000000000
359     Region                            2               N/A         GDS              32      *not implemented for AMDHSA*
360     Local                             3               group       LDS              32      0xFFFFFFFF
361     Constant                          4               constant    *same as global* 64      0x0000000000000000
362     Private                           5               private     scratch          32      0xFFFFFFFF
363     Constant 32-bit                   6               *TODO*                               0x00000000
364     Buffer Fat Pointer (experimental) 7               *TODO*
365     ================================= =============== =========== ================ ======= ============================
366
367**Generic**
368  The generic address space uses the hardware flat address support available in
369  GFX7-GFX10. This uses two fixed ranges of virtual addresses (the private and
370  local apertures), that are outside the range of addressable global memory, to
371  map from a flat address to a private or local address.
372
373  FLAT instructions can take a flat address and access global, private
374  (scratch), and group (LDS) memory depending on if the address is within one
375  of the aperture ranges. Flat access to scratch requires hardware aperture
376  setup and setup in the kernel prologue (see
377  :ref:`amdgpu-amdhsa-kernel-prolog-flat-scratch`). Flat access to LDS requires
378  hardware aperture setup and M0 (GFX7-GFX8) register setup (see
379  :ref:`amdgpu-amdhsa-kernel-prolog-m0`).
380
381  To convert between a private or group address space address (termed a segment
382  address) and a flat address the base address of the corresponding aperture
383  can be used. For GFX7-GFX8 these are available in the
384  :ref:`amdgpu-amdhsa-hsa-aql-queue` the address of which can be obtained with
385  Queue Ptr SGPR (see :ref:`amdgpu-amdhsa-initial-kernel-execution-state`). For
386  GFX9-GFX10 the aperture base addresses are directly available as inline
387  constant registers ``SRC_SHARED_BASE/LIMIT`` and ``SRC_PRIVATE_BASE/LIMIT``.
388  In 64-bit address mode the aperture sizes are 2^32 bytes and the base is
389  aligned to 2^32 which makes it easier to convert from flat to segment or
390  segment to flat.
391
392  A global address space address has the same value when used as a flat address
393  so no conversion is needed.
394
395**Global and Constant**
396  The global and constant address spaces both use global virtual addresses,
397  which are the same virtual address space used by the CPU. However, some
398  virtual addresses may only be accessible to the CPU, some only accessible
399  by the GPU, and some by both.
400
401  Using the constant address space indicates that the data will not change
402  during the execution of the kernel. This allows scalar read instructions to
403  be used. The vector and scalar L1 caches are invalidated of volatile data
404  before each kernel dispatch execution to allow constant memory to change
405  values between kernel dispatches.
406
407**Region**
408  The region address space uses the hardware Global Data Store (GDS). All
409  wavefronts executing on the same device will access the same memory for any
410  given region address. However, the same region address accessed by wavefronts
411  executing on different devices will access different memory. It is higher
412  performance than global memory. It is allocated by the runtime. The data
413  store (DS) instructions can be used to access it.
414
415**Local**
416  The local address space uses the hardware Local Data Store (LDS) which is
417  automatically allocated when the hardware creates the wavefronts of a
418  work-group, and freed when all the wavefronts of a work-group have
419  terminated. All wavefronts belonging to the same work-group will access the
420  same memory for any given local address. However, the same local address
421  accessed by wavefronts belonging to different work-groups will access
422  different memory. It is higher performance than global memory. The data store
423  (DS) instructions can be used to access it.
424
425**Private**
426  The private address space uses the hardware scratch memory support which
427  automatically allocates memory when it creates a wavefront and frees it when
428  a wavefronts terminates. The memory accessed by a lane of a wavefront for any
429  given private address will be different to the memory accessed by another lane
430  of the same or different wavefront for the same private address.
431
432  If a kernel dispatch uses scratch, then the hardware allocates memory from a
433  pool of backing memory allocated by the runtime for each wavefront. The lanes
434  of the wavefront access this using dword (4 byte) interleaving. The mapping
435  used from private address to backing memory address is:
436
437    ``wavefront-scratch-base +
438    ((private-address / 4) * wavefront-size * 4) +
439    (wavefront-lane-id * 4) + (private-address % 4)``
440
441  If each lane of a wavefront accesses the same private address, the
442  interleaving results in adjacent dwords being accessed and hence requires
443  fewer cache lines to be fetched.
444
445  There are different ways that the wavefront scratch base address is
446  determined by a wavefront (see
447  :ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
448
449  Scratch memory can be accessed in an interleaved manner using buffer
450  instructions with the scratch buffer descriptor and per wavefront scratch
451  offset, by the scratch instructions, or by flat instructions. Multi-dword
452  access is not supported except by flat and scratch instructions in
453  GFX9-GFX10.
454
455**Constant 32-bit**
456  *TODO*
457
458**Buffer Fat Pointer**
459  The buffer fat pointer is an experimental address space that is currently
460  unsupported in the backend. It exposes a non-integral pointer that is in
461  the future intended to support the modelling of 128-bit buffer descriptors
462  plus a 32-bit offset into the buffer (in total encapsulating a 160-bit
463  *pointer*), allowing normal LLVM load/store/atomic operations to be used to
464  model the buffer descriptors used heavily in graphics workloads targeting
465  the backend.
466
467.. _amdgpu-memory-scopes:
468
469Memory Scopes
470-------------
471
472This section provides LLVM memory synchronization scopes supported by the AMDGPU
473backend memory model when the target triple OS is ``amdhsa`` (see
474:ref:`amdgpu-amdhsa-memory-model` and :ref:`amdgpu-target-triples`).
475
476The memory model supported is based on the HSA memory model [HSA]_ which is
477based in turn on HRF-indirect with scope inclusion [HRF]_. The happens-before
478relation is transitive over the synchronizes-with relation independent of scope
479and synchronizes-with allows the memory scope instances to be inclusive (see
480table :ref:`amdgpu-amdhsa-llvm-sync-scopes-table`).
481
482This is different to the OpenCL [OpenCL]_ memory model which does not have scope
483inclusion and requires the memory scopes to exactly match. However, this
484is conservatively correct for OpenCL.
485
486  .. table:: AMDHSA LLVM Sync Scopes
487     :name: amdgpu-amdhsa-llvm-sync-scopes-table
488
489     ======================= ===================================================
490     LLVM Sync Scope         Description
491     ======================= ===================================================
492     *none*                  The default: ``system``.
493
494                             Synchronizes with, and participates in modification
495                             and seq_cst total orderings with, other operations
496                             (except image operations) for all address spaces
497                             (except private, or generic that accesses private)
498                             provided the other operation's sync scope is:
499
500                             - ``system``.
501                             - ``agent`` and executed by a thread on the same
502                               agent.
503                             - ``workgroup`` and executed by a thread in the
504                               same work-group.
505                             - ``wavefront`` and executed by a thread in the
506                               same wavefront.
507
508     ``agent``               Synchronizes with, and participates in modification
509                             and seq_cst total orderings with, other operations
510                             (except image operations) for all address spaces
511                             (except private, or generic that accesses private)
512                             provided the other operation's sync scope is:
513
514                             - ``system`` or ``agent`` and executed by a thread
515                               on the same agent.
516                             - ``workgroup`` and executed by a thread in the
517                               same work-group.
518                             - ``wavefront`` and executed by a thread in the
519                               same wavefront.
520
521     ``workgroup``           Synchronizes with, and participates in modification
522                             and seq_cst total orderings with, other operations
523                             (except image operations) for all address spaces
524                             (except private, or generic that accesses private)
525                             provided the other operation's sync scope is:
526
527                             - ``system``, ``agent`` or ``workgroup`` and
528                               executed by a thread in the same work-group.
529                             - ``wavefront`` and executed by a thread in the
530                               same wavefront.
531
532     ``wavefront``           Synchronizes with, and participates in modification
533                             and seq_cst total orderings with, other operations
534                             (except image operations) for all address spaces
535                             (except private, or generic that accesses private)
536                             provided the other operation's sync scope is:
537
538                             - ``system``, ``agent``, ``workgroup`` or
539                               ``wavefront`` and executed by a thread in the
540                               same wavefront.
541
542     ``singlethread``        Only synchronizes with and participates in
543                             modification and seq_cst total orderings with,
544                             other operations (except image operations) running
545                             in the same thread for all address spaces (for
546                             example, in signal handlers).
547
548     ``one-as``              Same as ``system`` but only synchronizes with other
549                             operations within the same address space.
550
551     ``agent-one-as``        Same as ``agent`` but only synchronizes with other
552                             operations within the same address space.
553
554     ``workgroup-one-as``    Same as ``workgroup`` but only synchronizes with
555                             other operations within the same address space.
556
557     ``wavefront-one-as``    Same as ``wavefront`` but only synchronizes with
558                             other operations within the same address space.
559
560     ``singlethread-one-as`` Same as ``singlethread`` but only synchronizes with
561                             other operations within the same address space.
562     ======================= ===================================================
563
564LLVM IR Intrinsics
565------------------
566
567The AMDGPU backend implements the following LLVM IR intrinsics.
568
569*This section is WIP.*
570
571.. TODO::
572
573   List AMDGPU intrinsics.
574
575LLVM IR Attributes
576------------------
577
578The AMDGPU backend supports the following LLVM IR attributes.
579
580  .. table:: AMDGPU LLVM IR Attributes
581     :name: amdgpu-llvm-ir-attributes-table
582
583     ======================================= ==========================================================
584     LLVM Attribute                          Description
585     ======================================= ==========================================================
586     "amdgpu-flat-work-group-size"="min,max" Specify the minimum and maximum flat work group sizes that
587                                             will be specified when the kernel is dispatched. Generated
588                                             by the ``amdgpu_flat_work_group_size`` CLANG attribute [CLANG-ATTR]_.
589     "amdgpu-implicitarg-num-bytes"="n"      Number of kernel argument bytes to add to the kernel
590                                             argument block size for the implicit arguments. This
591                                             varies by OS and language (for OpenCL see
592                                             :ref:`opencl-kernel-implicit-arguments-appended-for-amdhsa-os-table`).
593     "amdgpu-num-sgpr"="n"                   Specifies the number of SGPRs to use. Generated by
594                                             the ``amdgpu_num_sgpr`` CLANG attribute [CLANG-ATTR]_.
595     "amdgpu-num-vgpr"="n"                   Specifies the number of VGPRs to use. Generated by the
596                                             ``amdgpu_num_vgpr`` CLANG attribute [CLANG-ATTR]_.
597     "amdgpu-waves-per-eu"="m,n"             Specify the minimum and maximum number of waves per
598                                             execution unit. Generated by the ``amdgpu_waves_per_eu``
599                                             CLANG attribute [CLANG-ATTR]_.
600     "amdgpu-ieee" true/false.               Specify whether the function expects the IEEE field of the
601                                             mode register to be set on entry. Overrides the default for
602                                             the calling convention.
603     "amdgpu-dx10-clamp" true/false.         Specify whether the function expects the DX10_CLAMP field of
604                                             the mode register to be set on entry. Overrides the default
605                                             for the calling convention.
606     ======================================= ==========================================================
607
608.. _amdgpu-elf-code-object:
609
610ELF Code Object
611===============
612
613The AMDGPU backend generates a standard ELF [ELF]_ relocatable code object that
614can be linked by ``lld`` to produce a standard ELF shared code object which can
615be loaded and executed on an AMDGPU target.
616
617.. _amdgpu-elf-header:
618
619Header
620------
621
622The AMDGPU backend uses the following ELF header:
623
624  .. table:: AMDGPU ELF Header
625     :name: amdgpu-elf-header-table
626
627     ========================== ===============================
628     Field                      Value
629     ========================== ===============================
630     ``e_ident[EI_CLASS]``      ``ELFCLASS64``
631     ``e_ident[EI_DATA]``       ``ELFDATA2LSB``
632     ``e_ident[EI_OSABI]``      - ``ELFOSABI_NONE``
633                                - ``ELFOSABI_AMDGPU_HSA``
634                                - ``ELFOSABI_AMDGPU_PAL``
635                                - ``ELFOSABI_AMDGPU_MESA3D``
636     ``e_ident[EI_ABIVERSION]`` - ``ELFABIVERSION_AMDGPU_HSA``
637                                - ``ELFABIVERSION_AMDGPU_PAL``
638                                - ``ELFABIVERSION_AMDGPU_MESA3D``
639     ``e_type``                 - ``ET_REL``
640                                - ``ET_DYN``
641     ``e_machine``              ``EM_AMDGPU``
642     ``e_entry``                0
643     ``e_flags``                See :ref:`amdgpu-elf-header-e_flags-table`
644     ========================== ===============================
645
646..
647
648  .. table:: AMDGPU ELF Header Enumeration Values
649     :name: amdgpu-elf-header-enumeration-values-table
650
651     =============================== =====
652     Name                            Value
653     =============================== =====
654     ``EM_AMDGPU``                   224
655     ``ELFOSABI_NONE``               0
656     ``ELFOSABI_AMDGPU_HSA``         64
657     ``ELFOSABI_AMDGPU_PAL``         65
658     ``ELFOSABI_AMDGPU_MESA3D``      66
659     ``ELFABIVERSION_AMDGPU_HSA``    1
660     ``ELFABIVERSION_AMDGPU_PAL``    0
661     ``ELFABIVERSION_AMDGPU_MESA3D`` 0
662     =============================== =====
663
664``e_ident[EI_CLASS]``
665  The ELF class is:
666
667  * ``ELFCLASS32`` for ``r600`` architecture.
668
669  * ``ELFCLASS64`` for ``amdgcn`` architecture which only supports 64-bit
670    process address space applications.
671
672``e_ident[EI_DATA]``
673  All AMDGPU targets use ``ELFDATA2LSB`` for little-endian byte ordering.
674
675``e_ident[EI_OSABI]``
676  One of the following AMDGPU target architecture specific OS ABIs
677  (see :ref:`amdgpu-os-table`):
678
679  * ``ELFOSABI_NONE`` for *unknown* OS.
680
681  * ``ELFOSABI_AMDGPU_HSA`` for ``amdhsa`` OS.
682
683  * ``ELFOSABI_AMDGPU_PAL`` for ``amdpal`` OS.
684
685  * ``ELFOSABI_AMDGPU_MESA3D`` for ``mesa3D`` OS.
686
687``e_ident[EI_ABIVERSION]``
688  The ABI version of the AMDGPU target architecture specific OS ABI to which the code
689  object conforms:
690
691  * ``ELFABIVERSION_AMDGPU_HSA`` is used to specify the version of AMD HSA
692    runtime ABI.
693
694  * ``ELFABIVERSION_AMDGPU_PAL`` is used to specify the version of AMD PAL
695    runtime ABI.
696
697  * ``ELFABIVERSION_AMDGPU_MESA3D`` is used to specify the version of AMD MESA
698    3D runtime ABI.
699
700``e_type``
701  Can be one of the following values:
702
703
704  ``ET_REL``
705    The type produced by the AMDGPU backend compiler as it is relocatable code
706    object.
707
708  ``ET_DYN``
709    The type produced by the linker as it is a shared code object.
710
711  The AMD HSA runtime loader requires a ``ET_DYN`` code object.
712
713``e_machine``
714  The value ``EM_AMDGPU`` is used for the machine for all processors supported
715  by the ``r600`` and ``amdgcn`` architectures (see
716  :ref:`amdgpu-processor-table`). The specific processor is specified in the
717  ``EF_AMDGPU_MACH`` bit field of the ``e_flags`` (see
718  :ref:`amdgpu-elf-header-e_flags-table`).
719
720``e_entry``
721  The entry point is 0 as the entry points for individual kernels must be
722  selected in order to invoke them through AQL packets.
723
724``e_flags``
725  The AMDGPU backend uses the following ELF header flags:
726
727  .. table:: AMDGPU ELF Header ``e_flags``
728     :name: amdgpu-elf-header-e_flags-table
729
730     ================================= ========== =============================
731     Name                              Value      Description
732     ================================= ========== =============================
733     **AMDGPU Processor Flag**                    See :ref:`amdgpu-processor-table`.
734     -------------------------------------------- -----------------------------
735     ``EF_AMDGPU_MACH``                0x000000ff AMDGPU processor selection
736                                                  mask for
737                                                  ``EF_AMDGPU_MACH_xxx`` values
738                                                  defined in
739                                                  :ref:`amdgpu-ef-amdgpu-mach-table`.
740     ``EF_AMDGPU_XNACK``               0x00000100 Indicates if the ``xnack``
741                                                  target feature is
742                                                  enabled for all code
743                                                  contained in the code object.
744                                                  If the processor
745                                                  does not support the
746                                                  ``xnack`` target
747                                                  feature then must
748                                                  be 0.
749                                                  See
750                                                  :ref:`amdgpu-target-features`.
751     ``EF_AMDGPU_SRAM_ECC``            0x00000200 Indicates if the ``sram-ecc``
752                                                  target feature is
753                                                  enabled for all code
754                                                  contained in the code object.
755                                                  If the processor
756                                                  does not support the
757                                                  ``sram-ecc`` target
758                                                  feature then must
759                                                  be 0.
760                                                  See
761                                                  :ref:`amdgpu-target-features`.
762     ================================= ========== =============================
763
764  .. table:: AMDGPU ``EF_AMDGPU_MACH`` Values
765     :name: amdgpu-ef-amdgpu-mach-table
766
767     ================================= ========== =============================
768     Name                              Value      Description (see
769                                                  :ref:`amdgpu-processor-table`)
770     ================================= ========== =============================
771     ``EF_AMDGPU_MACH_NONE``           0x000      *not specified*
772     ``EF_AMDGPU_MACH_R600_R600``      0x001      ``r600``
773     ``EF_AMDGPU_MACH_R600_R630``      0x002      ``r630``
774     ``EF_AMDGPU_MACH_R600_RS880``     0x003      ``rs880``
775     ``EF_AMDGPU_MACH_R600_RV670``     0x004      ``rv670``
776     ``EF_AMDGPU_MACH_R600_RV710``     0x005      ``rv710``
777     ``EF_AMDGPU_MACH_R600_RV730``     0x006      ``rv730``
778     ``EF_AMDGPU_MACH_R600_RV770``     0x007      ``rv770``
779     ``EF_AMDGPU_MACH_R600_CEDAR``     0x008      ``cedar``
780     ``EF_AMDGPU_MACH_R600_CYPRESS``   0x009      ``cypress``
781     ``EF_AMDGPU_MACH_R600_JUNIPER``   0x00a      ``juniper``
782     ``EF_AMDGPU_MACH_R600_REDWOOD``   0x00b      ``redwood``
783     ``EF_AMDGPU_MACH_R600_SUMO``      0x00c      ``sumo``
784     ``EF_AMDGPU_MACH_R600_BARTS``     0x00d      ``barts``
785     ``EF_AMDGPU_MACH_R600_CAICOS``    0x00e      ``caicos``
786     ``EF_AMDGPU_MACH_R600_CAYMAN``    0x00f      ``cayman``
787     ``EF_AMDGPU_MACH_R600_TURKS``     0x010      ``turks``
788     *reserved*                        0x011 -    Reserved for ``r600``
789                                       0x01f      architecture processors.
790     ``EF_AMDGPU_MACH_AMDGCN_GFX600``  0x020      ``gfx600``
791     ``EF_AMDGPU_MACH_AMDGCN_GFX601``  0x021      ``gfx601``
792     ``EF_AMDGPU_MACH_AMDGCN_GFX700``  0x022      ``gfx700``
793     ``EF_AMDGPU_MACH_AMDGCN_GFX701``  0x023      ``gfx701``
794     ``EF_AMDGPU_MACH_AMDGCN_GFX702``  0x024      ``gfx702``
795     ``EF_AMDGPU_MACH_AMDGCN_GFX703``  0x025      ``gfx703``
796     ``EF_AMDGPU_MACH_AMDGCN_GFX704``  0x026      ``gfx704``
797     *reserved*                        0x027      Reserved.
798     ``EF_AMDGPU_MACH_AMDGCN_GFX801``  0x028      ``gfx801``
799     ``EF_AMDGPU_MACH_AMDGCN_GFX802``  0x029      ``gfx802``
800     ``EF_AMDGPU_MACH_AMDGCN_GFX803``  0x02a      ``gfx803``
801     ``EF_AMDGPU_MACH_AMDGCN_GFX810``  0x02b      ``gfx810``
802     ``EF_AMDGPU_MACH_AMDGCN_GFX900``  0x02c      ``gfx900``
803     ``EF_AMDGPU_MACH_AMDGCN_GFX902``  0x02d      ``gfx902``
804     ``EF_AMDGPU_MACH_AMDGCN_GFX904``  0x02e      ``gfx904``
805     ``EF_AMDGPU_MACH_AMDGCN_GFX906``  0x02f      ``gfx906``
806     ``EF_AMDGPU_MACH_AMDGCN_GFX908``  0x030      ``gfx908``
807     ``EF_AMDGPU_MACH_AMDGCN_GFX909``  0x031      ``gfx909``
808     *reserved*                        0x032      Reserved.
809     ``EF_AMDGPU_MACH_AMDGCN_GFX1010`` 0x033      ``gfx1010``
810     ``EF_AMDGPU_MACH_AMDGCN_GFX1011`` 0x034      ``gfx1011``
811     ``EF_AMDGPU_MACH_AMDGCN_GFX1012`` 0x035      ``gfx1012``
812     ``EF_AMDGPU_MACH_AMDGCN_GFX1030`` 0x036      ``gfx1030``
813     ================================= ========== =============================
814
815Sections
816--------
817
818An AMDGPU target ELF code object has the standard ELF sections which include:
819
820  .. table:: AMDGPU ELF Sections
821     :name: amdgpu-elf-sections-table
822
823     ================== ================ =================================
824     Name               Type             Attributes
825     ================== ================ =================================
826     ``.bss``           ``SHT_NOBITS``   ``SHF_ALLOC`` + ``SHF_WRITE``
827     ``.data``          ``SHT_PROGBITS`` ``SHF_ALLOC`` + ``SHF_WRITE``
828     ``.debug_``\ *\**  ``SHT_PROGBITS`` *none*
829     ``.dynamic``       ``SHT_DYNAMIC``  ``SHF_ALLOC``
830     ``.dynstr``        ``SHT_PROGBITS`` ``SHF_ALLOC``
831     ``.dynsym``        ``SHT_PROGBITS`` ``SHF_ALLOC``
832     ``.got``           ``SHT_PROGBITS`` ``SHF_ALLOC`` + ``SHF_WRITE``
833     ``.hash``          ``SHT_HASH``     ``SHF_ALLOC``
834     ``.note``          ``SHT_NOTE``     *none*
835     ``.rela``\ *name*  ``SHT_RELA``     *none*
836     ``.rela.dyn``      ``SHT_RELA``     *none*
837     ``.rodata``        ``SHT_PROGBITS`` ``SHF_ALLOC``
838     ``.shstrtab``      ``SHT_STRTAB``   *none*
839     ``.strtab``        ``SHT_STRTAB``   *none*
840     ``.symtab``        ``SHT_SYMTAB``   *none*
841     ``.text``          ``SHT_PROGBITS`` ``SHF_ALLOC`` + ``SHF_EXECINSTR``
842     ================== ================ =================================
843
844These sections have their standard meanings (see [ELF]_) and are only generated
845if needed.
846
847``.debug``\ *\**
848  The standard DWARF sections. See :ref:`amdgpu-dwarf-debug-information` for
849  information on the DWARF produced by the AMDGPU backend.
850
851``.dynamic``, ``.dynstr``, ``.dynsym``, ``.hash``
852  The standard sections used by a dynamic loader.
853
854``.note``
855  See :ref:`amdgpu-note-records` for the note records supported by the AMDGPU
856  backend.
857
858``.rela``\ *name*, ``.rela.dyn``
859  For relocatable code objects, *name* is the name of the section that the
860  relocation records apply. For example, ``.rela.text`` is the section name for
861  relocation records associated with the ``.text`` section.
862
863  For linked shared code objects, ``.rela.dyn`` contains all the relocation
864  records from each of the relocatable code object's ``.rela``\ *name* sections.
865
866  See :ref:`amdgpu-relocation-records` for the relocation records supported by
867  the AMDGPU backend.
868
869``.text``
870  The executable machine code for the kernels and functions they call. Generated
871  as position independent code. See :ref:`amdgpu-code-conventions` for
872  information on conventions used in the isa generation.
873
874.. _amdgpu-note-records:
875
876Note Records
877------------
878
879The AMDGPU backend code object contains ELF note records in the ``.note``
880section. The set of generated notes and their semantics depend on the code
881object version; see :ref:`amdgpu-note-records-v2` and
882:ref:`amdgpu-note-records-v3`.
883
884As required by ``ELFCLASS32`` and ``ELFCLASS64``, minimal zero-byte padding
885must be generated after the ``name`` field to ensure the ``desc`` field is 4
886byte aligned. In addition, minimal zero-byte padding must be generated to
887ensure the ``desc`` field size is a multiple of 4 bytes. The ``sh_addralign``
888field of the ``.note`` section must be at least 4 to indicate at least 8 byte
889alignment.
890
891.. _amdgpu-note-records-v2:
892
893Code Object V2 Note Records (-mattr=-code-object-v3)
894~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
895
896.. warning:: Code Object V2 is not the default code object version emitted by
897  this version of LLVM. For a description of the notes generated with the
898  default configuration (Code Object V3) see :ref:`amdgpu-note-records-v3`.
899
900The AMDGPU backend code object uses the following ELF note record in the
901``.note`` section when compiling for Code Object V2 (-mattr=-code-object-v3).
902
903Additional note records may be present, but any which are not documented here
904are deprecated and should not be used.
905
906  .. table:: AMDGPU Code Object V2 ELF Note Records
907     :name: amdgpu-elf-note-records-table-v2
908
909     ===== ============================== ======================================
910     Name  Type                           Description
911     ===== ============================== ======================================
912     "AMD" ``NT_AMD_AMDGPU_HSA_METADATA`` <metadata null terminated string>
913     ===== ============================== ======================================
914
915..
916
917  .. table:: AMDGPU Code Object V2 ELF Note Record Enumeration Values
918     :name: amdgpu-elf-note-record-enumeration-values-table-v2
919
920     ============================== =====
921     Name                           Value
922     ============================== =====
923     *reserved*                       0-9
924     ``NT_AMD_AMDGPU_HSA_METADATA``    10
925     *reserved*                        11
926     ============================== =====
927
928``NT_AMD_AMDGPU_HSA_METADATA``
929  Specifies extensible metadata associated with the code objects executed on HSA
930  [HSA]_ compatible runtimes such as AMD's ROCm [AMD-ROCm]_. It is required when
931  the target triple OS is ``amdhsa`` (see :ref:`amdgpu-target-triples`). See
932  :ref:`amdgpu-amdhsa-code-object-metadata-v2` for the syntax of the code
933  object metadata string.
934
935.. _amdgpu-note-records-v3:
936
937Code Object V3 Note Records (-mattr=+code-object-v3)
938~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
939
940The AMDGPU backend code object uses the following ELF note record in the
941``.note`` section when compiling for Code Object V3 (-mattr=+code-object-v3).
942
943Additional note records may be present, but any which are not documented here
944are deprecated and should not be used.
945
946  .. table:: AMDGPU Code Object V3 ELF Note Records
947     :name: amdgpu-elf-note-records-table-v3
948
949     ======== ============================== ======================================
950     Name     Type                           Description
951     ======== ============================== ======================================
952     "AMDGPU" ``NT_AMDGPU_METADATA``         Metadata in Message Pack [MsgPack]_
953                                             binary format.
954     ======== ============================== ======================================
955
956..
957
958  .. table:: AMDGPU Code Object V3 ELF Note Record Enumeration Values
959     :name: amdgpu-elf-note-record-enumeration-values-table-v3
960
961     ============================== =====
962     Name                           Value
963     ============================== =====
964     *reserved*                     0-31
965     ``NT_AMDGPU_METADATA``         32
966     ============================== =====
967
968``NT_AMDGPU_METADATA``
969  Specifies extensible metadata associated with an AMDGPU code
970  object. It is encoded as a map in the Message Pack [MsgPack]_ binary
971  data format. See :ref:`amdgpu-amdhsa-code-object-metadata-v3` for the
972  map keys defined for the ``amdhsa`` OS.
973
974.. _amdgpu-symbols:
975
976Symbols
977-------
978
979Symbols include the following:
980
981  .. table:: AMDGPU ELF Symbols
982     :name: amdgpu-elf-symbols-table
983
984     ===================== ================== ================ ==================
985     Name                  Type               Section          Description
986     ===================== ================== ================ ==================
987     *link-name*           ``STT_OBJECT``     - ``.data``      Global variable
988                                              - ``.rodata``
989                                              - ``.bss``
990     *link-name*\ ``.kd``  ``STT_OBJECT``     - ``.rodata``    Kernel descriptor
991     *link-name*           ``STT_FUNC``       - ``.text``      Kernel entry point
992     *link-name*           ``STT_OBJECT``     - SHN_AMDGPU_LDS Global variable in LDS
993     ===================== ================== ================ ==================
994
995Global variable
996  Global variables both used and defined by the compilation unit.
997
998  If the symbol is defined in the compilation unit then it is allocated in the
999  appropriate section according to if it has initialized data or is readonly.
1000
1001  If the symbol is external then its section is ``STN_UNDEF`` and the loader
1002  will resolve relocations using the definition provided by another code object
1003  or explicitly defined by the runtime.
1004
1005  If the symbol resides in local/group memory (LDS) then its section is the
1006  special processor specific section name ``SHN_AMDGPU_LDS``, and the
1007  ``st_value`` field describes alignment requirements as it does for common
1008  symbols.
1009
1010  .. TODO::
1011
1012     Add description of linked shared object symbols. Seems undefined symbols
1013     are marked as STT_NOTYPE.
1014
1015Kernel descriptor
1016  Every HSA kernel has an associated kernel descriptor. It is the address of the
1017  kernel descriptor that is used in the AQL dispatch packet used to invoke the
1018  kernel, not the kernel entry point. The layout of the HSA kernel descriptor is
1019  defined in :ref:`amdgpu-amdhsa-kernel-descriptor`.
1020
1021Kernel entry point
1022  Every HSA kernel also has a symbol for its machine code entry point.
1023
1024.. _amdgpu-relocation-records:
1025
1026Relocation Records
1027------------------
1028
1029AMDGPU backend generates ``Elf64_Rela`` relocation records. Supported
1030relocatable fields are:
1031
1032``word32``
1033  This specifies a 32-bit field occupying 4 bytes with arbitrary byte
1034  alignment. These values use the same byte order as other word values in the
1035  AMDGPU architecture.
1036
1037``word64``
1038  This specifies a 64-bit field occupying 8 bytes with arbitrary byte
1039  alignment. These values use the same byte order as other word values in the
1040  AMDGPU architecture.
1041
1042Following notations are used for specifying relocation calculations:
1043
1044**A**
1045  Represents the addend used to compute the value of the relocatable field.
1046
1047**G**
1048  Represents the offset into the global offset table at which the relocation
1049  entry's symbol will reside during execution.
1050
1051**GOT**
1052  Represents the address of the global offset table.
1053
1054**P**
1055  Represents the place (section offset for ``et_rel`` or address for ``et_dyn``)
1056  of the storage unit being relocated (computed using ``r_offset``).
1057
1058**S**
1059  Represents the value of the symbol whose index resides in the relocation
1060  entry. Relocations not using this must specify a symbol index of
1061  ``STN_UNDEF``.
1062
1063**B**
1064  Represents the base address of a loaded executable or shared object which is
1065  the difference between the ELF address and the actual load address.
1066  Relocations using this are only valid in executable or shared objects.
1067
1068The following relocation types are supported:
1069
1070  .. table:: AMDGPU ELF Relocation Records
1071     :name: amdgpu-elf-relocation-records-table
1072
1073     ========================== ======= =====  ==========  ==============================
1074     Relocation Type            Kind    Value  Field       Calculation
1075     ========================== ======= =====  ==========  ==============================
1076     ``R_AMDGPU_NONE``                  0      *none*      *none*
1077     ``R_AMDGPU_ABS32_LO``      Static, 1      ``word32``  (S + A) & 0xFFFFFFFF
1078                                Dynamic
1079     ``R_AMDGPU_ABS32_HI``      Static, 2      ``word32``  (S + A) >> 32
1080                                Dynamic
1081     ``R_AMDGPU_ABS64``         Static, 3      ``word64``  S + A
1082                                Dynamic
1083     ``R_AMDGPU_REL32``         Static  4      ``word32``  S + A - P
1084     ``R_AMDGPU_REL64``         Static  5      ``word64``  S + A - P
1085     ``R_AMDGPU_ABS32``         Static, 6      ``word32``  S + A
1086                                Dynamic
1087     ``R_AMDGPU_GOTPCREL``      Static  7      ``word32``  G + GOT + A - P
1088     ``R_AMDGPU_GOTPCREL32_LO`` Static  8      ``word32``  (G + GOT + A - P) & 0xFFFFFFFF
1089     ``R_AMDGPU_GOTPCREL32_HI`` Static  9      ``word32``  (G + GOT + A - P) >> 32
1090     ``R_AMDGPU_REL32_LO``      Static  10     ``word32``  (S + A - P) & 0xFFFFFFFF
1091     ``R_AMDGPU_REL32_HI``      Static  11     ``word32``  (S + A - P) >> 32
1092     *reserved*                         12
1093     ``R_AMDGPU_RELATIVE64``    Dynamic 13     ``word64``  B + A
1094     ========================== ======= =====  ==========  ==============================
1095
1096``R_AMDGPU_ABS32_LO`` and ``R_AMDGPU_ABS32_HI`` are only supported by
1097the ``mesa3d`` OS, which does not support ``R_AMDGPU_ABS64``.
1098
1099There is no current OS loader support for 32-bit programs and so
1100``R_AMDGPU_ABS32`` is not used.
1101
1102.. _amdgpu-loaded-code-object-path-uniform-resource-identifier:
1103
1104Loaded Code Object Path Uniform Resource Identifier (URI)
1105---------------------------------------------------------
1106
1107The AMD GPU code object loader represents the path of the ELF shared object from
1108which the code object was loaded as a textual Unifom Resource Identifier (URI).
1109Note that the code object is the in memory loaded relocated form of the ELF
1110shared object.  Multiple code objects may be loaded at different memory
1111addresses in the same process from the same ELF shared object.
1112
1113The loaded code object path URI syntax is defined by the following BNF syntax:
1114
1115.. code::
1116
1117  code_object_uri ::== file_uri | memory_uri
1118  file_uri        ::== "file://" file_path [ range_specifier ]
1119  memory_uri      ::== "memory://" process_id range_specifier
1120  range_specifier ::== [ "#" | "?" ] "offset=" number "&" "size=" number
1121  file_path       ::== URI_ENCODED_OS_FILE_PATH
1122  process_id      ::== DECIMAL_NUMBER
1123  number          ::== HEX_NUMBER | DECIMAL_NUMBER | OCTAL_NUMBER
1124
1125**number**
1126  Is a C integral literal where hexadecimal values are prefixed by "0x" or "0X",
1127  and octal values by "0".
1128
1129**file_path**
1130  Is the file's path specified as a URI encoded UTF-8 string. In URI encoding,
1131  every character that is not in the regular expression ``[a-zA-Z0-9/_.~-]`` is
1132  encoded as two uppercase hexidecimal digits proceeded by "%".  Directories in
1133  the path are separated by "/".
1134
1135**offset**
1136  Is a 0-based byte offset to the start of the code object.  For a file URI, it
1137  is from the start of the file specified by the ``file_path``, and if omitted
1138  defaults to 0. For a memory URI, it is the memory address and is required.
1139
1140**size**
1141  Is the number of bytes in the code object.  For a file URI, if omitted it
1142  defaults to the size of the file.  It is required for a memory URI.
1143
1144**process_id**
1145  Is the identity of the process owning the memory.  For Linux it is the C
1146  unsigned integral decimal literal for the process ID (PID).
1147
1148For example:
1149
1150.. code::
1151
1152  file:///dir1/dir2/file1
1153  file:///dir3/dir4/file2#offset=0x2000&size=3000
1154  memory://1234#offset=0x20000&size=3000
1155
1156.. _amdgpu-dwarf-debug-information:
1157
1158DWARF Debug Information
1159=======================
1160
1161.. warning::
1162
1163   This section describes a **provisional proposal** for AMDGPU DWARF [DWARF]_
1164   that is not currently fully implemented and is subject to change.
1165
1166AMDGPU generates DWARF [DWARF]_ debugging information ELF sections (see
1167:ref:`amdgpu-elf-code-object`) which contain information that maps the code
1168object executable code and data to the source language constructs. It can be
1169used by tools such as debuggers and profilers. It uses features defined in
1170:doc:`AMDGPUDwarfProposalForHeterogeneousDebugging` that are made available in
1171DWARF Version 4 and DWARF Version 5 as an LLVM vendor extension.
1172
1173This section defines the AMDGPU target architecture specific DWARF mappings.
1174
1175.. _amdgpu-dwarf-register-identifier:
1176
1177Register Identifier
1178-------------------
1179
1180This section defines the AMDGPU target architecture register numbers used in
1181DWARF operation expressions (see DWARF Version 5 section 2.5 and
1182:ref:`amdgpu-dwarf-operation-expressions`) and Call Frame Information
1183instructions (see DWARF Version 5 section 6.4 and
1184:ref:`amdgpu-dwarf-call-frame-information`).
1185
1186A single code object can contain code for kernels that have different wavefront
1187sizes. The vector registers and some scalar registers are based on the wavefront
1188size. AMDGPU defines distinct DWARF registers for each wavefront size. This
1189simplifies the consumer of the DWARF so that each register has a fixed size,
1190rather than being dynamic according to the wavefront size mode. Similarly,
1191distinct DWARF registers are defined for those registers that vary in size
1192according to the process address size. This allows a consumer to treat a
1193specific AMDGPU processor as a single architecture regardless of how it is
1194configured at run time. The compiler explicitly specifies the DWARF registers
1195that match the mode in which the code it is generating will be executed.
1196
1197DWARF registers are encoded as numbers, which are mapped to architecture
1198registers. The mapping for AMDGPU is defined in
1199:ref:`amdgpu-dwarf-register-mapping-table`. All AMDGPU targets use the same
1200mapping.
1201
1202.. table:: AMDGPU DWARF Register Mapping
1203   :name: amdgpu-dwarf-register-mapping-table
1204
1205   ============== ================= ======== ==================================
1206   DWARF Register AMDGPU Register   Bit Size Description
1207   ============== ================= ======== ==================================
1208   0              PC_32             32       Program Counter (PC) when
1209                                             executing in a 32-bit process
1210                                             address space. Used in the CFI to
1211                                             describe the PC of the calling
1212                                             frame.
1213   1              EXEC_MASK_32      32       Execution Mask Register when
1214                                             executing in wavefront 32 mode.
1215   2-15           *Reserved*                 *Reserved for highly accessed
1216                                             registers using DWARF shortcut.*
1217   16             PC_64             64       Program Counter (PC) when
1218                                             executing in a 64-bit process
1219                                             address space. Used in the CFI to
1220                                             describe the PC of the calling
1221                                             frame.
1222   17             EXEC_MASK_64      64       Execution Mask Register when
1223                                             executing in wavefront 64 mode.
1224   18-31          *Reserved*                 *Reserved for highly accessed
1225                                             registers using DWARF shortcut.*
1226   32-95          SGPR0-SGPR63      32       Scalar General Purpose
1227                                             Registers.
1228   96-127         *Reserved*                 *Reserved for frequently accessed
1229                                             registers using DWARF 1-byte ULEB.*
1230   128            SCC               32       Scalar Condition Code Register.
1231   129-511        *Reserved*                 *Reserved for future Scalar
1232                                             Architectural Registers.*
1233   512            VCC_32            32       Vector Condition Code Register
1234                                             when executing in wavefront 32
1235                                             mode.
1236   513-1023       *Reserved*                 *Reserved for future Vector
1237                                             Architectural Registers when
1238                                             executing in wavefront 32 mode.*
1239   768            VCC_64            32       Vector Condition Code Register
1240                                             when executing in wavefront 64
1241                                             mode.
1242   769-1023       *Reserved*                 *Reserved for future Vector
1243                                             Architectural Registers when
1244                                             executing in wavefront 64 mode.*
1245   1024-1087      *Reserved*                 *Reserved for padding.*
1246   1088-1129      SGPR64-SGPR105    32       Scalar General Purpose Registers.
1247   1130-1535      *Reserved*                 *Reserved for future Scalar
1248                                             General Purpose Registers.*
1249   1536-1791      VGPR0-VGPR255     32*32    Vector General Purpose Registers
1250                                             when executing in wavefront 32
1251                                             mode.
1252   1792-2047      *Reserved*                 *Reserved for future Vector
1253                                             General Purpose Registers when
1254                                             executing in wavefront 32 mode.*
1255   2048-2303      AGPR0-AGPR255     32*32    Vector Accumulation Registers
1256                                             when executing in wavefront 32
1257                                             mode.
1258   2304-2559      *Reserved*                 *Reserved for future Vector
1259                                             Accumulation Registers when
1260                                             executing in wavefront 32 mode.*
1261   2560-2815      VGPR0-VGPR255     64*32    Vector General Purpose Registers
1262                                             when executing in wavefront 64
1263                                             mode.
1264   2816-3071      *Reserved*                 *Reserved for future Vector
1265                                             General Purpose Registers when
1266                                             executing in wavefront 64 mode.*
1267   3072-3327      AGPR0-AGPR255     64*32    Vector Accumulation Registers
1268                                             when executing in wavefront 64
1269                                             mode.
1270   3328-3583      *Reserved*                 *Reserved for future Vector
1271                                             Accumulation Registers when
1272                                             executing in wavefront 64 mode.*
1273   ============== ================= ======== ==================================
1274
1275The vector registers are represented as the full size for the wavefront. They
1276are organized as consecutive dwords (32-bits), one per lane, with the dword at
1277the least significant bit position corresponding to lane 0 and so forth. DWARF
1278location expressions involving the ``DW_OP_LLVM_offset`` and
1279``DW_OP_LLVM_push_lane`` operations are used to select the part of the vector
1280register corresponding to the lane that is executing the current thread of
1281execution in languages that are implemented using a SIMD or SIMT execution
1282model.
1283
1284If the wavefront size is 32 lanes then the wavefront 32 mode register
1285definitions are used. If the wavefront size is 64 lanes then the wavefront 64
1286mode register definitions are used. Some AMDGPU targets support executing in
1287both wavefront 32 and wavefront 64 mode. The register definitions corresponding
1288to the wavefront mode of the generated code will be used.
1289
1290If code is generated to execute in a 32-bit process address space, then the
129132-bit process address space register definitions are used. If code is generated
1292to execute in a 64-bit process address space, then the 64-bit process address
1293space register definitions are used. The ``amdgcn`` target only supports the
129464-bit process address space.
1295
1296.. _amdgpu-dwarf-address-class-identifier:
1297
1298Address Class Identifier
1299------------------------
1300
1301The DWARF address class represents the source language memory space. See DWARF
1302Version 5 section 2.12 which is updated by the propoal in
1303:ref:`amdgpu-dwarf-segment_addresses`.
1304
1305The DWARF address class mapping used for AMDGPU is defined in
1306:ref:`amdgpu-dwarf-address-class-mapping-table`.
1307
1308.. table:: AMDGPU DWARF Address Class Mapping
1309   :name: amdgpu-dwarf-address-class-mapping-table
1310
1311   ========================= ====== =================
1312   DWARF                            AMDGPU
1313   -------------------------------- -----------------
1314   Address Class Name        Value  Address Space
1315   ========================= ====== =================
1316   ``DW_ADDR_none``          0x0000 Generic (Flat)
1317   ``DW_ADDR_LLVM_global``   0x0001 Global
1318   ``DW_ADDR_LLVM_constant`` 0x0002 Global
1319   ``DW_ADDR_LLVM_group``    0x0003 Local (group/LDS)
1320   ``DW_ADDR_LLVM_private``  0x0004 Private (Scratch)
1321   ``DW_ADDR_AMDGPU_region`` 0x8000 Region (GDS)
1322   ========================= ====== =================
1323
1324The DWARF address class values defined in the proposal at
1325:ref:`amdgpu-dwarf-segment_addresses` are used.
1326
1327In addition, ``DW_ADDR_AMDGPU_region`` is encoded as a vendor extension. This is
1328available for use for the AMD extension for access to the hardware GDS memory
1329which is scratchpad memory allocated per device.
1330
1331For AMDGPU if no ``DW_AT_address_class`` attribute is present, then the default
1332address class of ``DW_ADDR_none`` is used.
1333
1334See :ref:`amdgpu-dwarf-address-space-identifier` for information on the AMDGPU
1335mapping of DWARF address classes to DWARF address spaces, including address size
1336and NULL value.
1337
1338.. _amdgpu-dwarf-address-space-identifier:
1339
1340Address Space Identifier
1341------------------------
1342
1343DWARF address spaces correspond to target architecture specific linear
1344addressable memory areas. See DWARF Version 5 section 2.12 and
1345:ref:`amdgpu-dwarf-segment_addresses`.
1346
1347The DWARF address space mapping used for AMDGPU is defined in
1348:ref:`amdgpu-dwarf-address-space-mapping-table`.
1349
1350.. table:: AMDGPU DWARF Address Space Mapping
1351   :name: amdgpu-dwarf-address-space-mapping-table
1352
1353   ======================================= ===== ======= ======== ================= =======================
1354   DWARF                                                          AMDGPU            Notes
1355   --------------------------------------- ----- ---------------- ----------------- -----------------------
1356   Address Space Name                      Value Address Bit Size Address Space
1357   --------------------------------------- ----- ------- -------- ----------------- -----------------------
1358   ..                                            64-bit  32-bit
1359                                                 process process
1360                                                 address address
1361                                                 space   space
1362   ======================================= ===== ======= ======== ================= =======================
1363   ``DW_ASPACE_none``                      0x00  8       4        Global            *default address space*
1364   ``DW_ASPACE_AMDGPU_generic``            0x01  8       4        Generic (Flat)
1365   ``DW_ASPACE_AMDGPU_region``             0x02  4       4        Region (GDS)
1366   ``DW_ASPACE_AMDGPU_local``              0x03  4       4        Local (group/LDS)
1367   *Reserved*                              0x04
1368   ``DW_ASPACE_AMDGPU_private_lane``       0x05  4       4        Private (Scratch) *focused lane*
1369   ``DW_ASPACE_AMDGPU_private_wave``       0x06  4       4        Private (Scratch) *unswizzled wavefront*
1370   *Reserved*                              0x07-
1371                                           0x1F
1372   ``DW_ASPACE_AMDGPU_private_lane<0-63>`` 0x20- 4       4        Private (Scratch) *specific lane*
1373                                           0x5F
1374   ======================================= ===== ======= ======== ================= =======================
1375
1376See :ref:`amdgpu-address-spaces` for information on the AMDGPU address spaces
1377including address size and NULL value.
1378
1379The ``DW_ASPACE_none`` address space is the default target architecture address
1380space used in DWARF operations that do not specify an address space. It
1381therefore has to map to the global address space so that the ``DW_OP_addr*`` and
1382related operations can refer to addresses in the program code.
1383
1384The ``DW_ASPACE_AMDGPU_generic`` address space allows location expressions to
1385specify the flat address space. If the address corresponds to an address in the
1386local address space, then it corresponds to the wavefront that is executing the
1387focused thread of execution. If the address corresponds to an address in the
1388private address space, then it corresponds to the lane that is executing the
1389focused thread of execution for languages that are implemented using a SIMD or
1390SIMT execution model.
1391
1392.. note::
1393
1394  CUDA-like languages such as HIP that do not have address spaces in the
1395  language type system, but do allow variables to be allocated in different
1396  address spaces, need to explicitly specify the ``DW_ASPACE_AMDGPU_generic``
1397  address space in the DWARF expression operations as the default address space
1398  is the global address space.
1399
1400The ``DW_ASPACE_AMDGPU_local`` address space allows location expressions to
1401specify the local address space corresponding to the wavefront that is executing
1402the focused thread of execution.
1403
1404The ``DW_ASPACE_AMDGPU_private_lane`` address space allows location expressions
1405to specify the private address space corresponding to the lane that is executing
1406the focused thread of execution for languages that are implemented using a SIMD
1407or SIMT execution model.
1408
1409The ``DW_ASPACE_AMDGPU_private_wave`` address space allows location expressions
1410to specify the unswizzled private address space corresponding to the wavefront
1411that is executing the focused thread of execution. The wavefront view of private
1412memory is the per wavefront unswizzled backing memory layout defined in
1413:ref:`amdgpu-address-spaces`, such that address 0 corresponds to the first
1414location for the backing memory of the wavefront (namely the address is not
1415offset by ``wavefront-scratch-base``). The following formula can be used to
1416convert from a ``DW_ASPACE_AMDGPU_private_lane`` address to a
1417``DW_ASPACE_AMDGPU_private_wave`` address:
1418
1419::
1420
1421  private-address-wavefront =
1422    ((private-address-lane / 4) * wavefront-size * 4) +
1423    (wavefront-lane-id * 4) + (private-address-lane % 4)
1424
1425If the ``DW_ASPACE_AMDGPU_private_lane`` address is dword aligned, and the start
1426of the dwords for each lane starting with lane 0 is required, then this
1427simplifies to:
1428
1429::
1430
1431  private-address-wavefront =
1432    private-address-lane * wavefront-size
1433
1434A compiler can use the ``DW_ASPACE_AMDGPU_private_wave`` address space to read a
1435complete spilled vector register back into a complete vector register in the
1436CFI. The frame pointer can be a private lane address which is dword aligned,
1437which can be shifted to multiply by the wavefront size, and then used to form a
1438private wavefront address that gives a location for a contiguous set of dwords,
1439one per lane, where the vector register dwords are spilled. The compiler knows
1440the wavefront size since it generates the code. Note that the type of the
1441address may have to be converted as the size of a
1442``DW_ASPACE_AMDGPU_private_lane`` address may be smaller than the size of a
1443``DW_ASPACE_AMDGPU_private_wave`` address.
1444
1445The ``DW_ASPACE_AMDGPU_private_lane<N>`` address space allows location
1446expressions to specify the private address space corresponding to a specific
1447lane N. For example, this can be used when the compiler spills scalar registers
1448to scratch memory, with each scalar register being saved to a different lane's
1449scratch memory.
1450
1451.. _amdgpu-dwarf-lane-identifier:
1452
1453Lane identifier
1454---------------
1455
1456DWARF lane identifies specify a target architecture lane position for hardware
1457that executes in a SIMD or SIMT manner, and on which a source language maps its
1458threads of execution onto those lanes. The DWARF lane identifier is pushed by
1459the ``DW_OP_LLVM_push_lane`` DWARF expression operation. See DWARF Version 5
1460section 2.5 which is updated by the proposal in
1461:ref:`amdgpu-dwarf-operation-expressions`.
1462
1463For AMDGPU, the lane identifier corresponds to the hardware lane ID of a
1464wavefront. It is numbered from 0 to the wavefront size minus 1.
1465
1466Operation Expressions
1467---------------------
1468
1469DWARF expressions are used to compute program values and the locations of
1470program objects. See DWARF Version 5 section 2.5 and
1471:ref:`amdgpu-dwarf-operation-expressions`.
1472
1473DWARF location descriptions describe how to access storage which includes memory
1474and registers. When accessing storage on AMDGPU, bytes are ordered with least
1475significant bytes first, and bits are ordered within bytes with least
1476significant bits first.
1477
1478For AMDGPU CFI expressions, ``DW_OP_LLVM_select_bit_piece`` is used to describe
1479unwinding vector registers that are spilled under the execution mask to memory:
1480the zero-single location description is the vector register, and the one-single
1481location description is the spilled memory location description. The
1482``DW_OP_LLVM_form_aspace_address`` is used to specify the address space of the
1483memory location description.
1484
1485In AMDGPU expressions, ``DW_OP_LLVM_select_bit_piece`` is used by the
1486``DW_AT_LLVM_lane_pc`` attribute expression where divergent control flow is
1487controlled by the execution mask. An undefined location description together
1488with ``DW_OP_LLVM_extend`` is used to indicate the lane was not active on entry
1489to the subprogram. See :ref:`amdgpu-dwarf-dw-at-llvm-lane-pc` for an example.
1490
1491Debugger Information Entry Attributes
1492-------------------------------------
1493
1494This section describes how certain debugger information entry attributes are
1495used by AMDGPU. See the sections in DWARF Version 5 section 2 which are updated
1496by the proposal in :ref:`amdgpu-dwarf-debugging-information-entry-attributes`.
1497
1498.. _amdgpu-dwarf-dw-at-llvm-lane-pc:
1499
1500``DW_AT_LLVM_lane_pc``
1501~~~~~~~~~~~~~~~~~~~~~~
1502
1503For AMDGPU, the ``DW_AT_LLVM_lane_pc`` attribute is used to specify the program
1504location of the separate lanes of a SIMT thread.
1505
1506If the lane is an active lane then this will be the same as the current program
1507location.
1508
1509If the lane is inactive, but was active on entry to the subprogram, then this is
1510the program location in the subprogram at which execution of the lane is
1511conceptual positioned.
1512
1513If the lane was not active on entry to the subprogram, then this will be the
1514undefined location. A client debugger can check if the lane is part of a valid
1515work-group by checking that the lane is in the range of the associated
1516work-group within the grid, accounting for partial work-groups. If it is not,
1517then the debugger can omit any information for the lane. Otherwise, the debugger
1518may repeatedly unwind the stack and inspect the ``DW_AT_LLVM_lane_pc`` of the
1519calling subprogram until it finds a non-undefined location. Conceptually the
1520lane only has the call frames that it has a non-undefined
1521``DW_AT_LLVM_lane_pc``.
1522
1523The following example illustrates how the AMDGPU backend can generate a DWARF
1524location list expression for the nested ``IF/THEN/ELSE`` structures of the
1525following subprogram pseudo code for a target with 64 lanes per wavefront.
1526
1527.. code::
1528  :number-lines:
1529
1530  SUBPROGRAM X
1531  BEGIN
1532    a;
1533    IF (c1) THEN
1534      b;
1535      IF (c2) THEN
1536        c;
1537      ELSE
1538        d;
1539      ENDIF
1540      e;
1541    ELSE
1542      f;
1543    ENDIF
1544    g;
1545  END
1546
1547The AMDGPU backend may generate the following pseudo LLVM MIR to manipulate the
1548execution mask (``EXEC``) to linearize the control flow. The condition is
1549evaluated to make a mask of the lanes for which the condition evaluates to true.
1550First the ``THEN`` region is executed by setting the ``EXEC`` mask to the
1551logical ``AND`` of the current ``EXEC`` mask with the condition mask. Then the
1552``ELSE`` region is executed by negating the ``EXEC`` mask and logical ``AND`` of
1553the saved ``EXEC`` mask at the start of the region. After the ``IF/THEN/ELSE``
1554region the ``EXEC`` mask is restored to the value it had at the beginning of the
1555region. This is shown below. Other approaches are possible, but the basic
1556concept is the same.
1557
1558.. code::
1559  :number-lines:
1560
1561  $lex_start:
1562    a;
1563    %1 = EXEC
1564    %2 = c1
1565  $lex_1_start:
1566    EXEC = %1 & %2
1567  $if_1_then:
1568      b;
1569      %3 = EXEC
1570      %4 = c2
1571  $lex_1_1_start:
1572      EXEC = %3 & %4
1573  $lex_1_1_then:
1574        c;
1575      EXEC = ~EXEC & %3
1576  $lex_1_1_else:
1577        d;
1578      EXEC = %3
1579  $lex_1_1_end:
1580      e;
1581    EXEC = ~EXEC & %1
1582  $lex_1_else:
1583      f;
1584    EXEC = %1
1585  $lex_1_end:
1586    g;
1587  $lex_end:
1588
1589To create the DWARF location list expression that defines the location
1590description of a vector of lane program locations, the LLVM MIR ``DBG_VALUE``
1591pseudo instruction can be used to annotate the linearized control flow. This can
1592be done by defining an artificial variable for the lane PC. The DWARF location
1593list expression created for it is used as the value of the
1594``DW_AT_LLVM_lane_pc`` attribute on the subprogram's debugger information entry.
1595
1596A DWARF procedure is defined for each well nested structured control flow region
1597which provides the conceptual lane program location for a lane if it is not
1598active (namely it is divergent). The DWARF operation expression for each region
1599conceptually inherits the value of the immediately enclosing region and modifies
1600it according to the semantics of the region.
1601
1602For an ``IF/THEN/ELSE`` region the divergent program location is at the start of
1603the region for the ``THEN`` region since it is executed first. For the ``ELSE``
1604region the divergent program location is at the end of the ``IF/THEN/ELSE``
1605region since the ``THEN`` region has completed.
1606
1607The lane PC artificial variable is assigned at each region transition. It uses
1608the immediately enclosing region's DWARF procedure to compute the program
1609location for each lane assuming they are divergent, and then modifies the result
1610by inserting the current program location for each lane that the ``EXEC`` mask
1611indicates is active.
1612
1613By having separate DWARF procedures for each region, they can be reused to
1614define the value for any nested region. This reduces the total size of the DWARF
1615operation expressions.
1616
1617The following provides an example using pseudo LLVM MIR.
1618
1619.. code::
1620  :number-lines:
1621
1622  $lex_start:
1623    DEFINE_DWARF %__uint_64 = DW_TAG_base_type[
1624      DW_AT_name = "__uint64";
1625      DW_AT_byte_size = 8;
1626      DW_AT_encoding = DW_ATE_unsigned;
1627    ];
1628    DEFINE_DWARF %__active_lane_pc = DW_TAG_dwarf_procedure[
1629      DW_AT_name = "__active_lane_pc";
1630      DW_AT_location = [
1631        DW_OP_regx PC;
1632        DW_OP_LLVM_extend 64, 64;
1633        DW_OP_regval_type EXEC, %uint_64;
1634        DW_OP_LLVM_select_bit_piece 64, 64;
1635      ];
1636    ];
1637    DEFINE_DWARF %__divergent_lane_pc = DW_TAG_dwarf_procedure[
1638      DW_AT_name = "__divergent_lane_pc";
1639      DW_AT_location = [
1640        DW_OP_LLVM_undefined;
1641        DW_OP_LLVM_extend 64, 64;
1642      ];
1643    ];
1644    DBG_VALUE $noreg, $noreg, %DW_AT_LLVM_lane_pc, DIExpression[
1645      DW_OP_call_ref %__divergent_lane_pc;
1646      DW_OP_call_ref %__active_lane_pc;
1647    ];
1648    a;
1649    %1 = EXEC;
1650    DBG_VALUE %1, $noreg, %__lex_1_save_exec;
1651    %2 = c1;
1652  $lex_1_start:
1653    EXEC = %1 & %2;
1654  $lex_1_then:
1655      DEFINE_DWARF %__divergent_lane_pc_1_then = DW_TAG_dwarf_procedure[
1656        DW_AT_name = "__divergent_lane_pc_1_then";
1657        DW_AT_location = DIExpression[
1658          DW_OP_call_ref %__divergent_lane_pc;
1659          DW_OP_addrx &lex_1_start;
1660          DW_OP_stack_value;
1661          DW_OP_LLVM_extend 64, 64;
1662          DW_OP_call_ref %__lex_1_save_exec;
1663          DW_OP_deref_type 64, %__uint_64;
1664          DW_OP_LLVM_select_bit_piece 64, 64;
1665        ];
1666      ];
1667      DBG_VALUE $noreg, $noreg, %DW_AT_LLVM_lane_pc, DIExpression[
1668        DW_OP_call_ref %__divergent_lane_pc_1_then;
1669        DW_OP_call_ref %__active_lane_pc;
1670      ];
1671      b;
1672      %3 = EXEC;
1673      DBG_VALUE %3, %__lex_1_1_save_exec;
1674      %4 = c2;
1675  $lex_1_1_start:
1676      EXEC = %3 & %4;
1677  $lex_1_1_then:
1678        DEFINE_DWARF %__divergent_lane_pc_1_1_then = DW_TAG_dwarf_procedure[
1679          DW_AT_name = "__divergent_lane_pc_1_1_then";
1680          DW_AT_location = DIExpression[
1681            DW_OP_call_ref %__divergent_lane_pc_1_then;
1682            DW_OP_addrx &lex_1_1_start;
1683            DW_OP_stack_value;
1684            DW_OP_LLVM_extend 64, 64;
1685            DW_OP_call_ref %__lex_1_1_save_exec;
1686            DW_OP_deref_type 64, %__uint_64;
1687            DW_OP_LLVM_select_bit_piece 64, 64;
1688          ];
1689        ];
1690        DBG_VALUE $noreg, $noreg, %DW_AT_LLVM_lane_pc, DIExpression[
1691          DW_OP_call_ref %__divergent_lane_pc_1_1_then;
1692          DW_OP_call_ref %__active_lane_pc;
1693        ];
1694        c;
1695      EXEC = ~EXEC & %3;
1696  $lex_1_1_else:
1697        DEFINE_DWARF %__divergent_lane_pc_1_1_else = DW_TAG_dwarf_procedure[
1698          DW_AT_name = "__divergent_lane_pc_1_1_else";
1699          DW_AT_location = DIExpression[
1700            DW_OP_call_ref %__divergent_lane_pc_1_then;
1701            DW_OP_addrx &lex_1_1_end;
1702            DW_OP_stack_value;
1703            DW_OP_LLVM_extend 64, 64;
1704            DW_OP_call_ref %__lex_1_1_save_exec;
1705            DW_OP_deref_type 64, %__uint_64;
1706            DW_OP_LLVM_select_bit_piece 64, 64;
1707          ];
1708        ];
1709        DBG_VALUE $noreg, $noreg, %DW_AT_LLVM_lane_pc, DIExpression[
1710          DW_OP_call_ref %__divergent_lane_pc_1_1_else;
1711          DW_OP_call_ref %__active_lane_pc;
1712        ];
1713        d;
1714      EXEC = %3;
1715  $lex_1_1_end:
1716      DBG_VALUE $noreg, $noreg, %DW_AT_LLVM_lane_pc, DIExpression[
1717        DW_OP_call_ref %__divergent_lane_pc;
1718        DW_OP_call_ref %__active_lane_pc;
1719      ];
1720      e;
1721    EXEC = ~EXEC & %1;
1722  $lex_1_else:
1723      DEFINE_DWARF %__divergent_lane_pc_1_else = DW_TAG_dwarf_procedure[
1724        DW_AT_name = "__divergent_lane_pc_1_else";
1725        DW_AT_location = DIExpression[
1726          DW_OP_call_ref %__divergent_lane_pc;
1727          DW_OP_addrx &lex_1_end;
1728          DW_OP_stack_value;
1729          DW_OP_LLVM_extend 64, 64;
1730          DW_OP_call_ref %__lex_1_save_exec;
1731          DW_OP_deref_type 64, %__uint_64;
1732          DW_OP_LLVM_select_bit_piece 64, 64;
1733        ];
1734      ];
1735      DBG_VALUE $noreg, $noreg, %DW_AT_LLVM_lane_pc, DIExpression[
1736        DW_OP_call_ref %__divergent_lane_pc_1_else;
1737        DW_OP_call_ref %__active_lane_pc;
1738      ];
1739      f;
1740    EXEC = %1;
1741  $lex_1_end:
1742    DBG_VALUE $noreg, $noreg, %DW_AT_LLVM_lane_pc DIExpression[
1743      DW_OP_call_ref %__divergent_lane_pc;
1744      DW_OP_call_ref %__active_lane_pc;
1745    ];
1746    g;
1747  $lex_end:
1748
1749The DWARF procedure ``%__active_lane_pc`` is used to update the lane pc elements
1750that are active, with the current program location.
1751
1752Artificial variables %__lex_1_save_exec and %__lex_1_1_save_exec are created for
1753the execution masks saved on entry to a region. Using the ``DBG_VALUE`` pseudo
1754instruction, location list entries will be created that describe where the
1755artificial variables are allocated at any given program location. The compiler
1756may allocate them to registers or spill them to memory.
1757
1758The DWARF procedures for each region use the values of the saved execution mask
1759artificial variables to only update the lanes that are active on entry to the
1760region. All other lanes retain the value of the enclosing region where they were
1761last active. If they were not active on entry to the subprogram, then will have
1762the undefined location description.
1763
1764Other structured control flow regions can be handled similarly. For example,
1765loops would set the divergent program location for the region at the end of the
1766loop. Any lanes active will be in the loop, and any lanes not active must have
1767exited the loop.
1768
1769An ``IF/THEN/ELSEIF/ELSEIF/...`` region can be treated as a nest of
1770``IF/THEN/ELSE`` regions.
1771
1772The DWARF procedures can use the active lane artificial variable described in
1773:ref:`amdgpu-dwarf-amdgpu-dw-at-llvm-active-lane` rather than the actual
1774``EXEC`` mask in order to support whole or quad wavefront mode.
1775
1776.. _amdgpu-dwarf-amdgpu-dw-at-llvm-active-lane:
1777
1778``DW_AT_LLVM_active_lane``
1779~~~~~~~~~~~~~~~~~~~~~~~~~~
1780
1781The ``DW_AT_LLVM_active_lane`` attribute on a subprogram debugger information
1782entry is used to specify the lanes that are conceptually active for a SIMT
1783thread.
1784
1785The execution mask may be modified to implement whole or quad wavefront mode
1786operations. For example, all lanes may need to temporarily be made active to
1787execute a whole wavefront operation. Such regions would save the ``EXEC`` mask,
1788update it to enable the necessary lanes, perform the operations, and then
1789restore the ``EXEC`` mask from the saved value. While executing the whole
1790wavefront region, the conceptual execution mask is the saved value, not the
1791``EXEC`` value.
1792
1793This is handled by defining an artificial variable for the active lane mask. The
1794active lane mask artificial variable would be the actual ``EXEC`` mask for
1795normal regions, and the saved execution mask for regions where the mask is
1796temporarily updated. The location list expression created for this artificial
1797variable is used to define the value of the ``DW_AT_LLVM_active_lane``
1798attribute.
1799
1800``DW_AT_LLVM_augmentation``
1801~~~~~~~~~~~~~~~~~~~~~~~~~~~
1802
1803For AMDGPU, the ``DW_AT_LLVM_augmentation`` attribute of a compilation unit
1804debugger information entry has the following value for the augmentation string:
1805
1806::
1807
1808  [amdgpu:v0.0]
1809
1810The "vX.Y" specifies the major X and minor Y version number of the AMDGPU
1811extensions used in the DWARF of the compilation unit. The version number
1812conforms to [SEMVER]_.
1813
1814Call Frame Information
1815----------------------
1816
1817DWARF Call Frame Information (CFI) describes how a consumer can virtually
1818*unwind* call frames in a running process or core dump. See DWARF Version 5
1819section 6.4 and :ref:`amdgpu-dwarf-call-frame-information`.
1820
1821For AMDGPU, the Common Information Entry (CIE) fields have the following values:
1822
18231.  ``augmentation`` string contains the following null-terminated UTF-8 string:
1824
1825    ::
1826
1827      [amd:v0.0]
1828
1829    The ``vX.Y`` specifies the major X and minor Y version number of the AMDGPU
1830    extensions used in this CIE or to the FDEs that use it. The version number
1831    conforms to [SEMVER]_.
1832
18332.  ``address_size`` for the ``Global`` address space is defined in
1834    :ref:`amdgpu-dwarf-address-space-identifier`.
1835
18363.  ``segment_selector_size`` is 0 as AMDGPU does not use a segment selector.
1837
18384.  ``code_alignment_factor`` is 4 bytes.
1839
1840    .. TODO::
1841
1842       Add to :ref:`amdgpu-processor-table` table.
1843
18445.  ``data_alignment_factor`` is 4 bytes.
1845
1846    .. TODO::
1847
1848       Add to :ref:`amdgpu-processor-table` table.
1849
18506.  ``return_address_register`` is ``PC_32`` for 32-bit processes and ``PC_64``
1851    for 64-bit processes defined in :ref:`amdgpu-dwarf-register-identifier`.
1852
18537.  ``initial_instructions`` Since a subprogram X with fewer registers can be
1854    called from subprogram Y that has more allocated, X will not change any of
1855    the extra registers as it cannot access them. Therefore, the default rule
1856    for all columns is ``same value``.
1857
1858For AMDGPU the register number follows the numbering defined in
1859:ref:`amdgpu-dwarf-register-identifier`.
1860
1861For AMDGPU the instructions are variable size. A consumer can subtract 1 from
1862the return address to get the address of a byte within the call site
1863instructions. See DWARF Version 5 section 6.4.4.
1864
1865Accelerated Access
1866------------------
1867
1868See DWARF Version 5 section 6.1.
1869
1870Lookup By Name Section Header
1871~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
1872
1873See DWARF Version 5 section 6.1.1.4.1 and :ref:`amdgpu-dwarf-lookup-by-name`.
1874
1875For AMDGPU the lookup by name section header table:
1876
1877``augmentation_string_size`` (uword)
1878
1879  Set to the length of the ``augmentation_string`` value which is always a
1880  multiple of 4.
1881
1882``augmentation_string`` (sequence of UTF-8 characters)
1883
1884  Contains the following UTF-8 string null padded to a multiple of 4 bytes:
1885
1886  ::
1887
1888    [amdgpu:v0.0]
1889
1890  The "vX.Y" specifies the major X and minor Y version number of the AMDGPU
1891  extensions used in the DWARF of this index. The version number conforms to
1892  [SEMVER]_.
1893
1894  .. note::
1895
1896    This is different to the DWARF Version 5 definition that requires the first
1897    4 characters to be the vendor ID. But this is consistent with the other
1898    augmentation strings and does allow multiple vendor contributions. However,
1899    backwards compatibility may be more desirable.
1900
1901Lookup By Address Section Header
1902~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
1903
1904See DWARF Version 5 section 6.1.2.
1905
1906For AMDGPU the lookup by address section header table:
1907
1908``address_size`` (ubyte)
1909
1910  Match the address size for the ``Global`` address space defined in
1911  :ref:`amdgpu-dwarf-address-space-identifier`.
1912
1913``segment_selector_size`` (ubyte)
1914
1915  AMDGPU does not use a segment selector so this is 0. The entries in the
1916  ``.debug_aranges`` do not have a segment selector.
1917
1918Line Number Information
1919-----------------------
1920
1921See DWARF Version 5 section 6.2 and :ref:`amdgpu-dwarf-line-number-information`.
1922
1923AMDGPU does not use the ``isa`` state machine registers and always sets it to 0.
1924The instruction set must be obtained from the ELF file header ``e_flags`` field
1925in the ``EF_AMDGPU_MACH`` bit position (see :ref:`ELF Header
1926<amdgpu-elf-header>`). See DWARF Version 5 section 6.2.2.
1927
1928.. TODO::
1929
1930  Should the ``isa`` state machine register be used to indicate if the code is
1931  in wavefront32 or wavefront64 mode? Or used to specify the architecture ISA?
1932
1933For AMDGPU the line number program header fields have the following values (see
1934DWARF Version 5 section 6.2.4):
1935
1936``address_size`` (ubyte)
1937  Matches the address size for the ``Global`` address space defined in
1938  :ref:`amdgpu-dwarf-address-space-identifier`.
1939
1940``segment_selector_size`` (ubyte)
1941  AMDGPU does not use a segment selector so this is 0.
1942
1943``minimum_instruction_length`` (ubyte)
1944  For GFX9-GFX10 this is 4.
1945
1946``maximum_operations_per_instruction`` (ubyte)
1947  For GFX9-GFX10 this is 1.
1948
1949Source text for online-compiled programs (for example, those compiled by the
1950OpenCL language runtime) may be embedded into the DWARF Version 5 line table.
1951See DWARF Version 5 section 6.2.4.1 which is updated by the proposal in
1952:ref:`DW_LNCT_LLVM_source
1953<amdgpu-dwarf-line-number-information-dw-lnct-llvm-source>`.
1954
1955The Clang option used to control source embedding in AMDGPU is defined in
1956:ref:`amdgpu-clang-debug-options-table`.
1957
1958  .. table:: AMDGPU Clang Debug Options
1959     :name: amdgpu-clang-debug-options-table
1960
1961     ==================== ==================================================
1962     Debug Flag           Description
1963     ==================== ==================================================
1964     -g[no-]embed-source  Enable/disable embedding source text in DWARF
1965                          debug sections. Useful for environments where
1966                          source cannot be written to disk, such as
1967                          when performing online compilation.
1968     ==================== ==================================================
1969
1970For example:
1971
1972``-gembed-source``
1973  Enable the embedded source.
1974
1975``-gno-embed-source``
1976  Disable the embedded source.
1977
197832-Bit and 64-Bit DWARF Formats
1979-------------------------------
1980
1981See DWARF Version 5 section 7.4 and
1982:ref:`amdgpu-dwarf-32-bit-and-64-bit-dwarf-formats`.
1983
1984For AMDGPU:
1985
1986* For the ``amdgcn`` target architecture only the 64-bit process address space
1987  is supported.
1988
1989* The producer can generate either 32-bit or 64-bit DWARF format. LLVM generates
1990  the 32-bit DWARF format.
1991
1992Unit Headers
1993------------
1994
1995For AMDGPU the following values apply for each of the unit headers described in
1996DWARF Version 5 sections 7.5.1.1, 7.5.1.2, and 7.5.1.3:
1997
1998``address_size`` (ubyte)
1999  Matches the address size for the ``Global`` address space defined in
2000  :ref:`amdgpu-dwarf-address-space-identifier`.
2001
2002.. _amdgpu-code-conventions:
2003
2004Code Conventions
2005================
2006
2007This section provides code conventions used for each supported target triple OS
2008(see :ref:`amdgpu-target-triples`).
2009
2010AMDHSA
2011------
2012
2013This section provides code conventions used when the target triple OS is
2014``amdhsa`` (see :ref:`amdgpu-target-triples`).
2015
2016.. _amdgpu-amdhsa-code-object-target-identification:
2017
2018Code Object Target Identification
2019~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
2020
2021The AMDHSA OS uses the following syntax to specify the code object
2022target as a single string:
2023
2024  ``<Architecture>-<Vendor>-<OS>-<Environment>-<Processor><Target Features>``
2025
2026Where:
2027
2028  - ``<Architecture>``, ``<Vendor>``, ``<OS>`` and ``<Environment>``
2029    are the same as the *Target Triple* (see
2030    :ref:`amdgpu-target-triples`).
2031
2032  - ``<Processor>`` is the same as the *Processor* (see
2033    :ref:`amdgpu-processors`).
2034
2035  - ``<Target Features>`` is a list of the enabled *Target Features*
2036    (see :ref:`amdgpu-target-features`), each prefixed by a plus, that
2037    apply to *Processor*. The list must be in the same order as listed
2038    in the table :ref:`amdgpu-target-feature-table`. Note that *Target
2039    Features* must be included in the list if they are enabled even if
2040    that is the default for *Processor*.
2041
2042For example:
2043
2044  ``"amdgcn-amd-amdhsa--gfx902+xnack"``
2045
2046.. _amdgpu-amdhsa-code-object-metadata:
2047
2048Code Object Metadata
2049~~~~~~~~~~~~~~~~~~~~
2050
2051The code object metadata specifies extensible metadata associated with the code
2052objects executed on HSA [HSA]_ compatible runtimes such as AMD's ROCm
2053[AMD-ROCm]_. The encoding and semantics of this metadata depends on the code
2054object version; see :ref:`amdgpu-amdhsa-code-object-metadata-v2` and
2055:ref:`amdgpu-amdhsa-code-object-metadata-v3`.
2056
2057Code object metadata is specified in a note record (see
2058:ref:`amdgpu-note-records`) and is required when the target triple OS is
2059``amdhsa`` (see :ref:`amdgpu-target-triples`). It must contain the minimum
2060information necessary to support the ROCM kernel queries. For example, the
2061segment sizes needed in a dispatch packet. In addition, a high-level language
2062runtime may require other information to be included. For example, the AMD
2063OpenCL runtime records kernel argument information.
2064
2065.. _amdgpu-amdhsa-code-object-metadata-v2:
2066
2067Code Object V2 Metadata (-mattr=-code-object-v3)
2068++++++++++++++++++++++++++++++++++++++++++++++++
2069
2070.. warning:: Code Object V2 is not the default code object version emitted by
2071  this version of LLVM. For a description of the metadata generated with the
2072  default configuration (Code Object V3) see
2073  :ref:`amdgpu-amdhsa-code-object-metadata-v3`.
2074
2075Code object V2 metadata is specified by the ``NT_AMD_AMDGPU_METADATA`` note
2076record (see :ref:`amdgpu-note-records-v2`).
2077
2078The metadata is specified as a YAML formatted string (see [YAML]_ and
2079:doc:`YamlIO`).
2080
2081.. TODO::
2082
2083  Is the string null terminated? It probably should not if YAML allows it to
2084  contain null characters, otherwise it should be.
2085
2086The metadata is represented as a single YAML document comprised of the mapping
2087defined in table :ref:`amdgpu-amdhsa-code-object-metadata-map-table-v2` and
2088referenced tables.
2089
2090For boolean values, the string values of ``false`` and ``true`` are used for
2091false and true respectively.
2092
2093Additional information can be added to the mappings. To avoid conflicts, any
2094non-AMD key names should be prefixed by "*vendor-name*.".
2095
2096  .. table:: AMDHSA Code Object V2 Metadata Map
2097     :name: amdgpu-amdhsa-code-object-metadata-map-table-v2
2098
2099     ========== ============== ========= =======================================
2100     String Key Value Type     Required? Description
2101     ========== ============== ========= =======================================
2102     "Version"  sequence of    Required  - The first integer is the major
2103                2 integers                 version. Currently 1.
2104                                         - The second integer is the minor
2105                                           version. Currently 0.
2106     "Printf"   sequence of              Each string is encoded information
2107                strings                  about a printf function call. The
2108                                         encoded information is organized as
2109                                         fields separated by colon (':'):
2110
2111                                         ``ID:N:S[0]:S[1]:...:S[N-1]:FormatString``
2112
2113                                         where:
2114
2115                                         ``ID``
2116                                           A 32-bit integer as a unique id for
2117                                           each printf function call
2118
2119                                         ``N``
2120                                           A 32-bit integer equal to the number
2121                                           of arguments of printf function call
2122                                           minus 1
2123
2124                                         ``S[i]`` (where i = 0, 1, ... , N-1)
2125                                           32-bit integers for the size in bytes
2126                                           of the i-th FormatString argument of
2127                                           the printf function call
2128
2129                                         FormatString
2130                                           The format string passed to the
2131                                           printf function call.
2132     "Kernels"  sequence of    Required  Sequence of the mappings for each
2133                mapping                  kernel in the code object. See
2134                                         :ref:`amdgpu-amdhsa-code-object-kernel-metadata-map-table-v2`
2135                                         for the definition of the mapping.
2136     ========== ============== ========= =======================================
2137
2138..
2139
2140  .. table:: AMDHSA Code Object V2 Kernel Metadata Map
2141     :name: amdgpu-amdhsa-code-object-kernel-metadata-map-table-v2
2142
2143     ================= ============== ========= ================================
2144     String Key        Value Type     Required? Description
2145     ================= ============== ========= ================================
2146     "Name"            string         Required  Source name of the kernel.
2147     "SymbolName"      string         Required  Name of the kernel
2148                                                descriptor ELF symbol.
2149     "Language"        string                   Source language of the kernel.
2150                                                Values include:
2151
2152                                                - "OpenCL C"
2153                                                - "OpenCL C++"
2154                                                - "HCC"
2155                                                - "OpenMP"
2156
2157     "LanguageVersion" sequence of              - The first integer is the major
2158                       2 integers                 version.
2159                                                - The second integer is the
2160                                                  minor version.
2161     "Attrs"           mapping                  Mapping of kernel attributes.
2162                                                See
2163                                                :ref:`amdgpu-amdhsa-code-object-kernel-attribute-metadata-map-table-v2`
2164                                                for the mapping definition.
2165     "Args"            sequence of              Sequence of mappings of the
2166                       mapping                  kernel arguments. See
2167                                                :ref:`amdgpu-amdhsa-code-object-kernel-argument-metadata-map-table-v2`
2168                                                for the definition of the mapping.
2169     "CodeProps"       mapping                  Mapping of properties related to
2170                                                the kernel code. See
2171                                                :ref:`amdgpu-amdhsa-code-object-kernel-code-properties-metadata-map-table-v2`
2172                                                for the mapping definition.
2173     ================= ============== ========= ================================
2174
2175..
2176
2177  .. table:: AMDHSA Code Object V2 Kernel Attribute Metadata Map
2178     :name: amdgpu-amdhsa-code-object-kernel-attribute-metadata-map-table-v2
2179
2180     =================== ============== ========= ==============================
2181     String Key          Value Type     Required? Description
2182     =================== ============== ========= ==============================
2183     "ReqdWorkGroupSize" sequence of              If not 0, 0, 0 then all values
2184                         3 integers               must be >=1 and the dispatch
2185                                                  work-group size X, Y, Z must
2186                                                  correspond to the specified
2187                                                  values. Defaults to 0, 0, 0.
2188
2189                                                  Corresponds to the OpenCL
2190                                                  ``reqd_work_group_size``
2191                                                  attribute.
2192     "WorkGroupSizeHint" sequence of              The dispatch work-group size
2193                         3 integers               X, Y, Z is likely to be the
2194                                                  specified values.
2195
2196                                                  Corresponds to the OpenCL
2197                                                  ``work_group_size_hint``
2198                                                  attribute.
2199     "VecTypeHint"       string                   The name of a scalar or vector
2200                                                  type.
2201
2202                                                  Corresponds to the OpenCL
2203                                                  ``vec_type_hint`` attribute.
2204
2205     "RuntimeHandle"     string                   The external symbol name
2206                                                  associated with a kernel.
2207                                                  OpenCL runtime allocates a
2208                                                  global buffer for the symbol
2209                                                  and saves the kernel's address
2210                                                  to it, which is used for
2211                                                  device side enqueueing. Only
2212                                                  available for device side
2213                                                  enqueued kernels.
2214     =================== ============== ========= ==============================
2215
2216..
2217
2218  .. table:: AMDHSA Code Object V2 Kernel Argument Metadata Map
2219     :name: amdgpu-amdhsa-code-object-kernel-argument-metadata-map-table-v2
2220
2221     ================= ============== ========= ================================
2222     String Key        Value Type     Required? Description
2223     ================= ============== ========= ================================
2224     "Name"            string                   Kernel argument name.
2225     "TypeName"        string                   Kernel argument type name.
2226     "Size"            integer        Required  Kernel argument size in bytes.
2227     "Align"           integer        Required  Kernel argument alignment in
2228                                                bytes. Must be a power of two.
2229     "ValueKind"       string         Required  Kernel argument kind that
2230                                                specifies how to set up the
2231                                                corresponding argument.
2232                                                Values include:
2233
2234                                                "ByValue"
2235                                                  The argument is copied
2236                                                  directly into the kernarg.
2237
2238                                                "GlobalBuffer"
2239                                                  A global address space pointer
2240                                                  to the buffer data is passed
2241                                                  in the kernarg.
2242
2243                                                "DynamicSharedPointer"
2244                                                  A group address space pointer
2245                                                  to dynamically allocated LDS
2246                                                  is passed in the kernarg.
2247
2248                                                "Sampler"
2249                                                  A global address space
2250                                                  pointer to a S# is passed in
2251                                                  the kernarg.
2252
2253                                                "Image"
2254                                                  A global address space
2255                                                  pointer to a T# is passed in
2256                                                  the kernarg.
2257
2258                                                "Pipe"
2259                                                  A global address space pointer
2260                                                  to an OpenCL pipe is passed in
2261                                                  the kernarg.
2262
2263                                                "Queue"
2264                                                  A global address space pointer
2265                                                  to an OpenCL device enqueue
2266                                                  queue is passed in the
2267                                                  kernarg.
2268
2269                                                "HiddenGlobalOffsetX"
2270                                                  The OpenCL grid dispatch
2271                                                  global offset for the X
2272                                                  dimension is passed in the
2273                                                  kernarg.
2274
2275                                                "HiddenGlobalOffsetY"
2276                                                  The OpenCL grid dispatch
2277                                                  global offset for the Y
2278                                                  dimension is passed in the
2279                                                  kernarg.
2280
2281                                                "HiddenGlobalOffsetZ"
2282                                                  The OpenCL grid dispatch
2283                                                  global offset for the Z
2284                                                  dimension is passed in the
2285                                                  kernarg.
2286
2287                                                "HiddenNone"
2288                                                  An argument that is not used
2289                                                  by the kernel. Space needs to
2290                                                  be left for it, but it does
2291                                                  not need to be set up.
2292
2293                                                "HiddenPrintfBuffer"
2294                                                  A global address space pointer
2295                                                  to the runtime printf buffer
2296                                                  is passed in kernarg.
2297
2298                                                "HiddenHostcallBuffer"
2299                                                  A global address space pointer
2300                                                  to the runtime hostcall buffer
2301                                                  is passed in kernarg.
2302
2303                                                "HiddenDefaultQueue"
2304                                                  A global address space pointer
2305                                                  to the OpenCL device enqueue
2306                                                  queue that should be used by
2307                                                  the kernel by default is
2308                                                  passed in the kernarg.
2309
2310                                                "HiddenCompletionAction"
2311                                                  A global address space pointer
2312                                                  to help link enqueued kernels into
2313                                                  the ancestor tree for determining
2314                                                  when the parent kernel has finished.
2315
2316                                                "HiddenMultiGridSyncArg"
2317                                                  A global address space pointer for
2318                                                  multi-grid synchronization is
2319                                                  passed in the kernarg.
2320
2321     "ValueType"       string                   Unused and deprecated. This should no longer
2322                                                be emitted, but is accepted for compatibility.
2323
2324
2325     "PointeeAlign"    integer                  Alignment in bytes of pointee
2326                                                type for pointer type kernel
2327                                                argument. Must be a power
2328                                                of 2. Only present if
2329                                                "ValueKind" is
2330                                                "DynamicSharedPointer".
2331     "AddrSpaceQual"   string                   Kernel argument address space
2332                                                qualifier. Only present if
2333                                                "ValueKind" is "GlobalBuffer" or
2334                                                "DynamicSharedPointer". Values
2335                                                are:
2336
2337                                                - "Private"
2338                                                - "Global"
2339                                                - "Constant"
2340                                                - "Local"
2341                                                - "Generic"
2342                                                - "Region"
2343
2344                                                .. TODO::
2345                                                   Is GlobalBuffer only Global
2346                                                   or Constant? Is
2347                                                   DynamicSharedPointer always
2348                                                   Local? Can HCC allow Generic?
2349                                                   How can Private or Region
2350                                                   ever happen?
2351     "AccQual"         string                   Kernel argument access
2352                                                qualifier. Only present if
2353                                                "ValueKind" is "Image" or
2354                                                "Pipe". Values
2355                                                are:
2356
2357                                                - "ReadOnly"
2358                                                - "WriteOnly"
2359                                                - "ReadWrite"
2360
2361                                                .. TODO::
2362                                                   Does this apply to
2363                                                   GlobalBuffer?
2364     "ActualAccQual"   string                   The actual memory accesses
2365                                                performed by the kernel on the
2366                                                kernel argument. Only present if
2367                                                "ValueKind" is "GlobalBuffer",
2368                                                "Image", or "Pipe". This may be
2369                                                more restrictive than indicated
2370                                                by "AccQual" to reflect what the
2371                                                kernel actual does. If not
2372                                                present then the runtime must
2373                                                assume what is implied by
2374                                                "AccQual" and "IsConst". Values
2375                                                are:
2376
2377                                                - "ReadOnly"
2378                                                - "WriteOnly"
2379                                                - "ReadWrite"
2380
2381     "IsConst"         boolean                  Indicates if the kernel argument
2382                                                is const qualified. Only present
2383                                                if "ValueKind" is
2384                                                "GlobalBuffer".
2385
2386     "IsRestrict"      boolean                  Indicates if the kernel argument
2387                                                is restrict qualified. Only
2388                                                present if "ValueKind" is
2389                                                "GlobalBuffer".
2390
2391     "IsVolatile"      boolean                  Indicates if the kernel argument
2392                                                is volatile qualified. Only
2393                                                present if "ValueKind" is
2394                                                "GlobalBuffer".
2395
2396     "IsPipe"          boolean                  Indicates if the kernel argument
2397                                                is pipe qualified. Only present
2398                                                if "ValueKind" is "Pipe".
2399
2400                                                .. TODO::
2401                                                   Can GlobalBuffer be pipe
2402                                                   qualified?
2403     ================= ============== ========= ================================
2404
2405..
2406
2407  .. table:: AMDHSA Code Object V2 Kernel Code Properties Metadata Map
2408     :name: amdgpu-amdhsa-code-object-kernel-code-properties-metadata-map-table-v2
2409
2410     ============================ ============== ========= =====================
2411     String Key                   Value Type     Required? Description
2412     ============================ ============== ========= =====================
2413     "KernargSegmentSize"         integer        Required  The size in bytes of
2414                                                           the kernarg segment
2415                                                           that holds the values
2416                                                           of the arguments to
2417                                                           the kernel.
2418     "GroupSegmentFixedSize"      integer        Required  The amount of group
2419                                                           segment memory
2420                                                           required by a
2421                                                           work-group in
2422                                                           bytes. This does not
2423                                                           include any
2424                                                           dynamically allocated
2425                                                           group segment memory
2426                                                           that may be added
2427                                                           when the kernel is
2428                                                           dispatched.
2429     "PrivateSegmentFixedSize"    integer        Required  The amount of fixed
2430                                                           private address space
2431                                                           memory required for a
2432                                                           work-item in
2433                                                           bytes. If the kernel
2434                                                           uses a dynamic call
2435                                                           stack then additional
2436                                                           space must be added
2437                                                           to this value for the
2438                                                           call stack.
2439     "KernargSegmentAlign"        integer        Required  The maximum byte
2440                                                           alignment of
2441                                                           arguments in the
2442                                                           kernarg segment. Must
2443                                                           be a power of 2.
2444     "WavefrontSize"              integer        Required  Wavefront size. Must
2445                                                           be a power of 2.
2446     "NumSGPRs"                   integer        Required  Number of scalar
2447                                                           registers used by a
2448                                                           wavefront for
2449                                                           GFX6-GFX10. This
2450                                                           includes the special
2451                                                           SGPRs for VCC, Flat
2452                                                           Scratch (GFX7-GFX10)
2453                                                           and XNACK (for
2454                                                           GFX8-GFX10). It does
2455                                                           not include the 16
2456                                                           SGPR added if a trap
2457                                                           handler is
2458                                                           enabled. It is not
2459                                                           rounded up to the
2460                                                           allocation
2461                                                           granularity.
2462     "NumVGPRs"                   integer        Required  Number of vector
2463                                                           registers used by
2464                                                           each work-item for
2465                                                           GFX6-GFX10
2466     "MaxFlatWorkGroupSize"       integer        Required  Maximum flat
2467                                                           work-group size
2468                                                           supported by the
2469                                                           kernel in work-items.
2470                                                           Must be >=1 and
2471                                                           consistent with
2472                                                           ReqdWorkGroupSize if
2473                                                           not 0, 0, 0.
2474     "NumSpilledSGPRs"            integer                  Number of stores from
2475                                                           a scalar register to
2476                                                           a register allocator
2477                                                           created spill
2478                                                           location.
2479     "NumSpilledVGPRs"            integer                  Number of stores from
2480                                                           a vector register to
2481                                                           a register allocator
2482                                                           created spill
2483                                                           location.
2484     ============================ ============== ========= =====================
2485
2486.. _amdgpu-amdhsa-code-object-metadata-v3:
2487
2488Code Object V3 Metadata (-mattr=+code-object-v3)
2489++++++++++++++++++++++++++++++++++++++++++++++++
2490
2491Code object V3 metadata is specified by the ``NT_AMDGPU_METADATA`` note record
2492(see :ref:`amdgpu-note-records-v3`).
2493
2494The metadata is represented as Message Pack formatted binary data (see
2495[MsgPack]_). The top level is a Message Pack map that includes the
2496keys defined in table
2497:ref:`amdgpu-amdhsa-code-object-metadata-map-table-v3` and referenced
2498tables.
2499
2500Additional information can be added to the maps. To avoid conflicts,
2501any key names should be prefixed by "*vendor-name*." where
2502``vendor-name`` can be the name of the vendor and specific vendor
2503tool that generates the information. The prefix is abbreviated to
2504simply "." when it appears within a map that has been added by the
2505same *vendor-name*.
2506
2507  .. table:: AMDHSA Code Object V3 Metadata Map
2508     :name: amdgpu-amdhsa-code-object-metadata-map-table-v3
2509
2510     ================= ============== ========= =======================================
2511     String Key        Value Type     Required? Description
2512     ================= ============== ========= =======================================
2513     "amdhsa.version"  sequence of    Required  - The first integer is the major
2514                       2 integers                 version. Currently 1.
2515                                                - The second integer is the minor
2516                                                  version. Currently 0.
2517     "amdhsa.printf"   sequence of              Each string is encoded information
2518                       strings                  about a printf function call. The
2519                                                encoded information is organized as
2520                                                fields separated by colon (':'):
2521
2522                                                ``ID:N:S[0]:S[1]:...:S[N-1]:FormatString``
2523
2524                                                where:
2525
2526                                                ``ID``
2527                                                  A 32-bit integer as a unique id for
2528                                                  each printf function call
2529
2530                                                ``N``
2531                                                  A 32-bit integer equal to the number
2532                                                  of arguments of printf function call
2533                                                  minus 1
2534
2535                                                ``S[i]`` (where i = 0, 1, ... , N-1)
2536                                                  32-bit integers for the size in bytes
2537                                                  of the i-th FormatString argument of
2538                                                  the printf function call
2539
2540                                                FormatString
2541                                                  The format string passed to the
2542                                                  printf function call.
2543     "amdhsa.kernels"  sequence of    Required  Sequence of the maps for each
2544                       map                      kernel in the code object. See
2545                                                :ref:`amdgpu-amdhsa-code-object-kernel-metadata-map-table-v3`
2546                                                for the definition of the keys included
2547                                                in that map.
2548     ================= ============== ========= =======================================
2549
2550..
2551
2552  .. table:: AMDHSA Code Object V3 Kernel Metadata Map
2553     :name: amdgpu-amdhsa-code-object-kernel-metadata-map-table-v3
2554
2555     =================================== ============== ========= ================================
2556     String Key                          Value Type     Required? Description
2557     =================================== ============== ========= ================================
2558     ".name"                             string         Required  Source name of the kernel.
2559     ".symbol"                           string         Required  Name of the kernel
2560                                                                  descriptor ELF symbol.
2561     ".language"                         string                   Source language of the kernel.
2562                                                                  Values include:
2563
2564                                                                  - "OpenCL C"
2565                                                                  - "OpenCL C++"
2566                                                                  - "HCC"
2567                                                                  - "HIP"
2568                                                                  - "OpenMP"
2569                                                                  - "Assembler"
2570
2571     ".language_version"                 sequence of              - The first integer is the major
2572                                         2 integers                 version.
2573                                                                  - The second integer is the
2574                                                                    minor version.
2575     ".args"                             sequence of              Sequence of maps of the
2576                                         map                      kernel arguments. See
2577                                                                  :ref:`amdgpu-amdhsa-code-object-kernel-argument-metadata-map-table-v3`
2578                                                                  for the definition of the keys
2579                                                                  included in that map.
2580     ".reqd_workgroup_size"              sequence of              If not 0, 0, 0 then all values
2581                                         3 integers               must be >=1 and the dispatch
2582                                                                  work-group size X, Y, Z must
2583                                                                  correspond to the specified
2584                                                                  values. Defaults to 0, 0, 0.
2585
2586                                                                  Corresponds to the OpenCL
2587                                                                  ``reqd_work_group_size``
2588                                                                  attribute.
2589     ".workgroup_size_hint"              sequence of              The dispatch work-group size
2590                                         3 integers               X, Y, Z is likely to be the
2591                                                                  specified values.
2592
2593                                                                  Corresponds to the OpenCL
2594                                                                  ``work_group_size_hint``
2595                                                                  attribute.
2596     ".vec_type_hint"                    string                   The name of a scalar or vector
2597                                                                  type.
2598
2599                                                                  Corresponds to the OpenCL
2600                                                                  ``vec_type_hint`` attribute.
2601
2602     ".device_enqueue_symbol"            string                   The external symbol name
2603                                                                  associated with a kernel.
2604                                                                  OpenCL runtime allocates a
2605                                                                  global buffer for the symbol
2606                                                                  and saves the kernel's address
2607                                                                  to it, which is used for
2608                                                                  device side enqueueing. Only
2609                                                                  available for device side
2610                                                                  enqueued kernels.
2611     ".kernarg_segment_size"             integer        Required  The size in bytes of
2612                                                                  the kernarg segment
2613                                                                  that holds the values
2614                                                                  of the arguments to
2615                                                                  the kernel.
2616     ".group_segment_fixed_size"         integer        Required  The amount of group
2617                                                                  segment memory
2618                                                                  required by a
2619                                                                  work-group in
2620                                                                  bytes. This does not
2621                                                                  include any
2622                                                                  dynamically allocated
2623                                                                  group segment memory
2624                                                                  that may be added
2625                                                                  when the kernel is
2626                                                                  dispatched.
2627     ".private_segment_fixed_size"       integer        Required  The amount of fixed
2628                                                                  private address space
2629                                                                  memory required for a
2630                                                                  work-item in
2631                                                                  bytes. If the kernel
2632                                                                  uses a dynamic call
2633                                                                  stack then additional
2634                                                                  space must be added
2635                                                                  to this value for the
2636                                                                  call stack.
2637     ".kernarg_segment_align"            integer        Required  The maximum byte
2638                                                                  alignment of
2639                                                                  arguments in the
2640                                                                  kernarg segment. Must
2641                                                                  be a power of 2.
2642     ".wavefront_size"                   integer        Required  Wavefront size. Must
2643                                                                  be a power of 2.
2644     ".sgpr_count"                       integer        Required  Number of scalar
2645                                                                  registers required by a
2646                                                                  wavefront for
2647                                                                  GFX6-GFX9. A register
2648                                                                  is required if it is
2649                                                                  used explicitly, or
2650                                                                  if a higher numbered
2651                                                                  register is used
2652                                                                  explicitly. This
2653                                                                  includes the special
2654                                                                  SGPRs for VCC, Flat
2655                                                                  Scratch (GFX7-GFX9)
2656                                                                  and XNACK (for
2657                                                                  GFX8-GFX9). It does
2658                                                                  not include the 16
2659                                                                  SGPR added if a trap
2660                                                                  handler is
2661                                                                  enabled. It is not
2662                                                                  rounded up to the
2663                                                                  allocation
2664                                                                  granularity.
2665     ".vgpr_count"                       integer        Required  Number of vector
2666                                                                  registers required by
2667                                                                  each work-item for
2668                                                                  GFX6-GFX9. A register
2669                                                                  is required if it is
2670                                                                  used explicitly, or
2671                                                                  if a higher numbered
2672                                                                  register is used
2673                                                                  explicitly.
2674     ".max_flat_workgroup_size"          integer        Required  Maximum flat
2675                                                                  work-group size
2676                                                                  supported by the
2677                                                                  kernel in work-items.
2678                                                                  Must be >=1 and
2679                                                                  consistent with
2680                                                                  ReqdWorkGroupSize if
2681                                                                  not 0, 0, 0.
2682     ".sgpr_spill_count"                 integer                  Number of stores from
2683                                                                  a scalar register to
2684                                                                  a register allocator
2685                                                                  created spill
2686                                                                  location.
2687     ".vgpr_spill_count"                 integer                  Number of stores from
2688                                                                  a vector register to
2689                                                                  a register allocator
2690                                                                  created spill
2691                                                                  location.
2692     =================================== ============== ========= ================================
2693
2694..
2695
2696  .. table:: AMDHSA Code Object V3 Kernel Argument Metadata Map
2697     :name: amdgpu-amdhsa-code-object-kernel-argument-metadata-map-table-v3
2698
2699     ====================== ============== ========= ================================
2700     String Key             Value Type     Required? Description
2701     ====================== ============== ========= ================================
2702     ".name"                string                   Kernel argument name.
2703     ".type_name"           string                   Kernel argument type name.
2704     ".size"                integer        Required  Kernel argument size in bytes.
2705     ".offset"              integer        Required  Kernel argument offset in
2706                                                     bytes. The offset must be a
2707                                                     multiple of the alignment
2708                                                     required by the argument.
2709     ".value_kind"          string         Required  Kernel argument kind that
2710                                                     specifies how to set up the
2711                                                     corresponding argument.
2712                                                     Values include:
2713
2714                                                     "by_value"
2715                                                       The argument is copied
2716                                                       directly into the kernarg.
2717
2718                                                     "global_buffer"
2719                                                       A global address space pointer
2720                                                       to the buffer data is passed
2721                                                       in the kernarg.
2722
2723                                                     "dynamic_shared_pointer"
2724                                                       A group address space pointer
2725                                                       to dynamically allocated LDS
2726                                                       is passed in the kernarg.
2727
2728                                                     "sampler"
2729                                                       A global address space
2730                                                       pointer to a S# is passed in
2731                                                       the kernarg.
2732
2733                                                     "image"
2734                                                       A global address space
2735                                                       pointer to a T# is passed in
2736                                                       the kernarg.
2737
2738                                                     "pipe"
2739                                                       A global address space pointer
2740                                                       to an OpenCL pipe is passed in
2741                                                       the kernarg.
2742
2743                                                     "queue"
2744                                                       A global address space pointer
2745                                                       to an OpenCL device enqueue
2746                                                       queue is passed in the
2747                                                       kernarg.
2748
2749                                                     "hidden_global_offset_x"
2750                                                       The OpenCL grid dispatch
2751                                                       global offset for the X
2752                                                       dimension is passed in the
2753                                                       kernarg.
2754
2755                                                     "hidden_global_offset_y"
2756                                                       The OpenCL grid dispatch
2757                                                       global offset for the Y
2758                                                       dimension is passed in the
2759                                                       kernarg.
2760
2761                                                     "hidden_global_offset_z"
2762                                                       The OpenCL grid dispatch
2763                                                       global offset for the Z
2764                                                       dimension is passed in the
2765                                                       kernarg.
2766
2767                                                     "hidden_none"
2768                                                       An argument that is not used
2769                                                       by the kernel. Space needs to
2770                                                       be left for it, but it does
2771                                                       not need to be set up.
2772
2773                                                     "hidden_printf_buffer"
2774                                                       A global address space pointer
2775                                                       to the runtime printf buffer
2776                                                       is passed in kernarg.
2777
2778                                                     "hidden_hostcall_buffer"
2779                                                       A global address space pointer
2780                                                       to the runtime hostcall buffer
2781                                                       is passed in kernarg.
2782
2783                                                     "hidden_default_queue"
2784                                                       A global address space pointer
2785                                                       to the OpenCL device enqueue
2786                                                       queue that should be used by
2787                                                       the kernel by default is
2788                                                       passed in the kernarg.
2789
2790                                                     "hidden_completion_action"
2791                                                       A global address space pointer
2792                                                       to help link enqueued kernels into
2793                                                       the ancestor tree for determining
2794                                                       when the parent kernel has finished.
2795
2796                                                     "hidden_multigrid_sync_arg"
2797                                                       A global address space pointer for
2798                                                       multi-grid synchronization is
2799                                                       passed in the kernarg.
2800
2801     ".value_type"          string                    Unused and deprecated. This should no longer
2802                                                      be emitted, but is accepted for compatibility.
2803
2804     ".pointee_align"       integer                  Alignment in bytes of pointee
2805                                                     type for pointer type kernel
2806                                                     argument. Must be a power
2807                                                     of 2. Only present if
2808                                                     ".value_kind" is
2809                                                     "dynamic_shared_pointer".
2810     ".address_space"       string                   Kernel argument address space
2811                                                     qualifier. Only present if
2812                                                     ".value_kind" is "global_buffer" or
2813                                                     "dynamic_shared_pointer". Values
2814                                                     are:
2815
2816                                                     - "private"
2817                                                     - "global"
2818                                                     - "constant"
2819                                                     - "local"
2820                                                     - "generic"
2821                                                     - "region"
2822
2823                                                     .. TODO::
2824                                                        Is "global_buffer" only "global"
2825                                                        or "constant"? Is
2826                                                        "dynamic_shared_pointer" always
2827                                                        "local"? Can HCC allow "generic"?
2828                                                        How can "private" or "region"
2829                                                        ever happen?
2830     ".access"              string                   Kernel argument access
2831                                                     qualifier. Only present if
2832                                                     ".value_kind" is "image" or
2833                                                     "pipe". Values
2834                                                     are:
2835
2836                                                     - "read_only"
2837                                                     - "write_only"
2838                                                     - "read_write"
2839
2840                                                     .. TODO::
2841                                                        Does this apply to
2842                                                        "global_buffer"?
2843     ".actual_access"       string                   The actual memory accesses
2844                                                     performed by the kernel on the
2845                                                     kernel argument. Only present if
2846                                                     ".value_kind" is "global_buffer",
2847                                                     "image", or "pipe". This may be
2848                                                     more restrictive than indicated
2849                                                     by ".access" to reflect what the
2850                                                     kernel actual does. If not
2851                                                     present then the runtime must
2852                                                     assume what is implied by
2853                                                     ".access" and ".is_const"      . Values
2854                                                     are:
2855
2856                                                     - "read_only"
2857                                                     - "write_only"
2858                                                     - "read_write"
2859
2860     ".is_const"            boolean                  Indicates if the kernel argument
2861                                                     is const qualified. Only present
2862                                                     if ".value_kind" is
2863                                                     "global_buffer".
2864
2865     ".is_restrict"         boolean                  Indicates if the kernel argument
2866                                                     is restrict qualified. Only
2867                                                     present if ".value_kind" is
2868                                                     "global_buffer".
2869
2870     ".is_volatile"         boolean                  Indicates if the kernel argument
2871                                                     is volatile qualified. Only
2872                                                     present if ".value_kind" is
2873                                                     "global_buffer".
2874
2875     ".is_pipe"             boolean                  Indicates if the kernel argument
2876                                                     is pipe qualified. Only present
2877                                                     if ".value_kind" is "pipe".
2878
2879                                                     .. TODO::
2880                                                        Can "global_buffer" be pipe
2881                                                        qualified?
2882     ====================== ============== ========= ================================
2883
2884..
2885
2886Kernel Dispatch
2887~~~~~~~~~~~~~~~
2888
2889The HSA architected queuing language (AQL) defines a user space memory
2890interface that can be used to control the dispatch of kernels, in an agent
2891independent way. An agent can have zero or more AQL queues created for it using
2892the ROCm runtime, in which AQL packets (all of which are 64 bytes) can be
2893placed. See the *HSA Platform System Architecture Specification* [HSA]_ for the
2894AQL queue mechanics and packet layouts.
2895
2896The packet processor of a kernel agent is responsible for detecting and
2897dispatching HSA kernels from the AQL queues associated with it. For AMD GPUs the
2898packet processor is implemented by the hardware command processor (CP),
2899asynchronous dispatch controller (ADC) and shader processor input controller
2900(SPI).
2901
2902The ROCm runtime can be used to allocate an AQL queue object. It uses the kernel
2903mode driver to initialize and register the AQL queue with CP.
2904
2905To dispatch a kernel the following actions are performed. This can occur in the
2906CPU host program, or from an HSA kernel executing on a GPU.
2907
29081. A pointer to an AQL queue for the kernel agent on which the kernel is to be
2909   executed is obtained.
29102. A pointer to the kernel descriptor (see
2911   :ref:`amdgpu-amdhsa-kernel-descriptor`) of the kernel to execute is obtained.
2912   It must be for a kernel that is contained in a code object that that was
2913   loaded by the ROCm runtime on the kernel agent with which the AQL queue is
2914   associated.
29153. Space is allocated for the kernel arguments using the ROCm runtime allocator
2916   for a memory region with the kernarg property for the kernel agent that will
2917   execute the kernel. It must be at least 16-byte aligned.
29184. Kernel argument values are assigned to the kernel argument memory
2919   allocation. The layout is defined in the *HSA Programmer's Language
2920   Reference* [HSA]_. For AMDGPU the kernel execution directly accesses the
2921   kernel argument memory in the same way constant memory is accessed. (Note
2922   that the HSA specification allows an implementation to copy the kernel
2923   argument contents to another location that is accessed by the kernel.)
29245. An AQL kernel dispatch packet is created on the AQL queue. The ROCm runtime
2925   api uses 64-bit atomic operations to reserve space in the AQL queue for the
2926   packet. The packet must be set up, and the final write must use an atomic
2927   store release to set the packet kind to ensure the packet contents are
2928   visible to the kernel agent. AQL defines a doorbell signal mechanism to
2929   notify the kernel agent that the AQL queue has been updated. These rules, and
2930   the layout of the AQL queue and kernel dispatch packet is defined in the *HSA
2931   System Architecture Specification* [HSA]_.
29326. A kernel dispatch packet includes information about the actual dispatch,
2933   such as grid and work-group size, together with information from the code
2934   object about the kernel, such as segment sizes. The ROCm runtime queries on
2935   the kernel symbol can be used to obtain the code object values which are
2936   recorded in the :ref:`amdgpu-amdhsa-code-object-metadata`.
29377. CP executes micro-code and is responsible for detecting and setting up the
2938   GPU to execute the wavefronts of a kernel dispatch.
29398. CP ensures that when the a wavefront starts executing the kernel machine
2940   code, the scalar general purpose registers (SGPR) and vector general purpose
2941   registers (VGPR) are set up as required by the machine code. The required
2942   setup is defined in the :ref:`amdgpu-amdhsa-kernel-descriptor`. The initial
2943   register state is defined in
2944   :ref:`amdgpu-amdhsa-initial-kernel-execution-state`.
29459. The prolog of the kernel machine code (see
2946   :ref:`amdgpu-amdhsa-kernel-prolog`) sets up the machine state as necessary
2947   before continuing executing the machine code that corresponds to the kernel.
294810. When the kernel dispatch has completed execution, CP signals the completion
2949    signal specified in the kernel dispatch packet if not 0.
2950
2951Image and Samplers
2952~~~~~~~~~~~~~~~~~~
2953
2954Image and sample handles created by the ROCm runtime are 64-bit addresses of a
2955hardware 32-byte V# and 48 byte S# object respectively. In order to support the
2956HSA ``query_sampler`` operations two extra dwords are used to store the HSA BRIG
2957enumeration values for the queries that are not trivially deducible from the S#
2958representation.
2959
2960HSA Signals
2961~~~~~~~~~~~
2962
2963HSA signal handles created by the ROCm runtime are 64-bit addresses of a
2964structure allocated in memory accessible from both the CPU and GPU. The
2965structure is defined by the ROCm runtime and subject to change between releases
2966(see [AMD-ROCm-github]_).
2967
2968.. _amdgpu-amdhsa-hsa-aql-queue:
2969
2970HSA AQL Queue
2971~~~~~~~~~~~~~
2972
2973The HSA AQL queue structure is defined by the ROCm runtime and subject to change
2974between releases (see [AMD-ROCm-github]_). For some processors it contains
2975fields needed to implement certain language features such as the flat address
2976aperture bases. It also contains fields used by CP such as managing the
2977allocation of scratch memory.
2978
2979.. _amdgpu-amdhsa-kernel-descriptor:
2980
2981Kernel Descriptor
2982~~~~~~~~~~~~~~~~~
2983
2984A kernel descriptor consists of the information needed by CP to initiate the
2985execution of a kernel, including the entry point address of the machine code
2986that implements the kernel.
2987
2988Kernel Descriptor for GFX6-GFX10
2989++++++++++++++++++++++++++++++++
2990
2991CP microcode requires the Kernel descriptor to be allocated on 64-byte
2992alignment.
2993
2994  .. table:: Kernel Descriptor for GFX6-GFX10
2995     :name: amdgpu-amdhsa-kernel-descriptor-gfx6-gfx10-table
2996
2997     ======= ======= =============================== ============================
2998     Bits    Size    Field Name                      Description
2999     ======= ======= =============================== ============================
3000     31:0    4 bytes GROUP_SEGMENT_FIXED_SIZE        The amount of fixed local
3001                                                     address space memory
3002                                                     required for a work-group
3003                                                     in bytes. This does not
3004                                                     include any dynamically
3005                                                     allocated local address
3006                                                     space memory that may be
3007                                                     added when the kernel is
3008                                                     dispatched.
3009     63:32   4 bytes PRIVATE_SEGMENT_FIXED_SIZE      The amount of fixed
3010                                                     private address space
3011                                                     memory required for a
3012                                                     work-item in bytes. If
3013                                                     is_dynamic_callstack is 1
3014                                                     then additional space must
3015                                                     be added to this value for
3016                                                     the call stack.
3017     127:64  8 bytes                                 Reserved, must be 0.
3018     191:128 8 bytes KERNEL_CODE_ENTRY_BYTE_OFFSET   Byte offset (possibly
3019                                                     negative) from base
3020                                                     address of kernel
3021                                                     descriptor to kernel's
3022                                                     entry point instruction
3023                                                     which must be 256 byte
3024                                                     aligned.
3025     351:272 20                                      Reserved, must be 0.
3026             bytes
3027     383:352 4 bytes COMPUTE_PGM_RSRC3               GFX6-9
3028                                                       Reserved, must be 0.
3029                                                     GFX10
3030                                                       Compute Shader (CS)
3031                                                       program settings used by
3032                                                       CP to set up
3033                                                       ``COMPUTE_PGM_RSRC3``
3034                                                       configuration
3035                                                       register. See
3036                                                       :ref:`amdgpu-amdhsa-compute_pgm_rsrc3-gfx10-table`.
3037     415:384 4 bytes COMPUTE_PGM_RSRC1               Compute Shader (CS)
3038                                                     program settings used by
3039                                                     CP to set up
3040                                                     ``COMPUTE_PGM_RSRC1``
3041                                                     configuration
3042                                                     register. See
3043                                                     :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx10-table`.
3044     447:416 4 bytes COMPUTE_PGM_RSRC2               Compute Shader (CS)
3045                                                     program settings used by
3046                                                     CP to set up
3047                                                     ``COMPUTE_PGM_RSRC2``
3048                                                     configuration
3049                                                     register. See
3050                                                     :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx10-table`.
3051     448     1 bit   ENABLE_SGPR_PRIVATE_SEGMENT     Enable the setup of the
3052                     _BUFFER                         SGPR user data registers
3053                                                     (see
3054                                                     :ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
3055
3056                                                     The total number of SGPR
3057                                                     user data registers
3058                                                     requested must not exceed
3059                                                     16 and match value in
3060                                                     ``compute_pgm_rsrc2.user_sgpr.user_sgpr_count``.
3061                                                     Any requests beyond 16
3062                                                     will be ignored.
3063     449     1 bit   ENABLE_SGPR_DISPATCH_PTR        *see above*
3064     450     1 bit   ENABLE_SGPR_QUEUE_PTR           *see above*
3065     451     1 bit   ENABLE_SGPR_KERNARG_SEGMENT_PTR *see above*
3066     452     1 bit   ENABLE_SGPR_DISPATCH_ID         *see above*
3067     453     1 bit   ENABLE_SGPR_FLAT_SCRATCH_INIT   *see above*
3068     454     1 bit   ENABLE_SGPR_PRIVATE_SEGMENT     *see above*
3069                     _SIZE
3070     457:455 3 bits                                  Reserved, must be 0.
3071     458     1 bit   ENABLE_WAVEFRONT_SIZE32         GFX6-9
3072                                                       Reserved, must be 0.
3073                                                     GFX10
3074                                                       - If 0 execute in
3075                                                         wavefront size 64 mode.
3076                                                       - If 1 execute in
3077                                                         native wavefront size
3078                                                         32 mode.
3079     463:459 5 bits                                  Reserved, must be 0.
3080     511:464 6 bytes                                 Reserved, must be 0.
3081     512     **Total size 64 bytes.**
3082     ======= ====================================================================
3083
3084..
3085
3086  .. table:: compute_pgm_rsrc1 for GFX6-GFX10
3087     :name: amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx10-table
3088
3089     ======= ======= =============================== ===========================================================================
3090     Bits    Size    Field Name                      Description
3091     ======= ======= =============================== ===========================================================================
3092     5:0     6 bits  GRANULATED_WORKITEM_VGPR_COUNT  Number of vector register
3093                                                     blocks used by each work-item;
3094                                                     granularity is device
3095                                                     specific:
3096
3097                                                     GFX6-GFX9
3098                                                       - vgprs_used 0..256
3099                                                       - max(0, ceil(vgprs_used / 4) - 1)
3100                                                     GFX10 (wavefront size 64)
3101                                                       - max_vgpr 1..256
3102                                                       - max(0, ceil(vgprs_used / 4) - 1)
3103                                                     GFX10 (wavefront size 32)
3104                                                       - max_vgpr 1..256
3105                                                       - max(0, ceil(vgprs_used / 8) - 1)
3106
3107                                                     Where vgprs_used is defined
3108                                                     as the highest VGPR number
3109                                                     explicitly referenced plus
3110                                                     one.
3111
3112                                                     Used by CP to set up
3113                                                     ``COMPUTE_PGM_RSRC1.VGPRS``.
3114
3115                                                     The
3116                                                     :ref:`amdgpu-assembler`
3117                                                     calculates this
3118                                                     automatically for the
3119                                                     selected processor from
3120                                                     values provided to the
3121                                                     `.amdhsa_kernel` directive
3122                                                     by the
3123                                                     `.amdhsa_next_free_vgpr`
3124                                                     nested directive (see
3125                                                     :ref:`amdhsa-kernel-directives-table`).
3126     9:6     4 bits  GRANULATED_WAVEFRONT_SGPR_COUNT Number of scalar register
3127                                                     blocks used by a wavefront;
3128                                                     granularity is device
3129                                                     specific:
3130
3131                                                     GFX6-GFX8
3132                                                       - sgprs_used 0..112
3133                                                       - max(0, ceil(sgprs_used / 8) - 1)
3134                                                     GFX9
3135                                                       - sgprs_used 0..112
3136                                                       - 2 * max(0, ceil(sgprs_used / 16) - 1)
3137                                                     GFX10
3138                                                       Reserved, must be 0.
3139                                                       (128 SGPRs always
3140                                                       allocated.)
3141
3142                                                     Where sgprs_used is
3143                                                     defined as the highest
3144                                                     SGPR number explicitly
3145                                                     referenced plus one, plus
3146                                                     a target specific number
3147                                                     of additional special
3148                                                     SGPRs for VCC,
3149                                                     FLAT_SCRATCH (GFX7+) and
3150                                                     XNACK_MASK (GFX8+), and
3151                                                     any additional
3152                                                     target specific
3153                                                     limitations. It does not
3154                                                     include the 16 SGPRs added
3155                                                     if a trap handler is
3156                                                     enabled.
3157
3158                                                     The target specific
3159                                                     limitations and special
3160                                                     SGPR layout are defined in
3161                                                     the hardware
3162                                                     documentation, which can
3163                                                     be found in the
3164                                                     :ref:`amdgpu-processors`
3165                                                     table.
3166
3167                                                     Used by CP to set up
3168                                                     ``COMPUTE_PGM_RSRC1.SGPRS``.
3169
3170                                                     The
3171                                                     :ref:`amdgpu-assembler`
3172                                                     calculates this
3173                                                     automatically for the
3174                                                     selected processor from
3175                                                     values provided to the
3176                                                     `.amdhsa_kernel` directive
3177                                                     by the
3178                                                     `.amdhsa_next_free_sgpr`
3179                                                     and `.amdhsa_reserve_*`
3180                                                     nested directives (see
3181                                                     :ref:`amdhsa-kernel-directives-table`).
3182     11:10   2 bits  PRIORITY                        Must be 0.
3183
3184                                                     Start executing wavefront
3185                                                     at the specified priority.
3186
3187                                                     CP is responsible for
3188                                                     filling in
3189                                                     ``COMPUTE_PGM_RSRC1.PRIORITY``.
3190     13:12   2 bits  FLOAT_ROUND_MODE_32             Wavefront starts execution
3191                                                     with specified rounding
3192                                                     mode for single (32
3193                                                     bit) floating point
3194                                                     precision floating point
3195                                                     operations.
3196
3197                                                     Floating point rounding
3198                                                     mode values are defined in
3199                                                     :ref:`amdgpu-amdhsa-floating-point-rounding-mode-enumeration-values-table`.
3200
3201                                                     Used by CP to set up
3202                                                     ``COMPUTE_PGM_RSRC1.FLOAT_MODE``.
3203     15:14   2 bits  FLOAT_ROUND_MODE_16_64          Wavefront starts execution
3204                                                     with specified rounding
3205                                                     denorm mode for half/double (16
3206                                                     and 64-bit) floating point
3207                                                     precision floating point
3208                                                     operations.
3209
3210                                                     Floating point rounding
3211                                                     mode values are defined in
3212                                                     :ref:`amdgpu-amdhsa-floating-point-rounding-mode-enumeration-values-table`.
3213
3214                                                     Used by CP to set up
3215                                                     ``COMPUTE_PGM_RSRC1.FLOAT_MODE``.
3216     17:16   2 bits  FLOAT_DENORM_MODE_32            Wavefront starts execution
3217                                                     with specified denorm mode
3218                                                     for single (32
3219                                                     bit)  floating point
3220                                                     precision floating point
3221                                                     operations.
3222
3223                                                     Floating point denorm mode
3224                                                     values are defined in
3225                                                     :ref:`amdgpu-amdhsa-floating-point-denorm-mode-enumeration-values-table`.
3226
3227                                                     Used by CP to set up
3228                                                     ``COMPUTE_PGM_RSRC1.FLOAT_MODE``.
3229     19:18   2 bits  FLOAT_DENORM_MODE_16_64         Wavefront starts execution
3230                                                     with specified denorm mode
3231                                                     for half/double (16
3232                                                     and 64-bit) floating point
3233                                                     precision floating point
3234                                                     operations.
3235
3236                                                     Floating point denorm mode
3237                                                     values are defined in
3238                                                     :ref:`amdgpu-amdhsa-floating-point-denorm-mode-enumeration-values-table`.
3239
3240                                                     Used by CP to set up
3241                                                     ``COMPUTE_PGM_RSRC1.FLOAT_MODE``.
3242     20      1 bit   PRIV                            Must be 0.
3243
3244                                                     Start executing wavefront
3245                                                     in privilege trap handler
3246                                                     mode.
3247
3248                                                     CP is responsible for
3249                                                     filling in
3250                                                     ``COMPUTE_PGM_RSRC1.PRIV``.
3251     21      1 bit   ENABLE_DX10_CLAMP               Wavefront starts execution
3252                                                     with DX10 clamp mode
3253                                                     enabled. Used by the vector
3254                                                     ALU to force DX10 style
3255                                                     treatment of NaN's (when
3256                                                     set, clamp NaN to zero,
3257                                                     otherwise pass NaN
3258                                                     through).
3259
3260                                                     Used by CP to set up
3261                                                     ``COMPUTE_PGM_RSRC1.DX10_CLAMP``.
3262     22      1 bit   DEBUG_MODE                      Must be 0.
3263
3264                                                     Start executing wavefront
3265                                                     in single step mode.
3266
3267                                                     CP is responsible for
3268                                                     filling in
3269                                                     ``COMPUTE_PGM_RSRC1.DEBUG_MODE``.
3270     23      1 bit   ENABLE_IEEE_MODE                Wavefront starts execution
3271                                                     with IEEE mode
3272                                                     enabled. Floating point
3273                                                     opcodes that support
3274                                                     exception flag gathering
3275                                                     will quiet and propagate
3276                                                     signaling-NaN inputs per
3277                                                     IEEE 754-2008. Min_dx10 and
3278                                                     max_dx10 become IEEE
3279                                                     754-2008 compliant due to
3280                                                     signaling-NaN propagation
3281                                                     and quieting.
3282
3283                                                     Used by CP to set up
3284                                                     ``COMPUTE_PGM_RSRC1.IEEE_MODE``.
3285     24      1 bit   BULKY                           Must be 0.
3286
3287                                                     Only one work-group allowed
3288                                                     to execute on a compute
3289                                                     unit.
3290
3291                                                     CP is responsible for
3292                                                     filling in
3293                                                     ``COMPUTE_PGM_RSRC1.BULKY``.
3294     25      1 bit   CDBG_USER                       Must be 0.
3295
3296                                                     Flag that can be used to
3297                                                     control debugging code.
3298
3299                                                     CP is responsible for
3300                                                     filling in
3301                                                     ``COMPUTE_PGM_RSRC1.CDBG_USER``.
3302     26      1 bit   FP16_OVFL                       GFX6-GFX8
3303                                                       Reserved, must be 0.
3304                                                     GFX9-GFX10
3305                                                       Wavefront starts execution
3306                                                       with specified fp16 overflow
3307                                                       mode.
3308
3309                                                       - If 0, fp16 overflow generates
3310                                                         +/-INF values.
3311                                                       - If 1, fp16 overflow that is the
3312                                                         result of an +/-INF input value
3313                                                         or divide by 0 produces a +/-INF,
3314                                                         otherwise clamps computed
3315                                                         overflow to +/-MAX_FP16 as
3316                                                         appropriate.
3317
3318                                                       Used by CP to set up
3319                                                       ``COMPUTE_PGM_RSRC1.FP16_OVFL``.
3320     28:27   2 bits                                  Reserved, must be 0.
3321     29      1 bit    WGP_MODE                       GFX6-GFX9
3322                                                       Reserved, must be 0.
3323                                                     GFX10
3324                                                       - If 0 execute work-groups in
3325                                                         CU wavefront execution mode.
3326                                                       - If 1 execute work-groups on
3327                                                         in WGP wavefront execution mode.
3328
3329                                                       See :ref:`amdgpu-amdhsa-memory-model`.
3330
3331                                                       Used by CP to set up
3332                                                       ``COMPUTE_PGM_RSRC1.WGP_MODE``.
3333     30      1 bit    MEM_ORDERED                    GFX6-9
3334                                                       Reserved, must be 0.
3335                                                     GFX10
3336                                                       Controls the behavior of the
3337                                                       waitcnt's vmcnt and vscnt
3338                                                       counters.
3339
3340                                                       - If 0 vmcnt reports completion
3341                                                         of load and atomic with return
3342                                                         out of order with sample
3343                                                         instructions, and the vscnt
3344                                                         reports the completion of
3345                                                         store and atomic without
3346                                                         return in order.
3347                                                       - If 1 vmcnt reports completion
3348                                                         of load, atomic with return
3349                                                         and sample instructions in
3350                                                         order, and the vscnt reports
3351                                                         the completion of store and
3352                                                         atomic without return in order.
3353
3354                                                       Used by CP to set up
3355                                                       ``COMPUTE_PGM_RSRC1.MEM_ORDERED``.
3356     31      1 bit    FWD_PROGRESS                   GFX6-9
3357                                                       Reserved, must be 0.
3358                                                     GFX10
3359                                                       - If 0 execute SIMD wavefronts
3360                                                         using oldest first policy.
3361                                                       - If 1 execute SIMD wavefronts to
3362                                                         ensure wavefronts will make some
3363                                                         forward progress.
3364
3365                                                       Used by CP to set up
3366                                                       ``COMPUTE_PGM_RSRC1.FWD_PROGRESS``.
3367     32      **Total size 4 bytes**
3368     ======= ===================================================================================================================
3369
3370..
3371
3372  .. table:: compute_pgm_rsrc2 for GFX6-GFX10
3373     :name: amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx10-table
3374
3375     ======= ======= =============================== ===========================================================================
3376     Bits    Size    Field Name                      Description
3377     ======= ======= =============================== ===========================================================================
3378     0       1 bit   ENABLE_SGPR_PRIVATE_SEGMENT     Enable the setup of the
3379                     _WAVEFRONT_OFFSET               SGPR wavefront scratch offset
3380                                                     system register (see
3381                                                     :ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
3382
3383                                                     Used by CP to set up
3384                                                     ``COMPUTE_PGM_RSRC2.SCRATCH_EN``.
3385     5:1     5 bits  USER_SGPR_COUNT                 The total number of SGPR
3386                                                     user data registers
3387                                                     requested. This number must
3388                                                     match the number of user
3389                                                     data registers enabled.
3390
3391                                                     Used by CP to set up
3392                                                     ``COMPUTE_PGM_RSRC2.USER_SGPR``.
3393     6       1 bit   ENABLE_TRAP_HANDLER             Must be 0.
3394
3395                                                     This bit represents
3396                                                     ``COMPUTE_PGM_RSRC2.TRAP_PRESENT``,
3397                                                     which is set by the CP if
3398                                                     the runtime has installed a
3399                                                     trap handler.
3400     7       1 bit   ENABLE_SGPR_WORKGROUP_ID_X      Enable the setup of the
3401                                                     system SGPR register for
3402                                                     the work-group id in the X
3403                                                     dimension (see
3404                                                     :ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
3405
3406                                                     Used by CP to set up
3407                                                     ``COMPUTE_PGM_RSRC2.TGID_X_EN``.
3408     8       1 bit   ENABLE_SGPR_WORKGROUP_ID_Y      Enable the setup of the
3409                                                     system SGPR register for
3410                                                     the work-group id in the Y
3411                                                     dimension (see
3412                                                     :ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
3413
3414                                                     Used by CP to set up
3415                                                     ``COMPUTE_PGM_RSRC2.TGID_Y_EN``.
3416     9       1 bit   ENABLE_SGPR_WORKGROUP_ID_Z      Enable the setup of the
3417                                                     system SGPR register for
3418                                                     the work-group id in the Z
3419                                                     dimension (see
3420                                                     :ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
3421
3422                                                     Used by CP to set up
3423                                                     ``COMPUTE_PGM_RSRC2.TGID_Z_EN``.
3424     10      1 bit   ENABLE_SGPR_WORKGROUP_INFO      Enable the setup of the
3425                                                     system SGPR register for
3426                                                     work-group information (see
3427                                                     :ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
3428
3429                                                     Used by CP to set up
3430                                                     ``COMPUTE_PGM_RSRC2.TGID_SIZE_EN``.
3431     12:11   2 bits  ENABLE_VGPR_WORKITEM_ID         Enable the setup of the
3432                                                     VGPR system registers used
3433                                                     for the work-item ID.
3434                                                     :ref:`amdgpu-amdhsa-system-vgpr-work-item-id-enumeration-values-table`
3435                                                     defines the values.
3436
3437                                                     Used by CP to set up
3438                                                     ``COMPUTE_PGM_RSRC2.TIDIG_CMP_CNT``.
3439     13      1 bit   ENABLE_EXCEPTION_ADDRESS_WATCH  Must be 0.
3440
3441                                                     Wavefront starts execution
3442                                                     with address watch
3443                                                     exceptions enabled which
3444                                                     are generated when L1 has
3445                                                     witnessed a thread access
3446                                                     an *address of
3447                                                     interest*.
3448
3449                                                     CP is responsible for
3450                                                     filling in the address
3451                                                     watch bit in
3452                                                     ``COMPUTE_PGM_RSRC2.EXCP_EN_MSB``
3453                                                     according to what the
3454                                                     runtime requests.
3455     14      1 bit   ENABLE_EXCEPTION_MEMORY         Must be 0.
3456
3457                                                     Wavefront starts execution
3458                                                     with memory violation
3459                                                     exceptions exceptions
3460                                                     enabled which are generated
3461                                                     when a memory violation has
3462                                                     occurred for this wavefront from
3463                                                     L1 or LDS
3464                                                     (write-to-read-only-memory,
3465                                                     mis-aligned atomic, LDS
3466                                                     address out of range,
3467                                                     illegal address, etc.).
3468
3469                                                     CP sets the memory
3470                                                     violation bit in
3471                                                     ``COMPUTE_PGM_RSRC2.EXCP_EN_MSB``
3472                                                     according to what the
3473                                                     runtime requests.
3474     23:15   9 bits  GRANULATED_LDS_SIZE             Must be 0.
3475
3476                                                     CP uses the rounded value
3477                                                     from the dispatch packet,
3478                                                     not this value, as the
3479                                                     dispatch may contain
3480                                                     dynamically allocated group
3481                                                     segment memory. CP writes
3482                                                     directly to
3483                                                     ``COMPUTE_PGM_RSRC2.LDS_SIZE``.
3484
3485                                                     Amount of group segment
3486                                                     (LDS) to allocate for each
3487                                                     work-group. Granularity is
3488                                                     device specific:
3489
3490                                                     GFX6:
3491                                                       roundup(lds-size / (64 * 4))
3492                                                     GFX7-GFX10:
3493                                                       roundup(lds-size / (128 * 4))
3494
3495     24      1 bit   ENABLE_EXCEPTION_IEEE_754_FP    Wavefront starts execution
3496                     _INVALID_OPERATION              with specified exceptions
3497                                                     enabled.
3498
3499                                                     Used by CP to set up
3500                                                     ``COMPUTE_PGM_RSRC2.EXCP_EN``
3501                                                     (set from bits 0..6).
3502
3503                                                     IEEE 754 FP Invalid
3504                                                     Operation
3505     25      1 bit   ENABLE_EXCEPTION_FP_DENORMAL    FP Denormal one or more
3506                     _SOURCE                         input operands is a
3507                                                     denormal number
3508     26      1 bit   ENABLE_EXCEPTION_IEEE_754_FP    IEEE 754 FP Division by
3509                     _DIVISION_BY_ZERO               Zero
3510     27      1 bit   ENABLE_EXCEPTION_IEEE_754_FP    IEEE 754 FP FP Overflow
3511                     _OVERFLOW
3512     28      1 bit   ENABLE_EXCEPTION_IEEE_754_FP    IEEE 754 FP Underflow
3513                     _UNDERFLOW
3514     29      1 bit   ENABLE_EXCEPTION_IEEE_754_FP    IEEE 754 FP Inexact
3515                     _INEXACT
3516     30      1 bit   ENABLE_EXCEPTION_INT_DIVIDE_BY  Integer Division by Zero
3517                     _ZERO                           (rcp_iflag_f32 instruction
3518                                                     only)
3519     31      1 bit                                   Reserved, must be 0.
3520     32      **Total size 4 bytes.**
3521     ======= ===================================================================================================================
3522
3523..
3524
3525  .. table:: compute_pgm_rsrc3 for GFX10
3526     :name: amdgpu-amdhsa-compute_pgm_rsrc3-gfx10-table
3527
3528     ======= ======= =============================== ===========================================================================
3529     Bits    Size    Field Name                      Description
3530     ======= ======= =============================== ===========================================================================
3531     3:0     4 bits  SHARED_VGPR_COUNT               Number of shared VGPRs for wavefront size 64. Granularity 8. Value 0-120.
3532                                                     compute_pgm_rsrc1.vgprs + shared_vgpr_cnt cannot exceed 64.
3533     31:4    28                                      Reserved, must be 0.
3534             bits
3535     32      **Total size 4 bytes.**
3536     ======= ===================================================================================================================
3537
3538..
3539
3540  .. table:: Floating Point Rounding Mode Enumeration Values
3541     :name: amdgpu-amdhsa-floating-point-rounding-mode-enumeration-values-table
3542
3543     ====================================== ===== ==============================
3544     Enumeration Name                       Value Description
3545     ====================================== ===== ==============================
3546     FLOAT_ROUND_MODE_NEAR_EVEN             0     Round Ties To Even
3547     FLOAT_ROUND_MODE_PLUS_INFINITY         1     Round Toward +infinity
3548     FLOAT_ROUND_MODE_MINUS_INFINITY        2     Round Toward -infinity
3549     FLOAT_ROUND_MODE_ZERO                  3     Round Toward 0
3550     ====================================== ===== ==============================
3551
3552..
3553
3554  .. table:: Floating Point Denorm Mode Enumeration Values
3555     :name: amdgpu-amdhsa-floating-point-denorm-mode-enumeration-values-table
3556
3557     ====================================== ===== ==============================
3558     Enumeration Name                       Value Description
3559     ====================================== ===== ==============================
3560     FLOAT_DENORM_MODE_FLUSH_SRC_DST        0     Flush Source and Destination
3561                                                  Denorms
3562     FLOAT_DENORM_MODE_FLUSH_DST            1     Flush Output Denorms
3563     FLOAT_DENORM_MODE_FLUSH_SRC            2     Flush Source Denorms
3564     FLOAT_DENORM_MODE_FLUSH_NONE           3     No Flush
3565     ====================================== ===== ==============================
3566
3567..
3568
3569  .. table:: System VGPR Work-Item ID Enumeration Values
3570     :name: amdgpu-amdhsa-system-vgpr-work-item-id-enumeration-values-table
3571
3572     ======================================== ===== ============================
3573     Enumeration Name                         Value Description
3574     ======================================== ===== ============================
3575     SYSTEM_VGPR_WORKITEM_ID_X                0     Set work-item X dimension
3576                                                    ID.
3577     SYSTEM_VGPR_WORKITEM_ID_X_Y              1     Set work-item X and Y
3578                                                    dimensions ID.
3579     SYSTEM_VGPR_WORKITEM_ID_X_Y_Z            2     Set work-item X, Y and Z
3580                                                    dimensions ID.
3581     SYSTEM_VGPR_WORKITEM_ID_UNDEFINED        3     Undefined.
3582     ======================================== ===== ============================
3583
3584.. _amdgpu-amdhsa-initial-kernel-execution-state:
3585
3586Initial Kernel Execution State
3587~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
3588
3589This section defines the register state that will be set up by the packet
3590processor prior to the start of execution of every wavefront. This is limited by
3591the constraints of the hardware controllers of CP/ADC/SPI.
3592
3593The order of the SGPR registers is defined, but the compiler can specify which
3594ones are actually setup in the kernel descriptor using the ``enable_sgpr_*`` bit
3595fields (see :ref:`amdgpu-amdhsa-kernel-descriptor`). The register numbers used
3596for enabled registers are dense starting at SGPR0: the first enabled register is
3597SGPR0, the next enabled register is SGPR1 etc.; disabled registers do not have
3598an SGPR number.
3599
3600The initial SGPRs comprise up to 16 User SRGPs that are set by CP and apply to
3601all wavefronts of the grid. It is possible to specify more than 16 User SGPRs
3602using the ``enable_sgpr_*`` bit fields, in which case only the first 16 are
3603actually initialized. These are then immediately followed by the System SGPRs
3604that are set up by ADC/SPI and can have different values for each wavefront of
3605the grid dispatch.
3606
3607SGPR register initial state is defined in
3608:ref:`amdgpu-amdhsa-sgpr-register-set-up-order-table`.
3609
3610  .. table:: SGPR Register Set Up Order
3611     :name: amdgpu-amdhsa-sgpr-register-set-up-order-table
3612
3613     ========== ========================== ====== ==============================
3614     SGPR Order Name                       Number Description
3615                (kernel descriptor enable  of
3616                field)                     SGPRs
3617     ========== ========================== ====== ==============================
3618     First      Private Segment Buffer     4      V# that can be used, together
3619                (enable_sgpr_private              with Scratch Wavefront Offset
3620                _segment_buffer)                  as an offset, to access the
3621                                                  private address space using a
3622                                                  segment address.
3623
3624                                                  CP uses the value provided by
3625                                                  the runtime.
3626     then       Dispatch Ptr               2      64-bit address of AQL dispatch
3627                (enable_sgpr_dispatch_ptr)        packet for kernel dispatch
3628                                                  actually executing.
3629     then       Queue Ptr                  2      64-bit address of amd_queue_t
3630                (enable_sgpr_queue_ptr)           object for AQL queue on which
3631                                                  the dispatch packet was
3632                                                  queued.
3633     then       Kernarg Segment Ptr        2      64-bit address of Kernarg
3634                (enable_sgpr_kernarg              segment. This is directly
3635                _segment_ptr)                     copied from the
3636                                                  kernarg_address in the kernel
3637                                                  dispatch packet.
3638
3639                                                  Having CP load it once avoids
3640                                                  loading it at the beginning of
3641                                                  every wavefront.
3642     then       Dispatch Id                2      64-bit Dispatch ID of the
3643                (enable_sgpr_dispatch_id)         dispatch packet being
3644                                                  executed.
3645     then       Flat Scratch Init          2      This is 2 SGPRs:
3646                (enable_sgpr_flat_scratch
3647                _init)                            GFX6
3648                                                    Not supported.
3649                                                  GFX7-GFX8
3650                                                    The first SGPR is a 32-bit
3651                                                    byte offset from
3652                                                    ``SH_HIDDEN_PRIVATE_BASE_VIMID``
3653                                                    to per SPI base of memory
3654                                                    for scratch for the queue
3655                                                    executing the kernel
3656                                                    dispatch. CP obtains this
3657                                                    from the runtime. (The
3658                                                    Scratch Segment Buffer base
3659                                                    address is
3660                                                    ``SH_HIDDEN_PRIVATE_BASE_VIMID``
3661                                                    plus this offset.) The value
3662                                                    of Scratch Wavefront Offset must
3663                                                    be added to this offset by
3664                                                    the kernel machine code,
3665                                                    right shifted by 8, and
3666                                                    moved to the FLAT_SCRATCH_HI
3667                                                    SGPR register.
3668                                                    FLAT_SCRATCH_HI corresponds
3669                                                    to SGPRn-4 on GFX7, and
3670                                                    SGPRn-6 on GFX8 (where SGPRn
3671                                                    is the highest numbered SGPR
3672                                                    allocated to the wavefront).
3673                                                    FLAT_SCRATCH_HI is
3674                                                    multiplied by 256 (as it is
3675                                                    in units of 256 bytes) and
3676                                                    added to
3677                                                    ``SH_HIDDEN_PRIVATE_BASE_VIMID``
3678                                                    to calculate the per wavefront
3679                                                    FLAT SCRATCH BASE in flat
3680                                                    memory instructions that
3681                                                    access the scratch
3682                                                    aperture.
3683
3684                                                    The second SGPR is 32-bit
3685                                                    byte size of a single
3686                                                    work-item's scratch memory
3687                                                    usage. CP obtains this from
3688                                                    the runtime, and it is
3689                                                    always a multiple of DWORD.
3690                                                    CP checks that the value in
3691                                                    the kernel dispatch packet
3692                                                    Private Segment Byte Size is
3693                                                    not larger and requests the
3694                                                    runtime to increase the
3695                                                    queue's scratch size if
3696                                                    necessary. The kernel code
3697                                                    must move it to
3698                                                    FLAT_SCRATCH_LO which is
3699                                                    SGPRn-3 on GFX7 and SGPRn-5
3700                                                    on GFX8. FLAT_SCRATCH_LO is
3701                                                    used as the FLAT SCRATCH
3702                                                    SIZE in flat memory
3703                                                    instructions. Having CP load
3704                                                    it once avoids loading it at
3705                                                    the beginning of every
3706                                                    wavefront.
3707                                                  GFX9-GFX10
3708                                                    This is the
3709                                                    64-bit base address of the
3710                                                    per SPI scratch backing
3711                                                    memory managed by SPI for
3712                                                    the queue executing the
3713                                                    kernel dispatch. CP obtains
3714                                                    this from the runtime (and
3715                                                    divides it if there are
3716                                                    multiple Shader Arrays each
3717                                                    with its own SPI). The value
3718                                                    of Scratch Wavefront Offset must
3719                                                    be added by the kernel
3720                                                    machine code and the result
3721                                                    moved to the FLAT_SCRATCH
3722                                                    SGPR which is SGPRn-6 and
3723                                                    SGPRn-5. It is used as the
3724                                                    FLAT SCRATCH BASE in flat
3725                                                    memory instructions.
3726     then       Private Segment Size       1      The 32-bit byte size of a
3727                                                  (enable_sgpr_private single
3728                                                  work-item's
3729                                                  scratch_segment_size) memory
3730                                                  allocation. This is the
3731                                                  value from the kernel
3732                                                  dispatch packet Private
3733                                                  Segment Byte Size rounded up
3734                                                  by CP to a multiple of
3735                                                  DWORD.
3736
3737                                                  Having CP load it once avoids
3738                                                  loading it at the beginning of
3739                                                  every wavefront.
3740
3741                                                  This is not used for
3742                                                  GFX7-GFX8 since it is the same
3743                                                  value as the second SGPR of
3744                                                  Flat Scratch Init. However, it
3745                                                  may be needed for GFX9-GFX10 which
3746                                                  changes the meaning of the
3747                                                  Flat Scratch Init value.
3748     then       Grid Work-Group Count X    1      32-bit count of the number of
3749                (enable_sgpr_grid                 work-groups in the X dimension
3750                _workgroup_count_X)               for the grid being
3751                                                  executed. Computed from the
3752                                                  fields in the kernel dispatch
3753                                                  packet as ((grid_size.x +
3754                                                  workgroup_size.x - 1) /
3755                                                  workgroup_size.x).
3756     then       Grid Work-Group Count Y    1      32-bit count of the number of
3757                (enable_sgpr_grid                 work-groups in the Y dimension
3758                _workgroup_count_Y &&             for the grid being
3759                less than 16 previous             executed. Computed from the
3760                SGPRs)                            fields in the kernel dispatch
3761                                                  packet as ((grid_size.y +
3762                                                  workgroup_size.y - 1) /
3763                                                  workgroupSize.y).
3764
3765                                                  Only initialized if <16
3766                                                  previous SGPRs initialized.
3767     then       Grid Work-Group Count Z    1      32-bit count of the number of
3768                (enable_sgpr_grid                 work-groups in the Z dimension
3769                _workgroup_count_Z &&             for the grid being
3770                less than 16 previous             executed. Computed from the
3771                SGPRs)                            fields in the kernel dispatch
3772                                                  packet as ((grid_size.z +
3773                                                  workgroup_size.z - 1) /
3774                                                  workgroupSize.z).
3775
3776                                                  Only initialized if <16
3777                                                  previous SGPRs initialized.
3778     then       Work-Group Id X            1      32-bit work-group id in X
3779                (enable_sgpr_workgroup_id         dimension of grid for
3780                _X)                               wavefront.
3781     then       Work-Group Id Y            1      32-bit work-group id in Y
3782                (enable_sgpr_workgroup_id         dimension of grid for
3783                _Y)                               wavefront.
3784     then       Work-Group Id Z            1      32-bit work-group id in Z
3785                (enable_sgpr_workgroup_id         dimension of grid for
3786                _Z)                               wavefront.
3787     then       Work-Group Info            1      {first_wavefront, 14'b0000,
3788                (enable_sgpr_workgroup            ordered_append_term[10:0],
3789                _info)                            threadgroup_size_in_wavefronts[5:0]}
3790     then       Scratch Wavefront Offset   1      32-bit byte offset from base
3791                (enable_sgpr_private              of scratch base of queue
3792                _segment_wavefront_offset)        executing the kernel
3793                                                  dispatch. Must be used as an
3794                                                  offset with Private
3795                                                  segment address when using
3796                                                  Scratch Segment Buffer. It
3797                                                  must be used to set up FLAT
3798                                                  SCRATCH for flat addressing
3799                                                  (see
3800                                                  :ref:`amdgpu-amdhsa-kernel-prolog-flat-scratch`).
3801     ========== ========================== ====== ==============================
3802
3803The order of the VGPR registers is defined, but the compiler can specify which
3804ones are actually setup in the kernel descriptor using the ``enable_vgpr*`` bit
3805fields (see :ref:`amdgpu-amdhsa-kernel-descriptor`). The register numbers used
3806for enabled registers are dense starting at VGPR0: the first enabled register is
3807VGPR0, the next enabled register is VGPR1 etc.; disabled registers do not have a
3808VGPR number.
3809
3810VGPR register initial state is defined in
3811:ref:`amdgpu-amdhsa-vgpr-register-set-up-order-table`.
3812
3813  .. table:: VGPR Register Set Up Order
3814     :name: amdgpu-amdhsa-vgpr-register-set-up-order-table
3815
3816     ========== ========================== ====== ==============================
3817     VGPR Order Name                       Number Description
3818                (kernel descriptor enable  of
3819                field)                     VGPRs
3820     ========== ========================== ====== ==============================
3821     First      Work-Item Id X             1      32-bit work item id in X
3822                (Always initialized)              dimension of work-group for
3823                                                  wavefront lane.
3824     then       Work-Item Id Y             1      32-bit work item id in Y
3825                (enable_vgpr_workitem_id          dimension of work-group for
3826                > 0)                              wavefront lane.
3827     then       Work-Item Id Z             1      32-bit work item id in Z
3828                (enable_vgpr_workitem_id          dimension of work-group for
3829                > 1)                              wavefront lane.
3830     ========== ========================== ====== ==============================
3831
3832The setting of registers is done by GPU CP/ADC/SPI hardware as follows:
3833
38341. SGPRs before the Work-Group Ids are set by CP using the 16 User Data
3835   registers.
38362. Work-group Id registers X, Y, Z are set by ADC which supports any
3837   combination including none.
38383. Scratch Wavefront Offset is set by SPI in a per wavefront basis which is why
3839   its value cannot be included with the flat scratch init value which is per
3840   queue.
38414. The VGPRs are set by SPI which only supports specifying either (X), (X, Y)
3842   or (X, Y, Z).
3843
3844Flat Scratch register pair are adjacent SGPRs so they can be moved as a 64-bit
3845value to the hardware required SGPRn-3 and SGPRn-4 respectively.
3846
3847The global segment can be accessed either using buffer instructions (GFX6 which
3848has V# 64-bit address support), flat instructions (GFX7-GFX10), or global
3849instructions (GFX9-GFX10).
3850
3851If buffer operations are used, then the compiler can generate a V# with the
3852following properties:
3853
3854* base address of 0
3855* no swizzle
3856* ATC: 1 if IOMMU present (such as APU)
3857* ptr64: 1
3858* MTYPE set to support memory coherence that matches the runtime (such as CC for
3859  APU and NC for dGPU).
3860
3861.. _amdgpu-amdhsa-kernel-prolog:
3862
3863Kernel Prolog
3864~~~~~~~~~~~~~
3865
3866The compiler performs initialization in the kernel prologue depending on the
3867target and information about things like stack usage in the kernel and called
3868functions. Some of this initialization requires the compiler to request certain
3869User and System SGPRs be present in the
3870:ref:`amdgpu-amdhsa-initial-kernel-execution-state` via the
3871:ref:`amdgpu-amdhsa-kernel-descriptor`.
3872
3873.. _amdgpu-amdhsa-kernel-prolog-cfi:
3874
3875CFI
3876+++
3877
38781.  The CFI return address is undefined.
3879
38802.  The CFI CFA is defined using an expression which evaluates to a location
3881    description that comprises one memory location description for the
3882    ``DW_ASPACE_AMDGPU_private_lane`` address space address ``0``.
3883
3884.. _amdgpu-amdhsa-kernel-prolog-m0:
3885
3886M0
3887++
3888
3889GFX6-GFX8
3890  The M0 register must be initialized with a value at least the total LDS size
3891  if the kernel may access LDS via DS or flat operations. Total LDS size is
3892  available in dispatch packet. For M0, it is also possible to use maximum
3893  possible value of LDS for given target (0x7FFF for GFX6 and 0xFFFF for
3894  GFX7-GFX8).
3895GFX9-GFX10
3896  The M0 register is not used for range checking LDS accesses and so does not
3897  need to be initialized in the prolog.
3898
3899.. _amdgpu-amdhsa-kernel-prolog-stack-pointer:
3900
3901Stack Pointer
3902+++++++++++++
3903
3904If the kernel has function calls it must set up the ABI stack pointer described
3905in :ref:`amdgpu-amdhsa-function-call-convention-non-kernel-functions` by setting
3906SGPR32 to the unswizzled scratch offset of the address past the last local
3907allocation.
3908
3909.. _amdgpu-amdhsa-kernel-prolog-frame-pointer:
3910
3911Frame Pointer
3912+++++++++++++
3913
3914If the kernel needs a frame pointer for the reasons defined in
3915``SIFrameLowering`` then SGPR33 is used and is always set to ``0`` in the
3916kernel prolog. If a frame pointer is not required then all uses of the frame
3917pointer are replaced with immediate ``0`` offsets.
3918
3919.. _amdgpu-amdhsa-kernel-prolog-flat-scratch:
3920
3921Flat Scratch
3922++++++++++++
3923
3924If the kernel or any function it calls may use flat operations to access
3925scratch memory, the prolog code must set up the FLAT_SCRATCH register pair
3926(FLAT_SCRATCH_LO/FLAT_SCRATCH_HI which are in SGPRn-4/SGPRn-3). Initialization
3927uses Flat Scratch Init and Scratch Wavefront Offset SGPR registers (see
3928:ref:`amdgpu-amdhsa-initial-kernel-execution-state`):
3929
3930GFX6
3931  Flat scratch is not supported.
3932
3933GFX7-GFX8
3934
3935  1. The low word of Flat Scratch Init is 32-bit byte offset from
3936     ``SH_HIDDEN_PRIVATE_BASE_VIMID`` to the base of scratch backing memory
3937     being managed by SPI for the queue executing the kernel dispatch. This is
3938     the same value used in the Scratch Segment Buffer V# base address. The
3939     prolog must add the value of Scratch Wavefront Offset to get the
3940     wavefront's byte scratch backing memory offset from
3941     ``SH_HIDDEN_PRIVATE_BASE_VIMID``. Since FLAT_SCRATCH_LO is in units of 256
3942     bytes, the offset must be right shifted by 8 before moving into
3943     FLAT_SCRATCH_LO.
3944  2. The second word of Flat Scratch Init is 32-bit byte size of a single
3945     work-items scratch memory usage. This is directly loaded from the kernel
3946     dispatch packet Private Segment Byte Size and rounded up to a multiple of
3947     DWORD. Having CP load it once avoids loading it at the beginning of every
3948     wavefront. The prolog must move it to FLAT_SCRATCH_LO for use as FLAT
3949     SCRATCH SIZE.
3950
3951GFX9-GFX10
3952  The Flat Scratch Init is the 64-bit address of the base of scratch backing
3953  memory being managed by SPI for the queue executing the kernel dispatch. The
3954  prolog must add the value of Scratch Wavefront Offset and moved to the
3955  FLAT_SCRATCH pair for use as the flat scratch base in flat memory
3956  instructions.
3957
3958.. _amdgpu-amdhsa-kernel-prolog-private-segment-buffer:
3959
3960Private Segment Buffer
3961++++++++++++++++++++++
3962
3963A set of four SGPRs beginning at a four-aligned SGPR index are always selected
3964to serve as the scratch V# for the kernel as follows:
3965
3966  - If it is known during instruction selection that there is stack usage,
3967    SGPR0-3 is reserved for use as the scratch V#.  Stack usage is assumed if
3968    optimizations are disabled (``-O0``), if stack objects already exist (for
3969    locals, etc.), or if there are any function calls.
3970
3971  - Otherwise, four high numbered SGPRs beginning at a four-aligned SGPR index
3972    are reserved for the tentative scratch V#. These will be used if it is
3973    determined that spilling is needed.
3974
3975    - If no use is made of the tentative scratch V#, then it is unreserved,
3976      and the register count is determined ignoring it.
3977    - If use is made of the tentative scratch V#, then its register numbers
3978      are shifted to the first four-aligned SGPR index after the highest one
3979      allocated by the register allocator, and all uses are updated. The
3980      register count includes them in the shifted location.
3981    - In either case, if the processor has the SGPR allocation bug, the
3982      tentative allocation is not shifted or unreserved in order to ensure
3983      the register count is higher to workaround the bug.
3984
3985    .. note::
3986
3987      This approach of using a tentative scratch V# and shifting the register
3988      numbers if used avoids having to perform register allocation a second
3989      time if the tentative V# is eliminated. This is more efficient and
3990      avoids the problem that the second register allocation may perform
3991      spilling which will fail as there is no longer a scratch V#.
3992
3993When the kernel prolog code is being emitted it is known whether the scratch V#
3994described above is actually used. If it is, the prolog code must set it up by
3995copying the Private Segment Buffer to the scratch V# registers and then adding
3996the Private Segment Wavefront Offset to the queue base address in the V#. The
3997result is a V# with a base address pointing to the beginning of the wavefront
3998scratch backing memory.
3999
4000The Private Segment Buffer is always requested, but the Private Segment
4001Wavefront Offset is only requested if it is used (see
4002:ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
4003
4004.. _amdgpu-amdhsa-memory-model:
4005
4006Memory Model
4007~~~~~~~~~~~~
4008
4009This section describes the mapping of LLVM memory model onto AMDGPU machine code
4010(see :ref:`memmodel`).
4011
4012The AMDGPU backend supports the memory synchronization scopes specified in
4013:ref:`amdgpu-memory-scopes`.
4014
4015The code sequences used to implement the memory model are defined in table
4016:ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx6-gfx10-table`.
4017
4018The sequences specify the order of instructions that a single thread must
4019execute. The ``s_waitcnt`` and ``buffer_wbinvl1_vol`` are defined with respect
4020to other memory instructions executed by the same thread. This allows them to be
4021moved earlier or later which can allow them to be combined with other instances
4022of the same instruction, or hoisted/sunk out of loops to improve
4023performance. Only the instructions related to the memory model are given;
4024additional ``s_waitcnt`` instructions are required to ensure registers are
4025defined before being used. These may be able to be combined with the memory
4026model ``s_waitcnt`` instructions as described above.
4027
4028The AMDGPU backend supports the following memory models:
4029
4030  HSA Memory Model [HSA]_
4031    The HSA memory model uses a single happens-before relation for all address
4032    spaces (see :ref:`amdgpu-address-spaces`).
4033  OpenCL Memory Model [OpenCL]_
4034    The OpenCL memory model which has separate happens-before relations for the
4035    global and local address spaces. Only a fence specifying both global and
4036    local address space, and seq_cst instructions join the relationships. Since
4037    the LLVM ``memfence`` instruction does not allow an address space to be
4038    specified the OpenCL fence has to conservatively assume both local and
4039    global address space was specified. However, optimizations can often be
4040    done to eliminate the additional ``s_waitcnt`` instructions when there are
4041    no intervening memory instructions which access the corresponding address
4042    space. The code sequences in the table indicate what can be omitted for the
4043    OpenCL memory. The target triple environment is used to determine if the
4044    source language is OpenCL (see :ref:`amdgpu-opencl`).
4045
4046``ds/flat_load/store/atomic`` instructions to local memory are termed LDS
4047operations.
4048
4049``buffer/global/flat_load/store/atomic`` instructions to global memory are
4050termed vector memory operations.
4051
4052For GFX6-GFX9:
4053
4054* Each agent has multiple shader arrays (SA).
4055* Each SA has multiple compute units (CU).
4056* Each CU has multiple SIMDs that execute wavefronts.
4057* The wavefronts for a single work-group are executed in the same CU but may be
4058  executed by different SIMDs.
4059* Each CU has a single LDS memory shared by the wavefronts of the work-groups
4060  executing on it.
4061* All LDS operations of a CU are performed as wavefront wide operations in a
4062  global order and involve no caching. Completion is reported to a wavefront in
4063  execution order.
4064* The LDS memory has multiple request queues shared by the SIMDs of a
4065  CU. Therefore, the LDS operations performed by different wavefronts of a
4066  work-group can be reordered relative to each other, which can result in
4067  reordering the visibility of vector memory operations with respect to LDS
4068  operations of other wavefronts in the same work-group. A ``s_waitcnt
4069  lgkmcnt(0)`` is required to ensure synchronization between LDS operations and
4070  vector memory operations between wavefronts of a work-group, but not between
4071  operations performed by the same wavefront.
4072* The vector memory operations are performed as wavefront wide operations and
4073  completion is reported to a wavefront in execution order. The exception is
4074  that for GFX7-GFX9 ``flat_load/store/atomic`` instructions can report out of
4075  vector memory order if they access LDS memory, and out of LDS operation order
4076  if they access global memory.
4077* The vector memory operations access a single vector L1 cache shared by all
4078  SIMDs a CU. Therefore, no special action is required for coherence between the
4079  lanes of a single wavefront, or for coherence between wavefronts in the same
4080  work-group. A ``buffer_wbinvl1_vol`` is required for coherence between
4081  wavefronts executing in different work-groups as they may be executing on
4082  different CUs.
4083* The scalar memory operations access a scalar L1 cache shared by all wavefronts
4084  on a group of CUs. The scalar and vector L1 caches are not coherent. However,
4085  scalar operations are used in a restricted way so do not impact the memory
4086  model. See :ref:`amdgpu-address-spaces`.
4087* The vector and scalar memory operations use an L2 cache shared by all CUs on
4088  the same agent.
4089* The L2 cache has independent channels to service disjoint ranges of virtual
4090  addresses.
4091* Each CU has a separate request queue per channel. Therefore, the vector and
4092  scalar memory operations performed by wavefronts executing in different
4093  work-groups (which may be executing on different CUs) of an agent can be
4094  reordered relative to each other. A ``s_waitcnt vmcnt(0)`` is required to
4095  ensure synchronization between vector memory operations of different CUs. It
4096  ensures a previous vector memory operation has completed before executing a
4097  subsequent vector memory or LDS operation and so can be used to meet the
4098  requirements of acquire and release.
4099* The L2 cache can be kept coherent with other agents on some targets, or ranges
4100  of virtual addresses can be set up to bypass it to ensure system coherence.
4101
4102For GFX10:
4103
4104* Each agent has multiple shader arrays (SA).
4105* Each SA has multiple work-group processors (WGP).
4106* Each WGP has multiple compute units (CU).
4107* Each CU has multiple SIMDs that execute wavefronts.
4108* The wavefronts for a single work-group are executed in the same
4109  WGP. In CU wavefront execution mode the wavefronts may be executed by
4110  different SIMDs in the same CU. In WGP wavefront execution mode the
4111  wavefronts may be executed by different SIMDs in different CUs in the same
4112  WGP.
4113* Each WGP has a single LDS memory shared by the wavefronts of the work-groups
4114  executing on it.
4115* All LDS operations of a WGP are performed as wavefront wide operations in a
4116  global order and involve no caching. Completion is reported to a wavefront in
4117  execution order.
4118* The LDS memory has multiple request queues shared by the SIMDs of a
4119  WGP. Therefore, the LDS operations performed by different wavefronts of a
4120  work-group can be reordered relative to each other, which can result in
4121  reordering the visibility of vector memory operations with respect to LDS
4122  operations of other wavefronts in the same work-group. A ``s_waitcnt
4123  lgkmcnt(0)`` is required to ensure synchronization between LDS operations and
4124  vector memory operations between wavefronts of a work-group, but not between
4125  operations performed by the same wavefront.
4126* The vector memory operations are performed as wavefront wide operations.
4127  Completion of load/store/sample operations are reported to a wavefront in
4128  execution order of other load/store/sample operations performed by that
4129  wavefront.
4130* The vector memory operations access a vector L0 cache. There is a single L0
4131  cache per CU. Each SIMD of a CU accesses the same L0 cache. Therefore, no
4132  special action is required for coherence between the lanes of a single
4133  wavefront. However, a ``BUFFER_GL0_INV`` is required for coherence between
4134  wavefronts executing in the same work-group as they may be executing on SIMDs
4135  of different CUs that access different L0s. A ``BUFFER_GL0_INV`` is also
4136  required for coherence between wavefronts executing in different work-groups
4137  as they may be executing on different WGPs.
4138* The scalar memory operations access a scalar L0 cache shared by all wavefronts
4139  on a WGP. The scalar and vector L0 caches are not coherent. However, scalar
4140  operations are used in a restricted way so do not impact the memory model. See
4141  :ref:`amdgpu-address-spaces`.
4142* The vector and scalar memory L0 caches use an L1 cache shared by all WGPs on
4143  the same SA. Therefore, no special action is required for coherence between
4144  the wavefronts of a single work-group. However, a ``BUFFER_GL1_INV`` is
4145  required for coherence between wavefronts executing in different work-groups
4146  as they may be executing on different SAs that access different L1s.
4147* The L1 caches have independent quadrants to service disjoint ranges of virtual
4148  addresses.
4149* Each L0 cache has a separate request queue per L1 quadrant. Therefore, the
4150  vector and scalar memory operations performed by different wavefronts, whether
4151  executing in the same or different work-groups (which may be executing on
4152  different CUs accessing different L0s), can be reordered relative to each
4153  other. A ``s_waitcnt vmcnt(0) & vscnt(0)`` is required to ensure
4154  synchronization between vector memory operations of different wavefronts. It
4155  ensures a previous vector memory operation has completed before executing a
4156  subsequent vector memory or LDS operation and so can be used to meet the
4157  requirements of acquire, release and sequential consistency.
4158* The L1 caches use an L2 cache shared by all SAs on the same agent.
4159* The L2 cache has independent channels to service disjoint ranges of virtual
4160  addresses.
4161* Each L1 quadrant of a single SA accesses a different L2 channel. Each L1
4162  quadrant has a separate request queue per L2 channel. Therefore, the vector
4163  and scalar memory operations performed by wavefronts executing in different
4164  work-groups (which may be executing on different SAs) of an agent can be
4165  reordered relative to each other. A ``s_waitcnt vmcnt(0) & vscnt(0)`` is
4166  required to ensure synchronization between vector memory operations of
4167  different SAs. It ensures a previous vector memory operation has completed
4168  before executing a subsequent vector memory and so can be used to meet the
4169  requirements of acquire, release and sequential consistency.
4170* The L2 cache can be kept coherent with other agents on some targets, or ranges
4171  of virtual addresses can be set up to bypass it to ensure system coherence.
4172
4173Private address space uses ``buffer_load/store`` using the scratch V#
4174(GFX6-GFX8), or ``scratch_load/store`` (GFX9-GFX10). Since only a single thread
4175is accessing the memory, atomic memory orderings are not meaningful, and all
4176accesses are treated as non-atomic.
4177
4178Constant address space uses ``buffer/global_load`` instructions (or equivalent
4179scalar memory instructions). Since the constant address space contents do not
4180change during the execution of a kernel dispatch it is not legal to perform
4181stores, and atomic memory orderings are not meaningful, and all access are
4182treated as non-atomic.
4183
4184A memory synchronization scope wider than work-group is not meaningful for the
4185group (LDS) address space and is treated as work-group.
4186
4187The memory model does not support the region address space which is treated as
4188non-atomic.
4189
4190Acquire memory ordering is not meaningful on store atomic instructions and is
4191treated as non-atomic.
4192
4193Release memory ordering is not meaningful on load atomic instructions and is
4194treated a non-atomic.
4195
4196Acquire-release memory ordering is not meaningful on load or store atomic
4197instructions and is treated as acquire and release respectively.
4198
4199AMDGPU backend only uses scalar memory operations to access memory that is
4200proven to not change during the execution of the kernel dispatch. This includes
4201constant address space and global address space for program scope const
4202variables. Therefore, the kernel machine code does not have to maintain the
4203scalar L1 cache to ensure it is coherent with the vector L1 cache. The scalar
4204and vector L1 caches are invalidated between kernel dispatches by CP since
4205constant address space data may change between kernel dispatch executions. See
4206:ref:`amdgpu-address-spaces`.
4207
4208The one exception is if scalar writes are used to spill SGPR registers. In this
4209case the AMDGPU backend ensures the memory location used to spill is never
4210accessed by vector memory operations at the same time. If scalar writes are used
4211then a ``s_dcache_wb`` is inserted before the ``s_endpgm`` and before a function
4212return since the locations may be used for vector memory instructions by a
4213future wavefront that uses the same scratch area, or a function call that
4214creates a frame at the same address, respectively. There is no need for a
4215``s_dcache_inv`` as all scalar writes are write-before-read in the same thread.
4216
4217For GFX6-GFX9, scratch backing memory (which is used for the private address
4218space) is accessed with MTYPE NC_NV (non-coherent non-volatile). Since the
4219private address space is only accessed by a single thread, and is always
4220write-before-read, there is never a need to invalidate these entries from the L1
4221cache. Hence all cache invalidates are done as ``*_vol`` to only invalidate the
4222volatile cache lines.
4223
4224For GFX10, scratch backing memory (which is used for the private address space)
4225is accessed with MTYPE NC (non-coherent). Since the private address space is
4226only accessed by a single thread, and is always write-before-read, there is
4227never a need to invalidate these entries from the L0 or L1 caches.
4228
4229For GFX10, wavefronts are executed in native mode with in-order reporting of
4230loads and sample instructions. In this mode vmcnt reports completion of load,
4231atomic with return and sample instructions in order, and the vscnt reports the
4232completion of store and atomic without return in order. See ``MEM_ORDERED``
4233field in :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx10-table`.
4234
4235In GFX10, wavefronts can be executed in WGP or CU wavefront execution mode:
4236
4237* In WGP wavefront execution mode the wavefronts of a work-group are executed
4238  on the SIMDs of both CUs of the WGP. Therefore, explicit management of the per
4239  CU L0 caches is required for work-group synchronization. Also accesses to L1
4240  at work-group scope need to be explicitly ordered as the accesses from
4241  different CUs are not ordered.
4242* In CU wavefront execution mode the wavefronts of a work-group are executed on
4243  the SIMDs of a single CU of the WGP. Therefore, all global memory access by
4244  the work-group access the same L0 which in turn ensures L1 accesses are
4245  ordered and so do not require explicit management of the caches for
4246  work-group synchronization.
4247
4248See ``WGP_MODE`` field in
4249:ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx10-table` and
4250:ref:`amdgpu-target-features`.
4251
4252On dGPU the kernarg backing memory is accessed as UC (uncached) to avoid needing
4253to invalidate the L2 cache. For GFX6-GFX9, this also causes it to be treated as
4254non-volatile and so is not invalidated by ``*_vol``. On APU it is accessed as CC
4255(cache coherent) and so the L2 cache will be coherent with the CPU and other
4256agents.
4257
4258  .. table:: AMDHSA Memory Model Code Sequences GFX6-GFX10
4259     :name: amdgpu-amdhsa-memory-model-code-sequences-gfx6-gfx10-table
4260
4261     ============ ============ ============== ========== =============================== ==================================
4262     LLVM Instr   LLVM Memory  LLVM Memory    AMDGPU     AMDGPU Machine Code             AMDGPU Machine Code
4263                  Ordering     Sync Scope     Address    GFX6-9                          GFX10
4264                                              Space
4265     ============ ============ ============== ========== =============================== ==================================
4266     **Non-Atomic**
4267     ----------------------------------------------------------------------------------------------------------------------
4268     load         *none*       *none*         - global   - !volatile & !nontemporal      - !volatile & !nontemporal
4269                                              - generic
4270                                              - private    1. buffer/global/flat_load      1. buffer/global/flat_load
4271                                              - constant
4272                                                         - volatile & !nontemporal       - volatile & !nontemporal
4273
4274                                                           1. buffer/global/flat_load      1. buffer/global/flat_load
4275                                                              glc=1                           glc=1 dlc=1
4276
4277                                                         - nontemporal                   - nontemporal
4278
4279                                                           1. buffer/global/flat_load      1. buffer/global/flat_load
4280                                                              glc=1 slc=1                     slc=1
4281
4282     load         *none*       *none*         - local    1. ds_load                      1. ds_load
4283     store        *none*       *none*         - global   - !nontemporal                  - !nontemporal
4284                                              - generic
4285                                              - private    1. buffer/global/flat_store     1. buffer/global/flat_store
4286                                              - constant
4287                                                         - nontemporal                   - nontemporal
4288
4289                                                           1. buffer/global/flat_store      1. buffer/global/flat_store
4290                                                              glc=1 slc=1                      slc=1
4291
4292     store        *none*       *none*         - local    1. ds_store                     1. ds_store
4293     **Unordered Atomic**
4294     ----------------------------------------------------------------------------------------------------------------------
4295     load atomic  unordered    *any*          *any*      *Same as non-atomic*.           *Same as non-atomic*.
4296     store atomic unordered    *any*          *any*      *Same as non-atomic*.           *Same as non-atomic*.
4297     atomicrmw    unordered    *any*          *any*      *Same as monotonic              *Same as monotonic
4298                                                         atomic*.                        atomic*.
4299     **Monotonic Atomic**
4300     ----------------------------------------------------------------------------------------------------------------------
4301     load atomic  monotonic    - singlethread - global   1. buffer/global/flat_load      1. buffer/global/flat_load
4302                               - wavefront    - generic
4303     load atomic  monotonic    - workgroup    - global   1. buffer/global/flat_load      1. buffer/global/flat_load
4304                                              - generic                                     glc=1
4305
4306                                                                                           - If CU wavefront execution mode, omit glc=1.
4307
4308     load atomic  monotonic    - singlethread - local    1. ds_load                      1. ds_load
4309                               - wavefront
4310                               - workgroup
4311     load atomic  monotonic    - agent        - global   1. buffer/global/flat_load      1. buffer/global/flat_load
4312                               - system       - generic     glc=1                           glc=1 dlc=1
4313     store atomic monotonic    - singlethread - global   1. buffer/global/flat_store     1. buffer/global/flat_store
4314                               - wavefront    - generic
4315                               - workgroup
4316                               - agent
4317                               - system
4318     store atomic monotonic    - singlethread - local    1. ds_store                     1. ds_store
4319                               - wavefront
4320                               - workgroup
4321     atomicrmw    monotonic    - singlethread - global   1. buffer/global/flat_atomic    1. buffer/global/flat_atomic
4322                               - wavefront    - generic
4323                               - workgroup
4324                               - agent
4325                               - system
4326     atomicrmw    monotonic    - singlethread - local    1. ds_atomic                    1. ds_atomic
4327                               - wavefront
4328                               - workgroup
4329     **Acquire Atomic**
4330     ----------------------------------------------------------------------------------------------------------------------
4331     load atomic  acquire      - singlethread - global   1. buffer/global/ds/flat_load   1. buffer/global/ds/flat_load
4332                               - wavefront    - local
4333                                              - generic
4334     load atomic  acquire      - workgroup    - global   1. buffer/global/flat_load      1. buffer/global_load glc=1
4335
4336                                                                                           - If CU wavefront execution mode, omit glc=1.
4337
4338                                                                                         2. s_waitcnt vmcnt(0)
4339
4340                                                                                           - If CU wavefront execution mode, omit.
4341                                                                                           - Must happen before
4342                                                                                             the following buffer_gl0_inv
4343                                                                                             and before any following
4344                                                                                             global/generic
4345                                                                                             load/load
4346                                                                                             atomic/store/store
4347                                                                                             atomic/atomicrmw.
4348
4349                                                                                         3. buffer_gl0_inv
4350
4351                                                                                           - If CU wavefront execution mode, omit.
4352                                                                                           - Ensures that
4353                                                                                             following
4354                                                                                             loads will not see
4355                                                                                             stale data.
4356
4357     load atomic  acquire      - workgroup    - local    1. ds_load                      1. ds_load
4358                                                         2. s_waitcnt lgkmcnt(0)         2. s_waitcnt lgkmcnt(0)
4359
4360                                                           - If OpenCL, omit.              - If OpenCL, omit.
4361                                                           - Must happen before            - Must happen before
4362                                                             any following                   the following buffer_gl0_inv
4363                                                             global/generic                  and before any following
4364                                                             load/load                       global/generic load/load
4365                                                             atomic/store/store              atomic/store/store
4366                                                             atomic/atomicrmw.               atomic/atomicrmw.
4367                                                           - Ensures any                   - Ensures any
4368                                                             following global                following global
4369                                                             data read is no                 data read is no
4370                                                             older than the load             older than the load
4371                                                             atomic value being              atomic value being
4372                                                             acquired.                       acquired.
4373
4374                                                                                         3. buffer_gl0_inv
4375
4376                                                                                           - If CU wavefront execution mode, omit.
4377                                                                                           - If OpenCL, omit.
4378                                                                                           - Ensures that
4379                                                                                             following
4380                                                                                             loads will not see
4381                                                                                             stale data.
4382
4383     load atomic  acquire      - workgroup    - generic  1. flat_load                    1. flat_load glc=1
4384
4385                                                                                           - If CU wavefront execution mode, omit glc=1.
4386
4387                                                         2. s_waitcnt lgkmcnt(0)         2. s_waitcnt lgkmcnt(0) &
4388                                                                                            vmcnt(0)
4389
4390                                                                                           - If CU wavefront execution mode, omit vmcnt.
4391                                                           - If OpenCL, omit.              - If OpenCL, omit
4392                                                                                             lgkmcnt(0).
4393                                                           - Must happen before            - Must happen before
4394                                                             any following                   the following
4395                                                             global/generic                  buffer_gl0_inv and any
4396                                                             load/load                       following global/generic
4397                                                             atomic/store/store              load/load
4398                                                             atomic/atomicrmw.               atomic/store/store
4399                                                                                             atomic/atomicrmw.
4400                                                           - Ensures any                   - Ensures any
4401                                                             following global                following global
4402                                                             data read is no                 data read is no
4403                                                             older than the load             older than the load
4404                                                             atomic value being              atomic value being
4405                                                             acquired.                       acquired.
4406
4407                                                                                         3. buffer_gl0_inv
4408
4409                                                                                           - If CU wavefront execution mode, omit.
4410                                                                                           - Ensures that
4411                                                                                             following
4412                                                                                             loads will not see
4413                                                                                             stale data.
4414
4415     load atomic  acquire      - agent        - global   1. buffer/global/flat_load      1. buffer/global_load
4416                               - system                     glc=1                           glc=1 dlc=1
4417                                                         2. s_waitcnt vmcnt(0)           2. s_waitcnt vmcnt(0)
4418
4419                                                           - Must happen before            - Must happen before
4420                                                             following                       following
4421                                                             buffer_wbinvl1_vol.             buffer_gl*_inv.
4422                                                           - Ensures the load              - Ensures the load
4423                                                             has completed                   has completed
4424                                                             before invalidating             before invalidating
4425                                                             the cache.                      the caches.
4426
4427                                                         3. buffer_wbinvl1_vol           3. buffer_gl0_inv;
4428                                                                                            buffer_gl1_inv
4429
4430                                                           - Must happen before            - Must happen before
4431                                                             any following                   any following
4432                                                             global/generic                  global/generic
4433                                                             load/load                       load/load
4434                                                             atomic/atomicrmw.               atomic/atomicrmw.
4435                                                           - Ensures that                  - Ensures that
4436                                                             following                       following
4437                                                             loads will not see              loads will not see
4438                                                             stale global data.              stale global data.
4439
4440     load atomic  acquire      - agent        - generic  1. flat_load glc=1              1. flat_load glc=1 dlc=1
4441                               - system                  2. s_waitcnt vmcnt(0) &         2. s_waitcnt vmcnt(0) &
4442                                                            lgkmcnt(0)                      lgkmcnt(0)
4443
4444                                                           - If OpenCL omit                - If OpenCL omit
4445                                                             lgkmcnt(0).                     lgkmcnt(0).
4446                                                           - Must happen before            - Must happen before
4447                                                             following                       following
4448                                                             buffer_wbinvl1_vol.             buffer_gl*_invl.
4449                                                           - Ensures the flat_load         - Ensures the flat_load
4450                                                             has completed                   has completed
4451                                                             before invalidating             before invalidating
4452                                                             the cache.                      the caches.
4453
4454                                                         3. buffer_wbinvl1_vol           3. buffer_gl0_inv;
4455                                                                                            buffer_gl1_inv
4456
4457                                                           - Must happen before            - Must happen before
4458                                                             any following                   any following
4459                                                             global/generic                  global/generic
4460                                                             load/load                       load/load
4461                                                             atomic/atomicrmw.               atomic/atomicrmw.
4462                                                           - Ensures that                  - Ensures that
4463                                                             following loads                 following loads
4464                                                             will not see stale              will not see stale
4465                                                             global data.                    global data.
4466
4467     atomicrmw    acquire      - singlethread - global   1. buffer/global/ds/flat_atomic 1. buffer/global/ds/flat_atomic
4468                               - wavefront    - local
4469                                              - generic
4470     atomicrmw    acquire      - workgroup    - global   1. buffer/global/flat_atomic    1. buffer/global_atomic
4471                                                                                         2. s_waitcnt vm/vscnt(0)
4472
4473                                                                                           - If CU wavefront execution mode, omit.
4474                                                                                           - Use vmcnt if atomic with
4475                                                                                             return and vscnt if atomic
4476                                                                                             with no-return.
4477                                                                                           - Must happen before
4478                                                                                             the following buffer_gl0_inv
4479                                                                                             and before any following
4480                                                                                             global/generic
4481                                                                                             load/load
4482                                                                                             atomic/store/store
4483                                                                                             atomic/atomicrmw.
4484
4485                                                                                         3. buffer_gl0_inv
4486
4487                                                                                           - If CU wavefront execution mode, omit.
4488                                                                                           - Ensures that
4489                                                                                             following
4490                                                                                             loads will not see
4491                                                                                             stale data.
4492
4493     atomicrmw    acquire      - workgroup    - local    1. ds_atomic                    1. ds_atomic
4494                                                         2. waitcnt lgkmcnt(0)           2. waitcnt lgkmcnt(0)
4495
4496                                                           - If OpenCL, omit.              - If OpenCL, omit.
4497                                                           - Must happen before            - Must happen before
4498                                                             any following                   the following
4499                                                             global/generic                  buffer_gl0_inv.
4500                                                             load/load
4501                                                             atomic/store/store
4502                                                             atomic/atomicrmw.
4503                                                           - Ensures any                   - Ensures any
4504                                                             following global                following global
4505                                                             data read is no                 data read is no
4506                                                             older than the                  older than the
4507                                                             atomicrmw value                 atomicrmw value
4508                                                             being acquired.                 being acquired.
4509
4510                                                                                         3. buffer_gl0_inv
4511
4512                                                                                           - If OpenCL omit.
4513                                                                                           - Ensures that
4514                                                                                             following
4515                                                                                             loads will not see
4516                                                                                             stale data.
4517
4518     atomicrmw    acquire      - workgroup    - generic  1. flat_atomic                  1. flat_atomic
4519                                                         2. waitcnt lgkmcnt(0)           2. waitcnt lgkmcnt(0) &
4520                                                                                            vm/vscnt(0)
4521
4522                                                                                           - If CU wavefront execution mode, omit vm/vscnt.
4523                                                           - If OpenCL, omit.              - If OpenCL, omit
4524                                                                                             waitcnt lgkmcnt(0)..
4525                                                                                           - Use vmcnt if atomic with
4526                                                                                             return and vscnt if atomic
4527                                                                                             with no-return.
4528                                                                                             waitcnt lgkmcnt(0).
4529                                                           - Must happen before            - Must happen before
4530                                                             any following                   the following
4531                                                             global/generic                  buffer_gl0_inv.
4532                                                             load/load
4533                                                             atomic/store/store
4534                                                             atomic/atomicrmw.
4535                                                           - Ensures any                   - Ensures any
4536                                                             following global                following global
4537                                                             data read is no                 data read is no
4538                                                             older than the                  older than the
4539                                                             atomicrmw value                 atomicrmw value
4540                                                             being acquired.                 being acquired.
4541
4542                                                                                         3. buffer_gl0_inv
4543
4544                                                                                           - If CU wavefront execution mode, omit.
4545                                                                                           - Ensures that
4546                                                                                             following
4547                                                                                             loads will not see
4548                                                                                             stale data.
4549
4550     atomicrmw    acquire      - agent        - global   1. buffer/global/flat_atomic    1. buffer/global_atomic
4551                               - system                  2. s_waitcnt vmcnt(0)           2. s_waitcnt vm/vscnt(0)
4552
4553                                                                                           - Use vmcnt if atomic with
4554                                                                                             return and vscnt if atomic
4555                                                                                             with no-return.
4556                                                                                             waitcnt lgkmcnt(0).
4557                                                           - Must happen before            - Must happen before
4558                                                             following                       following
4559                                                             buffer_wbinvl1_vol.             buffer_gl*_inv.
4560                                                           - Ensures the                   - Ensures the
4561                                                             atomicrmw has                   atomicrmw has
4562                                                             completed before                completed before
4563                                                             invalidating the                invalidating the
4564                                                             cache.                          caches.
4565
4566                                                         3. buffer_wbinvl1_vol           3. buffer_gl0_inv;
4567                                                                                            buffer_gl1_inv
4568
4569                                                           - Must happen before            - Must happen before
4570                                                             any following                   any following
4571                                                             global/generic                  global/generic
4572                                                             load/load                       load/load
4573                                                             atomic/atomicrmw.               atomic/atomicrmw.
4574                                                           - Ensures that                  - Ensures that
4575                                                             following loads                 following loads
4576                                                             will not see stale              will not see stale
4577                                                             global data.                    global data.
4578
4579     atomicrmw    acquire      - agent        - generic  1. flat_atomic                  1. flat_atomic
4580                               - system                  2. s_waitcnt vmcnt(0) &         2. s_waitcnt vm/vscnt(0) &
4581                                                            lgkmcnt(0)                      lgkmcnt(0)
4582
4583                                                           - If OpenCL, omit               - If OpenCL, omit
4584                                                             lgkmcnt(0).                     lgkmcnt(0).
4585                                                                                           - Use vmcnt if atomic with
4586                                                                                             return and vscnt if atomic
4587                                                                                             with no-return.
4588                                                           - Must happen before            - Must happen before
4589                                                             following                       following
4590                                                             buffer_wbinvl1_vol.             buffer_gl*_inv.
4591                                                           - Ensures the                   - Ensures the
4592                                                             atomicrmw has                   atomicrmw has
4593                                                             completed before                completed before
4594                                                             invalidating the                invalidating the
4595                                                             cache.                          caches.
4596
4597                                                         3. buffer_wbinvl1_vol           3. buffer_gl0_inv;
4598                                                                                            buffer_gl1_inv
4599
4600                                                           - Must happen before            - Must happen before
4601                                                             any following                   any following
4602                                                             global/generic                  global/generic
4603                                                             load/load                       load/load
4604                                                             atomic/atomicrmw.               atomic/atomicrmw.
4605                                                           - Ensures that                  - Ensures that
4606                                                             following loads                 following loads
4607                                                             will not see stale              will not see stale
4608                                                             global data.                    global data.
4609
4610     fence        acquire      - singlethread *none*     *none*                          *none*
4611                               - wavefront
4612     fence        acquire      - workgroup    *none*     1. s_waitcnt lgkmcnt(0)         1. s_waitcnt lgkmcnt(0) &
4613                                                                                            vmcnt(0) & vscnt(0)
4614
4615                                                                                           - If CU wavefront execution mode, omit vmcnt and
4616                                                                                             vscnt.
4617                                                           - If OpenCL and                 - If OpenCL and
4618                                                             address space is                address space is
4619                                                             not generic, omit.              not generic, omit
4620                                                                                             lgkmcnt(0).
4621                                                                                           - If OpenCL and
4622                                                                                             address space is
4623                                                                                             local, omit
4624                                                                                             vmcnt(0) and vscnt(0).
4625                                                           - However, since LLVM           - However, since LLVM
4626                                                             currently has no                currently has no
4627                                                             address space on                address space on
4628                                                             the fence need to               the fence need to
4629                                                             conservatively                  conservatively
4630                                                             always generate. If             always generate. If
4631                                                             fence had an                    fence had an
4632                                                             address space then              address space then
4633                                                             set to address                  set to address
4634                                                             space of OpenCL                 space of OpenCL
4635                                                             fence flag, or to               fence flag, or to
4636                                                             generic if both                 generic if both
4637                                                             local and global                local and global
4638                                                             flags are                       flags are
4639                                                             specified.                      specified.
4640                                                           - Must happen after
4641                                                             any preceding
4642                                                             local/generic load
4643                                                             atomic/atomicrmw
4644                                                             with an equal or
4645                                                             wider sync scope
4646                                                             and memory ordering
4647                                                             stronger than
4648                                                             unordered (this is
4649                                                             termed the
4650                                                             fence-paired-atomic).
4651                                                           - Must happen before
4652                                                             any following
4653                                                             global/generic
4654                                                             load/load
4655                                                             atomic/store/store
4656                                                             atomic/atomicrmw.
4657                                                           - Ensures any
4658                                                             following global
4659                                                             data read is no
4660                                                             older than the
4661                                                             value read by the
4662                                                             fence-paired-atomic.
4663                                                                                           - Could be split into
4664                                                                                             separate s_waitcnt
4665                                                                                             vmcnt(0), s_waitcnt
4666                                                                                             vscnt(0) and s_waitcnt
4667                                                                                             lgkmcnt(0) to allow
4668                                                                                             them to be
4669                                                                                             independently moved
4670                                                                                             according to the
4671                                                                                             following rules.
4672                                                                                           - s_waitcnt vmcnt(0)
4673                                                                                             must happen after
4674                                                                                             any preceding
4675                                                                                             global/generic load
4676                                                                                             atomic/
4677                                                                                             atomicrmw-with-return-value
4678                                                                                             with an equal or
4679                                                                                             wider sync scope
4680                                                                                             and memory ordering
4681                                                                                             stronger than
4682                                                                                             unordered (this is
4683                                                                                             termed the
4684                                                                                             fence-paired-atomic).
4685                                                                                           - s_waitcnt vscnt(0)
4686                                                                                             must happen after
4687                                                                                             any preceding
4688                                                                                             global/generic
4689                                                                                             atomicrmw-no-return-value
4690                                                                                             with an equal or
4691                                                                                             wider sync scope
4692                                                                                             and memory ordering
4693                                                                                             stronger than
4694                                                                                             unordered (this is
4695                                                                                             termed the
4696                                                                                             fence-paired-atomic).
4697                                                                                           - s_waitcnt lgkmcnt(0)
4698                                                                                             must happen after
4699                                                                                             any preceding
4700                                                                                             local/generic load
4701                                                                                             atomic/atomicrmw
4702                                                                                             with an equal or
4703                                                                                             wider sync scope
4704                                                                                             and memory ordering
4705                                                                                             stronger than
4706                                                                                             unordered (this is
4707                                                                                             termed the
4708                                                                                             fence-paired-atomic).
4709                                                                                           - Must happen before
4710                                                                                             the following
4711                                                                                             buffer_gl0_inv.
4712                                                                                           - Ensures that the
4713                                                                                             fence-paired atomic
4714                                                                                             has completed
4715                                                                                             before invalidating
4716                                                                                             the
4717                                                                                             cache. Therefore
4718                                                                                             any following
4719                                                                                             locations read must
4720                                                                                             be no older than
4721                                                                                             the value read by
4722                                                                                             the
4723                                                                                             fence-paired-atomic.
4724
4725                                                                                         3. buffer_gl0_inv
4726
4727                                                                                           - If CU wavefront execution mode, omit.
4728                                                                                           - Ensures that
4729                                                                                             following
4730                                                                                             loads will not see
4731                                                                                             stale data.
4732
4733     fence        acquire      - agent        *none*     1. s_waitcnt lgkmcnt(0) &       1. s_waitcnt lgkmcnt(0) &
4734                               - system                     vmcnt(0)                        vmcnt(0) & vscnt(0)
4735
4736                                                           - If OpenCL and                 - If OpenCL and
4737                                                             address space is                address space is
4738                                                             not generic, omit               not generic, omit
4739                                                             lgkmcnt(0).                     lgkmcnt(0).
4740                                                                                           - If OpenCL and
4741                                                                                             address space is
4742                                                                                             local, omit
4743                                                                                             vmcnt(0) and vscnt(0).
4744                                                           - However, since LLVM           - However, since LLVM
4745                                                             currently has no                currently has no
4746                                                             address space on                address space on
4747                                                             the fence need to               the fence need to
4748                                                             conservatively                  conservatively
4749                                                             always generate                 always generate
4750                                                             (see comment for                (see comment for
4751                                                             previous fence).                previous fence).
4752                                                           - Could be split into
4753                                                             separate s_waitcnt
4754                                                             vmcnt(0) and
4755                                                             s_waitcnt
4756                                                             lgkmcnt(0) to allow
4757                                                             them to be
4758                                                             independently moved
4759                                                             according to the
4760                                                             following rules.
4761                                                           - s_waitcnt vmcnt(0)
4762                                                             must happen after
4763                                                             any preceding
4764                                                             global/generic load
4765                                                             atomic/atomicrmw
4766                                                             with an equal or
4767                                                             wider sync scope
4768                                                             and memory ordering
4769                                                             stronger than
4770                                                             unordered (this is
4771                                                             termed the
4772                                                             fence-paired-atomic).
4773                                                           - s_waitcnt lgkmcnt(0)
4774                                                             must happen after
4775                                                             any preceding
4776                                                             local/generic load
4777                                                             atomic/atomicrmw
4778                                                             with an equal or
4779                                                             wider sync scope
4780                                                             and memory ordering
4781                                                             stronger than
4782                                                             unordered (this is
4783                                                             termed the
4784                                                             fence-paired-atomic).
4785                                                           - Must happen before
4786                                                             the following
4787                                                             buffer_wbinvl1_vol.
4788                                                           - Ensures that the
4789                                                             fence-paired atomic
4790                                                             has completed
4791                                                             before invalidating
4792                                                             the
4793                                                             cache. Therefore
4794                                                             any following
4795                                                             locations read must
4796                                                             be no older than
4797                                                             the value read by
4798                                                             the
4799                                                             fence-paired-atomic.
4800                                                                                           - Could be split into
4801                                                                                             separate s_waitcnt
4802                                                                                             vmcnt(0), s_waitcnt
4803                                                                                             vscnt(0) and s_waitcnt
4804                                                                                             lgkmcnt(0) to allow
4805                                                                                             them to be
4806                                                                                             independently moved
4807                                                                                             according to the
4808                                                                                             following rules.
4809                                                                                           - s_waitcnt vmcnt(0)
4810                                                                                             must happen after
4811                                                                                             any preceding
4812                                                                                             global/generic load
4813                                                                                             atomic/
4814                                                                                             atomicrmw-with-return-value
4815                                                                                             with an equal or
4816                                                                                             wider sync scope
4817                                                                                             and memory ordering
4818                                                                                             stronger than
4819                                                                                             unordered (this is
4820                                                                                             termed the
4821                                                                                             fence-paired-atomic).
4822                                                                                           - s_waitcnt vscnt(0)
4823                                                                                             must happen after
4824                                                                                             any preceding
4825                                                                                             global/generic
4826                                                                                             atomicrmw-no-return-value
4827                                                                                             with an equal or
4828                                                                                             wider sync scope
4829                                                                                             and memory ordering
4830                                                                                             stronger than
4831                                                                                             unordered (this is
4832                                                                                             termed the
4833                                                                                             fence-paired-atomic).
4834                                                                                           - s_waitcnt lgkmcnt(0)
4835                                                                                             must happen after
4836                                                                                             any preceding
4837                                                                                             local/generic load
4838                                                                                             atomic/atomicrmw
4839                                                                                             with an equal or
4840                                                                                             wider sync scope
4841                                                                                             and memory ordering
4842                                                                                             stronger than
4843                                                                                             unordered (this is
4844                                                                                             termed the
4845                                                                                             fence-paired-atomic).
4846                                                                                           - Must happen before
4847                                                                                             the following
4848                                                                                             buffer_gl*_inv.
4849                                                                                           - Ensures that the
4850                                                                                             fence-paired atomic
4851                                                                                             has completed
4852                                                                                             before invalidating
4853                                                                                             the
4854                                                                                             caches. Therefore
4855                                                                                             any following
4856                                                                                             locations read must
4857                                                                                             be no older than
4858                                                                                             the value read by
4859                                                                                             the
4860                                                                                             fence-paired-atomic.
4861
4862                                                         2. buffer_wbinvl1_vol           2. buffer_gl0_inv;
4863                                                                                            buffer_gl1_inv
4864
4865                                                           - Must happen before any        - Must happen before any
4866                                                             following global/generic        following global/generic
4867                                                             load/load                       load/load
4868                                                             atomic/store/store              atomic/store/store
4869                                                             atomic/atomicrmw.               atomic/atomicrmw.
4870                                                           - Ensures that                  - Ensures that
4871                                                             following loads                 following loads
4872                                                             will not see stale              will not see stale
4873                                                             global data.                    global data.
4874
4875     **Release Atomic**
4876     ----------------------------------------------------------------------------------------------------------------------
4877     store atomic release      - singlethread - global   1. buffer/global/ds/flat_store  1. buffer/global/ds/flat_store
4878                               - wavefront    - local
4879                                              - generic
4880     store atomic release      - workgroup    - global   1. s_waitcnt lgkmcnt(0)         1. s_waitcnt lgkmcnt(0) &
4881                                                                                            vmcnt(0) & vscnt(0)
4882
4883                                                                                           - If CU wavefront execution mode, omit vmcnt and
4884                                                                                             vscnt.
4885                                                           - If OpenCL, omit.              - If OpenCL, omit
4886                                                                                             lgkmcnt(0).
4887                                                           - Must happen after
4888                                                             any preceding
4889                                                             local/generic
4890                                                             load/store/load
4891                                                             atomic/store
4892                                                             atomic/atomicrmw.
4893                                                                                           - Could be split into
4894                                                                                             separate s_waitcnt
4895                                                                                             vmcnt(0), s_waitcnt
4896                                                                                             vscnt(0) and s_waitcnt
4897                                                                                             lgkmcnt(0) to allow
4898                                                                                             them to be
4899                                                                                             independently moved
4900                                                                                             according to the
4901                                                                                             following rules.
4902                                                                                           - s_waitcnt vmcnt(0)
4903                                                                                             must happen after
4904                                                                                             any preceding
4905                                                                                             global/generic load/load
4906                                                                                             atomic/
4907                                                                                             atomicrmw-with-return-value.
4908                                                                                           - s_waitcnt vscnt(0)
4909                                                                                             must happen after
4910                                                                                             any preceding
4911                                                                                             global/generic
4912                                                                                             store/store
4913                                                                                             atomic/
4914                                                                                             atomicrmw-no-return-value.
4915                                                                                           - s_waitcnt lgkmcnt(0)
4916                                                                                             must happen after
4917                                                                                             any preceding
4918                                                                                             local/generic
4919                                                                                             load/store/load
4920                                                                                             atomic/store
4921                                                                                             atomic/atomicrmw.
4922                                                           - Must happen before            - Must happen before
4923                                                             the following                   the following
4924                                                             store.                          store.
4925                                                           - Ensures that all              - Ensures that all
4926                                                             memory operations               memory operations
4927                                                             to local have                   have
4928                                                             completed before                completed before
4929                                                             performing the                  performing the
4930                                                             store that is being             store that is being
4931                                                             released.                       released.
4932
4933                                                         2. buffer/global/flat_store     2. buffer/global_store
4934     store atomic release      - workgroup    - local                                    1. waitcnt vmcnt(0) & vscnt(0)
4935
4936                                                                                           - If CU wavefront execution mode, omit.
4937                                                                                           - If OpenCL, omit.
4938                                                                                           - Could be split into
4939                                                                                             separate s_waitcnt
4940                                                                                             vmcnt(0) and s_waitcnt
4941                                                                                             vscnt(0) to allow
4942                                                                                             them to be
4943                                                                                             independently moved
4944                                                                                             according to the
4945                                                                                             following rules.
4946                                                                                           - s_waitcnt vmcnt(0)
4947                                                                                             must happen after
4948                                                                                             any preceding
4949                                                                                             global/generic load/load
4950                                                                                             atomic/
4951                                                                                             atomicrmw-with-return-value.
4952                                                                                           - s_waitcnt vscnt(0)
4953                                                                                             must happen after
4954                                                                                             any preceding
4955                                                                                             global/generic
4956                                                                                             store/store atomic/
4957                                                                                             atomicrmw-no-return-value.
4958                                                                                           - Must happen before
4959                                                                                             the following
4960                                                                                             store.
4961                                                                                           - Ensures that all
4962                                                                                             global memory
4963                                                                                             operations have
4964                                                                                             completed before
4965                                                                                             performing the
4966                                                                                             store that is being
4967                                                                                             released.
4968
4969                                                         1. ds_store                     2. ds_store
4970     store atomic release      - workgroup    - generic  1. s_waitcnt lgkmcnt(0)         1. s_waitcnt lgkmcnt(0) &
4971                                                                                            vmcnt(0) & vscnt(0)
4972
4973                                                                                           - If CU wavefront execution mode, omit vmcnt and
4974                                                                                             vscnt.
4975                                                           - If OpenCL, omit.              - If OpenCL, omit
4976                                                                                             lgkmcnt(0).
4977                                                           - Must happen after
4978                                                             any preceding
4979                                                             local/generic
4980                                                             load/store/load
4981                                                             atomic/store
4982                                                             atomic/atomicrmw.
4983                                                                                           - Could be split into
4984                                                                                             separate s_waitcnt
4985                                                                                             vmcnt(0), s_waitcnt
4986                                                                                             vscnt(0) and s_waitcnt
4987                                                                                             lgkmcnt(0) to allow
4988                                                                                             them to be
4989                                                                                             independently moved
4990                                                                                             according to the
4991                                                                                             following rules.
4992                                                                                           - s_waitcnt vmcnt(0)
4993                                                                                             must happen after
4994                                                                                             any preceding
4995                                                                                             global/generic load/load
4996                                                                                             atomic/
4997                                                                                             atomicrmw-with-return-value.
4998                                                                                           - s_waitcnt vscnt(0)
4999                                                                                             must happen after
5000                                                                                             any preceding
5001                                                                                             global/generic
5002                                                                                             store/store
5003                                                                                             atomic/
5004                                                                                             atomicrmw-no-return-value.
5005                                                                                           - s_waitcnt lgkmcnt(0)
5006                                                                                             must happen after
5007                                                                                             any preceding
5008                                                                                             local/generic load/store/load
5009                                                                                             atomic/store atomic/atomicrmw.
5010                                                           - Must happen before            - Must happen before
5011                                                             the following                   the following
5012                                                             store.                          store.
5013                                                           - Ensures that all              - Ensures that all
5014                                                             memory operations               memory operations
5015                                                             to local have                   have
5016                                                             completed before                completed before
5017                                                             performing the                  performing the
5018                                                             store that is being             store that is being
5019                                                             released.                       released.
5020
5021                                                         2. flat_store                   2. flat_store
5022     store atomic release      - agent        - global   1. s_waitcnt lgkmcnt(0) &         1. s_waitcnt lgkmcnt(0) &
5023                               - system       - generic     vmcnt(0)                          vmcnt(0) & vscnt(0)
5024
5025                                                           - If OpenCL, omit               - If OpenCL, omit
5026                                                             lgkmcnt(0).                     lgkmcnt(0).
5027                                                           - Could be split into           - Could be split into
5028                                                             separate s_waitcnt              separate s_waitcnt
5029                                                             vmcnt(0) and                    vmcnt(0), s_waitcnt vscnt(0)
5030                                                             s_waitcnt                       and s_waitcnt
5031                                                             lgkmcnt(0) to allow             lgkmcnt(0) to allow
5032                                                             them to be                      them to be
5033                                                             independently moved             independently moved
5034                                                             according to the                according to the
5035                                                             following rules.                following rules.
5036                                                           - s_waitcnt vmcnt(0)            - s_waitcnt vmcnt(0)
5037                                                             must happen after               must happen after
5038                                                             any preceding                   any preceding
5039                                                             global/generic                  global/generic
5040                                                             load/store/load                 load/load
5041                                                             atomic/store                    atomic/
5042                                                             atomic/atomicrmw.               atomicrmw-with-return-value.
5043                                                                                           - s_waitcnt vscnt(0)
5044                                                                                             must happen after
5045                                                                                             any preceding
5046                                                                                             global/generic
5047                                                                                             store/store atomic/
5048                                                                                             atomicrmw-no-return-value.
5049                                                           - s_waitcnt lgkmcnt(0)          - s_waitcnt lgkmcnt(0)
5050                                                             must happen after               must happen after
5051                                                             any preceding                   any preceding
5052                                                             local/generic                   local/generic
5053                                                             load/store/load                 load/store/load
5054                                                             atomic/store                    atomic/store
5055                                                             atomic/atomicrmw.               atomic/atomicrmw.
5056                                                           - Must happen before            - Must happen before
5057                                                             the following                   the following
5058                                                             store.                          store.
5059                                                           - Ensures that all              - Ensures that all
5060                                                             memory operations               memory operations
5061                                                             to memory have                  to memory have
5062                                                             completed before                completed before
5063                                                             performing the                  performing the
5064                                                             store that is being             store that is being
5065                                                             released.                       released.
5066
5067                                                         2. buffer/global/ds/flat_store  2. buffer/global/ds/flat_store
5068     atomicrmw    release      - singlethread - global   1. buffer/global/ds/flat_atomic 1. buffer/global/ds/flat_atomic
5069                               - wavefront    - local
5070                                              - generic
5071     atomicrmw    release      - workgroup    - global   1. s_waitcnt lgkmcnt(0)         1. s_waitcnt lgkmcnt(0) &
5072                                                                                            vmcnt(0) & vscnt(0)
5073
5074                                                                                           - If CU wavefront execution mode, omit vmcnt and
5075                                                                                             vscnt.
5076                                                           - If OpenCL, omit.
5077
5078                                                           - Must happen after
5079                                                             any preceding
5080                                                             local/generic
5081                                                             load/store/load
5082                                                             atomic/store
5083                                                             atomic/atomicrmw.
5084                                                                                           - Could be split into
5085                                                                                             separate s_waitcnt
5086                                                                                             vmcnt(0), s_waitcnt
5087                                                                                             vscnt(0) and s_waitcnt
5088                                                                                             lgkmcnt(0) to allow
5089                                                                                             them to be
5090                                                                                             independently moved
5091                                                                                             according to the
5092                                                                                             following rules.
5093                                                                                           - s_waitcnt vmcnt(0)
5094                                                                                             must happen after
5095                                                                                             any preceding
5096                                                                                             global/generic load/load
5097                                                                                             atomic/
5098                                                                                             atomicrmw-with-return-value.
5099                                                                                           - s_waitcnt vscnt(0)
5100                                                                                             must happen after
5101                                                                                             any preceding
5102                                                                                             global/generic
5103                                                                                             store/store
5104                                                                                             atomic/
5105                                                                                             atomicrmw-no-return-value.
5106                                                                                           - s_waitcnt lgkmcnt(0)
5107                                                                                             must happen after
5108                                                                                             any preceding
5109                                                                                             local/generic
5110                                                                                             load/store/load
5111                                                                                             atomic/store
5112                                                                                             atomic/atomicrmw.
5113                                                           - Must happen before            - Must happen before
5114                                                             the following                   the following
5115                                                             atomicrmw.                      atomicrmw.
5116                                                           - Ensures that all              - Ensures that all
5117                                                             memory operations               memory operations
5118                                                             to local have                   have
5119                                                             completed before                completed before
5120                                                             performing the                  performing the
5121                                                             atomicrmw that is               atomicrmw that is
5122                                                             being released.                 being released.
5123
5124                                                         2. buffer/global/flat_atomic    2. buffer/global_atomic
5125     atomicrmw    release      - workgroup    - local                                    1. waitcnt vmcnt(0) & vscnt(0)
5126
5127                                                                                           - If CU wavefront execution mode, omit.
5128                                                                                           - If OpenCL, omit.
5129                                                                                           - Could be split into
5130                                                                                             separate s_waitcnt
5131                                                                                             vmcnt(0) and s_waitcnt
5132                                                                                             vscnt(0) to allow
5133                                                                                             them to be
5134                                                                                             independently moved
5135                                                                                             according to the
5136                                                                                             following rules.
5137                                                                                           - s_waitcnt vmcnt(0)
5138                                                                                             must happen after
5139                                                                                             any preceding
5140                                                                                             global/generic load/load
5141                                                                                             atomic/
5142                                                                                             atomicrmw-with-return-value.
5143                                                                                           - s_waitcnt vscnt(0)
5144                                                                                             must happen after
5145                                                                                             any preceding
5146                                                                                             global/generic
5147                                                                                             store/store atomic/
5148                                                                                             atomicrmw-no-return-value.
5149                                                                                           - Must happen before
5150                                                                                             the following
5151                                                                                             store.
5152                                                                                           - Ensures that all
5153                                                                                             global memory
5154                                                                                             operations have
5155                                                                                             completed before
5156                                                                                             performing the
5157                                                                                             store that is being
5158                                                                                             released.
5159
5160                                                         1. ds_atomic                    2. ds_atomic
5161     atomicrmw    release      - workgroup    - generic  1. s_waitcnt lgkmcnt(0)         1. s_waitcnt lgkmcnt(0) &
5162                                                                                            vmcnt(0) & vscnt(0)
5163
5164                                                                                           - If CU wavefront execution mode, omit vmcnt and
5165                                                                                             vscnt.
5166                                                           - If OpenCL, omit.              - If OpenCL, omit
5167                                                                                             waitcnt lgkmcnt(0).
5168                                                           - Must happen after
5169                                                             any preceding
5170                                                             local/generic
5171                                                             load/store/load
5172                                                             atomic/store
5173                                                             atomic/atomicrmw.
5174                                                                                           - Could be split into
5175                                                                                             separate s_waitcnt
5176                                                                                             vmcnt(0), s_waitcnt
5177                                                                                             vscnt(0) and s_waitcnt
5178                                                                                             lgkmcnt(0) to allow
5179                                                                                             them to be
5180                                                                                             independently moved
5181                                                                                             according to the
5182                                                                                             following rules.
5183                                                                                           - s_waitcnt vmcnt(0)
5184                                                                                             must happen after
5185                                                                                             any preceding
5186                                                                                             global/generic load/load
5187                                                                                             atomic/
5188                                                                                             atomicrmw-with-return-value.
5189                                                                                           - s_waitcnt vscnt(0)
5190                                                                                             must happen after
5191                                                                                             any preceding
5192                                                                                             global/generic
5193                                                                                             store/store
5194                                                                                             atomic/
5195                                                                                             atomicrmw-no-return-value.
5196                                                                                           - s_waitcnt lgkmcnt(0)
5197                                                                                             must happen after
5198                                                                                             any preceding
5199                                                                                             local/generic load/store/load
5200                                                                                             atomic/store atomic/atomicrmw.
5201                                                           - Must happen before            - Must happen before
5202                                                             the following                   the following
5203                                                             atomicrmw.                      atomicrmw.
5204                                                           - Ensures that all              - Ensures that all
5205                                                             memory operations               memory operations
5206                                                             to local have                   have
5207                                                             completed before                completed before
5208                                                             performing the                  performing the
5209                                                             atomicrmw that is               atomicrmw that is
5210                                                             being released.                 being released.
5211
5212                                                         2. flat_atomic                  2. flat_atomic
5213     atomicrmw    release      - agent        - global   1. s_waitcnt lgkmcnt(0) &       1. s_waitcnt lkkmcnt(0) &
5214                               - system       - generic     vmcnt(0)                         vmcnt(0) & vscnt(0)
5215
5216                                                           - If OpenCL, omit               - If OpenCL, omit
5217                                                             lgkmcnt(0).                     lgkmcnt(0).
5218                                                           - Could be split into           - Could be split into
5219                                                             separate s_waitcnt              separate s_waitcnt
5220                                                             vmcnt(0) and                    vmcnt(0), s_waitcnt
5221                                                             s_waitcnt                       vscnt(0) and s_waitcnt
5222                                                             lgkmcnt(0) to allow             lgkmcnt(0) to allow
5223                                                             them to be                      them to be
5224                                                             independently moved             independently moved
5225                                                             according to the                according to the
5226                                                             following rules.                following rules.
5227                                                           - s_waitcnt vmcnt(0)            - s_waitcnt vmcnt(0)
5228                                                             must happen after               must happen after
5229                                                             any preceding                   any preceding
5230                                                             global/generic                  global/generic
5231                                                             load/store/load                 load/load atomic/
5232                                                             atomic/store                    atomicrmw-with-return-value.
5233                                                             atomic/atomicrmw.
5234                                                                                           - s_waitcnt vscnt(0)
5235                                                                                             must happen after
5236                                                                                             any preceding
5237                                                                                             global/generic
5238                                                                                             store/store atomic/
5239                                                                                             atomicrmw-no-return-value.
5240                                                           - s_waitcnt lgkmcnt(0)          - s_waitcnt lgkmcnt(0)
5241                                                             must happen after               must happen after
5242                                                             any preceding                   any preceding
5243                                                             local/generic                   local/generic
5244                                                             load/store/load                 load/store/load
5245                                                             atomic/store                    atomic/store
5246                                                             atomic/atomicrmw.               atomic/atomicrmw.
5247                                                           - Must happen before            - Must happen before
5248                                                             the following                   the following
5249                                                             atomicrmw.                      atomicrmw.
5250                                                           - Ensures that all              - Ensures that all
5251                                                             memory operations               memory operations
5252                                                             to global and local             to global and local
5253                                                             have completed                  have completed
5254                                                             before performing               before performing
5255                                                             the atomicrmw that              the atomicrmw that
5256                                                             is being released.              is being released.
5257
5258                                                         2. buffer/global/ds/flat_atomic 2. buffer/global/ds/flat_atomic
5259     fence        release      - singlethread *none*     *none*                          *none*
5260                               - wavefront
5261     fence        release      - workgroup    *none*     1. s_waitcnt lgkmcnt(0)         1. s_waitcnt lgkmcnt(0) &
5262                                                                                            vmcnt(0) & vscnt(0)
5263
5264                                                                                           - If CU wavefront execution mode, omit vmcnt and
5265                                                                                             vscnt.
5266                                                           - If OpenCL and                 - If OpenCL and
5267                                                             address space is                address space is
5268                                                             not generic, omit.              not generic, omit
5269                                                                                             lgkmcnt(0).
5270                                                                                           - If OpenCL and
5271                                                                                             address space is
5272                                                                                             local, omit
5273                                                                                             vmcnt(0) and vscnt(0).
5274                                                           - However, since LLVM           - However, since LLVM
5275                                                             currently has no                currently has no
5276                                                             address space on                address space on
5277                                                             the fence need to               the fence need to
5278                                                             conservatively                  conservatively
5279                                                             always generate. If             always generate. If
5280                                                             fence had an                    fence had an
5281                                                             address space then              address space then
5282                                                             set to address                  set to address
5283                                                             space of OpenCL                 space of OpenCL
5284                                                             fence flag, or to               fence flag, or to
5285                                                             generic if both                 generic if both
5286                                                             local and global                local and global
5287                                                             flags are                       flags are
5288                                                             specified.                      specified.
5289                                                           - Must happen after
5290                                                             any preceding
5291                                                             local/generic
5292                                                             load/load
5293                                                             atomic/store/store
5294                                                             atomic/atomicrmw.
5295                                                                                           - Could be split into
5296                                                                                             separate s_waitcnt
5297                                                                                             vmcnt(0), s_waitcnt
5298                                                                                             vscnt(0) and s_waitcnt
5299                                                                                             lgkmcnt(0) to allow
5300                                                                                             them to be
5301                                                                                             independently moved
5302                                                                                             according to the
5303                                                                                             following rules.
5304                                                                                           - s_waitcnt vmcnt(0)
5305                                                                                             must happen after
5306                                                                                             any preceding
5307                                                                                             global/generic
5308                                                                                             load/load
5309                                                                                             atomic/
5310                                                                                             atomicrmw-with-return-value.
5311                                                                                           - s_waitcnt vscnt(0)
5312                                                                                             must happen after
5313                                                                                             any preceding
5314                                                                                             global/generic
5315                                                                                             store/store atomic/
5316                                                                                             atomicrmw-no-return-value.
5317                                                                                           - s_waitcnt lgkmcnt(0)
5318                                                                                             must happen after
5319                                                                                             any preceding
5320                                                                                             local/generic
5321                                                                                             load/store/load
5322                                                                                             atomic/store atomic/
5323                                                                                             atomicrmw.
5324                                                           - Must happen before            - Must happen before
5325                                                             any following store             any following store
5326                                                             atomic/atomicrmw                atomic/atomicrmw
5327                                                             with an equal or                with an equal or
5328                                                             wider sync scope                wider sync scope
5329                                                             and memory ordering             and memory ordering
5330                                                             stronger than                   stronger than
5331                                                             unordered (this is              unordered (this is
5332                                                             termed the                      termed the
5333                                                             fence-paired-atomic).           fence-paired-atomic).
5334                                                           - Ensures that all              - Ensures that all
5335                                                             memory operations               memory operations
5336                                                             to local have                   have
5337                                                             completed before                completed before
5338                                                             performing the                  performing the
5339                                                             following                       following
5340                                                             fence-paired-atomic.            fence-paired-atomic.
5341
5342     fence        release      - agent        *none*     1. s_waitcnt lgkmcnt(0) &       1. s_waitcnt lgkmcnt(0) &
5343                               - system                     vmcnt(0)                        vmcnt(0) & vscnt(0)
5344
5345                                                           - If OpenCL and                 - If OpenCL and
5346                                                             address space is                address space is
5347                                                             not generic, omit               not generic, omit
5348                                                             lgkmcnt(0).                     lgkmcnt(0).
5349                                                           - If OpenCL and                 - If OpenCL and
5350                                                             address space is                address space is
5351                                                             local, omit                     local, omit
5352                                                             vmcnt(0).                       vmcnt(0) and vscnt(0).
5353                                                           - However, since LLVM           - However, since LLVM
5354                                                             currently has no                currently has no
5355                                                             address space on                address space on
5356                                                             the fence need to               the fence need to
5357                                                             conservatively                  conservatively
5358                                                             always generate. If             always generate. If
5359                                                             fence had an                    fence had an
5360                                                             address space then              address space then
5361                                                             set to address                  set to address
5362                                                             space of OpenCL                 space of OpenCL
5363                                                             fence flag, or to               fence flag, or to
5364                                                             generic if both                 generic if both
5365                                                             local and global                local and global
5366                                                             flags are                       flags are
5367                                                             specified.                      specified.
5368                                                           - Could be split into           - Could be split into
5369                                                             separate s_waitcnt              separate s_waitcnt
5370                                                             vmcnt(0) and                    vmcnt(0), s_waitcnt
5371                                                             s_waitcnt                       vscnt(0) and s_waitcnt
5372                                                             lgkmcnt(0) to allow             lgkmcnt(0) to allow
5373                                                             them to be                      them to be
5374                                                             independently moved             independently moved
5375                                                             according to the                according to the
5376                                                             following rules.                following rules.
5377                                                           - s_waitcnt vmcnt(0)            - s_waitcnt vmcnt(0)
5378                                                             must happen after               must happen after
5379                                                             any preceding                   any preceding
5380                                                             global/generic                  global/generic
5381                                                             load/store/load                 load/load atomic/
5382                                                             atomic/store                    atomicrmw-with-return-value.
5383                                                             atomic/atomicrmw.
5384                                                                                           - s_waitcnt vscnt(0)
5385                                                                                             must happen after
5386                                                                                             any preceding
5387                                                                                             global/generic
5388                                                                                             store/store atomic/
5389                                                                                             atomicrmw-no-return-value.
5390                                                           - s_waitcnt lgkmcnt(0)          - s_waitcnt lgkmcnt(0)
5391                                                             must happen after               must happen after
5392                                                             any preceding                   any preceding
5393                                                             local/generic                   local/generic
5394                                                             load/store/load                 load/store/load
5395                                                             atomic/store                    atomic/store
5396                                                             atomic/atomicrmw.               atomic/atomicrmw.
5397                                                           - Must happen before            - Must happen before
5398                                                             any following store             any following store
5399                                                             atomic/atomicrmw                atomic/atomicrmw
5400                                                             with an equal or                with an equal or
5401                                                             wider sync scope                wider sync scope
5402                                                             and memory ordering             and memory ordering
5403                                                             stronger than                   stronger than
5404                                                             unordered (this is              unordered (this is
5405                                                             termed the                      termed the
5406                                                             fence-paired-atomic).           fence-paired-atomic).
5407                                                           - Ensures that all              - Ensures that all
5408                                                             memory operations               memory operations
5409                                                             have                            have
5410                                                             completed before                completed before
5411                                                             performing the                  performing the
5412                                                             following                       following
5413                                                             fence-paired-atomic.            fence-paired-atomic.
5414
5415     **Acquire-Release Atomic**
5416     ----------------------------------------------------------------------------------------------------------------------
5417     atomicrmw    acq_rel      - singlethread - global   1. buffer/global/ds/flat_atomic 1. buffer/global/ds/flat_atomic
5418                               - wavefront    - local
5419                                              - generic
5420     atomicrmw    acq_rel      - workgroup    - global   1. s_waitcnt lgkmcnt(0)         1. s_waitcnt lgkmcnt(0) &
5421                                                                                            vmcnt(0) & vscnt(0)
5422
5423                                                                                           - If CU wavefront execution mode, omit vmcnt and
5424                                                                                             vscnt.
5425                                                           - If OpenCL, omit.              - If OpenCL, omit
5426                                                                                             s_waitcnt lgkmcnt(0).
5427                                                           - Must happen after             - Must happen after
5428                                                             any preceding                   any preceding
5429                                                             local/generic                   local/generic
5430                                                             load/store/load                 load/store/load
5431                                                             atomic/store                    atomic/store
5432                                                             atomic/atomicrmw.               atomic/atomicrmw.
5433                                                                                           - Could be split into
5434                                                                                             separate s_waitcnt
5435                                                                                             vmcnt(0), s_waitcnt
5436                                                                                             vscnt(0) and s_waitcnt
5437                                                                                             lgkmcnt(0) to allow
5438                                                                                             them to be
5439                                                                                             independently moved
5440                                                                                             according to the
5441                                                                                             following rules.
5442                                                                                           - s_waitcnt vmcnt(0)
5443                                                                                             must happen after
5444                                                                                             any preceding
5445                                                                                             global/generic load/load
5446                                                                                             atomic/
5447                                                                                             atomicrmw-with-return-value.
5448                                                                                           - s_waitcnt vscnt(0)
5449                                                                                             must happen after
5450                                                                                             any preceding
5451                                                                                             global/generic
5452                                                                                             store/store
5453                                                                                             atomic/
5454                                                                                             atomicrmw-no-return-value.
5455                                                                                           - s_waitcnt lgkmcnt(0)
5456                                                                                             must happen after
5457                                                                                             any preceding
5458                                                                                             local/generic load/store/load
5459                                                                                             atomic/store atomic/atomicrmw.
5460                                                           - Must happen before            - Must happen before
5461                                                             the following                   the following
5462                                                             atomicrmw.                      atomicrmw.
5463                                                           - Ensures that all              - Ensures that all
5464                                                             memory operations               memory operations
5465                                                             to local have                   have
5466                                                             completed before                completed before
5467                                                             performing the                  performing the
5468                                                             atomicrmw that is               atomicrmw that is
5469                                                             being released.                 being released.
5470
5471                                                         2. buffer/global/flat_atomic    2. buffer/global_atomic
5472                                                                                         3. s_waitcnt vm/vscnt(0)
5473
5474                                                                                           - If CU wavefront execution mode, omit vm/vscnt.
5475                                                                                           - Use vmcnt if atomic with
5476                                                                                             return and vscnt if atomic
5477                                                                                             with no-return.
5478                                                                                             waitcnt lgkmcnt(0).
5479                                                                                           - Must happen before
5480                                                                                             the following
5481                                                                                             buffer_gl0_inv.
5482                                                                                           - Ensures any
5483                                                                                             following global
5484                                                                                             data read is no
5485                                                                                             older than the
5486                                                                                             atomicrmw value
5487                                                                                             being acquired.
5488
5489                                                                                         4. buffer_gl0_inv
5490
5491                                                                                           - If CU wavefront execution mode, omit.
5492                                                                                           - Ensures that
5493                                                                                             following
5494                                                                                             loads will not see
5495                                                                                             stale data.
5496
5497     atomicrmw    acq_rel      - workgroup    - local                                    1. waitcnt vmcnt(0) & vscnt(0)
5498
5499                                                                                           - If CU wavefront execution mode, omit.
5500                                                                                           - If OpenCL, omit.
5501                                                                                           - Could be split into
5502                                                                                             separate s_waitcnt
5503                                                                                             vmcnt(0) and s_waitcnt
5504                                                                                             vscnt(0) to allow
5505                                                                                             them to be
5506                                                                                             independently moved
5507                                                                                             according to the
5508                                                                                             following rules.
5509                                                                                           - s_waitcnt vmcnt(0)
5510                                                                                             must happen after
5511                                                                                             any preceding
5512                                                                                             global/generic load/load
5513                                                                                             atomic/
5514                                                                                             atomicrmw-with-return-value.
5515                                                                                           - s_waitcnt vscnt(0)
5516                                                                                             must happen after
5517                                                                                             any preceding
5518                                                                                             global/generic
5519                                                                                             store/store atomic/
5520                                                                                             atomicrmw-no-return-value.
5521                                                                                           - Must happen before
5522                                                                                             the following
5523                                                                                             store.
5524                                                                                           - Ensures that all
5525                                                                                             global memory
5526                                                                                             operations have
5527                                                                                             completed before
5528                                                                                             performing the
5529                                                                                             store that is being
5530                                                                                             released.
5531
5532                                                         1. ds_atomic                    2. ds_atomic
5533                                                         2. s_waitcnt lgkmcnt(0)         3. s_waitcnt lgkmcnt(0)
5534
5535                                                           - If OpenCL, omit.              - If OpenCL, omit.
5536                                                           - Must happen before            - Must happen before
5537                                                             any following                   the following
5538                                                             global/generic                  buffer_gl0_inv.
5539                                                             load/load
5540                                                             atomic/store/store
5541                                                             atomic/atomicrmw.
5542                                                           - Ensures any                   - Ensures any
5543                                                             following global                following global
5544                                                             data read is no                 data read is no
5545                                                             older than the load             older than the load
5546                                                             atomic value being              atomic value being
5547                                                             acquired.                       acquired.
5548
5549                                                                                         4. buffer_gl0_inv
5550
5551                                                                                           - If CU wavefront execution mode, omit.
5552                                                                                           - If OpenCL omit.
5553                                                                                           - Ensures that
5554                                                                                             following
5555                                                                                             loads will not see
5556                                                                                             stale data.
5557
5558     atomicrmw    acq_rel      - workgroup    - generic  1. s_waitcnt lgkmcnt(0)         1. s_waitcnt lgkmcnt(0) &
5559                                                                                            vmcnt(0) & vscnt(0)
5560
5561                                                                                           - If CU wavefront execution mode, omit vmcnt and
5562                                                                                             vscnt.
5563                                                           - If OpenCL, omit.              - If OpenCL, omit
5564                                                                                             waitcnt lgkmcnt(0).
5565                                                           - Must happen after
5566                                                             any preceding
5567                                                             local/generic
5568                                                             load/store/load
5569                                                             atomic/store
5570                                                             atomic/atomicrmw.
5571                                                                                           - Could be split into
5572                                                                                             separate s_waitcnt
5573                                                                                             vmcnt(0), s_waitcnt
5574                                                                                             vscnt(0) and s_waitcnt
5575                                                                                             lgkmcnt(0) to allow
5576                                                                                             them to be
5577                                                                                             independently moved
5578                                                                                             according to the
5579                                                                                             following rules.
5580                                                                                           - s_waitcnt vmcnt(0)
5581                                                                                             must happen after
5582                                                                                             any preceding
5583                                                                                             global/generic load/load
5584                                                                                             atomic/
5585                                                                                             atomicrmw-with-return-value.
5586                                                                                           - s_waitcnt vscnt(0)
5587                                                                                             must happen after
5588                                                                                             any preceding
5589                                                                                             global/generic
5590                                                                                             store/store
5591                                                                                             atomic/
5592                                                                                             atomicrmw-no-return-value.
5593                                                                                           - s_waitcnt lgkmcnt(0)
5594                                                                                             must happen after
5595                                                                                             any preceding
5596                                                                                             local/generic load/store/load
5597                                                                                             atomic/store atomic/atomicrmw.
5598                                                           - Must happen before            - Must happen before
5599                                                             the following                   the following
5600                                                             atomicrmw.                      atomicrmw.
5601                                                           - Ensures that all              - Ensures that all
5602                                                             memory operations               memory operations
5603                                                             to local have                   have
5604                                                             completed before                completed before
5605                                                             performing the                  performing the
5606                                                             atomicrmw that is               atomicrmw that is
5607                                                             being released.                 being released.
5608
5609                                                         2. flat_atomic                  2. flat_atomic
5610                                                         3. s_waitcnt lgkmcnt(0)         3. s_waitcnt lgkmcnt(0) &
5611                                                                                            vm/vscnt(0)
5612
5613                                                                                           - If CU wavefront execution mode, omit vm/vscnt.
5614                                                           - If OpenCL, omit.              - If OpenCL, omit
5615                                                                                             waitcnt lgkmcnt(0).
5616                                                           - Must happen before            - Must happen before
5617                                                             any following                   the following
5618                                                             global/generic                  buffer_gl0_inv.
5619                                                             load/load
5620                                                             atomic/store/store
5621                                                             atomic/atomicrmw.
5622                                                           - Ensures any                   - Ensures any
5623                                                             following global                following global
5624                                                             data read is no                 data read is no
5625                                                             older than the load             older than the load
5626                                                             atomic value being              atomic value being
5627                                                             acquired.                       acquired.
5628
5629                                                                                         3. buffer_gl0_inv
5630
5631                                                                                           - If CU wavefront execution mode, omit.
5632                                                                                           - Ensures that
5633                                                                                             following
5634                                                                                             loads will not see
5635                                                                                             stale data.
5636
5637     atomicrmw    acq_rel      - agent        - global   1. s_waitcnt lgkmcnt(0) &       1. s_waitcnt lgkmcnt(0) &
5638                               - system                     vmcnt(0)                        vmcnt(0) & vscnt(0)
5639
5640                                                           - If OpenCL, omit               - If OpenCL, omit
5641                                                             lgkmcnt(0).                     lgkmcnt(0).
5642                                                           - Could be split into           - Could be split into
5643                                                             separate s_waitcnt              separate s_waitcnt
5644                                                             vmcnt(0) and                    vmcnt(0), s_waitcnt
5645                                                             s_waitcnt                       vscnt(0) and s_waitcnt
5646                                                             lgkmcnt(0) to allow             lgkmcnt(0) to allow
5647                                                             them to be                      them to be
5648                                                             independently moved             independently moved
5649                                                             according to the                according to the
5650                                                             following rules.                following rules.
5651                                                           - s_waitcnt vmcnt(0)            - s_waitcnt vmcnt(0)
5652                                                             must happen after               must happen after
5653                                                             any preceding                   any preceding
5654                                                             global/generic                  global/generic
5655                                                             load/store/load                 load/load atomic/
5656                                                             atomic/store                    atomicrmw-with-return-value.
5657                                                             atomic/atomicrmw.
5658                                                                                           - s_waitcnt vscnt(0)
5659                                                                                             must happen after
5660                                                                                             any preceding
5661                                                                                             global/generic
5662                                                                                             store/store atomic/
5663                                                                                             atomicrmw-no-return-value.
5664                                                           - s_waitcnt lgkmcnt(0)          - s_waitcnt lgkmcnt(0)
5665                                                             must happen after               must happen after
5666                                                             any preceding                   any preceding
5667                                                             local/generic                   local/generic
5668                                                             load/store/load                 load/store/load
5669                                                             atomic/store                    atomic/store
5670                                                             atomic/atomicrmw.               atomic/atomicrmw.
5671                                                           - Must happen before            - Must happen before
5672                                                             the following                   the following
5673                                                             atomicrmw.                      atomicrmw.
5674                                                           - Ensures that all              - Ensures that all
5675                                                             memory operations               memory operations
5676                                                             to global have                  to global have
5677                                                             completed before                completed before
5678                                                             performing the                  performing the
5679                                                             atomicrmw that is               atomicrmw that is
5680                                                             being released.                 being released.
5681
5682                                                         2. buffer/global/flat_atomic    2. buffer/global_atomic
5683                                                         3. s_waitcnt vmcnt(0)           3. s_waitcnt vm/vscnt(0)
5684
5685                                                                                           - Use vmcnt if atomic with
5686                                                                                             return and vscnt if atomic
5687                                                                                             with no-return.
5688                                                                                             waitcnt lgkmcnt(0).
5689                                                           - Must happen before            - Must happen before
5690                                                             following                       following
5691                                                             buffer_wbinvl1_vol.             buffer_gl*_inv.
5692                                                           - Ensures the                   - Ensures the
5693                                                             atomicrmw has                   atomicrmw has
5694                                                             completed before                completed before
5695                                                             invalidating the                invalidating the
5696                                                             cache.                          caches.
5697
5698                                                         4. buffer_wbinvl1_vol           4. buffer_gl0_inv;
5699                                                                                            buffer_gl1_inv
5700
5701                                                           - Must happen before            - Must happen before
5702                                                             any following                   any following
5703                                                             global/generic                  global/generic
5704                                                             load/load                       load/load
5705                                                             atomic/atomicrmw.               atomic/atomicrmw.
5706                                                           - Ensures that                  - Ensures that
5707                                                             following loads                 following loads
5708                                                             will not see stale              will not see stale
5709                                                             global data.                    global data.
5710
5711     atomicrmw    acq_rel      - agent        - generic  1. s_waitcnt lgkmcnt(0) &       1. s_waitcnt lgkmcnt(0) &
5712                               - system                     vmcnt(0)                        vmcnt(0) & vscnt(0)
5713
5714                                                           - If OpenCL, omit               - If OpenCL, omit
5715                                                             lgkmcnt(0).                     lgkmcnt(0).
5716                                                           - Could be split into           - Could be split into
5717                                                             separate s_waitcnt              separate s_waitcnt
5718                                                             vmcnt(0) and                    vmcnt(0), s_waitcnt
5719                                                             s_waitcnt                       vscnt(0) and s_waitcnt
5720                                                             lgkmcnt(0) to allow             lgkmcnt(0) to allow
5721                                                             them to be                      them to be
5722                                                             independently moved             independently moved
5723                                                             according to the                according to the
5724                                                             following rules.                following rules.
5725                                                           - s_waitcnt vmcnt(0)            - s_waitcnt vmcnt(0)
5726                                                             must happen after               must happen after
5727                                                             any preceding                   any preceding
5728                                                             global/generic                  global/generic
5729                                                             load/store/load                 load/load atomic
5730                                                             atomic/store                    atomicrmw-with-return-value.
5731                                                             atomic/atomicrmw.
5732                                                                                           - s_waitcnt vscnt(0)
5733                                                                                             must happen after
5734                                                                                             any preceding
5735                                                                                             global/generic
5736                                                                                             store/store atomic/
5737                                                                                             atomicrmw-no-return-value.
5738                                                           - s_waitcnt lgkmcnt(0)          - s_waitcnt lgkmcnt(0)
5739                                                             must happen after               must happen after
5740                                                             any preceding                   any preceding
5741                                                             local/generic                   local/generic
5742                                                             load/store/load                 load/store/load
5743                                                             atomic/store                    atomic/store
5744                                                             atomic/atomicrmw.               atomic/atomicrmw.
5745                                                           - Must happen before            - Must happen before
5746                                                             the following                   the following
5747                                                             atomicrmw.                      atomicrmw.
5748                                                           - Ensures that all              - Ensures that all
5749                                                             memory operations               memory operations
5750                                                             to global have                  have
5751                                                             completed before                completed before
5752                                                             performing the                  performing the
5753                                                             atomicrmw that is               atomicrmw that is
5754                                                             being released.                 being released.
5755
5756                                                         2. flat_atomic                  2. flat_atomic
5757                                                         3. s_waitcnt vmcnt(0) &         3. s_waitcnt vm/vscnt(0) &
5758                                                            lgkmcnt(0)                      lgkmcnt(0)
5759
5760                                                           - If OpenCL, omit               - If OpenCL, omit
5761                                                             lgkmcnt(0).                     lgkmcnt(0).
5762                                                                                           - Use vmcnt if atomic with
5763                                                                                             return and vscnt if atomic
5764                                                                                             with no-return.
5765                                                           - Must happen before            - Must happen before
5766                                                             following                       following
5767                                                             buffer_wbinvl1_vol.             buffer_gl*_inv.
5768                                                           - Ensures the                   - Ensures the
5769                                                             atomicrmw has                   atomicrmw has
5770                                                             completed before                completed before
5771                                                             invalidating the                invalidating the
5772                                                             cache.                          caches.
5773
5774                                                         4. buffer_wbinvl1_vol           4. buffer_gl0_inv;
5775                                                                                            buffer_gl1_inv
5776
5777                                                           - Must happen before            - Must happen before
5778                                                             any following                   any following
5779                                                             global/generic                  global/generic
5780                                                             load/load                       load/load
5781                                                             atomic/atomicrmw.               atomic/atomicrmw.
5782                                                           - Ensures that                  - Ensures that
5783                                                             following loads                 following loads
5784                                                             will not see stale              will not see stale
5785                                                             global data.                    global data.
5786
5787     fence        acq_rel      - singlethread *none*     *none*                          *none*
5788                               - wavefront
5789     fence        acq_rel      - workgroup    *none*     1. s_waitcnt lgkmcnt(0)         1. s_waitcnt lgkmcnt(0) &
5790                                                                                            vmcnt(0) & vscnt(0)
5791
5792                                                                                           - If CU wavefront execution mode, omit vmcnt and
5793                                                                                             vscnt.
5794                                                           - If OpenCL and                 - If OpenCL and
5795                                                             address space is                address space is
5796                                                             not generic, omit.              not generic, omit
5797                                                                                             lgkmcnt(0).
5798                                                                                           - If OpenCL and
5799                                                                                             address space is
5800                                                                                             local, omit
5801                                                                                             vmcnt(0) and vscnt(0).
5802                                                           - However,                      - However,
5803                                                             since LLVM                      since LLVM
5804                                                             currently has no                currently has no
5805                                                             address space on                address space on
5806                                                             the fence need to               the fence need to
5807                                                             conservatively                  conservatively
5808                                                             always generate                 always generate
5809                                                             (see comment for                (see comment for
5810                                                             previous fence).                previous fence).
5811                                                           - Must happen after
5812                                                             any preceding
5813                                                             local/generic
5814                                                             load/load
5815                                                             atomic/store/store
5816                                                             atomic/atomicrmw.
5817                                                                                           - Could be split into
5818                                                                                             separate s_waitcnt
5819                                                                                             vmcnt(0), s_waitcnt
5820                                                                                             vscnt(0) and s_waitcnt
5821                                                                                             lgkmcnt(0) to allow
5822                                                                                             them to be
5823                                                                                             independently moved
5824                                                                                             according to the
5825                                                                                             following rules.
5826                                                                                           - s_waitcnt vmcnt(0)
5827                                                                                             must happen after
5828                                                                                             any preceding
5829                                                                                             global/generic
5830                                                                                             load/load
5831                                                                                             atomic/
5832                                                                                             atomicrmw-with-return-value.
5833                                                                                           - s_waitcnt vscnt(0)
5834                                                                                             must happen after
5835                                                                                             any preceding
5836                                                                                             global/generic
5837                                                                                             store/store atomic/
5838                                                                                             atomicrmw-no-return-value.
5839                                                                                           - s_waitcnt lgkmcnt(0)
5840                                                                                             must happen after
5841                                                                                             any preceding
5842                                                                                             local/generic
5843                                                                                             load/store/load
5844                                                                                             atomic/store atomic/
5845                                                                                             atomicrmw.
5846                                                           - Must happen before            - Must happen before
5847                                                             any following                   any following
5848                                                             global/generic                  global/generic
5849                                                             load/load                       load/load
5850                                                             atomic/store/store              atomic/store/store
5851                                                             atomic/atomicrmw.               atomic/atomicrmw.
5852                                                           - Ensures that all              - Ensures that all
5853                                                             memory operations               memory operations
5854                                                             to local have                   have
5855                                                             completed before                completed before
5856                                                             performing any                  performing any
5857                                                             following global                following global
5858                                                             memory operations.              memory operations.
5859                                                           - Ensures that the              - Ensures that the
5860                                                             preceding                       preceding
5861                                                             local/generic load              local/generic load
5862                                                             atomic/atomicrmw                atomic/atomicrmw
5863                                                             with an equal or                with an equal or
5864                                                             wider sync scope                wider sync scope
5865                                                             and memory ordering             and memory ordering
5866                                                             stronger than                   stronger than
5867                                                             unordered (this is              unordered (this is
5868                                                             termed the                      termed the
5869                                                             acquire-fence-paired-atomic     acquire-fence-paired-atomic
5870                                                             ) has completed                 ) has completed
5871                                                             before following                before following
5872                                                             global memory                   global memory
5873                                                             operations. This                operations. This
5874                                                             satisfies the                   satisfies the
5875                                                             requirements of                 requirements of
5876                                                             acquire.                        acquire.
5877                                                           - Ensures that all              - Ensures that all
5878                                                             previous memory                 previous memory
5879                                                             operations have                 operations have
5880                                                             completed before a              completed before a
5881                                                             following                       following
5882                                                             local/generic store             local/generic store
5883                                                             atomic/atomicrmw                atomic/atomicrmw
5884                                                             with an equal or                with an equal or
5885                                                             wider sync scope                wider sync scope
5886                                                             and memory ordering             and memory ordering
5887                                                             stronger than                   stronger than
5888                                                             unordered (this is              unordered (this is
5889                                                             termed the                      termed the
5890                                                             release-fence-paired-atomic     release-fence-paired-atomic
5891                                                             ). This satisfies the           ). This satisfies the
5892                                                             requirements of                 requirements of
5893                                                             release.                        release.
5894                                                                                           - Must happen before
5895                                                                                             the following
5896                                                                                             buffer_gl0_inv.
5897                                                                                           - Ensures that the
5898                                                                                             acquire-fence-paired
5899                                                                                             atomic has completed
5900                                                                                             before invalidating
5901                                                                                             the
5902                                                                                             cache. Therefore
5903                                                                                             any following
5904                                                                                             locations read must
5905                                                                                             be no older than
5906                                                                                             the value read by
5907                                                                                             the
5908                                                                                             acquire-fence-paired-atomic.
5909
5910                                                                                         3. buffer_gl0_inv
5911
5912                                                                                           - If CU wavefront execution mode, omit.
5913                                                                                           - Ensures that
5914                                                                                             following
5915                                                                                             loads will not see
5916                                                                                             stale data.
5917
5918     fence        acq_rel      - agent        *none*     1. s_waitcnt lgkmcnt(0) &       1. s_waitcnt lgkmcnt(0) &
5919                               - system                     vmcnt(0)                        vmcnt(0) & vscnt(0)
5920
5921                                                           - If OpenCL and                 - If OpenCL and
5922                                                             address space is                address space is
5923                                                             not generic, omit               not generic, omit
5924                                                             lgkmcnt(0).                     lgkmcnt(0).
5925                                                                                           - If OpenCL and
5926                                                                                             address space is
5927                                                                                             local, omit
5928                                                                                             vmcnt(0) and vscnt(0).
5929                                                           - However, since LLVM           - However, since LLVM
5930                                                             currently has no                currently has no
5931                                                             address space on                address space on
5932                                                             the fence need to               the fence need to
5933                                                             conservatively                  conservatively
5934                                                             always generate                 always generate
5935                                                             (see comment for                (see comment for
5936                                                             previous fence).                previous fence).
5937                                                           - Could be split into           - Could be split into
5938                                                             separate s_waitcnt              separate s_waitcnt
5939                                                             vmcnt(0) and                    vmcnt(0), s_waitcnt
5940                                                             s_waitcnt                       vscnt(0) and s_waitcnt
5941                                                             lgkmcnt(0) to allow             lgkmcnt(0) to allow
5942                                                             them to be                      them to be
5943                                                             independently moved             independently moved
5944                                                             according to the                according to the
5945                                                             following rules.                following rules.
5946                                                           - s_waitcnt vmcnt(0)            - s_waitcnt vmcnt(0)
5947                                                             must happen after               must happen after
5948                                                             any preceding                   any preceding
5949                                                             global/generic                  global/generic
5950                                                             load/store/load                 load/load
5951                                                             atomic/store                    atomic/
5952                                                             atomic/atomicrmw.               atomicrmw-with-return-value.
5953                                                                                           - s_waitcnt vscnt(0)
5954                                                                                             must happen after
5955                                                                                             any preceding
5956                                                                                             global/generic
5957                                                                                             store/store atomic/
5958                                                                                             atomicrmw-no-return-value.
5959                                                           - s_waitcnt lgkmcnt(0)          - s_waitcnt lgkmcnt(0)
5960                                                             must happen after               must happen after
5961                                                             any preceding                   any preceding
5962                                                             local/generic                   local/generic
5963                                                             load/store/load                 load/store/load
5964                                                             atomic/store                    atomic/store
5965                                                             atomic/atomicrmw.               atomic/atomicrmw.
5966                                                           - Must happen before            - Must happen before
5967                                                             the following                   the following
5968                                                             buffer_wbinvl1_vol.             buffer_gl*_inv.
5969                                                           - Ensures that the              - Ensures that the
5970                                                             preceding                       preceding
5971                                                             global/local/generic            global/local/generic
5972                                                             load                            load
5973                                                             atomic/atomicrmw                atomic/atomicrmw
5974                                                             with an equal or                with an equal or
5975                                                             wider sync scope                wider sync scope
5976                                                             and memory ordering             and memory ordering
5977                                                             stronger than                   stronger than
5978                                                             unordered (this is              unordered (this is
5979                                                             termed the                      termed the
5980                                                             acquire-fence-paired-atomic     acquire-fence-paired-atomic
5981                                                             ) has completed                 ) has completed
5982                                                             before invalidating             before invalidating
5983                                                             the cache. This                 the caches. This
5984                                                             satisfies the                   satisfies the
5985                                                             requirements of                 requirements of
5986                                                             acquire.                        acquire.
5987                                                           - Ensures that all              - Ensures that all
5988                                                             previous memory                 previous memory
5989                                                             operations have                 operations have
5990                                                             completed before a              completed before a
5991                                                             following                       following
5992                                                             global/local/generic            global/local/generic
5993                                                             store                           store
5994                                                             atomic/atomicrmw                atomic/atomicrmw
5995                                                             with an equal or                with an equal or
5996                                                             wider sync scope                wider sync scope
5997                                                             and memory ordering             and memory ordering
5998                                                             stronger than                   stronger than
5999                                                             unordered (this is              unordered (this is
6000                                                             termed the                      termed the
6001                                                             release-fence-paired-atomic     release-fence-paired-atomic
6002                                                             ). This satisfies the           ). This satisfies the
6003                                                             requirements of                 requirements of
6004                                                             release.                        release.
6005
6006                                                         2. buffer_wbinvl1_vol           2. buffer_gl0_inv;
6007                                                                                            buffer_gl1_inv
6008
6009                                                           - Must happen before            - Must happen before
6010                                                             any following                   any following
6011                                                             global/generic                  global/generic
6012                                                             load/load                       load/load
6013                                                             atomic/store/store              atomic/store/store
6014                                                             atomic/atomicrmw.               atomic/atomicrmw.
6015                                                           - Ensures that                  - Ensures that
6016                                                             following loads                 following loads
6017                                                             will not see stale              will not see stale
6018                                                             global data. This               global data. This
6019                                                             satisfies the                   satisfies the
6020                                                             requirements of                 requirements of
6021                                                             acquire.                        acquire.
6022
6023     **Sequential Consistent Atomic**
6024     ----------------------------------------------------------------------------------------------------------------------
6025     load atomic  seq_cst      - singlethread - global   *Same as corresponding          *Same as corresponding
6026                               - wavefront    - local    load atomic acquire,            load atomic acquire,
6027                                              - generic  except must generated           except must generated
6028                                                         all instructions even           all instructions even
6029                                                         for OpenCL.*                    for OpenCL.*
6030     load atomic  seq_cst      - workgroup    - global   1. s_waitcnt lgkmcnt(0)         1. s_waitcnt lgkmcnt(0) &
6031                                              - generic                                     vmcnt(0) & vscnt(0)
6032
6033                                                                                           - If CU wavefront execution mode, omit vmcnt and
6034                                                                                             vscnt.
6035                                                                                           - Could be split into
6036                                                                                             separate s_waitcnt
6037                                                                                             vmcnt(0), s_waitcnt
6038                                                                                             vscnt(0) and s_waitcnt
6039                                                                                             lgkmcnt(0) to allow
6040                                                                                             them to be
6041                                                                                             independently moved
6042                                                                                             according to the
6043                                                                                             following rules.
6044                                                           - Must                          - waitcnt lgkmcnt(0) must
6045                                                             happen after                    happen after
6046                                                             preceding                       preceding
6047                                                             global/generic load             local load
6048                                                             atomic/store                    atomic/store
6049                                                             atomic/atomicrmw                atomic/atomicrmw
6050                                                             with memory                     with memory
6051                                                             ordering of seq_cst             ordering of seq_cst
6052                                                             and with equal or               and with equal or
6053                                                             wider sync scope.               wider sync scope.
6054                                                             (Note that seq_cst              (Note that seq_cst
6055                                                             fences have their               fences have their
6056                                                             own s_waitcnt                   own s_waitcnt
6057                                                             lgkmcnt(0) and so do            lgkmcnt(0) and so do
6058                                                             not need to be                  not need to be
6059                                                             considered.)                    considered.)
6060                                                                                           - waitcnt vmcnt(0)
6061                                                                                             Must happen after
6062                                                                                             preceding
6063                                                                                             global/generic load
6064                                                                                             atomic/
6065                                                                                             atomicrmw-with-return-value
6066                                                                                             with memory
6067                                                                                             ordering of seq_cst
6068                                                                                             and with equal or
6069                                                                                             wider sync scope.
6070                                                                                             (Note that seq_cst
6071                                                                                             fences have their
6072                                                                                             own s_waitcnt
6073                                                                                             vmcnt(0) and so do
6074                                                                                             not need to be
6075                                                                                             considered.)
6076                                                                                           - waitcnt vscnt(0)
6077                                                                                             Must happen after
6078                                                                                             preceding
6079                                                                                             global/generic store
6080                                                                                             atomic/
6081                                                                                             atomicrmw-no-return-value
6082                                                                                             with memory
6083                                                                                             ordering of seq_cst
6084                                                                                             and with equal or
6085                                                                                             wider sync scope.
6086                                                                                             (Note that seq_cst
6087                                                                                             fences have their
6088                                                                                             own s_waitcnt
6089                                                                                             vscnt(0) and so do
6090                                                                                             not need to be
6091                                                                                             considered.)
6092                                                           - Ensures any                   - Ensures any
6093                                                             preceding                       preceding
6094                                                             sequential                      sequential
6095                                                             consistent local                consistent global/local
6096                                                             memory instructions             memory instructions
6097                                                             have completed                  have completed
6098                                                             before executing                before executing
6099                                                             this sequentially               this sequentially
6100                                                             consistent                      consistent
6101                                                             instruction. This               instruction. This
6102                                                             prevents reordering             prevents reordering
6103                                                             a seq_cst store                 a seq_cst store
6104                                                             followed by a                   followed by a
6105                                                             seq_cst load. (Note             seq_cst load. (Note
6106                                                             that seq_cst is                 that seq_cst is
6107                                                             stronger than                   stronger than
6108                                                             acquire/release as              acquire/release as
6109                                                             the reordering of               the reordering of
6110                                                             load acquire                    load acquire
6111                                                             followed by a store             followed by a store
6112                                                             release is                      release is
6113                                                             prevented by the                prevented by the
6114                                                             waitcnt of                      waitcnt of
6115                                                             the release, but                the release, but
6116                                                             there is nothing                there is nothing
6117                                                             preventing a store              preventing a store
6118                                                             release followed by             release followed by
6119                                                             load acquire from               load acquire from
6120                                                             competing out of                competing out of
6121                                                             order.)                         order.)
6122
6123                                                         2. *Following                   2. *Following
6124                                                            instructions same as            instructions same as
6125                                                            corresponding load              corresponding load
6126                                                            atomic acquire,                 atomic acquire,
6127                                                            except must generated           except must generated
6128                                                            all instructions even           all instructions even
6129                                                            for OpenCL.*                    for OpenCL.*
6130     load atomic  seq_cst      - workgroup    - local    *Same as corresponding
6131                                                         load atomic acquire,
6132                                                         except must generated
6133                                                         all instructions even
6134                                                         for OpenCL.*
6135
6136                                                                                         1. s_waitcnt vmcnt(0) & vscnt(0)
6137
6138                                                                                           - If CU wavefront execution mode, omit.
6139                                                                                           - Could be split into
6140                                                                                             separate s_waitcnt
6141                                                                                             vmcnt(0) and s_waitcnt
6142                                                                                             vscnt(0) to allow
6143                                                                                             them to be
6144                                                                                             independently moved
6145                                                                                             according to the
6146                                                                                             following rules.
6147                                                                                           - waitcnt vmcnt(0)
6148                                                                                             Must happen after
6149                                                                                             preceding
6150                                                                                             global/generic load
6151                                                                                             atomic/
6152                                                                                             atomicrmw-with-return-value
6153                                                                                             with memory
6154                                                                                             ordering of seq_cst
6155                                                                                             and with equal or
6156                                                                                             wider sync scope.
6157                                                                                             (Note that seq_cst
6158                                                                                             fences have their
6159                                                                                             own s_waitcnt
6160                                                                                             vmcnt(0) and so do
6161                                                                                             not need to be
6162                                                                                             considered.)
6163                                                                                           - waitcnt vscnt(0)
6164                                                                                             Must happen after
6165                                                                                             preceding
6166                                                                                             global/generic store
6167                                                                                             atomic/
6168                                                                                             atomicrmw-no-return-value
6169                                                                                             with memory
6170                                                                                             ordering of seq_cst
6171                                                                                             and with equal or
6172                                                                                             wider sync scope.
6173                                                                                             (Note that seq_cst
6174                                                                                             fences have their
6175                                                                                             own s_waitcnt
6176                                                                                             vscnt(0) and so do
6177                                                                                             not need to be
6178                                                                                             considered.)
6179                                                                                           - Ensures any
6180                                                                                             preceding
6181                                                                                             sequential
6182                                                                                             consistent global
6183                                                                                             memory instructions
6184                                                                                             have completed
6185                                                                                             before executing
6186                                                                                             this sequentially
6187                                                                                             consistent
6188                                                                                             instruction. This
6189                                                                                             prevents reordering
6190                                                                                             a seq_cst store
6191                                                                                             followed by a
6192                                                                                             seq_cst load. (Note
6193                                                                                             that seq_cst is
6194                                                                                             stronger than
6195                                                                                             acquire/release as
6196                                                                                             the reordering of
6197                                                                                             load acquire
6198                                                                                             followed by a store
6199                                                                                             release is
6200                                                                                             prevented by the
6201                                                                                             waitcnt of
6202                                                                                             the release, but
6203                                                                                             there is nothing
6204                                                                                             preventing a store
6205                                                                                             release followed by
6206                                                                                             load acquire from
6207                                                                                             competing out of
6208                                                                                             order.)
6209
6210                                                                                         2. *Following
6211                                                                                            instructions same as
6212                                                                                            corresponding load
6213                                                                                            atomic acquire,
6214                                                                                            except must generated
6215                                                                                            all instructions even
6216                                                                                            for OpenCL.*
6217
6218     load atomic  seq_cst      - agent        - global   1. s_waitcnt lgkmcnt(0) &       1. s_waitcnt lgkmcnt(0) &
6219                               - system       - generic     vmcnt(0)                        vmcnt(0) & vscnt(0)
6220
6221                                                           - Could be split into           - Could be split into
6222                                                             separate s_waitcnt              separate s_waitcnt
6223                                                             vmcnt(0)                        vmcnt(0), s_waitcnt
6224                                                             and s_waitcnt                   vscnt(0) and s_waitcnt
6225                                                             lgkmcnt(0) to allow             lgkmcnt(0) to allow
6226                                                             them to be                      them to be
6227                                                             independently moved             independently moved
6228                                                             according to the                according to the
6229                                                             following rules.                following rules.
6230                                                           - waitcnt lgkmcnt(0)            - waitcnt lgkmcnt(0)
6231                                                             must happen after               must happen after
6232                                                             preceding                       preceding
6233                                                             global/generic load             local load
6234                                                             atomic/store                    atomic/store
6235                                                             atomic/atomicrmw                atomic/atomicrmw
6236                                                             with memory                     with memory
6237                                                             ordering of seq_cst             ordering of seq_cst
6238                                                             and with equal or               and with equal or
6239                                                             wider sync scope.               wider sync scope.
6240                                                             (Note that seq_cst              (Note that seq_cst
6241                                                             fences have their               fences have their
6242                                                             own s_waitcnt                   own s_waitcnt
6243                                                             lgkmcnt(0) and so do            lgkmcnt(0) and so do
6244                                                             not need to be                  not need to be
6245                                                             considered.)                    considered.)
6246                                                           - waitcnt vmcnt(0)              - waitcnt vmcnt(0)
6247                                                             must happen after               must happen after
6248                                                             preceding                       preceding
6249                                                             global/generic load             global/generic load
6250                                                             atomic/store                    atomic/
6251                                                             atomic/atomicrmw                atomicrmw-with-return-value
6252                                                             with memory                     with memory
6253                                                             ordering of seq_cst             ordering of seq_cst
6254                                                             and with equal or               and with equal or
6255                                                             wider sync scope.               wider sync scope.
6256                                                             (Note that seq_cst              (Note that seq_cst
6257                                                             fences have their               fences have their
6258                                                             own s_waitcnt                   own s_waitcnt
6259                                                             vmcnt(0) and so do              vmcnt(0) and so do
6260                                                             not need to be                  not need to be
6261                                                             considered.)                    considered.)
6262                                                                                           - waitcnt vscnt(0)
6263                                                                                             Must happen after
6264                                                                                             preceding
6265                                                                                             global/generic store
6266                                                                                             atomic/
6267                                                                                             atomicrmw-no-return-value
6268                                                                                             with memory
6269                                                                                             ordering of seq_cst
6270                                                                                             and with equal or
6271                                                                                             wider sync scope.
6272                                                                                             (Note that seq_cst
6273                                                                                             fences have their
6274                                                                                             own s_waitcnt
6275                                                                                             vscnt(0) and so do
6276                                                                                             not need to be
6277                                                                                             considered.)
6278                                                           - Ensures any                   - Ensures any
6279                                                             preceding                       preceding
6280                                                             sequential                      sequential
6281                                                             consistent global               consistent global
6282                                                             memory instructions             memory instructions
6283                                                             have completed                  have completed
6284                                                             before executing                before executing
6285                                                             this sequentially               this sequentially
6286                                                             consistent                      consistent
6287                                                             instruction. This               instruction. This
6288                                                             prevents reordering             prevents reordering
6289                                                             a seq_cst store                 a seq_cst store
6290                                                             followed by a                   followed by a
6291                                                             seq_cst load. (Note             seq_cst load. (Note
6292                                                             that seq_cst is                 that seq_cst is
6293                                                             stronger than                   stronger than
6294                                                             acquire/release as              acquire/release as
6295                                                             the reordering of               the reordering of
6296                                                             load acquire                    load acquire
6297                                                             followed by a store             followed by a store
6298                                                             release is                      release is
6299                                                             prevented by the                prevented by the
6300                                                             waitcnt of                      waitcnt of
6301                                                             the release, but                the release, but
6302                                                             there is nothing                there is nothing
6303                                                             preventing a store              preventing a store
6304                                                             release followed by             release followed by
6305                                                             load acquire from               load acquire from
6306                                                             competing out of                competing out of
6307                                                             order.)                         order.)
6308
6309                                                         2. *Following                   2. *Following
6310                                                            instructions same as            instructions same as
6311                                                            corresponding load              corresponding load
6312                                                            atomic acquire,                 atomic acquire,
6313                                                            except must generated           except must generated
6314                                                            all instructions even           all instructions even
6315                                                            for OpenCL.*                    for OpenCL.*
6316     store atomic seq_cst      - singlethread - global   *Same as corresponding          *Same as corresponding
6317                               - wavefront    - local    store atomic release,           store atomic release,
6318                               - workgroup    - generic  except must generated           except must generated
6319                                                         all instructions even           all instructions even
6320                                                         for OpenCL.*                    for OpenCL.*
6321     store atomic seq_cst      - agent        - global   *Same as corresponding          *Same as corresponding
6322                               - system       - generic  store atomic release,           store atomic release,
6323                                                         except must generated           except must generated
6324                                                         all instructions even           all instructions even
6325                                                         for OpenCL.*                    for OpenCL.*
6326     atomicrmw    seq_cst      - singlethread - global   *Same as corresponding          *Same as corresponding
6327                               - wavefront    - local    atomicrmw acq_rel,              atomicrmw acq_rel,
6328                               - workgroup    - generic  except must generated           except must generated
6329                                                         all instructions even           all instructions even
6330                                                         for OpenCL.*                    for OpenCL.*
6331     atomicrmw    seq_cst      - agent        - global   *Same as corresponding          *Same as corresponding
6332                               - system       - generic  atomicrmw acq_rel,              atomicrmw acq_rel,
6333                                                         except must generated           except must generated
6334                                                         all instructions even           all instructions even
6335                                                         for OpenCL.*                    for OpenCL.*
6336     fence        seq_cst      - singlethread *none*     *Same as corresponding          *Same as corresponding
6337                               - wavefront               fence acq_rel,                  fence acq_rel,
6338                               - workgroup               except must generated           except must generated
6339                               - agent                   all instructions even           all instructions even
6340                               - system                  for OpenCL.*                    for OpenCL.*
6341     ============ ============ ============== ========== =============================== ==================================
6342
6343The memory order also adds the single thread optimization constrains defined in
6344table
6345:ref:`amdgpu-amdhsa-memory-model-single-thread-optimization-constraints-gfx6-gfx10-table`.
6346
6347  .. table:: AMDHSA Memory Model Single Thread Optimization Constraints GFX6-GFX10
6348     :name: amdgpu-amdhsa-memory-model-single-thread-optimization-constraints-gfx6-gfx10-table
6349
6350     ============ ==============================================================
6351     LLVM Memory  Optimization Constraints
6352     Ordering
6353     ============ ==============================================================
6354     unordered    *none*
6355     monotonic    *none*
6356     acquire      - If a load atomic/atomicrmw then no following load/load
6357                    atomic/store/ store atomic/atomicrmw/fence instruction can
6358                    be moved before the acquire.
6359                  - If a fence then same as load atomic, plus no preceding
6360                    associated fence-paired-atomic can be moved after the fence.
6361     release      - If a store atomic/atomicrmw then no preceding load/load
6362                    atomic/store/ store atomic/atomicrmw/fence instruction can
6363                    be moved after the release.
6364                  - If a fence then same as store atomic, plus no following
6365                    associated fence-paired-atomic can be moved before the
6366                    fence.
6367     acq_rel      Same constraints as both acquire and release.
6368     seq_cst      - If a load atomic then same constraints as acquire, plus no
6369                    preceding sequentially consistent load atomic/store
6370                    atomic/atomicrmw/fence instruction can be moved after the
6371                    seq_cst.
6372                  - If a store atomic then the same constraints as release, plus
6373                    no following sequentially consistent load atomic/store
6374                    atomic/atomicrmw/fence instruction can be moved before the
6375                    seq_cst.
6376                  - If an atomicrmw/fence then same constraints as acq_rel.
6377     ============ ==============================================================
6378
6379Trap Handler ABI
6380~~~~~~~~~~~~~~~~
6381
6382For code objects generated by AMDGPU backend for HSA [HSA]_ compatible runtimes
6383(such as ROCm [AMD-ROCm]_), the runtime installs a trap handler that supports
6384the ``s_trap`` instruction with the following usage:
6385
6386  .. table:: AMDGPU Trap Handler for AMDHSA OS
6387     :name: amdgpu-trap-handler-for-amdhsa-os-table
6388
6389     =================== =============== =============== =======================
6390     Usage               Code Sequence   Trap Handler    Description
6391                                         Inputs
6392     =================== =============== =============== =======================
6393     reserved            ``s_trap 0x00``                 Reserved by hardware.
6394     ``debugtrap(arg)``  ``s_trap 0x01`` ``SGPR0-1``:    Reserved for HSA
6395                                           ``queue_ptr`` ``debugtrap``
6396                                         ``VGPR0``:      intrinsic (not
6397                                           ``arg``       implemented).
6398     ``llvm.trap``       ``s_trap 0x02`` ``SGPR0-1``:    Causes dispatch to be
6399                                           ``queue_ptr`` terminated and its
6400                                                         associated queue put
6401                                                         into the error state.
6402     ``llvm.debugtrap``  ``s_trap 0x03``                 - If debugger not
6403                                                           installed then
6404                                                           behaves as a
6405                                                           no-operation. The
6406                                                           trap handler is
6407                                                           entered and
6408                                                           immediately returns
6409                                                           to continue
6410                                                           execution of the
6411                                                           wavefront.
6412                                                         - If the debugger is
6413                                                           installed, causes
6414                                                           the debug trap to be
6415                                                           reported by the
6416                                                           debugger and the
6417                                                           wavefront is put in
6418                                                           the halt state until
6419                                                           resumed by the
6420                                                           debugger.
6421     reserved            ``s_trap 0x04``                 Reserved.
6422     reserved            ``s_trap 0x05``                 Reserved.
6423     reserved            ``s_trap 0x06``                 Reserved.
6424     debugger breakpoint ``s_trap 0x07``                 Reserved for debugger
6425                                                         breakpoints.
6426     reserved            ``s_trap 0x08``                 Reserved.
6427     reserved            ``s_trap 0xfe``                 Reserved.
6428     reserved            ``s_trap 0xff``                 Reserved.
6429     =================== =============== =============== =======================
6430
6431.. _amdgpu-amdhsa-function-call-convention:
6432
6433Call Convention
6434~~~~~~~~~~~~~~~
6435
6436.. note::
6437
6438  This section is currently incomplete and has inakkuracies. It is WIP that will
6439  be updated as information is determined.
6440
6441See :ref:`amdgpu-dwarf-address-space-identifier` for information on swizzled
6442addresses. Unswizzled addresses are normal linear addresses.
6443
6444.. _amdgpu-amdhsa-function-call-convention-kernel-functions:
6445
6446Kernel Functions
6447++++++++++++++++
6448
6449This section describes the call convention ABI for the outer kernel function.
6450
6451See :ref:`amdgpu-amdhsa-initial-kernel-execution-state` for the kernel call
6452convention.
6453
6454The following is not part of the AMDGPU kernel calling convention but describes
6455how the AMDGPU implements function calls:
6456
64571.  Clang decides the kernarg layout to match the *HSA Programmer's Language
6458    Reference* [HSA]_.
6459
6460    - All structs are passed directly.
6461    - Lambda values are passed *TBA*.
6462
6463    .. TODO::
6464
6465      - Does this really follow HSA rules? Or are structs >16 bytes passed
6466        by-value struct?
6467      - What is ABI for lambda values?
6468
64694.  The kernel performs certain setup in its prolog, as described in
6470    :ref:`amdgpu-amdhsa-kernel-prolog`.
6471
6472.. _amdgpu-amdhsa-function-call-convention-non-kernel-functions:
6473
6474Non-Kernel Functions
6475++++++++++++++++++++
6476
6477This section describes the call convention ABI for functions other than the
6478outer kernel function.
6479
6480If a kernel has function calls then scratch is always allocated and used for
6481the call stack which grows from low address to high address using the swizzled
6482scratch address space.
6483
6484On entry to a function:
6485
64861.  SGPR0-3 contain a V# with the following properties (see
6487    :ref:`amdgpu-amdhsa-kernel-prolog-private-segment-buffer`):
6488
6489    * Base address pointing to the beginning of the wavefront scratch backing
6490      memory.
6491    * Swizzled with dword element size and stride of wavefront size elements.
6492
64932.  The FLAT_SCRATCH register pair is setup. See
6494    :ref:`amdgpu-amdhsa-kernel-prolog-flat-scratch`.
64953.  GFX6-8: M0 register set to the size of LDS in bytes. See
6496    :ref:`amdgpu-amdhsa-kernel-prolog-m0`.
64974.  The EXEC register is set to the lanes active on entry to the function.
64985.  MODE register: *TBD*
64996.  VGPR0-31 and SGPR4-29 are used to pass function input arguments as described
6500    below.
65017.  SGPR30-31 return address (RA). The code address that the function must
6502    return to when it completes. The value is undefined if the function is *no
6503    return*.
65048.  SGPR32 is used for the stack pointer (SP). It is an unswizzled scratch
6505    offset relative to the beginning of the wavefront scratch backing memory.
6506
6507    The unswizzled SP can be used with buffer instructions as an unswizzled SGPR
6508    offset with the scratch V# in SGPR0-3 to access the stack in a swizzled
6509    manner.
6510
6511    The unswizzled SP value can be converted into the swizzled SP value by:
6512
6513      | swizzled SP = unswizzled SP / wavefront size
6514
6515    This may be used to obtain the private address space address of stack
6516    objects and to convert this address to a flat address by adding the flat
6517    scratch aperture base address.
6518
6519    The swizzled SP value is always 4 bytes aligned for the ``r600``
6520    architecture and 16 byte aligned for the ``amdgcn`` architecture.
6521
6522    .. note::
6523
6524      The ``amdgcn`` value is selected to avoid dynamic stack alignment for the
6525      OpenCL language which has the largest base type defined as 16 bytes.
6526
6527    On entry, the swizzled SP value is the address of the first function
6528    argument passed on the stack. Other stack passed arguments are positive
6529    offsets from the entry swizzled SP value.
6530
6531    The function may use positive offsets beyond the last stack passed argument
6532    for stack allocated local variables and register spill slots. If necessary,
6533    the function may align these to greater alignment than 16 bytes. After these
6534    the function may dynamically allocate space for such things as runtime sized
6535    ``alloca`` local allocations.
6536
6537    If the function calls another function, it will place any stack allocated
6538    arguments after the last local allocation and adjust SGPR32 to the address
6539    after the last local allocation.
6540
65419.  All other registers are unspecified.
654210. Any necessary ``waitcnt`` has been performed to ensure memory is available
6543    to the function.
6544
6545On exit from a function:
6546
65471.  VGPR0-31 and SGPR4-29 are used to pass function result arguments as
6548    described below. Any registers used are considered clobbered registers.
65492.  The following registers are preserved and have the same value as on entry:
6550
6551    * FLAT_SCRATCH
6552    * EXEC
6553    * GFX6-8: M0
6554    * All SGPR registers except the clobbered registers of SGPR4-31.
6555    * VGPR40-47
6556      VGPR56-63
6557      VGPR72-79
6558      VGPR88-95
6559      VGPR104-111
6560      VGPR120-127
6561      VGPR136-143
6562      VGPR152-159
6563      VGPR168-175
6564      VGPR184-191
6565      VGPR200-207
6566      VGPR216-223
6567      VGPR232-239
6568      VGPR248-255
6569
6570        *Except the argument registers, the VGPR cloberred and the preserved
6571        registers are intermixed at regular intervals in order to
6572        get a better occupancy.*
6573
6574      For the AMDGPU backend, an inter-procedural register allocation (IPRA)
6575      optimization may mark some of clobbered SGPR and VGPR registers as
6576      preserved if it can be determined that the called function does not change
6577      their value.
6578
65792.  The PC is set to the RA provided on entry.
65803.  MODE register: *TBD*.
65814.  All other registers are clobbered.
65825.  Any necessary ``waitcnt`` has been performed to ensure memory accessed by
6583    function is available to the caller.
6584
6585.. TODO::
6586
6587  - On gfx908 are all ACC registers clobbered?
6588
6589  - How are function results returned? The address of structured types is passed
6590    by reference, but what about other types?
6591
6592The function input arguments are made up of the formal arguments explicitly
6593declared by the source language function plus the implicit input arguments used
6594by the implementation.
6595
6596The source language input arguments are:
6597
65981. Any source language implicit ``this`` or ``self`` argument comes first as a
6599   pointer type.
66002. Followed by the function formal arguments in left to right source order.
6601
6602The source language result arguments are:
6603
66041. The function result argument.
6605
6606The source language input or result struct type arguments that are less than or
6607equal to 16 bytes, are decomposed recursively into their base type fields, and
6608each field is passed as if a separate argument. For input arguments, if the
6609called function requires the struct to be in memory, for example because its
6610address is taken, then the function body is responsible for allocating a stack
6611location and copying the field arguments into it. Clang terms this *direct
6612struct*.
6613
6614The source language input struct type arguments that are greater than 16 bytes,
6615are passed by reference. The caller is responsible for allocating a stack
6616location to make a copy of the struct value and pass the address as the input
6617argument. The called function is responsible to perform the dereference when
6618accessing the input argument. Clang terms this *by-value struct*.
6619
6620A source language result struct type argument that is greater than 16 bytes, is
6621returned by reference. The caller is responsible for allocating a stack location
6622to hold the result value and passes the address as the last input argument
6623(before the implicit input arguments). In this case there are no result
6624arguments. The called function is responsible to perform the dereference when
6625storing the result value. Clang terms this *structured return (sret)*.
6626
6627*TODO: correct the ``sret`` definition.*
6628
6629.. TODO::
6630
6631  Is this definition correct? Or is ``sret`` only used if passing in registers, and
6632  pass as non-decomposed struct as stack argument? Or something else? Is the
6633  memory location in the caller stack frame, or a stack memory argument and so
6634  no address is passed as the caller can directly write to the argument stack
6635  location? But then the stack location is still live after return. If an
6636  argument stack location is it the first stack argument or the last one?
6637
6638Lambda argument types are treated as struct types with an implementation defined
6639set of fields.
6640
6641.. TODO::
6642
6643  Need to specify the ABI for lambda types for AMDGPU.
6644
6645For AMDGPU backend all source language arguments (including the decomposed
6646struct type arguments) are passed in VGPRs unless marked ``inreg`` in which case
6647they are passed in SGPRs.
6648
6649The AMDGPU backend walks the function call graph from the leaves to determine
6650which implicit input arguments are used, propagating to each caller of the
6651function. The used implicit arguments are appended to the function arguments
6652after the source language arguments in the following order:
6653
6654.. TODO::
6655
6656  Is recursion or external functions supported?
6657
66581.  Work-Item ID (1 VGPR)
6659
6660    The X, Y and Z work-item ID are packed into a single VGRP with the following
6661    layout. Only fields actually used by the function are set. The other bits
6662    are undefined.
6663
6664    The values come from the initial kernel execution state. See
6665    :ref:`amdgpu-amdhsa-vgpr-register-set-up-order-table`.
6666
6667    .. table:: Work-item implicit argument layout
6668      :name: amdgpu-amdhsa-workitem-implicit-argument-layout-table
6669
6670      ======= ======= ==============
6671      Bits    Size    Field Name
6672      ======= ======= ==============
6673      9:0     10 bits X Work-Item ID
6674      19:10   10 bits Y Work-Item ID
6675      29:20   10 bits Z Work-Item ID
6676      31:30   2 bits  Unused
6677      ======= ======= ==============
6678
66792.  Dispatch Ptr (2 SGPRs)
6680
6681    The value comes from the initial kernel execution state. See
6682    :ref:`amdgpu-amdhsa-sgpr-register-set-up-order-table`.
6683
66843.  Queue Ptr (2 SGPRs)
6685
6686    The value comes from the initial kernel execution state. See
6687    :ref:`amdgpu-amdhsa-sgpr-register-set-up-order-table`.
6688
66894.  Kernarg Segment Ptr (2 SGPRs)
6690
6691    The value comes from the initial kernel execution state. See
6692    :ref:`amdgpu-amdhsa-sgpr-register-set-up-order-table`.
6693
66945.  Dispatch id (2 SGPRs)
6695
6696    The value comes from the initial kernel execution state. See
6697    :ref:`amdgpu-amdhsa-sgpr-register-set-up-order-table`.
6698
66996.  Work-Group ID X (1 SGPR)
6700
6701    The value comes from the initial kernel execution state. See
6702    :ref:`amdgpu-amdhsa-sgpr-register-set-up-order-table`.
6703
67047.  Work-Group ID Y (1 SGPR)
6705
6706    The value comes from the initial kernel execution state. See
6707    :ref:`amdgpu-amdhsa-sgpr-register-set-up-order-table`.
6708
67098.  Work-Group ID Z (1 SGPR)
6710
6711    The value comes from the initial kernel execution state. See
6712    :ref:`amdgpu-amdhsa-sgpr-register-set-up-order-table`.
6713
67149.  Implicit Argument Ptr (2 SGPRs)
6715
6716    The value is computed by adding an offset to Kernarg Segment Ptr to get the
6717    global address space pointer to the first kernarg implicit argument.
6718
6719The input and result arguments are assigned in order in the following manner:
6720
6721.. note::
6722
6723  There are likely some errors and omissions in the following description that
6724  need correction.
6725
6726  .. TODO::
6727
6728    Check the clang source code to decipher how function arguments and return
6729    results are handled. Also see the AMDGPU specific values used.
6730
6731* VGPR arguments are assigned to consecutive VGPRs starting at VGPR0 up to
6732  VGPR31.
6733
6734  If there are more arguments than will fit in these registers, the remaining
6735  arguments are allocated on the stack in order on naturally aligned
6736  addresses.
6737
6738  .. TODO::
6739
6740    How are overly aligned structures allocated on the stack?
6741
6742* SGPR arguments are assigned to consecutive SGPRs starting at SGPR0 up to
6743  SGPR29.
6744
6745  If there are more arguments than will fit in these registers, the remaining
6746  arguments are allocated on the stack in order on naturally aligned
6747  addresses.
6748
6749Note that decomposed struct type arguments may have some fields passed in
6750registers and some in memory.
6751
6752.. TODO::
6753
6754  So, a struct which can pass some fields as decomposed register arguments, will
6755  pass the rest as decomposed stack elements? But an argument that will not start
6756  in registers will not be decomposed and will be passed as a non-decomposed
6757  stack value?
6758
6759The following is not part of the AMDGPU function calling convention but
6760describes how the AMDGPU implements function calls:
6761
67621.  SGPR33 is used as a frame pointer (FP) if necessary. Like the SP it is an
6763    unswizzled scratch address. It is only needed if runtime sized ``alloca``
6764    are used, or for the reasons defined in ``SIFrameLowering``.
67652.  Runtime stack alignment is supported. SGPR34 is used as a base pointer (BP)
6766    to access the incoming stack arguments in the function. The BP is needed
6767    only when the function requires the runtime stack alignment.
6768
67693.  Allocating SGPR arguments on the stack are not supported.
6770
67714.  No CFI is currently generated. See
6772    :ref:`amdgpu-dwarf-call-frame-information`.
6773
6774    .. note::
6775
6776      CFI will be generated that defines the CFA as the unswizzled address
6777      relative to the wave scratch base in the unswizzled private address space
6778      of the lowest address stack allocated local variable.
6779
6780      ``DW_AT_frame_base`` will be defined as the swizzled address in the
6781      swizzled private address space by dividing the CFA by the wavefront size
6782      (since CFA is always at least dword aligned which matches the scratch
6783      swizzle element size).
6784
6785      If no dynamic stack alignment was performed, the stack allocated arguments
6786      are accessed as negative offsets relative to ``DW_AT_frame_base``, and the
6787      local variables and register spill slots are accessed as positive offsets
6788      relative to ``DW_AT_frame_base``.
6789
67905.  Function argument passing is implemented by copying the input physical
6791    registers to virtual registers on entry. The register allocator can spill if
6792    necessary. These are copied back to physical registers at call sites. The
6793    net effect is that each function call can have these values in entirely
6794    distinct locations. The IPRA can help avoid shuffling argument registers.
67956.  Call sites are implemented by setting up the arguments at positive offsets
6796    from SP. Then SP is incremented to account for the known frame size before
6797    the call and decremented after the call.
6798
6799    .. note::
6800
6801      The CFI will reflect the changed calculation needed to compute the CFA
6802      from SP.
6803
68047.  4 byte spill slots are used in the stack frame. One slot is allocated for an
6805    emergency spill slot. Buffer instructions are used for stack accesses and
6806    not the ``flat_scratch`` instruction.
6807
6808    .. TODO::
6809
6810      Explain when the emergency spill slot is used.
6811
6812.. TODO::
6813
6814  Possible broken issues:
6815
6816  - Stack arguments must be aligned to required alignment.
6817  - Stack is aligned to max(16, max formal argument alignment)
6818  - Direct argument < 64 bits should check register budget.
6819  - Register budget calculation should respect ``inreg`` for SGPR.
6820  - SGPR overflow is not handled.
6821  - struct with 1 member unpeeling is not checking size of member.
6822  - ``sret`` is after ``this`` pointer.
6823  - Caller is not implementing stack realignment: need an extra pointer.
6824  - Should say AMDGPU passes FP rather than SP.
6825  - Should CFI define CFA as address of locals or arguments. Difference is
6826    apparent when have implemented dynamic alignment.
6827  - If ``SCRATCH`` instruction could allow negative offsets, then can make FP be
6828    highest address of stack frame and use negative offset for locals. Would
6829    allow SP to be the same as FP and could support signal-handler-like as now
6830    have a real SP for the top of the stack.
6831  - How is ``sret`` passed on the stack? In argument stack area? Can it overlay
6832    arguments?
6833
6834AMDPAL
6835------
6836
6837This section provides code conventions used when the target triple OS is
6838``amdpal`` (see :ref:`amdgpu-target-triples`) for passing runtime parameters
6839from the application/runtime to each invocation of a hardware shader. These
6840parameters include both generic, application-controlled parameters called
6841*user data* as well as system-generated parameters that are a product of the
6842draw or dispatch execution.
6843
6844User Data
6845~~~~~~~~~
6846
6847Each hardware stage has a set of 32-bit *user data registers* which can be
6848written from a command buffer and then loaded into SGPRs when waves are launched
6849via a subsequent dispatch or draw operation. This is the way most arguments are
6850passed from the application/runtime to a hardware shader.
6851
6852Compute User Data
6853~~~~~~~~~~~~~~~~~
6854
6855Compute shader user data mappings are simpler than graphics shaders and have a
6856fixed mapping.
6857
6858Note that there are always 10 available *user data entries* in registers -
6859entries beyond that limit must be fetched from memory (via the spill table
6860pointer) by the shader.
6861
6862  .. table:: PAL Compute Shader User Data Registers
6863     :name: pal-compute-user-data-registers
6864
6865     ============= ================================
6866     User Register Description
6867     ============= ================================
6868     0             Global Internal Table (32-bit pointer)
6869     1             Per-Shader Internal Table (32-bit pointer)
6870     2 - 11        Application-Controlled User Data (10 32-bit values)
6871     12            Spill Table (32-bit pointer)
6872     13 - 14       Thread Group Count (64-bit pointer)
6873     15            GDS Range
6874     ============= ================================
6875
6876Graphics User Data
6877~~~~~~~~~~~~~~~~~~
6878
6879Graphics pipelines support a much more flexible user data mapping:
6880
6881  .. table:: PAL Graphics Shader User Data Registers
6882     :name: pal-graphics-user-data-registers
6883
6884     ============= ================================
6885     User Register Description
6886     ============= ================================
6887     0             Global Internal Table (32-bit pointer)
6888     +             Per-Shader Internal Table (32-bit pointer)
6889     + 1-15        Application Controlled User Data
6890                   (1-15 Contiguous 32-bit Values in Registers)
6891     +             Spill Table (32-bit pointer)
6892     +             Draw Index (First Stage Only)
6893     +             Vertex Offset (First Stage Only)
6894     +             Instance Offset (First Stage Only)
6895     ============= ================================
6896
6897  The placement of the global internal table remains fixed in the first *user
6898  data SGPR register*. Otherwise all parameters are optional, and can be mapped
6899  to any desired *user data SGPR register*, with the following restrictions:
6900
6901  * Draw Index, Vertex Offset, and Instance Offset can only be used by the first
6902    active hardware stage in a graphics pipeline (i.e. where the API vertex
6903    shader runs).
6904
6905  * Application-controlled user data must be mapped into a contiguous range of
6906    user data registers.
6907
6908  * The application-controlled user data range supports compaction remapping, so
6909    only *entries* that are actually consumed by the shader must be assigned to
6910    corresponding *registers*. Note that in order to support an efficient runtime
6911    implementation, the remapping must pack *registers* in the same order as
6912    *entries*, with unused *entries* removed.
6913
6914.. _pal_global_internal_table:
6915
6916Global Internal Table
6917~~~~~~~~~~~~~~~~~~~~~
6918
6919The global internal table is a table of *shader resource descriptors* (SRDs)
6920that define how certain engine-wide, runtime-managed resources should be
6921accessed from a shader. The majority of these resources have HW-defined formats,
6922and it is up to the compiler to write/read data as required by the target
6923hardware.
6924
6925The following table illustrates the required format:
6926
6927  .. table:: PAL Global Internal Table
6928     :name: pal-git-table
6929
6930     ============= ================================
6931     Offset        Description
6932     ============= ================================
6933     0-3           Graphics Scratch SRD
6934     4-7           Compute Scratch SRD
6935     8-11          ES/GS Ring Output SRD
6936     12-15         ES/GS Ring Input SRD
6937     16-19         GS/VS Ring Output #0
6938     20-23         GS/VS Ring Output #1
6939     24-27         GS/VS Ring Output #2
6940     28-31         GS/VS Ring Output #3
6941     32-35         GS/VS Ring Input SRD
6942     36-39         Tessellation Factor Buffer SRD
6943     40-43         Off-Chip LDS Buffer SRD
6944     44-47         Off-Chip Param Cache Buffer SRD
6945     48-51         Sample Position Buffer SRD
6946     52            vaRange::ShadowDescriptorTable High Bits
6947     ============= ================================
6948
6949  The pointer to the global internal table passed to the shader as user data
6950  is a 32-bit pointer. The top 32 bits should be assumed to be the same as
6951  the top 32 bits of the pipeline, so the shader may use the program
6952  counter's top 32 bits.
6953
6954Unspecified OS
6955--------------
6956
6957This section provides code conventions used when the target triple OS is
6958empty (see :ref:`amdgpu-target-triples`).
6959
6960Trap Handler ABI
6961~~~~~~~~~~~~~~~~
6962
6963For code objects generated by AMDGPU backend for non-amdhsa OS, the runtime does
6964not install a trap handler. The ``llvm.trap`` and ``llvm.debugtrap``
6965instructions are handled as follows:
6966
6967  .. table:: AMDGPU Trap Handler for Non-AMDHSA OS
6968     :name: amdgpu-trap-handler-for-non-amdhsa-os-table
6969
6970     =============== =============== ===========================================
6971     Usage           Code Sequence   Description
6972     =============== =============== ===========================================
6973     llvm.trap       s_endpgm        Causes wavefront to be terminated.
6974     llvm.debugtrap  *none*          Compiler warning given that there is no
6975                                     trap handler installed.
6976     =============== =============== ===========================================
6977
6978Source Languages
6979================
6980
6981.. _amdgpu-opencl:
6982
6983OpenCL
6984------
6985
6986When the language is OpenCL the following differences occur:
6987
69881. The OpenCL memory model is used (see :ref:`amdgpu-amdhsa-memory-model`).
69892. The AMDGPU backend appends additional arguments to the kernel's explicit
6990   arguments for the AMDHSA OS (see
6991   :ref:`opencl-kernel-implicit-arguments-appended-for-amdhsa-os-table`).
69923. Additional metadata is generated
6993   (see :ref:`amdgpu-amdhsa-code-object-metadata`).
6994
6995  .. table:: OpenCL kernel implicit arguments appended for AMDHSA OS
6996     :name: opencl-kernel-implicit-arguments-appended-for-amdhsa-os-table
6997
6998     ======== ==== ========= ===========================================
6999     Position Byte Byte      Description
7000              Size Alignment
7001     ======== ==== ========= ===========================================
7002     1        8    8         OpenCL Global Offset X
7003     2        8    8         OpenCL Global Offset Y
7004     3        8    8         OpenCL Global Offset Z
7005     4        8    8         OpenCL address of printf buffer
7006     5        8    8         OpenCL address of virtual queue used by
7007                             enqueue_kernel.
7008     6        8    8         OpenCL address of AqlWrap struct used by
7009                             enqueue_kernel.
7010     7        8    8         Pointer argument used for Multi-gird
7011                             synchronization.
7012     ======== ==== ========= ===========================================
7013
7014.. _amdgpu-hcc:
7015
7016HCC
7017---
7018
7019When the language is HCC the following differences occur:
7020
70211. The HSA memory model is used (see :ref:`amdgpu-amdhsa-memory-model`).
7022
7023.. _amdgpu-assembler:
7024
7025Assembler
7026---------
7027
7028AMDGPU backend has LLVM-MC based assembler which is currently in development.
7029It supports AMDGCN GFX6-GFX10.
7030
7031This section describes general syntax for instructions and operands.
7032
7033Instructions
7034~~~~~~~~~~~~
7035
7036An instruction has the following :doc:`syntax<AMDGPUInstructionSyntax>`:
7037
7038  | ``<``\ *opcode*\ ``> <``\ *operand0*\ ``>, <``\ *operand1*\ ``>,...
7039    <``\ *modifier0*\ ``> <``\ *modifier1*\ ``>...``
7040
7041:doc:`Operands<AMDGPUOperandSyntax>` are comma-separated while
7042:doc:`modifiers<AMDGPUModifierSyntax>` are space-separated.
7043
7044The order of operands and modifiers is fixed.
7045Most modifiers are optional and may be omitted.
7046
7047Links to detailed instruction syntax description may be found in the following
7048table. Note that features under development are not included
7049in this description.
7050
7051    =================================== =======================================
7052    Core ISA                            ISA Extensions
7053    =================================== =======================================
7054    :doc:`GFX7<AMDGPU/AMDGPUAsmGFX7>`   \-
7055    :doc:`GFX8<AMDGPU/AMDGPUAsmGFX8>`   \-
7056    :doc:`GFX9<AMDGPU/AMDGPUAsmGFX9>`   :doc:`gfx900<AMDGPU/AMDGPUAsmGFX900>`
7057
7058                                        :doc:`gfx902<AMDGPU/AMDGPUAsmGFX900>`
7059
7060                                        :doc:`gfx904<AMDGPU/AMDGPUAsmGFX904>`
7061
7062                                        :doc:`gfx906<AMDGPU/AMDGPUAsmGFX906>`
7063
7064                                        :doc:`gfx908<AMDGPU/AMDGPUAsmGFX908>`
7065
7066                                        :doc:`gfx909<AMDGPU/AMDGPUAsmGFX900>`
7067
7068    :doc:`GFX10<AMDGPU/AMDGPUAsmGFX10>` :doc:`gfx1011<AMDGPU/AMDGPUAsmGFX1011>`
7069
7070                                        :doc:`gfx1012<AMDGPU/AMDGPUAsmGFX1011>`
7071    =================================== =======================================
7072
7073For more information about instructions, their semantics and supported
7074combinations of operands, refer to one of instruction set architecture manuals
7075[AMD-GCN-GFX6]_, [AMD-GCN-GFX7]_, [AMD-GCN-GFX8]_, [AMD-GCN-GFX9]_ and
7076[AMD-GCN-GFX10]_.
7077
7078Operands
7079~~~~~~~~
7080
7081Detailed description of operands may be found :doc:`here<AMDGPUOperandSyntax>`.
7082
7083Modifiers
7084~~~~~~~~~
7085
7086Detailed description of modifiers may be found
7087:doc:`here<AMDGPUModifierSyntax>`.
7088
7089Instruction Examples
7090~~~~~~~~~~~~~~~~~~~~
7091
7092DS
7093++
7094
7095.. code-block:: nasm
7096
7097  ds_add_u32 v2, v4 offset:16
7098  ds_write_src2_b64 v2 offset0:4 offset1:8
7099  ds_cmpst_f32 v2, v4, v6
7100  ds_min_rtn_f64 v[8:9], v2, v[4:5]
7101
7102For full list of supported instructions, refer to "LDS/GDS instructions" in ISA
7103Manual.
7104
7105FLAT
7106++++
7107
7108.. code-block:: nasm
7109
7110  flat_load_dword v1, v[3:4]
7111  flat_store_dwordx3 v[3:4], v[5:7]
7112  flat_atomic_swap v1, v[3:4], v5 glc
7113  flat_atomic_cmpswap v1, v[3:4], v[5:6] glc slc
7114  flat_atomic_fmax_x2 v[1:2], v[3:4], v[5:6] glc
7115
7116For full list of supported instructions, refer to "FLAT instructions" in ISA
7117Manual.
7118
7119MUBUF
7120+++++
7121
7122.. code-block:: nasm
7123
7124  buffer_load_dword v1, off, s[4:7], s1
7125  buffer_store_dwordx4 v[1:4], v2, ttmp[4:7], s1 offen offset:4 glc tfe
7126  buffer_store_format_xy v[1:2], off, s[4:7], s1
7127  buffer_wbinvl1
7128  buffer_atomic_inc v1, v2, s[8:11], s4 idxen offset:4 slc
7129
7130For full list of supported instructions, refer to "MUBUF Instructions" in ISA
7131Manual.
7132
7133SMRD/SMEM
7134+++++++++
7135
7136.. code-block:: nasm
7137
7138  s_load_dword s1, s[2:3], 0xfc
7139  s_load_dwordx8 s[8:15], s[2:3], s4
7140  s_load_dwordx16 s[88:103], s[2:3], s4
7141  s_dcache_inv_vol
7142  s_memtime s[4:5]
7143
7144For full list of supported instructions, refer to "Scalar Memory Operations" in
7145ISA Manual.
7146
7147SOP1
7148++++
7149
7150.. code-block:: nasm
7151
7152  s_mov_b32 s1, s2
7153  s_mov_b64 s[0:1], 0x80000000
7154  s_cmov_b32 s1, 200
7155  s_wqm_b64 s[2:3], s[4:5]
7156  s_bcnt0_i32_b64 s1, s[2:3]
7157  s_swappc_b64 s[2:3], s[4:5]
7158  s_cbranch_join s[4:5]
7159
7160For full list of supported instructions, refer to "SOP1 Instructions" in ISA
7161Manual.
7162
7163SOP2
7164++++
7165
7166.. code-block:: nasm
7167
7168  s_add_u32 s1, s2, s3
7169  s_and_b64 s[2:3], s[4:5], s[6:7]
7170  s_cselect_b32 s1, s2, s3
7171  s_andn2_b32 s2, s4, s6
7172  s_lshr_b64 s[2:3], s[4:5], s6
7173  s_ashr_i32 s2, s4, s6
7174  s_bfm_b64 s[2:3], s4, s6
7175  s_bfe_i64 s[2:3], s[4:5], s6
7176  s_cbranch_g_fork s[4:5], s[6:7]
7177
7178For full list of supported instructions, refer to "SOP2 Instructions" in ISA
7179Manual.
7180
7181SOPC
7182++++
7183
7184.. code-block:: nasm
7185
7186  s_cmp_eq_i32 s1, s2
7187  s_bitcmp1_b32 s1, s2
7188  s_bitcmp0_b64 s[2:3], s4
7189  s_setvskip s3, s5
7190
7191For full list of supported instructions, refer to "SOPC Instructions" in ISA
7192Manual.
7193
7194SOPP
7195++++
7196
7197.. code-block:: nasm
7198
7199  s_barrier
7200  s_nop 2
7201  s_endpgm
7202  s_waitcnt 0 ; Wait for all counters to be 0
7203  s_waitcnt vmcnt(0) & expcnt(0) & lgkmcnt(0) ; Equivalent to above
7204  s_waitcnt vmcnt(1) ; Wait for vmcnt counter to be 1.
7205  s_sethalt 9
7206  s_sleep 10
7207  s_sendmsg 0x1
7208  s_sendmsg sendmsg(MSG_INTERRUPT)
7209  s_trap 1
7210
7211For full list of supported instructions, refer to "SOPP Instructions" in ISA
7212Manual.
7213
7214Unless otherwise mentioned, little verification is performed on the operands
7215of SOPP Instructions, so it is up to the programmer to be familiar with the
7216range or acceptable values.
7217
7218VALU
7219++++
7220
7221For vector ALU instruction opcodes (VOP1, VOP2, VOP3, VOPC, VOP_DPP, VOP_SDWA),
7222the assembler will automatically use optimal encoding based on its operands. To
7223force specific encoding, one can add a suffix to the opcode of the instruction:
7224
7225* _e32 for 32-bit VOP1/VOP2/VOPC
7226* _e64 for 64-bit VOP3
7227* _dpp for VOP_DPP
7228* _sdwa for VOP_SDWA
7229
7230VOP1/VOP2/VOP3/VOPC examples:
7231
7232.. code-block:: nasm
7233
7234  v_mov_b32 v1, v2
7235  v_mov_b32_e32 v1, v2
7236  v_nop
7237  v_cvt_f64_i32_e32 v[1:2], v2
7238  v_floor_f32_e32 v1, v2
7239  v_bfrev_b32_e32 v1, v2
7240  v_add_f32_e32 v1, v2, v3
7241  v_mul_i32_i24_e64 v1, v2, 3
7242  v_mul_i32_i24_e32 v1, -3, v3
7243  v_mul_i32_i24_e32 v1, -100, v3
7244  v_addc_u32 v1, s[0:1], v2, v3, s[2:3]
7245  v_max_f16_e32 v1, v2, v3
7246
7247VOP_DPP examples:
7248
7249.. code-block:: nasm
7250
7251  v_mov_b32 v0, v0 quad_perm:[0,2,1,1]
7252  v_sin_f32 v0, v0 row_shl:1 row_mask:0xa bank_mask:0x1 bound_ctrl:0
7253  v_mov_b32 v0, v0 wave_shl:1
7254  v_mov_b32 v0, v0 row_mirror
7255  v_mov_b32 v0, v0 row_bcast:31
7256  v_mov_b32 v0, v0 quad_perm:[1,3,0,1] row_mask:0xa bank_mask:0x1 bound_ctrl:0
7257  v_add_f32 v0, v0, |v0| row_shl:1 row_mask:0xa bank_mask:0x1 bound_ctrl:0
7258  v_max_f16 v1, v2, v3 row_shl:1 row_mask:0xa bank_mask:0x1 bound_ctrl:0
7259
7260VOP_SDWA examples:
7261
7262.. code-block:: nasm
7263
7264  v_mov_b32 v1, v2 dst_sel:BYTE_0 dst_unused:UNUSED_PRESERVE src0_sel:DWORD
7265  v_min_u32 v200, v200, v1 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_1 src1_sel:DWORD
7266  v_sin_f32 v0, v0 dst_unused:UNUSED_PAD src0_sel:WORD_1
7267  v_fract_f32 v0, |v0| dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_1
7268  v_cmpx_le_u32 vcc, v1, v2 src0_sel:BYTE_2 src1_sel:WORD_0
7269
7270For full list of supported instructions, refer to "Vector ALU instructions".
7271
7272.. TODO::
7273
7274  Remove once we switch to code object v3 by default.
7275
7276.. _amdgpu-amdhsa-assembler-predefined-symbols-v2:
7277
7278Code Object V2 Predefined Symbols (-mattr=-code-object-v3)
7279~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
7280
7281.. warning:: Code Object V2 is not the default code object version emitted by
7282  this version of LLVM. For a description of the predefined symbols available
7283  with the default configuration (Code Object V3) see
7284  :ref:`amdgpu-amdhsa-assembler-predefined-symbols-v3`.
7285
7286The AMDGPU assembler defines and updates some symbols automatically. These
7287symbols do not affect code generation.
7288
7289.option.machine_version_major
7290+++++++++++++++++++++++++++++
7291
7292Set to the GFX major generation number of the target being assembled for. For
7293example, when assembling for a "GFX9" target this will be set to the integer
7294value "9". The possible GFX major generation numbers are presented in
7295:ref:`amdgpu-processors`.
7296
7297.option.machine_version_minor
7298+++++++++++++++++++++++++++++
7299
7300Set to the GFX minor generation number of the target being assembled for. For
7301example, when assembling for a "GFX810" target this will be set to the integer
7302value "1". The possible GFX minor generation numbers are presented in
7303:ref:`amdgpu-processors`.
7304
7305.option.machine_version_stepping
7306++++++++++++++++++++++++++++++++
7307
7308Set to the GFX stepping generation number of the target being assembled for.
7309For example, when assembling for a "GFX704" target this will be set to the
7310integer value "4". The possible GFX stepping generation numbers are presented
7311in :ref:`amdgpu-processors`.
7312
7313.kernel.vgpr_count
7314++++++++++++++++++
7315
7316Set to zero each time a
7317:ref:`amdgpu-amdhsa-assembler-directive-amdgpu_hsa_kernel` directive is
7318encountered. At each instruction, if the current value of this symbol is less
7319than or equal to the maximum VPGR number explicitly referenced within that
7320instruction then the symbol value is updated to equal that VGPR number plus
7321one.
7322
7323.kernel.sgpr_count
7324++++++++++++++++++
7325
7326Set to zero each time a
7327:ref:`amdgpu-amdhsa-assembler-directive-amdgpu_hsa_kernel` directive is
7328encountered. At each instruction, if the current value of this symbol is less
7329than or equal to the maximum VPGR number explicitly referenced within that
7330instruction then the symbol value is updated to equal that SGPR number plus
7331one.
7332
7333.. _amdgpu-amdhsa-assembler-directives-v2:
7334
7335Code Object V2 Directives (-mattr=-code-object-v3)
7336~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
7337
7338.. warning:: Code Object V2 is not the default code object version emitted by
7339  this version of LLVM. For a description of the directives supported with
7340  the default configuration (Code Object V3) see
7341  :ref:`amdgpu-amdhsa-assembler-directives-v3`.
7342
7343AMDGPU ABI defines auxiliary data in output code object. In assembly source,
7344one can specify them with assembler directives.
7345
7346.hsa_code_object_version major, minor
7347+++++++++++++++++++++++++++++++++++++
7348
7349*major* and *minor* are integers that specify the version of the HSA code
7350object that will be generated by the assembler.
7351
7352.hsa_code_object_isa [major, minor, stepping, vendor, arch]
7353+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
7354
7355
7356*major*, *minor*, and *stepping* are all integers that describe the instruction
7357set architecture (ISA) version of the assembly program.
7358
7359*vendor* and *arch* are quoted strings. *vendor* should always be equal to
7360"AMD" and *arch* should always be equal to "AMDGPU".
7361
7362By default, the assembler will derive the ISA version, *vendor*, and *arch*
7363from the value of the -mcpu option that is passed to the assembler.
7364
7365.. _amdgpu-amdhsa-assembler-directive-amdgpu_hsa_kernel:
7366
7367.amdgpu_hsa_kernel (name)
7368+++++++++++++++++++++++++
7369
7370This directives specifies that the symbol with given name is a kernel entry
7371point (label) and the object should contain corresponding symbol of type
7372STT_AMDGPU_HSA_KERNEL.
7373
7374.amd_kernel_code_t
7375++++++++++++++++++
7376
7377This directive marks the beginning of a list of key / value pairs that are used
7378to specify the amd_kernel_code_t object that will be emitted by the assembler.
7379The list must be terminated by the *.end_amd_kernel_code_t* directive. For any
7380amd_kernel_code_t values that are unspecified a default value will be used. The
7381default value for all keys is 0, with the following exceptions:
7382
7383- *amd_code_version_major* defaults to 1.
7384- *amd_kernel_code_version_minor* defaults to 2.
7385- *amd_machine_kind* defaults to 1.
7386- *amd_machine_version_major*, *machine_version_minor*, and
7387  *amd_machine_version_stepping* are derived from the value of the -mcpu option
7388  that is passed to the assembler.
7389- *kernel_code_entry_byte_offset* defaults to 256.
7390- *wavefront_size* defaults 6 for all targets before GFX10. For GFX10 onwards
7391  defaults to 6 if target feature ``wavefrontsize64`` is enabled, otherwise 5.
7392  Note that wavefront size is specified as a power of two, so a value of **n**
7393  means a size of 2^ **n**.
7394- *call_convention* defaults to -1.
7395- *kernarg_segment_alignment*, *group_segment_alignment*, and
7396  *private_segment_alignment* default to 4. Note that alignments are specified
7397  as a power of 2, so a value of **n** means an alignment of 2^ **n**.
7398- *enable_wgp_mode* defaults to 1 if target feature ``cumode`` is disabled for
7399  GFX10 onwards.
7400- *enable_mem_ordered* defaults to 1 for GFX10 onwards.
7401
7402The *.amd_kernel_code_t* directive must be placed immediately after the
7403function label and before any instructions.
7404
7405For a full list of amd_kernel_code_t keys, refer to AMDGPU ABI document,
7406comments in lib/Target/AMDGPU/AmdKernelCodeT.h and test/CodeGen/AMDGPU/hsa.s.
7407
7408.. _amdgpu-amdhsa-assembler-example-v2:
7409
7410Code Object V2 Example Source Code (-mattr=-code-object-v3)
7411~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
7412
7413.. warning:: Code Object V2 is not the default code object version emitted by
7414  this version of LLVM. For a description of the directives supported with
7415  the default configuration (Code Object V3) see
7416  :ref:`amdgpu-amdhsa-assembler-example-v3`.
7417
7418Here is an example of a minimal assembly source file, defining one HSA kernel:
7419
7420.. code::
7421   :number-lines:
7422
7423   .hsa_code_object_version 1,0
7424   .hsa_code_object_isa
7425
7426   .hsatext
7427   .globl  hello_world
7428   .p2align 8
7429   .amdgpu_hsa_kernel hello_world
7430
7431   hello_world:
7432
7433      .amd_kernel_code_t
7434         enable_sgpr_kernarg_segment_ptr = 1
7435         is_ptr64 = 1
7436         compute_pgm_rsrc1_vgprs = 0
7437         compute_pgm_rsrc1_sgprs = 0
7438         compute_pgm_rsrc2_user_sgpr = 2
7439         compute_pgm_rsrc1_wgp_mode = 0
7440         compute_pgm_rsrc1_mem_ordered = 0
7441         compute_pgm_rsrc1_fwd_progress = 1
7442     .end_amd_kernel_code_t
7443
7444     s_load_dwordx2 s[0:1], s[0:1] 0x0
7445     v_mov_b32 v0, 3.14159
7446     s_waitcnt lgkmcnt(0)
7447     v_mov_b32 v1, s0
7448     v_mov_b32 v2, s1
7449     flat_store_dword v[1:2], v0
7450     s_endpgm
7451   .Lfunc_end0:
7452        .size   hello_world, .Lfunc_end0-hello_world
7453
7454.. _amdgpu-amdhsa-assembler-predefined-symbols-v3:
7455
7456Code Object V3 Predefined Symbols (-mattr=+code-object-v3)
7457~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
7458
7459The AMDGPU assembler defines and updates some symbols automatically. These
7460symbols do not affect code generation.
7461
7462.amdgcn.gfx_generation_number
7463+++++++++++++++++++++++++++++
7464
7465Set to the GFX major generation number of the target being assembled for. For
7466example, when assembling for a "GFX9" target this will be set to the integer
7467value "9". The possible GFX major generation numbers are presented in
7468:ref:`amdgpu-processors`.
7469
7470.amdgcn.gfx_generation_minor
7471++++++++++++++++++++++++++++
7472
7473Set to the GFX minor generation number of the target being assembled for. For
7474example, when assembling for a "GFX810" target this will be set to the integer
7475value "1". The possible GFX minor generation numbers are presented in
7476:ref:`amdgpu-processors`.
7477
7478.amdgcn.gfx_generation_stepping
7479+++++++++++++++++++++++++++++++
7480
7481Set to the GFX stepping generation number of the target being assembled for.
7482For example, when assembling for a "GFX704" target this will be set to the
7483integer value "4". The possible GFX stepping generation numbers are presented
7484in :ref:`amdgpu-processors`.
7485
7486.. _amdgpu-amdhsa-assembler-symbol-next_free_vgpr:
7487
7488.amdgcn.next_free_vgpr
7489++++++++++++++++++++++
7490
7491Set to zero before assembly begins. At each instruction, if the current value
7492of this symbol is less than or equal to the maximum VGPR number explicitly
7493referenced within that instruction then the symbol value is updated to equal
7494that VGPR number plus one.
7495
7496May be used to set the `.amdhsa_next_free_vpgr` directive in
7497:ref:`amdhsa-kernel-directives-table`.
7498
7499May be set at any time, e.g. manually set to zero at the start of each kernel.
7500
7501.. _amdgpu-amdhsa-assembler-symbol-next_free_sgpr:
7502
7503.amdgcn.next_free_sgpr
7504++++++++++++++++++++++
7505
7506Set to zero before assembly begins. At each instruction, if the current value
7507of this symbol is less than or equal the maximum SGPR number explicitly
7508referenced within that instruction then the symbol value is updated to equal
7509that SGPR number plus one.
7510
7511May be used to set the `.amdhsa_next_free_spgr` directive in
7512:ref:`amdhsa-kernel-directives-table`.
7513
7514May be set at any time, e.g. manually set to zero at the start of each kernel.
7515
7516.. _amdgpu-amdhsa-assembler-directives-v3:
7517
7518Code Object V3 Directives (-mattr=+code-object-v3)
7519~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
7520
7521Directives which begin with ``.amdgcn`` are valid for all ``amdgcn``
7522architecture processors, and are not OS-specific. Directives which begin with
7523``.amdhsa`` are specific to ``amdgcn`` architecture processors when the
7524``amdhsa`` OS is specified. See :ref:`amdgpu-target-triples` and
7525:ref:`amdgpu-processors`.
7526
7527.amdgcn_target <target>
7528+++++++++++++++++++++++
7529
7530Optional directive which declares the target supported by the containing
7531assembler source file. Valid values are described in
7532:ref:`amdgpu-amdhsa-code-object-target-identification`. Used by the assembler
7533to validate command-line options such as ``-triple``, ``-mcpu``, and those
7534which specify target features.
7535
7536.amdhsa_kernel <name>
7537+++++++++++++++++++++
7538
7539Creates a correctly aligned AMDHSA kernel descriptor and a symbol,
7540``<name>.kd``, in the current location of the current section. Only valid when
7541the OS is ``amdhsa``. ``<name>`` must be a symbol that labels the first
7542instruction to execute, and does not need to be previously defined.
7543
7544Marks the beginning of a list of directives used to generate the bytes of a
7545kernel descriptor, as described in :ref:`amdgpu-amdhsa-kernel-descriptor`.
7546Directives which may appear in this list are described in
7547:ref:`amdhsa-kernel-directives-table`. Directives may appear in any order, must
7548be valid for the target being assembled for, and cannot be repeated. Directives
7549support the range of values specified by the field they reference in
7550:ref:`amdgpu-amdhsa-kernel-descriptor`. If a directive is not specified, it is
7551assumed to have its default value, unless it is marked as "Required", in which
7552case it is an error to omit the directive. This list of directives is
7553terminated by an ``.end_amdhsa_kernel`` directive.
7554
7555  .. table:: AMDHSA Kernel Assembler Directives
7556     :name: amdhsa-kernel-directives-table
7557
7558     ======================================================== =================== ============ ===================
7559     Directive                                                Default             Supported On Description
7560     ======================================================== =================== ============ ===================
7561     ``.amdhsa_group_segment_fixed_size``                     0                   GFX6-GFX10   Controls GROUP_SEGMENT_FIXED_SIZE in
7562                                                                                               :ref:`amdgpu-amdhsa-kernel-descriptor-gfx6-gfx10-table`.
7563     ``.amdhsa_private_segment_fixed_size``                   0                   GFX6-GFX10   Controls PRIVATE_SEGMENT_FIXED_SIZE in
7564                                                                                               :ref:`amdgpu-amdhsa-kernel-descriptor-gfx6-gfx10-table`.
7565     ``.amdhsa_user_sgpr_private_segment_buffer``             0                   GFX6-GFX10   Controls ENABLE_SGPR_PRIVATE_SEGMENT_BUFFER in
7566                                                                                               :ref:`amdgpu-amdhsa-kernel-descriptor-gfx6-gfx10-table`.
7567     ``.amdhsa_user_sgpr_dispatch_ptr``                       0                   GFX6-GFX10   Controls ENABLE_SGPR_DISPATCH_PTR in
7568                                                                                               :ref:`amdgpu-amdhsa-kernel-descriptor-gfx6-gfx10-table`.
7569     ``.amdhsa_user_sgpr_queue_ptr``                          0                   GFX6-GFX10   Controls ENABLE_SGPR_QUEUE_PTR in
7570                                                                                               :ref:`amdgpu-amdhsa-kernel-descriptor-gfx6-gfx10-table`.
7571     ``.amdhsa_user_sgpr_kernarg_segment_ptr``                0                   GFX6-GFX10   Controls ENABLE_SGPR_KERNARG_SEGMENT_PTR in
7572                                                                                               :ref:`amdgpu-amdhsa-kernel-descriptor-gfx6-gfx10-table`.
7573     ``.amdhsa_user_sgpr_dispatch_id``                        0                   GFX6-GFX10   Controls ENABLE_SGPR_DISPATCH_ID in
7574                                                                                               :ref:`amdgpu-amdhsa-kernel-descriptor-gfx6-gfx10-table`.
7575     ``.amdhsa_user_sgpr_flat_scratch_init``                  0                   GFX6-GFX10   Controls ENABLE_SGPR_FLAT_SCRATCH_INIT in
7576                                                                                               :ref:`amdgpu-amdhsa-kernel-descriptor-gfx6-gfx10-table`.
7577     ``.amdhsa_user_sgpr_private_segment_size``               0                   GFX6-GFX10   Controls ENABLE_SGPR_PRIVATE_SEGMENT_SIZE in
7578                                                                                               :ref:`amdgpu-amdhsa-kernel-descriptor-gfx6-gfx10-table`.
7579     ``.amdhsa_wavefront_size32``                             Target              GFX10        Controls ENABLE_WAVEFRONT_SIZE32 in
7580                                                              Feature                          :ref:`amdgpu-amdhsa-kernel-descriptor-gfx6-gfx10-table`.
7581                                                              Specific
7582                                                              (-wavefrontsize64)
7583     ``.amdhsa_system_sgpr_private_segment_wavefront_offset`` 0                   GFX6-GFX10   Controls ENABLE_SGPR_PRIVATE_SEGMENT_WAVEFRONT_OFFSET in
7584                                                                                               :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx10-table`.
7585     ``.amdhsa_system_sgpr_workgroup_id_x``                   1                   GFX6-GFX10   Controls ENABLE_SGPR_WORKGROUP_ID_X in
7586                                                                                               :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx10-table`.
7587     ``.amdhsa_system_sgpr_workgroup_id_y``                   0                   GFX6-GFX10   Controls ENABLE_SGPR_WORKGROUP_ID_Y in
7588                                                                                               :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx10-table`.
7589     ``.amdhsa_system_sgpr_workgroup_id_z``                   0                   GFX6-GFX10   Controls ENABLE_SGPR_WORKGROUP_ID_Z in
7590                                                                                               :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx10-table`.
7591     ``.amdhsa_system_sgpr_workgroup_info``                   0                   GFX6-GFX10   Controls ENABLE_SGPR_WORKGROUP_INFO in
7592                                                                                               :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx10-table`.
7593     ``.amdhsa_system_vgpr_workitem_id``                      0                   GFX6-GFX10   Controls ENABLE_VGPR_WORKITEM_ID in
7594                                                                                               :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx10-table`.
7595                                                                                               Possible values are defined in
7596                                                                                               :ref:`amdgpu-amdhsa-system-vgpr-work-item-id-enumeration-values-table`.
7597     ``.amdhsa_next_free_vgpr``                               Required            GFX6-GFX10   Maximum VGPR number explicitly referenced, plus one.
7598                                                                                               Used to calculate GRANULATED_WORKITEM_VGPR_COUNT in
7599                                                                                               :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx10-table`.
7600     ``.amdhsa_next_free_sgpr``                               Required            GFX6-GFX10   Maximum SGPR number explicitly referenced, plus one.
7601                                                                                               Used to calculate GRANULATED_WAVEFRONT_SGPR_COUNT in
7602                                                                                               :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx10-table`.
7603     ``.amdhsa_reserve_vcc``                                  1                   GFX6-GFX10   Whether the kernel may use the special VCC SGPR.
7604                                                                                               Used to calculate GRANULATED_WAVEFRONT_SGPR_COUNT in
7605                                                                                               :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx10-table`.
7606     ``.amdhsa_reserve_flat_scratch``                         1                   GFX7-GFX10   Whether the kernel may use flat instructions to access
7607                                                                                               scratch memory. Used to calculate
7608                                                                                               GRANULATED_WAVEFRONT_SGPR_COUNT in
7609                                                                                               :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx10-table`.
7610     ``.amdhsa_reserve_xnack_mask``                           Target              GFX8-GFX10   Whether the kernel may trigger XNACK replay.
7611                                                              Feature                          Used to calculate GRANULATED_WAVEFRONT_SGPR_COUNT in
7612                                                              Specific                         :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx10-table`.
7613                                                              (+xnack)
7614     ``.amdhsa_float_round_mode_32``                          0                   GFX6-GFX10   Controls FLOAT_ROUND_MODE_32 in
7615                                                                                               :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx10-table`.
7616                                                                                               Possible values are defined in
7617                                                                                               :ref:`amdgpu-amdhsa-floating-point-rounding-mode-enumeration-values-table`.
7618     ``.amdhsa_float_round_mode_16_64``                       0                   GFX6-GFX10   Controls FLOAT_ROUND_MODE_16_64 in
7619                                                                                               :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx10-table`.
7620                                                                                               Possible values are defined in
7621                                                                                               :ref:`amdgpu-amdhsa-floating-point-rounding-mode-enumeration-values-table`.
7622     ``.amdhsa_float_denorm_mode_32``                         0                   GFX6-GFX10   Controls FLOAT_DENORM_MODE_32 in
7623                                                                                               :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx10-table`.
7624                                                                                               Possible values are defined in
7625                                                                                               :ref:`amdgpu-amdhsa-floating-point-denorm-mode-enumeration-values-table`.
7626     ``.amdhsa_float_denorm_mode_16_64``                      3                   GFX6-GFX10   Controls FLOAT_DENORM_MODE_16_64 in
7627                                                                                               :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx10-table`.
7628                                                                                               Possible values are defined in
7629                                                                                               :ref:`amdgpu-amdhsa-floating-point-denorm-mode-enumeration-values-table`.
7630     ``.amdhsa_dx10_clamp``                                   1                   GFX6-GFX10   Controls ENABLE_DX10_CLAMP in
7631                                                                                               :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx10-table`.
7632     ``.amdhsa_ieee_mode``                                    1                   GFX6-GFX10   Controls ENABLE_IEEE_MODE in
7633                                                                                               :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx10-table`.
7634     ``.amdhsa_fp16_overflow``                                0                   GFX9-GFX10   Controls FP16_OVFL in
7635                                                                                               :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx10-table`.
7636     ``.amdhsa_workgroup_processor_mode``                     Target              GFX10        Controls ENABLE_WGP_MODE in
7637                                                              Feature                          :ref:`amdgpu-amdhsa-kernel-descriptor-gfx6-gfx10-table`.
7638                                                              Specific
7639                                                              (-cumode)
7640     ``.amdhsa_memory_ordered``                               1                   GFX10        Controls MEM_ORDERED in
7641                                                                                               :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx10-table`.
7642     ``.amdhsa_forward_progress``                             0                   GFX10        Controls FWD_PROGRESS in
7643                                                                                               :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx10-table`.
7644     ``.amdhsa_exception_fp_ieee_invalid_op``                 0                   GFX6-GFX10   Controls ENABLE_EXCEPTION_IEEE_754_FP_INVALID_OPERATION in
7645                                                                                               :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx10-table`.
7646     ``.amdhsa_exception_fp_denorm_src``                      0                   GFX6-GFX10   Controls ENABLE_EXCEPTION_FP_DENORMAL_SOURCE in
7647                                                                                               :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx10-table`.
7648     ``.amdhsa_exception_fp_ieee_div_zero``                   0                   GFX6-GFX10   Controls ENABLE_EXCEPTION_IEEE_754_FP_DIVISION_BY_ZERO in
7649                                                                                               :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx10-table`.
7650     ``.amdhsa_exception_fp_ieee_overflow``                   0                   GFX6-GFX10   Controls ENABLE_EXCEPTION_IEEE_754_FP_OVERFLOW in
7651                                                                                               :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx10-table`.
7652     ``.amdhsa_exception_fp_ieee_underflow``                  0                   GFX6-GFX10   Controls ENABLE_EXCEPTION_IEEE_754_FP_UNDERFLOW in
7653                                                                                               :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx10-table`.
7654     ``.amdhsa_exception_fp_ieee_inexact``                    0                   GFX6-GFX10   Controls ENABLE_EXCEPTION_IEEE_754_FP_INEXACT in
7655                                                                                               :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx10-table`.
7656     ``.amdhsa_exception_int_div_zero``                       0                   GFX6-GFX10   Controls ENABLE_EXCEPTION_INT_DIVIDE_BY_ZERO in
7657                                                                                               :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx10-table`.
7658     ======================================================== =================== ============ ===================
7659
7660.amdgpu_metadata
7661++++++++++++++++
7662
7663Optional directive which declares the contents of the ``NT_AMDGPU_METADATA``
7664note record (see :ref:`amdgpu-elf-note-records-table-v3`).
7665
7666The contents must be in the [YAML]_ markup format, with the same structure and
7667semantics described in :ref:`amdgpu-amdhsa-code-object-metadata-v3`.
7668
7669This directive is terminated by an ``.end_amdgpu_metadata`` directive.
7670
7671.. _amdgpu-amdhsa-assembler-example-v3:
7672
7673Code Object V3 Example Source Code (-mattr=+code-object-v3)
7674~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
7675
7676Here is an example of a minimal assembly source file, defining one HSA kernel:
7677
7678.. code::
7679   :number-lines:
7680
7681   .amdgcn_target "amdgcn-amd-amdhsa--gfx900+xnack" // optional
7682
7683   .text
7684   .globl hello_world
7685   .p2align 8
7686   .type hello_world,@function
7687   hello_world:
7688     s_load_dwordx2 s[0:1], s[0:1] 0x0
7689     v_mov_b32 v0, 3.14159
7690     s_waitcnt lgkmcnt(0)
7691     v_mov_b32 v1, s0
7692     v_mov_b32 v2, s1
7693     flat_store_dword v[1:2], v0
7694     s_endpgm
7695   .Lfunc_end0:
7696     .size   hello_world, .Lfunc_end0-hello_world
7697
7698   .rodata
7699   .p2align 6
7700   .amdhsa_kernel hello_world
7701     .amdhsa_user_sgpr_kernarg_segment_ptr 1
7702     .amdhsa_next_free_vgpr .amdgcn.next_free_vgpr
7703     .amdhsa_next_free_sgpr .amdgcn.next_free_sgpr
7704   .end_amdhsa_kernel
7705
7706   .amdgpu_metadata
7707   ---
7708   amdhsa.version:
7709     - 1
7710     - 0
7711   amdhsa.kernels:
7712     - .name: hello_world
7713       .symbol: hello_world.kd
7714       .kernarg_segment_size: 48
7715       .group_segment_fixed_size: 0
7716       .private_segment_fixed_size: 0
7717       .kernarg_segment_align: 4
7718       .wavefront_size: 64
7719       .sgpr_count: 2
7720       .vgpr_count: 3
7721       .max_flat_workgroup_size: 256
7722   ...
7723   .end_amdgpu_metadata
7724
7725If an assembly source file contains multiple kernels and/or functions, the
7726:ref:`amdgpu-amdhsa-assembler-symbol-next_free_vgpr` and
7727:ref:`amdgpu-amdhsa-assembler-symbol-next_free_sgpr` symbols may be reset using
7728the ``.set <symbol>, <expression>`` directive. For example, in the case of two
7729kernels, where ``function1`` is only called from ``kernel1`` it is sufficient
7730to group the function with the kernel that calls it and reset the symbols
7731between the two connected components:
7732
7733.. code::
7734   :number-lines:
7735
7736   .amdgcn_target "amdgcn-amd-amdhsa--gfx900+xnack" // optional
7737
7738   // gpr tracking symbols are implicitly set to zero
7739
7740   .text
7741   .globl kern0
7742   .p2align 8
7743   .type kern0,@function
7744   kern0:
7745     // ...
7746     s_endpgm
7747   .Lkern0_end:
7748     .size   kern0, .Lkern0_end-kern0
7749
7750   .rodata
7751   .p2align 6
7752   .amdhsa_kernel kern0
7753     // ...
7754     .amdhsa_next_free_vgpr .amdgcn.next_free_vgpr
7755     .amdhsa_next_free_sgpr .amdgcn.next_free_sgpr
7756   .end_amdhsa_kernel
7757
7758   // reset symbols to begin tracking usage in func1 and kern1
7759   .set .amdgcn.next_free_vgpr, 0
7760   .set .amdgcn.next_free_sgpr, 0
7761
7762   .text
7763   .hidden func1
7764   .global func1
7765   .p2align 2
7766   .type func1,@function
7767   func1:
7768     // ...
7769     s_setpc_b64 s[30:31]
7770   .Lfunc1_end:
7771   .size func1, .Lfunc1_end-func1
7772
7773   .globl kern1
7774   .p2align 8
7775   .type kern1,@function
7776   kern1:
7777     // ...
7778     s_getpc_b64 s[4:5]
7779     s_add_u32 s4, s4, func1@rel32@lo+4
7780     s_addc_u32 s5, s5, func1@rel32@lo+4
7781     s_swappc_b64 s[30:31], s[4:5]
7782     // ...
7783     s_endpgm
7784   .Lkern1_end:
7785     .size   kern1, .Lkern1_end-kern1
7786
7787   .rodata
7788   .p2align 6
7789   .amdhsa_kernel kern1
7790     // ...
7791     .amdhsa_next_free_vgpr .amdgcn.next_free_vgpr
7792     .amdhsa_next_free_sgpr .amdgcn.next_free_sgpr
7793   .end_amdhsa_kernel
7794
7795These symbols cannot identify connected components in order to automatically
7796track the usage for each kernel. However, in some cases careful organization of
7797the kernels and functions in the source file means there is minimal additional
7798effort required to accurately calculate GPR usage.
7799
7800Additional Documentation
7801========================
7802
7803.. [AMD-GCN-GFX6] `AMD Southern Islands Series ISA <http://developer.amd.com/wordpress/media/2012/12/AMD_Southern_Islands_Instruction_Set_Architecture.pdf>`__
7804.. [AMD-GCN-GFX7] `AMD Sea Islands Series ISA <http://developer.amd.com/wordpress/media/2013/07/AMD_Sea_Islands_Instruction_Set_Architecture.pdf>`_
7805.. [AMD-GCN-GFX8] `AMD GCN3 Instruction Set Architecture <http://amd-dev.wpengine.netdna-cdn.com/wordpress/media/2013/12/AMD_GCN3_Instruction_Set_Architecture_rev1.1.pdf>`__
7806.. [AMD-GCN-GFX9] `AMD "Vega" Instruction Set Architecture <http://developer.amd.com/wordpress/media/2013/12/Vega_Shader_ISA_28July2017.pdf>`__
7807.. [AMD-GCN-GFX10] `AMD "RDNA 1.0" Instruction Set Architecture <https://gpuopen.com/wp-content/uploads/2019/08/RDNA_Shader_ISA_5August2019.pdf>`__
7808.. [AMD-RADEON-HD-2000-3000] `AMD R6xx shader ISA <http://developer.amd.com/wordpress/media/2012/10/R600_Instruction_Set_Architecture.pdf>`__
7809.. [AMD-RADEON-HD-4000] `AMD R7xx shader ISA <http://developer.amd.com/wordpress/media/2012/10/R700-Family_Instruction_Set_Architecture.pdf>`__
7810.. [AMD-RADEON-HD-5000] `AMD Evergreen shader ISA <http://developer.amd.com/wordpress/media/2012/10/AMD_Evergreen-Family_Instruction_Set_Architecture.pdf>`__
7811.. [AMD-RADEON-HD-6000] `AMD Cayman/Trinity shader ISA <http://developer.amd.com/wordpress/media/2012/10/AMD_HD_6900_Series_Instruction_Set_Architecture.pdf>`__
7812.. [AMD-ROCm] `AMD ROCm Platform <https://rocm-documentation.readthedocs.io>`__
7813.. [AMD-ROCm-github] `ROCm github <http://github.com/RadeonOpenCompute>`__
7814.. [CLANG-ATTR] `Attributes in Clang <https://clang.llvm.org/docs/AttributeReference.html>`__
7815.. [DWARF] `DWARF Debugging Information Format <http://dwarfstd.org/>`__
7816.. [ELF] `Executable and Linkable Format (ELF) <http://www.sco.com/developers/gabi/>`__
7817.. [HRF] `Heterogeneous-race-free Memory Models <http://benedictgaster.org/wp-content/uploads/2014/01/asplos269-FINAL.pdf>`__
7818.. [HSA] `Heterogeneous System Architecture (HSA) Foundation <http://www.hsafoundation.com/>`__
7819.. [MsgPack] `Message Pack <http://www.msgpack.org/>`__
7820.. [OpenCL] `The OpenCL Specification Version 2.0 <http://www.khronos.org/registry/cl/specs/opencl-2.0.pdf>`__
7821.. [SEMVER] `Semantic Versioning <https://semver.org/>`__
7822.. [YAML] `YAML Ain't Markup Language (YAML™) Version 1.2 <http://www.yaml.org/spec/1.2/spec.html>`__
7823