1=============================
2User Guide for AMDGPU Backend
3=============================
4
5.. contents::
6   :local:
7
8Introduction
9============
10
11The AMDGPU backend provides ISA code generation for AMD GPUs, starting with the
12R600 family up until the current GCN families. It lives in the
13``lib/Target/AMDGPU`` directory.
14
15LLVM
16====
17
18.. _amdgpu-target-triples:
19
20Target Triples
21--------------
22
23Use the ``clang -target <Architecture>-<Vendor>-<OS>-<Environment>`` option to
24specify the target triple:
25
26  .. table:: AMDGPU Architectures
27     :name: amdgpu-architecture-table
28
29     ============ ==============================================================
30     Architecture Description
31     ============ ==============================================================
32     ``r600``     AMD GPUs HD2XXX-HD6XXX for graphics and compute shaders.
33     ``amdgcn``   AMD GPUs GCN GFX6 onwards for graphics and compute shaders.
34     ============ ==============================================================
35
36  .. table:: AMDGPU Vendors
37     :name: amdgpu-vendor-table
38
39     ============ ==============================================================
40     Vendor       Description
41     ============ ==============================================================
42     ``amd``      Can be used for all AMD GPU usage.
43     ``mesa3d``   Can be used if the OS is ``mesa3d``.
44     ============ ==============================================================
45
46  .. table:: AMDGPU Operating Systems
47     :name: amdgpu-os-table
48
49     ============== ============================================================
50     OS             Description
51     ============== ============================================================
52     *<empty>*      Defaults to the *unknown* OS.
53     ``amdhsa``     Compute kernels executed on HSA [HSA]_ compatible runtimes
54                    such as AMD's ROCm [AMD-ROCm]_.
55     ``amdpal``     Graphic shaders and compute kernels executed on AMD PAL
56                    runtime.
57     ``mesa3d``     Graphic shaders and compute kernels executed on Mesa 3D
58                    runtime.
59     ============== ============================================================
60
61  .. table:: AMDGPU Environments
62     :name: amdgpu-environment-table
63
64     ============ ==============================================================
65     Environment  Description
66     ============ ==============================================================
67     *<empty>*    Default.
68     ============ ==============================================================
69
70.. _amdgpu-processors:
71
72Processors
73----------
74
75Use the ``clang -mcpu <Processor>`` option to specify the AMD GPU processor. The
76names from both the *Processor* and *Alternative Processor* can be used.
77
78  .. table:: AMDGPU Processors
79     :name: amdgpu-processor-table
80
81     =========== =============== ============ ===== ================= ======= ======================
82     Processor   Alternative     Target       dGPU/ Target            ROCm    Example
83                 Processor       Triple       APU   Features          Support Products
84                                 Architecture       Supported
85                                                    [Default]
86     =========== =============== ============ ===== ================= ======= ======================
87     **Radeon HD 2000/3000 Series (R600)** [AMD-RADEON-HD-2000-3000]_
88     -----------------------------------------------------------------------------------------------
89     ``r600``                    ``r600``     dGPU
90     ``r630``                    ``r600``     dGPU
91     ``rs880``                   ``r600``     dGPU
92     ``rv670``                   ``r600``     dGPU
93     **Radeon HD 4000 Series (R700)** [AMD-RADEON-HD-4000]_
94     -----------------------------------------------------------------------------------------------
95     ``rv710``                   ``r600``     dGPU
96     ``rv730``                   ``r600``     dGPU
97     ``rv770``                   ``r600``     dGPU
98     **Radeon HD 5000 Series (Evergreen)** [AMD-RADEON-HD-5000]_
99     -----------------------------------------------------------------------------------------------
100     ``cedar``                   ``r600``     dGPU
101     ``cypress``                 ``r600``     dGPU
102     ``juniper``                 ``r600``     dGPU
103     ``redwood``                 ``r600``     dGPU
104     ``sumo``                    ``r600``     dGPU
105     **Radeon HD 6000 Series (Northern Islands)** [AMD-RADEON-HD-6000]_
106     -----------------------------------------------------------------------------------------------
107     ``barts``                   ``r600``     dGPU
108     ``caicos``                  ``r600``     dGPU
109     ``cayman``                  ``r600``     dGPU
110     ``turks``                   ``r600``     dGPU
111     **GCN GFX6 (Southern Islands (SI))** [AMD-GCN-GFX6]_
112     -----------------------------------------------------------------------------------------------
113     ``gfx600``  - ``tahiti``    ``amdgcn``   dGPU
114     ``gfx601``  - ``hainan``    ``amdgcn``   dGPU
115                 - ``oland``
116                 - ``pitcairn``
117                 - ``verde``
118     **GCN GFX7 (Sea Islands (CI))** [AMD-GCN-GFX7]_
119     -----------------------------------------------------------------------------------------------
120     ``gfx700``  - ``kaveri``    ``amdgcn``   APU                             - A6-7000
121                                                                              - A6 Pro-7050B
122                                                                              - A8-7100
123                                                                              - A8 Pro-7150B
124                                                                              - A10-7300
125                                                                              - A10 Pro-7350B
126                                                                              - FX-7500
127                                                                              - A8-7200P
128                                                                              - A10-7400P
129                                                                              - FX-7600P
130     ``gfx701``  - ``hawaii``    ``amdgcn``   dGPU                    ROCm    - FirePro W8100
131                                                                              - FirePro W9100
132                                                                              - FirePro S9150
133                                                                              - FirePro S9170
134     ``gfx702``                  ``amdgcn``   dGPU                    ROCm    - Radeon R9 290
135                                                                              - Radeon R9 290x
136                                                                              - Radeon R390
137                                                                              - Radeon R390x
138     ``gfx703``  - ``kabini``    ``amdgcn``   APU                             - E1-2100
139                 - ``mullins``                                                - E1-2200
140                                                                              - E1-2500
141                                                                              - E2-3000
142                                                                              - E2-3800
143                                                                              - A4-5000
144                                                                              - A4-5100
145                                                                              - A6-5200
146                                                                              - A4 Pro-3340B
147     ``gfx704``  - ``bonaire``   ``amdgcn``   dGPU                            - Radeon HD 7790
148                                                                              - Radeon HD 8770
149                                                                              - R7 260
150                                                                              - R7 260X
151     **GCN GFX8 (Volcanic Islands (VI))** [AMD-GCN-GFX8]_
152     -----------------------------------------------------------------------------------------------
153     ``gfx801``  - ``carrizo``   ``amdgcn``   APU   - xnack                   - A6-8500P
154                                                      [on]                    - Pro A6-8500B
155                                                                              - A8-8600P
156                                                                              - Pro A8-8600B
157                                                                              - FX-8800P
158                                                                              - Pro A12-8800B
159     \                           ``amdgcn``   APU   - xnack           ROCm    - A10-8700P
160                                                      [on]                    - Pro A10-8700B
161                                                                              - A10-8780P
162     \                           ``amdgcn``   APU   - xnack                   - A10-9600P
163                                                      [on]                    - A10-9630P
164                                                                              - A12-9700P
165                                                                              - A12-9730P
166                                                                              - FX-9800P
167                                                                              - FX-9830P
168     \                           ``amdgcn``   APU   - xnack                   - E2-9010
169                                                      [on]                    - A6-9210
170                                                                              - A9-9410
171     ``gfx802``  - ``iceland``   ``amdgcn``   dGPU  - xnack           ROCm    - FirePro S7150
172                 - ``tonga``                          [off]                   - FirePro S7100
173                                                                              - FirePro W7100
174                                                                              - Radeon R285
175                                                                              - Radeon R9 380
176                                                                              - Radeon R9 385
177                                                                              - Mobile FirePro
178                                                                                M7170
179     ``gfx803``  - ``fiji``      ``amdgcn``   dGPU  - xnack           ROCm    - Radeon R9 Nano
180                                                      [off]                   - Radeon R9 Fury
181                                                                              - Radeon R9 FuryX
182                                                                              - Radeon Pro Duo
183                                                                              - FirePro S9300x2
184                                                                              - Radeon Instinct MI8
185     \           - ``polaris10`` ``amdgcn``   dGPU  - xnack           ROCm    - Radeon RX 470
186                                                      [off]                   - Radeon RX 480
187                                                                              - Radeon Instinct MI6
188     \           - ``polaris11`` ``amdgcn``   dGPU  - xnack           ROCm    - Radeon RX 460
189                                                      [off]
190     ``gfx810``  - ``stoney``    ``amdgcn``   APU   - xnack
191                                                      [on]
192     **GCN GFX9** [AMD-GCN-GFX9]_
193     -----------------------------------------------------------------------------------------------
194     ``gfx900``                  ``amdgcn``   dGPU  - xnack           ROCm    - Radeon Vega
195                                                      [off]                     Frontier Edition
196                                                                              - Radeon RX Vega 56
197                                                                              - Radeon RX Vega 64
198                                                                              - Radeon RX Vega 64
199                                                                                Liquid
200                                                                              - Radeon Instinct MI25
201     ``gfx902``                  ``amdgcn``   APU   - xnack                   - Ryzen 3 2200G
202                                                      [on]                    - Ryzen 5 2400G
203     ``gfx904``                  ``amdgcn``   dGPU  - xnack                   *TBA*
204                                                      [off]
205                                                                              .. TODO
206                                                                                 Add product
207                                                                                 names.
208     ``gfx906``                  ``amdgcn``   dGPU  - xnack                   - Radeon Instinct MI50
209                                                      [off]                   - Radeon Instinct MI60
210     ``gfx908``                  ``amdgcn``   dGPU  - xnack                   *TBA*
211                                                      [off]
212                                                      sram-ecc
213                                                      [on]
214     ``gfx909``                  ``amdgcn``   APU   - xnack                   *TBA* (Raven Ridge 2)
215                                                      [on]
216                                                                              .. TODO
217                                                                                 Add product
218                                                                                 names.
219     **GCN GFX10** [AMD-GCN-GFX10]_
220     -----------------------------------------------------------------------------------------------
221     ``gfx1010``                 ``amdgcn``   dGPU  - xnack                   *TBA*
222                                                      [off]
223                                                    - wavefrontsize64
224                                                      [off]
225                                                    - cumode
226                                                      [off]
227                                                                              .. TODO
228                                                                                 Add product
229                                                                                 names.
230     ``gfx1011``                 ``amdgcn``   dGPU  - xnack                   *TBA*
231                                                      [off]
232                                                    - wavefrontsize64
233                                                      [off]
234                                                    - cumode
235                                                      [off]
236                                                                              .. TODO
237                                                                                 Add product
238                                                                                 names.
239     ``gfx1012``                 ``amdgcn``   dGPU  - xnack                   *TBA*
240                                                      [off]
241                                                    - wavefrontsize64
242                                                      [off]
243                                                    - cumode
244                                                      [off]
245                                                                              .. TODO
246                                                                                 Add product
247                                                                                 names.
248     =========== =============== ============ ===== ================= ======= ======================
249
250.. _amdgpu-target-features:
251
252Target Features
253---------------
254
255Target features control how code is generated to support certain
256processor specific features. Not all target features are supported by
257all processors. The runtime must ensure that the features supported by
258the device used to execute the code match the features enabled when
259generating the code. A mismatch of features may result in incorrect
260execution, or a reduction in performance.
261
262The target features supported by each processor, and the default value
263used if not specified explicitly, is listed in
264:ref:`amdgpu-processor-table`.
265
266Use the ``clang -m[no-]<TargetFeature>`` option to specify the AMD GPU
267target features.
268
269For example:
270
271``-mxnack``
272  Enable the ``xnack`` feature.
273``-mno-xnack``
274  Disable the ``xnack`` feature.
275
276  .. table:: AMDGPU Target Features
277     :name: amdgpu-target-feature-table
278
279     ====================== ==================================================
280     Target Feature         Description
281     ====================== ==================================================
282     -m[no-]xnack           Enable/disable generating code that has
283                            memory clauses that are compatible with
284                            having XNACK replay enabled.
285
286                            This is used for demand paging and page
287                            migration. If XNACK replay is enabled in
288                            the device, then if a page fault occurs
289                            the code may execute incorrectly if the
290                            ``xnack`` feature is not enabled. Executing
291                            code that has the feature enabled on a
292                            device that does not have XNACK replay
293                            enabled will execute correctly, but may
294                            be less performant than code with the
295                            feature disabled.
296
297     -m[no-]sram-ecc        Enable/disable generating code that assumes SRAM
298                            ECC is enabled/disabled.
299
300     -m[no-]wavefrontsize64 Control the default wavefront size used when
301                            generating code for kernels. When disabled
302                            native wavefront size 32 is used, when enabled
303                            wavefront size 64 is used.
304
305     -m[no-]cumode          Control the default wavefront execution mode used
306                            when generating code for kernels. When disabled
307                            native WGP wavefront execution mode is used,
308                            when enabled CU wavefront execution mode is used
309                            (see :ref:`amdgpu-amdhsa-memory-model`).
310     ====================== ==================================================
311
312.. _amdgpu-address-spaces:
313
314Address Spaces
315--------------
316
317The AMDGPU backend uses the following address space mappings.
318
319The memory space names used in the table, aside from the region memory space, is
320from the OpenCL standard.
321
322LLVM Address Space number is used throughout LLVM (for example, in LLVM IR).
323
324  .. table:: Address Space Mapping
325     :name: amdgpu-address-space-mapping-table
326
327     ================== =================================
328     LLVM Address Space Memory Space
329     ================== =================================
330     0                  Generic (Flat)
331     1                  Global
332     2                  Region (GDS)
333     3                  Local (group/LDS)
334     4                  Constant
335     5                  Private (Scratch)
336     6                  Constant 32-bit
337     7                  Buffer Fat Pointer (experimental)
338     ================== =================================
339
340The buffer fat pointer is an experimental address space that is currently
341unsupported in the backend. It exposes a non-integral pointer that is in future
342intended to support the modelling of 128-bit buffer descriptors + a 32-bit
343offset into the buffer descriptor (in total encapsulating a 160-bit 'pointer'),
344allowing us to use normal LLVM load/store/atomic operations to model the buffer
345descriptors used heavily in graphics workloads targeting the backend.
346
347.. _amdgpu-memory-scopes:
348
349Memory Scopes
350-------------
351
352This section provides LLVM memory synchronization scopes supported by the AMDGPU
353backend memory model when the target triple OS is ``amdhsa`` (see
354:ref:`amdgpu-amdhsa-memory-model` and :ref:`amdgpu-target-triples`).
355
356The memory model supported is based on the HSA memory model [HSA]_ which is
357based in turn on HRF-indirect with scope inclusion [HRF]_. The happens-before
358relation is transitive over the synchonizes-with relation independent of scope,
359and synchonizes-with allows the memory scope instances to be inclusive (see
360table :ref:`amdgpu-amdhsa-llvm-sync-scopes-table`).
361
362This is different to the OpenCL [OpenCL]_ memory model which does not have scope
363inclusion and requires the memory scopes to exactly match. However, this
364is conservatively correct for OpenCL.
365
366  .. table:: AMDHSA LLVM Sync Scopes
367     :name: amdgpu-amdhsa-llvm-sync-scopes-table
368
369     ======================= ===================================================
370     LLVM Sync Scope         Description
371     ======================= ===================================================
372     *none*                  The default: ``system``.
373
374                             Synchronizes with, and participates in modification
375                             and seq_cst total orderings with, other operations
376                             (except image operations) for all address spaces
377                             (except private, or generic that accesses private)
378                             provided the other operation's sync scope is:
379
380                             - ``system``.
381                             - ``agent`` and executed by a thread on the same
382                               agent.
383                             - ``workgroup`` and executed by a thread in the
384                               same workgroup.
385                             - ``wavefront`` and executed by a thread in the
386                               same wavefront.
387
388     ``agent``               Synchronizes with, and participates in modification
389                             and seq_cst total orderings with, other operations
390                             (except image operations) for all address spaces
391                             (except private, or generic that accesses private)
392                             provided the other operation's sync scope is:
393
394                             - ``system`` or ``agent`` and executed by a thread
395                               on the same agent.
396                             - ``workgroup`` and executed by a thread in the
397                               same workgroup.
398                             - ``wavefront`` and executed by a thread in the
399                               same wavefront.
400
401     ``workgroup``           Synchronizes with, and participates in modification
402                             and seq_cst total orderings with, other operations
403                             (except image operations) for all address spaces
404                             (except private, or generic that accesses private)
405                             provided the other operation's sync scope is:
406
407                             - ``system``, ``agent`` or ``workgroup`` and
408                               executed by a thread in the same workgroup.
409                             - ``wavefront`` and executed by a thread in the
410                               same wavefront.
411
412     ``wavefront``           Synchronizes with, and participates in modification
413                             and seq_cst total orderings with, other operations
414                             (except image operations) for all address spaces
415                             (except private, or generic that accesses private)
416                             provided the other operation's sync scope is:
417
418                             - ``system``, ``agent``, ``workgroup`` or
419                               ``wavefront`` and executed by a thread in the
420                               same wavefront.
421
422     ``singlethread``        Only synchronizes with, and participates in
423                             modification and seq_cst total orderings with,
424                             other operations (except image operations) running
425                             in the same thread for all address spaces (for
426                             example, in signal handlers).
427
428     ``one-as``              Same as ``system`` but only synchronizes with other
429                             operations within the same address space.
430
431     ``agent-one-as``        Same as ``agent`` but only synchronizes with other
432                             operations within the same address space.
433
434     ``workgroup-one-as``    Same as ``workgroup`` but only synchronizes with
435                             other operations within the same address space.
436
437     ``wavefront-one-as``    Same as ``wavefront`` but only synchronizes with
438                             other operations within the same address space.
439
440     ``singlethread-one-as`` Same as ``singlethread`` but only synchronizes with
441                             other operations within the same address space.
442     ======================= ===================================================
443
444AMDGPU Intrinsics
445-----------------
446
447The AMDGPU backend implements the following LLVM IR intrinsics.
448
449*This section is WIP.*
450
451.. TODO
452   List AMDGPU intrinsics
453
454AMDGPU Attributes
455-----------------
456
457The AMDGPU backend supports the following LLVM IR attributes.
458
459  .. table:: AMDGPU LLVM IR Attributes
460     :name: amdgpu-llvm-ir-attributes-table
461
462     ======================================= ==========================================================
463     LLVM Attribute                          Description
464     ======================================= ==========================================================
465     "amdgpu-flat-work-group-size"="min,max" Specify the minimum and maximum flat work group sizes that
466                                             will be specified when the kernel is dispatched. Generated
467                                             by the ``amdgpu_flat_work_group_size`` CLANG attribute [CLANG-ATTR]_.
468     "amdgpu-implicitarg-num-bytes"="n"      Number of kernel argument bytes to add to the kernel
469                                             argument block size for the implicit arguments. This
470                                             varies by OS and language (for OpenCL see
471                                             :ref:`opencl-kernel-implicit-arguments-appended-for-amdhsa-os-table`).
472     "amdgpu-num-sgpr"="n"                   Specifies the number of SGPRs to use. Generated by
473                                             the ``amdgpu_num_sgpr`` CLANG attribute [CLANG-ATTR]_.
474     "amdgpu-num-vgpr"="n"                   Specifies the number of VGPRs to use. Generated by the
475                                             ``amdgpu_num_vgpr`` CLANG attribute [CLANG-ATTR]_.
476     "amdgpu-waves-per-eu"="m,n"             Specify the minimum and maximum number of waves per
477                                             execution unit. Generated by the ``amdgpu_waves_per_eu``
478                                             CLANG attribute [CLANG-ATTR]_.
479     "amdgpu-ieee" true/false.               Specify whether the function expects the IEEE field of the
480                                             mode register to be set on entry. Overrides the default for
481                                             the calling convention.
482     "amdgpu-dx10-clamp" true/false.         Specify whether the function expects the DX10_CLAMP field of
483                                             the mode register to be set on entry. Overrides the default
484                                             for the calling convention.
485     ======================================= ==========================================================
486
487Code Object
488===========
489
490The AMDGPU backend generates a standard ELF [ELF]_ relocatable code object that
491can be linked by ``lld`` to produce a standard ELF shared code object which can
492be loaded and executed on an AMDGPU target.
493
494Header
495------
496
497The AMDGPU backend uses the following ELF header:
498
499  .. table:: AMDGPU ELF Header
500     :name: amdgpu-elf-header-table
501
502     ========================== ===============================
503     Field                      Value
504     ========================== ===============================
505     ``e_ident[EI_CLASS]``      ``ELFCLASS64``
506     ``e_ident[EI_DATA]``       ``ELFDATA2LSB``
507     ``e_ident[EI_OSABI]``      - ``ELFOSABI_NONE``
508                                - ``ELFOSABI_AMDGPU_HSA``
509                                - ``ELFOSABI_AMDGPU_PAL``
510                                - ``ELFOSABI_AMDGPU_MESA3D``
511     ``e_ident[EI_ABIVERSION]`` - ``ELFABIVERSION_AMDGPU_HSA``
512                                - ``ELFABIVERSION_AMDGPU_PAL``
513                                - ``ELFABIVERSION_AMDGPU_MESA3D``
514     ``e_type``                 - ``ET_REL``
515                                - ``ET_DYN``
516     ``e_machine``              ``EM_AMDGPU``
517     ``e_entry``                0
518     ``e_flags``                See :ref:`amdgpu-elf-header-e_flags-table`
519     ========================== ===============================
520
521..
522
523  .. table:: AMDGPU ELF Header Enumeration Values
524     :name: amdgpu-elf-header-enumeration-values-table
525
526     =============================== =====
527     Name                            Value
528     =============================== =====
529     ``EM_AMDGPU``                   224
530     ``ELFOSABI_NONE``               0
531     ``ELFOSABI_AMDGPU_HSA``         64
532     ``ELFOSABI_AMDGPU_PAL``         65
533     ``ELFOSABI_AMDGPU_MESA3D``      66
534     ``ELFABIVERSION_AMDGPU_HSA``    1
535     ``ELFABIVERSION_AMDGPU_PAL``    0
536     ``ELFABIVERSION_AMDGPU_MESA3D`` 0
537     =============================== =====
538
539``e_ident[EI_CLASS]``
540  The ELF class is:
541
542  * ``ELFCLASS32`` for ``r600`` architecture.
543
544  * ``ELFCLASS64`` for ``amdgcn`` architecture which only supports 64
545    bit applications.
546
547``e_ident[EI_DATA]``
548  All AMDGPU targets use ``ELFDATA2LSB`` for little-endian byte ordering.
549
550``e_ident[EI_OSABI]``
551  One of the following AMD GPU architecture specific OS ABIs
552  (see :ref:`amdgpu-os-table`):
553
554  * ``ELFOSABI_NONE`` for *unknown* OS.
555
556  * ``ELFOSABI_AMDGPU_HSA`` for ``amdhsa`` OS.
557
558  * ``ELFOSABI_AMDGPU_PAL`` for ``amdpal`` OS.
559
560  * ``ELFOSABI_AMDGPU_MESA3D`` for ``mesa3D`` OS.
561
562``e_ident[EI_ABIVERSION]``
563  The ABI version of the AMD GPU architecture specific OS ABI to which the code
564  object conforms:
565
566  * ``ELFABIVERSION_AMDGPU_HSA`` is used to specify the version of AMD HSA
567    runtime ABI.
568
569  * ``ELFABIVERSION_AMDGPU_PAL`` is used to specify the version of AMD PAL
570    runtime ABI.
571
572  * ``ELFABIVERSION_AMDGPU_MESA3D`` is used to specify the version of AMD MESA
573    3D runtime ABI.
574
575``e_type``
576  Can be one of the following values:
577
578
579  ``ET_REL``
580    The type produced by the AMD GPU backend compiler as it is relocatable code
581    object.
582
583  ``ET_DYN``
584    The type produced by the linker as it is a shared code object.
585
586  The AMD HSA runtime loader requires a ``ET_DYN`` code object.
587
588``e_machine``
589  The value ``EM_AMDGPU`` is used for the machine for all processors supported
590  by the ``r600`` and ``amdgcn`` architectures (see
591  :ref:`amdgpu-processor-table`). The specific processor is specified in the
592  ``EF_AMDGPU_MACH`` bit field of the ``e_flags`` (see
593  :ref:`amdgpu-elf-header-e_flags-table`).
594
595``e_entry``
596  The entry point is 0 as the entry points for individual kernels must be
597  selected in order to invoke them through AQL packets.
598
599``e_flags``
600  The AMDGPU backend uses the following ELF header flags:
601
602  .. table:: AMDGPU ELF Header ``e_flags``
603     :name: amdgpu-elf-header-e_flags-table
604
605     ================================= ========== =============================
606     Name                              Value      Description
607     ================================= ========== =============================
608     **AMDGPU Processor Flag**                    See :ref:`amdgpu-processor-table`.
609     -------------------------------------------- -----------------------------
610     ``EF_AMDGPU_MACH``                0x000000ff AMDGPU processor selection
611                                                  mask for
612                                                  ``EF_AMDGPU_MACH_xxx`` values
613                                                  defined in
614                                                  :ref:`amdgpu-ef-amdgpu-mach-table`.
615     ``EF_AMDGPU_XNACK``               0x00000100 Indicates if the ``xnack``
616                                                  target feature is
617                                                  enabled for all code
618                                                  contained in the code object.
619                                                  If the processor
620                                                  does not support the
621                                                  ``xnack`` target
622                                                  feature then must
623                                                  be 0.
624                                                  See
625                                                  :ref:`amdgpu-target-features`.
626     ``EF_AMDGPU_SRAM_ECC``            0x00000200 Indicates if the ``sram-ecc``
627                                                  target feature is
628                                                  enabled for all code
629                                                  contained in the code object.
630                                                  If the processor
631                                                  does not support the
632                                                  ``sram-ecc`` target
633                                                  feature then must
634                                                  be 0.
635                                                  See
636                                                  :ref:`amdgpu-target-features`.
637     ================================= ========== =============================
638
639  .. table:: AMDGPU ``EF_AMDGPU_MACH`` Values
640     :name: amdgpu-ef-amdgpu-mach-table
641
642     ================================= ========== =============================
643     Name                              Value      Description (see
644                                                  :ref:`amdgpu-processor-table`)
645     ================================= ========== =============================
646     ``EF_AMDGPU_MACH_NONE``           0x000      *not specified*
647     ``EF_AMDGPU_MACH_R600_R600``      0x001      ``r600``
648     ``EF_AMDGPU_MACH_R600_R630``      0x002      ``r630``
649     ``EF_AMDGPU_MACH_R600_RS880``     0x003      ``rs880``
650     ``EF_AMDGPU_MACH_R600_RV670``     0x004      ``rv670``
651     ``EF_AMDGPU_MACH_R600_RV710``     0x005      ``rv710``
652     ``EF_AMDGPU_MACH_R600_RV730``     0x006      ``rv730``
653     ``EF_AMDGPU_MACH_R600_RV770``     0x007      ``rv770``
654     ``EF_AMDGPU_MACH_R600_CEDAR``     0x008      ``cedar``
655     ``EF_AMDGPU_MACH_R600_CYPRESS``   0x009      ``cypress``
656     ``EF_AMDGPU_MACH_R600_JUNIPER``   0x00a      ``juniper``
657     ``EF_AMDGPU_MACH_R600_REDWOOD``   0x00b      ``redwood``
658     ``EF_AMDGPU_MACH_R600_SUMO``      0x00c      ``sumo``
659     ``EF_AMDGPU_MACH_R600_BARTS``     0x00d      ``barts``
660     ``EF_AMDGPU_MACH_R600_CAICOS``    0x00e      ``caicos``
661     ``EF_AMDGPU_MACH_R600_CAYMAN``    0x00f      ``cayman``
662     ``EF_AMDGPU_MACH_R600_TURKS``     0x010      ``turks``
663     *reserved*                        0x011 -    Reserved for ``r600``
664                                       0x01f      architecture processors.
665     ``EF_AMDGPU_MACH_AMDGCN_GFX600``  0x020      ``gfx600``
666     ``EF_AMDGPU_MACH_AMDGCN_GFX601``  0x021      ``gfx601``
667     ``EF_AMDGPU_MACH_AMDGCN_GFX700``  0x022      ``gfx700``
668     ``EF_AMDGPU_MACH_AMDGCN_GFX701``  0x023      ``gfx701``
669     ``EF_AMDGPU_MACH_AMDGCN_GFX702``  0x024      ``gfx702``
670     ``EF_AMDGPU_MACH_AMDGCN_GFX703``  0x025      ``gfx703``
671     ``EF_AMDGPU_MACH_AMDGCN_GFX704``  0x026      ``gfx704``
672     *reserved*                        0x027      Reserved.
673     ``EF_AMDGPU_MACH_AMDGCN_GFX801``  0x028      ``gfx801``
674     ``EF_AMDGPU_MACH_AMDGCN_GFX802``  0x029      ``gfx802``
675     ``EF_AMDGPU_MACH_AMDGCN_GFX803``  0x02a      ``gfx803``
676     ``EF_AMDGPU_MACH_AMDGCN_GFX810``  0x02b      ``gfx810``
677     ``EF_AMDGPU_MACH_AMDGCN_GFX900``  0x02c      ``gfx900``
678     ``EF_AMDGPU_MACH_AMDGCN_GFX902``  0x02d      ``gfx902``
679     ``EF_AMDGPU_MACH_AMDGCN_GFX904``  0x02e      ``gfx904``
680     ``EF_AMDGPU_MACH_AMDGCN_GFX906``  0x02f      ``gfx906``
681     ``EF_AMDGPU_MACH_AMDGCN_GFX908``  0x030      ``gfx908``
682     ``EF_AMDGPU_MACH_AMDGCN_GFX909``  0x031      ``gfx909``
683     *reserved*                        0x032      Reserved.
684     ``EF_AMDGPU_MACH_AMDGCN_GFX1010`` 0x033      ``gfx1010``
685     ``EF_AMDGPU_MACH_AMDGCN_GFX1011`` 0x034      ``gfx1011``
686     ``EF_AMDGPU_MACH_AMDGCN_GFX1012`` 0x035      ``gfx1012``
687     ================================= ========== =============================
688
689Sections
690--------
691
692An AMDGPU target ELF code object has the standard ELF sections which include:
693
694  .. table:: AMDGPU ELF Sections
695     :name: amdgpu-elf-sections-table
696
697     ================== ================ =================================
698     Name               Type             Attributes
699     ================== ================ =================================
700     ``.bss``           ``SHT_NOBITS``   ``SHF_ALLOC`` + ``SHF_WRITE``
701     ``.data``          ``SHT_PROGBITS`` ``SHF_ALLOC`` + ``SHF_WRITE``
702     ``.debug_``\ *\**  ``SHT_PROGBITS`` *none*
703     ``.dynamic``       ``SHT_DYNAMIC``  ``SHF_ALLOC``
704     ``.dynstr``        ``SHT_PROGBITS`` ``SHF_ALLOC``
705     ``.dynsym``        ``SHT_PROGBITS`` ``SHF_ALLOC``
706     ``.got``           ``SHT_PROGBITS`` ``SHF_ALLOC`` + ``SHF_WRITE``
707     ``.hash``          ``SHT_HASH``     ``SHF_ALLOC``
708     ``.note``          ``SHT_NOTE``     *none*
709     ``.rela``\ *name*  ``SHT_RELA``     *none*
710     ``.rela.dyn``      ``SHT_RELA``     *none*
711     ``.rodata``        ``SHT_PROGBITS`` ``SHF_ALLOC``
712     ``.shstrtab``      ``SHT_STRTAB``   *none*
713     ``.strtab``        ``SHT_STRTAB``   *none*
714     ``.symtab``        ``SHT_SYMTAB``   *none*
715     ``.text``          ``SHT_PROGBITS`` ``SHF_ALLOC`` + ``SHF_EXECINSTR``
716     ================== ================ =================================
717
718These sections have their standard meanings (see [ELF]_) and are only generated
719if needed.
720
721``.debug``\ *\**
722  The standard DWARF sections. See :ref:`amdgpu-dwarf` for information on the
723  DWARF produced by the AMDGPU backend.
724
725``.dynamic``, ``.dynstr``, ``.dynsym``, ``.hash``
726  The standard sections used by a dynamic loader.
727
728``.note``
729  See :ref:`amdgpu-note-records` for the note records supported by the AMDGPU
730  backend.
731
732``.rela``\ *name*, ``.rela.dyn``
733  For relocatable code objects, *name* is the name of the section that the
734  relocation records apply. For example, ``.rela.text`` is the section name for
735  relocation records associated with the ``.text`` section.
736
737  For linked shared code objects, ``.rela.dyn`` contains all the relocation
738  records from each of the relocatable code object's ``.rela``\ *name* sections.
739
740  See :ref:`amdgpu-relocation-records` for the relocation records supported by
741  the AMDGPU backend.
742
743``.text``
744  The executable machine code for the kernels and functions they call. Generated
745  as position independent code. See :ref:`amdgpu-code-conventions` for
746  information on conventions used in the isa generation.
747
748.. _amdgpu-note-records:
749
750Note Records
751------------
752
753The AMDGPU backend code object contains ELF note records in the ``.note``
754section. The set of generated notes and their semantics depend on the code
755object version; see :ref:`amdgpu-note-records-v2` and
756:ref:`amdgpu-note-records-v3`.
757
758As required by ``ELFCLASS32`` and ``ELFCLASS64``, minimal zero byte padding
759must be generated after the ``name`` field to ensure the ``desc`` field is 4
760byte aligned. In addition, minimal zero byte padding must be generated to
761ensure the ``desc`` field size is a multiple of 4 bytes. The ``sh_addralign``
762field of the ``.note`` section must be at least 4 to indicate at least 8 byte
763alignment.
764
765.. _amdgpu-note-records-v2:
766
767Code Object V2 Note Records (-mattr=-code-object-v3)
768~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
769
770.. warning:: Code Object V2 is not the default code object version emitted by
771  this version of LLVM. For a description of the notes generated with the
772  default configuration (Code Object V3) see :ref:`amdgpu-note-records-v3`.
773
774The AMDGPU backend code object uses the following ELF note record in the
775``.note`` section when compiling for Code Object V2 (-mattr=-code-object-v3).
776
777Additional note records may be present, but any which are not documented here
778are deprecated and should not be used.
779
780  .. table:: AMDGPU Code Object V2 ELF Note Records
781     :name: amdgpu-elf-note-records-table-v2
782
783     ===== ============================== ======================================
784     Name  Type                           Description
785     ===== ============================== ======================================
786     "AMD" ``NT_AMD_AMDGPU_HSA_METADATA`` <metadata null terminated string>
787     ===== ============================== ======================================
788
789..
790
791  .. table:: AMDGPU Code Object V2 ELF Note Record Enumeration Values
792     :name: amdgpu-elf-note-record-enumeration-values-table-v2
793
794     ============================== =====
795     Name                           Value
796     ============================== =====
797     *reserved*                       0-9
798     ``NT_AMD_AMDGPU_HSA_METADATA``    10
799     *reserved*                        11
800     ============================== =====
801
802``NT_AMD_AMDGPU_HSA_METADATA``
803  Specifies extensible metadata associated with the code objects executed on HSA
804  [HSA]_ compatible runtimes such as AMD's ROCm [AMD-ROCm]_. It is required when
805  the target triple OS is ``amdhsa`` (see :ref:`amdgpu-target-triples`). See
806  :ref:`amdgpu-amdhsa-code-object-metadata-v2` for the syntax of the code
807  object metadata string.
808
809.. _amdgpu-note-records-v3:
810
811Code Object V3 Note Records (-mattr=+code-object-v3)
812~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
813
814The AMDGPU backend code object uses the following ELF note record in the
815``.note`` section when compiling for Code Object V3 (-mattr=+code-object-v3).
816
817Additional note records may be present, but any which are not documented here
818are deprecated and should not be used.
819
820  .. table:: AMDGPU Code Object V3 ELF Note Records
821     :name: amdgpu-elf-note-records-table-v3
822
823     ======== ============================== ======================================
824     Name     Type                           Description
825     ======== ============================== ======================================
826     "AMDGPU" ``NT_AMDGPU_METADATA``         Metadata in Message Pack [MsgPack]_
827                                             binary format.
828     ======== ============================== ======================================
829
830..
831
832  .. table:: AMDGPU Code Object V3 ELF Note Record Enumeration Values
833     :name: amdgpu-elf-note-record-enumeration-values-table-v3
834
835     ============================== =====
836     Name                           Value
837     ============================== =====
838     *reserved*                     0-31
839     ``NT_AMDGPU_METADATA``         32
840     ============================== =====
841
842``NT_AMDGPU_METADATA``
843  Specifies extensible metadata associated with an AMDGPU code
844  object. It is encoded as a map in the Message Pack [MsgPack]_ binary
845  data format. See :ref:`amdgpu-amdhsa-code-object-metadata-v3` for the
846  map keys defined for the ``amdhsa`` OS.
847
848.. _amdgpu-symbols:
849
850Symbols
851-------
852
853Symbols include the following:
854
855  .. table:: AMDGPU ELF Symbols
856     :name: amdgpu-elf-symbols-table
857
858     ===================== ================== ================ ==================
859     Name                  Type               Section          Description
860     ===================== ================== ================ ==================
861     *link-name*           ``STT_OBJECT``     - ``.data``      Global variable
862                                              - ``.rodata``
863					      - ``.bss``
864     *link-name*\ ``.kd``  ``STT_OBJECT``     - ``.rodata``    Kernel descriptor
865     *link-name*           ``STT_FUNC``       - ``.text``      Kernel entry point
866     *link-name*           ``STT_OBJECT``     - SHN_AMDGPU_LDS Global variable in LDS
867     ===================== ================== ================ ==================
868
869Global variable
870  Global variables both used and defined by the compilation unit.
871
872  If the symbol is defined in the compilation unit then it is allocated in the
873  appropriate section according to if it has initialized data or is readonly.
874
875  If the symbol is external then its section is ``STN_UNDEF`` and the loader
876  will resolve relocations using the definition provided by another code object
877  or explicitly defined by the runtime.
878
879  If the symbol resides in local/group memory (LDS) then its section is the
880  special processor-specific section name ``SHN_AMDGPU_LDS``, and the
881  ``st_value`` field describes alignment requirements as it does for common
882  symbols.
883
884  .. TODO
885     Add description of linked shared object symbols. Seems undefined symbols
886     are marked as STT_NOTYPE.
887
888Kernel descriptor
889  Every HSA kernel has an associated kernel descriptor. It is the address of the
890  kernel descriptor that is used in the AQL dispatch packet used to invoke the
891  kernel, not the kernel entry point. The layout of the HSA kernel descriptor is
892  defined in :ref:`amdgpu-amdhsa-kernel-descriptor`.
893
894Kernel entry point
895  Every HSA kernel also has a symbol for its machine code entry point.
896
897.. _amdgpu-relocation-records:
898
899Relocation Records
900------------------
901
902AMDGPU backend generates ``Elf64_Rela`` relocation records. Supported
903relocatable fields are:
904
905``word32``
906  This specifies a 32-bit field occupying 4 bytes with arbitrary byte
907  alignment. These values use the same byte order as other word values in the
908  AMD GPU architecture.
909
910``word64``
911  This specifies a 64-bit field occupying 8 bytes with arbitrary byte
912  alignment. These values use the same byte order as other word values in the
913  AMD GPU architecture.
914
915Following notations are used for specifying relocation calculations:
916
917**A**
918  Represents the addend used to compute the value of the relocatable field.
919
920**G**
921  Represents the offset into the global offset table at which the relocation
922  entry's symbol will reside during execution.
923
924**GOT**
925  Represents the address of the global offset table.
926
927**P**
928  Represents the place (section offset for ``et_rel`` or address for ``et_dyn``)
929  of the storage unit being relocated (computed using ``r_offset``).
930
931**S**
932  Represents the value of the symbol whose index resides in the relocation
933  entry. Relocations not using this must specify a symbol index of ``STN_UNDEF``.
934
935**B**
936  Represents the base address of a loaded executable or shared object which is
937  the difference between the ELF address and the actual load address. Relocations
938  using this are only valid in executable or shared objects.
939
940The following relocation types are supported:
941
942  .. table:: AMDGPU ELF Relocation Records
943     :name: amdgpu-elf-relocation-records-table
944
945     ========================== ======= =====  ==========  ==============================
946     Relocation Type            Kind    Value  Field       Calculation
947     ========================== ======= =====  ==========  ==============================
948     ``R_AMDGPU_NONE``                  0      *none*      *none*
949     ``R_AMDGPU_ABS32_LO``      Static, 1      ``word32``  (S + A) & 0xFFFFFFFF
950                                Dynamic
951     ``R_AMDGPU_ABS32_HI``      Static, 2      ``word32``  (S + A) >> 32
952                                Dynamic
953     ``R_AMDGPU_ABS64``         Static, 3      ``word64``  S + A
954                                Dynamic
955     ``R_AMDGPU_REL32``         Static  4      ``word32``  S + A - P
956     ``R_AMDGPU_REL64``         Static  5      ``word64``  S + A - P
957     ``R_AMDGPU_ABS32``         Static, 6      ``word32``  S + A
958                                Dynamic
959     ``R_AMDGPU_GOTPCREL``      Static  7      ``word32``  G + GOT + A - P
960     ``R_AMDGPU_GOTPCREL32_LO`` Static  8      ``word32``  (G + GOT + A - P) & 0xFFFFFFFF
961     ``R_AMDGPU_GOTPCREL32_HI`` Static  9      ``word32``  (G + GOT + A - P) >> 32
962     ``R_AMDGPU_REL32_LO``      Static  10     ``word32``  (S + A - P) & 0xFFFFFFFF
963     ``R_AMDGPU_REL32_HI``      Static  11     ``word32``  (S + A - P) >> 32
964     *reserved*                         12
965     ``R_AMDGPU_RELATIVE64``    Dynamic 13     ``word64``  B + A
966     ========================== ======= =====  ==========  ==============================
967
968``R_AMDGPU_ABS32_LO`` and ``R_AMDGPU_ABS32_HI`` are only supported by
969the ``mesa3d`` OS, which does not support ``R_AMDGPU_ABS64``.
970
971There is no current OS loader support for 32 bit programs and so
972``R_AMDGPU_ABS32`` is not used.
973
974.. _amdgpu-dwarf:
975
976DWARF
977-----
978
979Standard DWARF [DWARF]_ Version 5 sections can be generated. These contain
980information that maps the code object executable code and data to the source
981language constructs. It can be used by tools such as debuggers and profilers.
982
983Address Space Mapping
984~~~~~~~~~~~~~~~~~~~~~
985
986The following address space mapping is used:
987
988  .. table:: AMDGPU DWARF Address Space Mapping
989     :name: amdgpu-dwarf-address-space-mapping-table
990
991     =================== =================
992     DWARF Address Space Memory Space
993     =================== =================
994     1                   Private (Scratch)
995     2                   Local (group/LDS)
996     *omitted*           Global
997     *omitted*           Constant
998     *omitted*           Generic (Flat)
999     *not supported*     Region (GDS)
1000     =================== =================
1001
1002See :ref:`amdgpu-address-spaces` for information on the memory space terminology
1003used in the table.
1004
1005An ``address_class`` attribute is generated on pointer type DIEs to specify the
1006DWARF address space of the value of the pointer when it is in the *private* or
1007*local* address space. Otherwise the attribute is omitted.
1008
1009An ``XDEREF`` operation is generated in location list expressions for variables
1010that are allocated in the *private* and *local* address space. Otherwise no
1011``XDREF`` is omitted.
1012
1013Register Mapping
1014~~~~~~~~~~~~~~~~
1015
1016*This section is WIP.*
1017
1018.. TODO
1019   Define DWARF register enumeration.
1020
1021   If want to present a wavefront state then should expose vector registers as
1022   64 wide (rather than per work-item view that LLVM uses). Either as separate
1023   registers, or a 64x4 byte single register. In either case use a new LANE op
1024   (akin to XDREF) to select the current lane usage in a location
1025   expression. This would also allow scalar register spilling to vector register
1026   lanes to be expressed (currently no debug information is being generated for
1027   spilling). If choose a wide single register approach then use LANE in
1028   conjunction with PIECE operation to select the dword part of the register for
1029   the current lane. If the separate register approach then use LANE to select
1030   the register.
1031
1032Source Text
1033~~~~~~~~~~~
1034
1035Source text for online-compiled programs (e.g. those compiled by the OpenCL
1036runtime) may be embedded into the DWARF v5 line table using the ``clang
1037-gembed-source`` option, described in table :ref:`amdgpu-debug-options`.
1038
1039For example:
1040
1041``-gembed-source``
1042  Enable the embedded source DWARF v5 extension.
1043``-gno-embed-source``
1044  Disable the embedded source DWARF v5 extension.
1045
1046  .. table:: AMDGPU Debug Options
1047     :name: amdgpu-debug-options
1048
1049     ==================== ==================================================
1050     Debug Flag           Description
1051     ==================== ==================================================
1052     -g[no-]embed-source  Enable/disable embedding source text in DWARF
1053                          debug sections. Useful for environments where
1054                          source cannot be written to disk, such as
1055                          when performing online compilation.
1056     ==================== ==================================================
1057
1058This option enables one extended content types in the DWARF v5 Line Number
1059Program Header, which is used to encode embedded source.
1060
1061  .. table:: AMDGPU DWARF Line Number Program Header Extended Content Types
1062     :name: amdgpu-dwarf-extended-content-types
1063
1064     ============================  ======================
1065     Content Type                  Form
1066     ============================  ======================
1067     ``DW_LNCT_LLVM_source``       ``DW_FORM_line_strp``
1068     ============================  ======================
1069
1070The source field will contain the UTF-8 encoded, null-terminated source text
1071with ``'\n'`` line endings. When the source field is present, consumers can use
1072the embedded source instead of attempting to discover the source on disk. When
1073the source field is absent, consumers can access the file to get the source
1074text.
1075
1076The above content type appears in the ``file_name_entry_format`` field of the
1077line table prologue, and its corresponding value appear in the ``file_names``
1078field. The current encoding of the content type is documented in table
1079:ref:`amdgpu-dwarf-extended-content-types-encoding`
1080
1081  .. table:: AMDGPU DWARF Line Number Program Header Extended Content Types Encoding
1082     :name: amdgpu-dwarf-extended-content-types-encoding
1083
1084     ============================  ====================
1085     Content Type                  Value
1086     ============================  ====================
1087     ``DW_LNCT_LLVM_source``       0x2001
1088     ============================  ====================
1089
1090.. _amdgpu-code-conventions:
1091
1092Code Conventions
1093================
1094
1095This section provides code conventions used for each supported target triple OS
1096(see :ref:`amdgpu-target-triples`).
1097
1098AMDHSA
1099------
1100
1101This section provides code conventions used when the target triple OS is
1102``amdhsa`` (see :ref:`amdgpu-target-triples`).
1103
1104.. _amdgpu-amdhsa-code-object-target-identification:
1105
1106Code Object Target Identification
1107~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
1108
1109The AMDHSA OS uses the following syntax to specify the code object
1110target as a single string:
1111
1112  ``<Architecture>-<Vendor>-<OS>-<Environment>-<Processor><Target Features>``
1113
1114Where:
1115
1116  - ``<Architecture>``, ``<Vendor>``, ``<OS>`` and ``<Environment>``
1117    are the same as the *Target Triple* (see
1118    :ref:`amdgpu-target-triples`).
1119
1120  - ``<Processor>`` is the same as the *Processor* (see
1121    :ref:`amdgpu-processors`).
1122
1123  - ``<Target Features>`` is a list of the enabled *Target Features*
1124    (see :ref:`amdgpu-target-features`), each prefixed by a plus, that
1125    apply to *Processor*. The list must be in the same order as listed
1126    in the table :ref:`amdgpu-target-feature-table`. Note that *Target
1127    Features* must be included in the list if they are enabled even if
1128    that is the default for *Processor*.
1129
1130For example:
1131
1132  ``"amdgcn-amd-amdhsa--gfx902+xnack"``
1133
1134.. _amdgpu-amdhsa-code-object-metadata:
1135
1136Code Object Metadata
1137~~~~~~~~~~~~~~~~~~~~
1138
1139The code object metadata specifies extensible metadata associated with the code
1140objects executed on HSA [HSA]_ compatible runtimes such as AMD's ROCm
1141[AMD-ROCm]_. The encoding and semantics of this metadata depends on the code
1142object version; see :ref:`amdgpu-amdhsa-code-object-metadata-v2` and
1143:ref:`amdgpu-amdhsa-code-object-metadata-v3`.
1144
1145Code object metadata is specified in a note record (see
1146:ref:`amdgpu-note-records`) and is required when the target triple OS is
1147``amdhsa`` (see :ref:`amdgpu-target-triples`). It must contain the minimum
1148information necessary to support the ROCM kernel queries. For example, the
1149segment sizes needed in a dispatch packet. In addition, a high level language
1150runtime may require other information to be included. For example, the AMD
1151OpenCL runtime records kernel argument information.
1152
1153.. _amdgpu-amdhsa-code-object-metadata-v2:
1154
1155Code Object V2 Metadata (-mattr=-code-object-v3)
1156++++++++++++++++++++++++++++++++++++++++++++++++
1157
1158.. warning:: Code Object V2 is not the default code object version emitted by
1159  this version of LLVM. For a description of the metadata generated with the
1160  default configuration (Code Object V3) see
1161  :ref:`amdgpu-amdhsa-code-object-metadata-v3`.
1162
1163Code object V2 metadata is specified by the ``NT_AMD_AMDGPU_METADATA`` note
1164record (see :ref:`amdgpu-note-records-v2`).
1165
1166The metadata is specified as a YAML formatted string (see [YAML]_ and
1167:doc:`YamlIO`).
1168
1169.. TODO
1170   Is the string null terminated? It probably should not if YAML allows it to
1171   contain null characters, otherwise it should be.
1172
1173The metadata is represented as a single YAML document comprised of the mapping
1174defined in table :ref:`amdgpu-amdhsa-code-object-metadata-map-table-v2` and
1175referenced tables.
1176
1177For boolean values, the string values of ``false`` and ``true`` are used for
1178false and true respectively.
1179
1180Additional information can be added to the mappings. To avoid conflicts, any
1181non-AMD key names should be prefixed by "*vendor-name*.".
1182
1183  .. table:: AMDHSA Code Object V2 Metadata Map
1184     :name: amdgpu-amdhsa-code-object-metadata-map-table-v2
1185
1186     ========== ============== ========= =======================================
1187     String Key Value Type     Required? Description
1188     ========== ============== ========= =======================================
1189     "Version"  sequence of    Required  - The first integer is the major
1190                2 integers                 version. Currently 1.
1191                                         - The second integer is the minor
1192                                           version. Currently 0.
1193     "Printf"   sequence of              Each string is encoded information
1194                strings                  about a printf function call. The
1195                                         encoded information is organized as
1196                                         fields separated by colon (':'):
1197
1198                                         ``ID:N:S[0]:S[1]:...:S[N-1]:FormatString``
1199
1200                                         where:
1201
1202                                         ``ID``
1203                                           A 32 bit integer as a unique id for
1204                                           each printf function call
1205
1206                                         ``N``
1207                                           A 32 bit integer equal to the number
1208                                           of arguments of printf function call
1209                                           minus 1
1210
1211                                         ``S[i]`` (where i = 0, 1, ... , N-1)
1212                                           32 bit integers for the size in bytes
1213                                           of the i-th FormatString argument of
1214                                           the printf function call
1215
1216                                         FormatString
1217                                           The format string passed to the
1218                                           printf function call.
1219     "Kernels"  sequence of    Required  Sequence of the mappings for each
1220                mapping                  kernel in the code object. See
1221                                         :ref:`amdgpu-amdhsa-code-object-kernel-metadata-map-table-v2`
1222                                         for the definition of the mapping.
1223     ========== ============== ========= =======================================
1224
1225..
1226
1227  .. table:: AMDHSA Code Object V2 Kernel Metadata Map
1228     :name: amdgpu-amdhsa-code-object-kernel-metadata-map-table-v2
1229
1230     ================= ============== ========= ================================
1231     String Key        Value Type     Required? Description
1232     ================= ============== ========= ================================
1233     "Name"            string         Required  Source name of the kernel.
1234     "SymbolName"      string         Required  Name of the kernel
1235                                                descriptor ELF symbol.
1236     "Language"        string                   Source language of the kernel.
1237                                                Values include:
1238
1239                                                - "OpenCL C"
1240                                                - "OpenCL C++"
1241                                                - "HCC"
1242                                                - "OpenMP"
1243
1244     "LanguageVersion" sequence of              - The first integer is the major
1245                       2 integers                 version.
1246                                                - The second integer is the
1247                                                  minor version.
1248     "Attrs"           mapping                  Mapping of kernel attributes.
1249                                                See
1250                                                :ref:`amdgpu-amdhsa-code-object-kernel-attribute-metadata-map-table-v2`
1251                                                for the mapping definition.
1252     "Args"            sequence of              Sequence of mappings of the
1253                       mapping                  kernel arguments. See
1254                                                :ref:`amdgpu-amdhsa-code-object-kernel-argument-metadata-map-table-v2`
1255                                                for the definition of the mapping.
1256     "CodeProps"       mapping                  Mapping of properties related to
1257                                                the kernel code. See
1258                                                :ref:`amdgpu-amdhsa-code-object-kernel-code-properties-metadata-map-table-v2`
1259                                                for the mapping definition.
1260     ================= ============== ========= ================================
1261
1262..
1263
1264  .. table:: AMDHSA Code Object V2 Kernel Attribute Metadata Map
1265     :name: amdgpu-amdhsa-code-object-kernel-attribute-metadata-map-table-v2
1266
1267     =================== ============== ========= ==============================
1268     String Key          Value Type     Required? Description
1269     =================== ============== ========= ==============================
1270     "ReqdWorkGroupSize" sequence of              If not 0, 0, 0 then all values
1271                         3 integers               must be >=1 and the dispatch
1272                                                  work-group size X, Y, Z must
1273                                                  correspond to the specified
1274                                                  values. Defaults to 0, 0, 0.
1275
1276                                                  Corresponds to the OpenCL
1277                                                  ``reqd_work_group_size``
1278                                                  attribute.
1279     "WorkGroupSizeHint" sequence of              The dispatch work-group size
1280                         3 integers               X, Y, Z is likely to be the
1281                                                  specified values.
1282
1283                                                  Corresponds to the OpenCL
1284                                                  ``work_group_size_hint``
1285                                                  attribute.
1286     "VecTypeHint"       string                   The name of a scalar or vector
1287                                                  type.
1288
1289                                                  Corresponds to the OpenCL
1290                                                  ``vec_type_hint`` attribute.
1291
1292     "RuntimeHandle"     string                   The external symbol name
1293                                                  associated with a kernel.
1294                                                  OpenCL runtime allocates a
1295                                                  global buffer for the symbol
1296                                                  and saves the kernel's address
1297                                                  to it, which is used for
1298                                                  device side enqueueing. Only
1299                                                  available for device side
1300                                                  enqueued kernels.
1301     =================== ============== ========= ==============================
1302
1303..
1304
1305  .. table:: AMDHSA Code Object V2 Kernel Argument Metadata Map
1306     :name: amdgpu-amdhsa-code-object-kernel-argument-metadata-map-table-v2
1307
1308     ================= ============== ========= ================================
1309     String Key        Value Type     Required? Description
1310     ================= ============== ========= ================================
1311     "Name"            string                   Kernel argument name.
1312     "TypeName"        string                   Kernel argument type name.
1313     "Size"            integer        Required  Kernel argument size in bytes.
1314     "Align"           integer        Required  Kernel argument alignment in
1315                                                bytes. Must be a power of two.
1316     "ValueKind"       string         Required  Kernel argument kind that
1317                                                specifies how to set up the
1318                                                corresponding argument.
1319                                                Values include:
1320
1321                                                "ByValue"
1322                                                  The argument is copied
1323                                                  directly into the kernarg.
1324
1325                                                "GlobalBuffer"
1326                                                  A global address space pointer
1327                                                  to the buffer data is passed
1328                                                  in the kernarg.
1329
1330                                                "DynamicSharedPointer"
1331                                                  A group address space pointer
1332                                                  to dynamically allocated LDS
1333                                                  is passed in the kernarg.
1334
1335                                                "Sampler"
1336                                                  A global address space
1337                                                  pointer to a S# is passed in
1338                                                  the kernarg.
1339
1340                                                "Image"
1341                                                  A global address space
1342                                                  pointer to a T# is passed in
1343                                                  the kernarg.
1344
1345                                                "Pipe"
1346                                                  A global address space pointer
1347                                                  to an OpenCL pipe is passed in
1348                                                  the kernarg.
1349
1350                                                "Queue"
1351                                                  A global address space pointer
1352                                                  to an OpenCL device enqueue
1353                                                  queue is passed in the
1354                                                  kernarg.
1355
1356                                                "HiddenGlobalOffsetX"
1357                                                  The OpenCL grid dispatch
1358                                                  global offset for the X
1359                                                  dimension is passed in the
1360                                                  kernarg.
1361
1362                                                "HiddenGlobalOffsetY"
1363                                                  The OpenCL grid dispatch
1364                                                  global offset for the Y
1365                                                  dimension is passed in the
1366                                                  kernarg.
1367
1368                                                "HiddenGlobalOffsetZ"
1369                                                  The OpenCL grid dispatch
1370                                                  global offset for the Z
1371                                                  dimension is passed in the
1372                                                  kernarg.
1373
1374                                                "HiddenNone"
1375                                                  An argument that is not used
1376                                                  by the kernel. Space needs to
1377                                                  be left for it, but it does
1378                                                  not need to be set up.
1379
1380                                                "HiddenPrintfBuffer"
1381                                                  A global address space pointer
1382                                                  to the runtime printf buffer
1383                                                  is passed in kernarg.
1384
1385                                                "HiddenDefaultQueue"
1386                                                  A global address space pointer
1387                                                  to the OpenCL device enqueue
1388                                                  queue that should be used by
1389                                                  the kernel by default is
1390                                                  passed in the kernarg.
1391
1392                                                "HiddenCompletionAction"
1393                                                  A global address space pointer
1394                                                  to help link enqueued kernels into
1395                                                  the ancestor tree for determining
1396                                                  when the parent kernel has finished.
1397
1398                                                "HiddenMultiGridSyncArg"
1399                                                  A global address space pointer for
1400                                                  multi-grid synchronization is
1401                                                  passed in the kernarg.
1402
1403     "ValueType"       string         Required  Kernel argument value type. Only
1404                                                present if "ValueKind" is
1405                                                "ByValue". For vector data
1406                                                types, the value is for the
1407                                                element type. Values include:
1408
1409                                                - "Struct"
1410                                                - "I8"
1411                                                - "U8"
1412                                                - "I16"
1413                                                - "U16"
1414                                                - "F16"
1415                                                - "I32"
1416                                                - "U32"
1417                                                - "F32"
1418                                                - "I64"
1419                                                - "U64"
1420                                                - "F64"
1421
1422                                                .. TODO
1423                                                   How can it be determined if a
1424                                                   vector type, and what size
1425                                                   vector?
1426     "PointeeAlign"    integer                  Alignment in bytes of pointee
1427                                                type for pointer type kernel
1428                                                argument. Must be a power
1429                                                of 2. Only present if
1430                                                "ValueKind" is
1431                                                "DynamicSharedPointer".
1432     "AddrSpaceQual"   string                   Kernel argument address space
1433                                                qualifier. Only present if
1434                                                "ValueKind" is "GlobalBuffer" or
1435                                                "DynamicSharedPointer". Values
1436                                                are:
1437
1438                                                - "Private"
1439                                                - "Global"
1440                                                - "Constant"
1441                                                - "Local"
1442                                                - "Generic"
1443                                                - "Region"
1444
1445                                                .. TODO
1446                                                   Is GlobalBuffer only Global
1447                                                   or Constant? Is
1448                                                   DynamicSharedPointer always
1449                                                   Local? Can HCC allow Generic?
1450                                                   How can Private or Region
1451                                                   ever happen?
1452     "AccQual"         string                   Kernel argument access
1453                                                qualifier. Only present if
1454                                                "ValueKind" is "Image" or
1455                                                "Pipe". Values
1456                                                are:
1457
1458                                                - "ReadOnly"
1459                                                - "WriteOnly"
1460                                                - "ReadWrite"
1461
1462                                                .. TODO
1463                                                   Does this apply to
1464                                                   GlobalBuffer?
1465     "ActualAccQual"   string                   The actual memory accesses
1466                                                performed by the kernel on the
1467                                                kernel argument. Only present if
1468                                                "ValueKind" is "GlobalBuffer",
1469                                                "Image", or "Pipe". This may be
1470                                                more restrictive than indicated
1471                                                by "AccQual" to reflect what the
1472                                                kernel actual does. If not
1473                                                present then the runtime must
1474                                                assume what is implied by
1475                                                "AccQual" and "IsConst". Values
1476                                                are:
1477
1478                                                - "ReadOnly"
1479                                                - "WriteOnly"
1480                                                - "ReadWrite"
1481
1482     "IsConst"         boolean                  Indicates if the kernel argument
1483                                                is const qualified. Only present
1484                                                if "ValueKind" is
1485                                                "GlobalBuffer".
1486
1487     "IsRestrict"      boolean                  Indicates if the kernel argument
1488                                                is restrict qualified. Only
1489                                                present if "ValueKind" is
1490                                                "GlobalBuffer".
1491
1492     "IsVolatile"      boolean                  Indicates if the kernel argument
1493                                                is volatile qualified. Only
1494                                                present if "ValueKind" is
1495                                                "GlobalBuffer".
1496
1497     "IsPipe"          boolean                  Indicates if the kernel argument
1498                                                is pipe qualified. Only present
1499                                                if "ValueKind" is "Pipe".
1500
1501                                                .. TODO
1502                                                   Can GlobalBuffer be pipe
1503                                                   qualified?
1504     ================= ============== ========= ================================
1505
1506..
1507
1508  .. table:: AMDHSA Code Object V2 Kernel Code Properties Metadata Map
1509     :name: amdgpu-amdhsa-code-object-kernel-code-properties-metadata-map-table-v2
1510
1511     ============================ ============== ========= =====================
1512     String Key                   Value Type     Required? Description
1513     ============================ ============== ========= =====================
1514     "KernargSegmentSize"         integer        Required  The size in bytes of
1515                                                           the kernarg segment
1516                                                           that holds the values
1517                                                           of the arguments to
1518                                                           the kernel.
1519     "GroupSegmentFixedSize"      integer        Required  The amount of group
1520                                                           segment memory
1521                                                           required by a
1522                                                           work-group in
1523                                                           bytes. This does not
1524                                                           include any
1525                                                           dynamically allocated
1526                                                           group segment memory
1527                                                           that may be added
1528                                                           when the kernel is
1529                                                           dispatched.
1530     "PrivateSegmentFixedSize"    integer        Required  The amount of fixed
1531                                                           private address space
1532                                                           memory required for a
1533                                                           work-item in
1534                                                           bytes. If the kernel
1535                                                           uses a dynamic call
1536                                                           stack then additional
1537                                                           space must be added
1538                                                           to this value for the
1539                                                           call stack.
1540     "KernargSegmentAlign"        integer        Required  The maximum byte
1541                                                           alignment of
1542                                                           arguments in the
1543                                                           kernarg segment. Must
1544                                                           be a power of 2.
1545     "WavefrontSize"              integer        Required  Wavefront size. Must
1546                                                           be a power of 2.
1547     "NumSGPRs"                   integer        Required  Number of scalar
1548                                                           registers used by a
1549                                                           wavefront for
1550                                                           GFX6-GFX10. This
1551                                                           includes the special
1552                                                           SGPRs for VCC, Flat
1553                                                           Scratch (GFX7-GFX10)
1554                                                           and XNACK (for
1555                                                           GFX8-GFX10). It does
1556                                                           not include the 16
1557                                                           SGPR added if a trap
1558                                                           handler is
1559                                                           enabled. It is not
1560                                                           rounded up to the
1561                                                           allocation
1562                                                           granularity.
1563     "NumVGPRs"                   integer        Required  Number of vector
1564                                                           registers used by
1565                                                           each work-item for
1566                                                           GFX6-GFX10
1567     "MaxFlatWorkGroupSize"       integer        Required  Maximum flat
1568                                                           work-group size
1569                                                           supported by the
1570                                                           kernel in work-items.
1571                                                           Must be >=1 and
1572                                                           consistent with
1573                                                           ReqdWorkGroupSize if
1574                                                           not 0, 0, 0.
1575     "NumSpilledSGPRs"            integer                  Number of stores from
1576                                                           a scalar register to
1577                                                           a register allocator
1578                                                           created spill
1579                                                           location.
1580     "NumSpilledVGPRs"            integer                  Number of stores from
1581                                                           a vector register to
1582                                                           a register allocator
1583                                                           created spill
1584                                                           location.
1585     ============================ ============== ========= =====================
1586
1587.. _amdgpu-amdhsa-code-object-metadata-v3:
1588
1589Code Object V3 Metadata (-mattr=+code-object-v3)
1590++++++++++++++++++++++++++++++++++++++++++++++++
1591
1592Code object V3 metadata is specified by the ``NT_AMDGPU_METADATA`` note record
1593(see :ref:`amdgpu-note-records-v3`).
1594
1595The metadata is represented as Message Pack formatted binary data (see
1596[MsgPack]_). The top level is a Message Pack map that includes the
1597keys defined in table
1598:ref:`amdgpu-amdhsa-code-object-metadata-map-table-v3` and referenced
1599tables.
1600
1601Additional information can be added to the maps. To avoid conflicts,
1602any key names should be prefixed by "*vendor-name*." where
1603``vendor-name`` can be the the name of the vendor and specific vendor
1604tool that generates the information. The prefix is abbreviated to
1605simply "." when it appears within a map that has been added by the
1606same *vendor-name*.
1607
1608  .. table:: AMDHSA Code Object V3 Metadata Map
1609     :name: amdgpu-amdhsa-code-object-metadata-map-table-v3
1610
1611     ================= ============== ========= =======================================
1612     String Key        Value Type     Required? Description
1613     ================= ============== ========= =======================================
1614     "amdhsa.version"  sequence of    Required  - The first integer is the major
1615                       2 integers                 version. Currently 1.
1616                                                - The second integer is the minor
1617                                                  version. Currently 0.
1618     "amdhsa.printf"   sequence of              Each string is encoded information
1619                       strings                  about a printf function call. The
1620                                                encoded information is organized as
1621                                                fields separated by colon (':'):
1622
1623                                                ``ID:N:S[0]:S[1]:...:S[N-1]:FormatString``
1624
1625                                                where:
1626
1627                                                ``ID``
1628                                                  A 32 bit integer as a unique id for
1629                                                  each printf function call
1630
1631                                                ``N``
1632                                                  A 32 bit integer equal to the number
1633                                                  of arguments of printf function call
1634                                                  minus 1
1635
1636                                                ``S[i]`` (where i = 0, 1, ... , N-1)
1637                                                  32 bit integers for the size in bytes
1638                                                  of the i-th FormatString argument of
1639                                                  the printf function call
1640
1641                                                FormatString
1642                                                  The format string passed to the
1643                                                  printf function call.
1644     "amdhsa.kernels"  sequence of    Required  Sequence of the maps for each
1645                       map                      kernel in the code object. See
1646                                                :ref:`amdgpu-amdhsa-code-object-kernel-metadata-map-table-v3`
1647                                                for the definition of the keys included
1648                                                in that map.
1649     ================= ============== ========= =======================================
1650
1651..
1652
1653  .. table:: AMDHSA Code Object V3 Kernel Metadata Map
1654     :name: amdgpu-amdhsa-code-object-kernel-metadata-map-table-v3
1655
1656     =================================== ============== ========= ================================
1657     String Key                          Value Type     Required? Description
1658     =================================== ============== ========= ================================
1659     ".name"                             string         Required  Source name of the kernel.
1660     ".symbol"                           string         Required  Name of the kernel
1661                                                                  descriptor ELF symbol.
1662     ".language"                         string                   Source language of the kernel.
1663                                                                  Values include:
1664
1665                                                                  - "OpenCL C"
1666                                                                  - "OpenCL C++"
1667                                                                  - "HCC"
1668                                                                  - "HIP"
1669                                                                  - "OpenMP"
1670                                                                  - "Assembler"
1671
1672     ".language_version"                 sequence of              - The first integer is the major
1673                                         2 integers                 version.
1674                                                                  - The second integer is the
1675                                                                    minor version.
1676     ".args"                             sequence of              Sequence of maps of the
1677                                         map                      kernel arguments. See
1678                                                                  :ref:`amdgpu-amdhsa-code-object-kernel-argument-metadata-map-table-v3`
1679                                                                  for the definition of the keys
1680                                                                  included in that map.
1681     ".reqd_workgroup_size"              sequence of              If not 0, 0, 0 then all values
1682                                         3 integers               must be >=1 and the dispatch
1683                                                                  work-group size X, Y, Z must
1684                                                                  correspond to the specified
1685                                                                  values. Defaults to 0, 0, 0.
1686
1687                                                                  Corresponds to the OpenCL
1688                                                                  ``reqd_work_group_size``
1689                                                                  attribute.
1690     ".workgroup_size_hint"              sequence of              The dispatch work-group size
1691                                         3 integers               X, Y, Z is likely to be the
1692                                                                  specified values.
1693
1694                                                                  Corresponds to the OpenCL
1695                                                                  ``work_group_size_hint``
1696                                                                  attribute.
1697     ".vec_type_hint"                    string                   The name of a scalar or vector
1698                                                                  type.
1699
1700                                                                  Corresponds to the OpenCL
1701                                                                  ``vec_type_hint`` attribute.
1702
1703     ".device_enqueue_symbol"            string                   The external symbol name
1704                                                                  associated with a kernel.
1705                                                                  OpenCL runtime allocates a
1706                                                                  global buffer for the symbol
1707                                                                  and saves the kernel's address
1708                                                                  to it, which is used for
1709                                                                  device side enqueueing. Only
1710                                                                  available for device side
1711                                                                  enqueued kernels.
1712     ".kernarg_segment_size"             integer        Required  The size in bytes of
1713                                                                  the kernarg segment
1714                                                                  that holds the values
1715                                                                  of the arguments to
1716                                                                  the kernel.
1717     ".group_segment_fixed_size"         integer        Required  The amount of group
1718                                                                  segment memory
1719                                                                  required by a
1720                                                                  work-group in
1721                                                                  bytes. This does not
1722                                                                  include any
1723                                                                  dynamically allocated
1724                                                                  group segment memory
1725                                                                  that may be added
1726                                                                  when the kernel is
1727                                                                  dispatched.
1728     ".private_segment_fixed_size"       integer        Required  The amount of fixed
1729                                                                  private address space
1730                                                                  memory required for a
1731                                                                  work-item in
1732                                                                  bytes. If the kernel
1733                                                                  uses a dynamic call
1734                                                                  stack then additional
1735                                                                  space must be added
1736                                                                  to this value for the
1737                                                                  call stack.
1738     ".kernarg_segment_align"            integer        Required  The maximum byte
1739                                                                  alignment of
1740                                                                  arguments in the
1741                                                                  kernarg segment. Must
1742                                                                  be a power of 2.
1743     ".wavefront_size"                   integer        Required  Wavefront size. Must
1744                                                                  be a power of 2.
1745     ".sgpr_count"                       integer        Required  Number of scalar
1746                                                                  registers required by a
1747                                                                  wavefront for
1748                                                                  GFX6-GFX9. A register
1749                                                                  is required if it is
1750                                                                  used explicitly, or
1751                                                                  if a higher numbered
1752                                                                  register is used
1753                                                                  explicitly. This
1754                                                                  includes the special
1755                                                                  SGPRs for VCC, Flat
1756                                                                  Scratch (GFX7-GFX9)
1757                                                                  and XNACK (for
1758                                                                  GFX8-GFX9). It does
1759                                                                  not include the 16
1760                                                                  SGPR added if a trap
1761                                                                  handler is
1762                                                                  enabled. It is not
1763                                                                  rounded up to the
1764                                                                  allocation
1765                                                                  granularity.
1766     ".vgpr_count"                       integer        Required  Number of vector
1767                                                                  registers required by
1768                                                                  each work-item for
1769                                                                  GFX6-GFX9. A register
1770                                                                  is required if it is
1771                                                                  used explicitly, or
1772                                                                  if a higher numbered
1773                                                                  register is used
1774                                                                  explicitly.
1775     ".max_flat_workgroup_size"          integer        Required  Maximum flat
1776                                                                  work-group size
1777                                                                  supported by the
1778                                                                  kernel in work-items.
1779                                                                  Must be >=1 and
1780                                                                  consistent with
1781                                                                  ReqdWorkGroupSize if
1782                                                                  not 0, 0, 0.
1783     ".sgpr_spill_count"                 integer                  Number of stores from
1784                                                                  a scalar register to
1785                                                                  a register allocator
1786                                                                  created spill
1787                                                                  location.
1788     ".vgpr_spill_count"                 integer                  Number of stores from
1789                                                                  a vector register to
1790                                                                  a register allocator
1791                                                                  created spill
1792                                                                  location.
1793     =================================== ============== ========= ================================
1794
1795..
1796
1797  .. table:: AMDHSA Code Object V3 Kernel Argument Metadata Map
1798     :name: amdgpu-amdhsa-code-object-kernel-argument-metadata-map-table-v3
1799
1800     ====================== ============== ========= ================================
1801     String Key             Value Type     Required? Description
1802     ====================== ============== ========= ================================
1803     ".name"                string                   Kernel argument name.
1804     ".type_name"           string                   Kernel argument type name.
1805     ".size"                integer        Required  Kernel argument size in bytes.
1806     ".offset"              integer        Required  Kernel argument offset in
1807                                                     bytes. The offset must be a
1808                                                     multiple of the alignment
1809                                                     required by the argument.
1810     ".value_kind"          string         Required  Kernel argument kind that
1811                                                     specifies how to set up the
1812                                                     corresponding argument.
1813                                                     Values include:
1814
1815                                                     "by_value"
1816                                                       The argument is copied
1817                                                       directly into the kernarg.
1818
1819                                                     "global_buffer"
1820                                                       A global address space pointer
1821                                                       to the buffer data is passed
1822                                                       in the kernarg.
1823
1824                                                     "dynamic_shared_pointer"
1825                                                       A group address space pointer
1826                                                       to dynamically allocated LDS
1827                                                       is passed in the kernarg.
1828
1829                                                     "sampler"
1830                                                       A global address space
1831                                                       pointer to a S# is passed in
1832                                                       the kernarg.
1833
1834                                                     "image"
1835                                                       A global address space
1836                                                       pointer to a T# is passed in
1837                                                       the kernarg.
1838
1839                                                     "pipe"
1840                                                       A global address space pointer
1841                                                       to an OpenCL pipe is passed in
1842                                                       the kernarg.
1843
1844                                                     "queue"
1845                                                       A global address space pointer
1846                                                       to an OpenCL device enqueue
1847                                                       queue is passed in the
1848                                                       kernarg.
1849
1850                                                     "hidden_global_offset_x"
1851                                                       The OpenCL grid dispatch
1852                                                       global offset for the X
1853                                                       dimension is passed in the
1854                                                       kernarg.
1855
1856                                                     "hidden_global_offset_y"
1857                                                       The OpenCL grid dispatch
1858                                                       global offset for the Y
1859                                                       dimension is passed in the
1860                                                       kernarg.
1861
1862                                                     "hidden_global_offset_z"
1863                                                       The OpenCL grid dispatch
1864                                                       global offset for the Z
1865                                                       dimension is passed in the
1866                                                       kernarg.
1867
1868                                                     "hidden_none"
1869                                                       An argument that is not used
1870                                                       by the kernel. Space needs to
1871                                                       be left for it, but it does
1872                                                       not need to be set up.
1873
1874                                                     "hidden_printf_buffer"
1875                                                       A global address space pointer
1876                                                       to the runtime printf buffer
1877                                                       is passed in kernarg.
1878
1879                                                     "hidden_default_queue"
1880                                                       A global address space pointer
1881                                                       to the OpenCL device enqueue
1882                                                       queue that should be used by
1883                                                       the kernel by default is
1884                                                       passed in the kernarg.
1885
1886                                                     "hidden_completion_action"
1887                                                       A global address space pointer
1888                                                       to help link enqueued kernels into
1889                                                       the ancestor tree for determining
1890                                                       when the parent kernel has finished.
1891
1892                                                     "hidden_multigrid_sync_arg"
1893                                                       A global address space pointer for
1894                                                       multi-grid synchronization is
1895                                                       passed in the kernarg.
1896
1897     ".value_type"          string         Required  Kernel argument value type. Only
1898                                                     present if ".value_kind" is
1899                                                     "by_value". For vector data
1900                                                     types, the value is for the
1901                                                     element type. Values include:
1902
1903                                                     - "struct"
1904                                                     - "i8"
1905                                                     - "u8"
1906                                                     - "i16"
1907                                                     - "u16"
1908                                                     - "f16"
1909                                                     - "i32"
1910                                                     - "u32"
1911                                                     - "f32"
1912                                                     - "i64"
1913                                                     - "u64"
1914                                                     - "f64"
1915
1916                                                     .. TODO
1917                                                        How can it be determined if a
1918                                                        vector type, and what size
1919                                                        vector?
1920     ".pointee_align"       integer                  Alignment in bytes of pointee
1921                                                     type for pointer type kernel
1922                                                     argument. Must be a power
1923                                                     of 2. Only present if
1924                                                     ".value_kind" is
1925                                                     "dynamic_shared_pointer".
1926     ".address_space"       string                   Kernel argument address space
1927                                                     qualifier. Only present if
1928                                                     ".value_kind" is "global_buffer" or
1929                                                     "dynamic_shared_pointer". Values
1930                                                     are:
1931
1932                                                     - "private"
1933                                                     - "global"
1934                                                     - "constant"
1935                                                     - "local"
1936                                                     - "generic"
1937                                                     - "region"
1938
1939                                                     .. TODO
1940                                                        Is "global_buffer" only "global"
1941                                                        or "constant"? Is
1942                                                        "dynamic_shared_pointer" always
1943                                                        "local"? Can HCC allow "generic"?
1944                                                        How can "private" or "region"
1945                                                        ever happen?
1946     ".access"              string                   Kernel argument access
1947                                                     qualifier. Only present if
1948                                                     ".value_kind" is "image" or
1949                                                     "pipe". Values
1950                                                     are:
1951
1952                                                     - "read_only"
1953                                                     - "write_only"
1954                                                     - "read_write"
1955
1956                                                     .. TODO
1957                                                        Does this apply to
1958                                                        "global_buffer"?
1959     ".actual_access"       string                   The actual memory accesses
1960                                                     performed by the kernel on the
1961                                                     kernel argument. Only present if
1962                                                     ".value_kind" is "global_buffer",
1963                                                     "image", or "pipe". This may be
1964                                                     more restrictive than indicated
1965                                                     by ".access" to reflect what the
1966                                                     kernel actual does. If not
1967                                                     present then the runtime must
1968                                                     assume what is implied by
1969                                                     ".access" and ".is_const"      . Values
1970                                                     are:
1971
1972                                                     - "read_only"
1973                                                     - "write_only"
1974                                                     - "read_write"
1975
1976     ".is_const"            boolean                  Indicates if the kernel argument
1977                                                     is const qualified. Only present
1978                                                     if ".value_kind" is
1979                                                     "global_buffer".
1980
1981     ".is_restrict"         boolean                  Indicates if the kernel argument
1982                                                     is restrict qualified. Only
1983                                                     present if ".value_kind" is
1984                                                     "global_buffer".
1985
1986     ".is_volatile"         boolean                  Indicates if the kernel argument
1987                                                     is volatile qualified. Only
1988                                                     present if ".value_kind" is
1989                                                     "global_buffer".
1990
1991     ".is_pipe"             boolean                  Indicates if the kernel argument
1992                                                     is pipe qualified. Only present
1993                                                     if ".value_kind" is "pipe".
1994
1995                                                     .. TODO
1996                                                        Can "global_buffer" be pipe
1997                                                        qualified?
1998     ====================== ============== ========= ================================
1999
2000..
2001
2002Kernel Dispatch
2003~~~~~~~~~~~~~~~
2004
2005The HSA architected queuing language (AQL) defines a user space memory interface
2006that can be used to control the dispatch of kernels, in an agent independent
2007way. An agent can have zero or more AQL queues created for it using the ROCm
2008runtime, in which AQL packets (all of which are 64 bytes) can be placed. See the
2009*HSA Platform System Architecture Specification* [HSA]_ for the AQL queue
2010mechanics and packet layouts.
2011
2012The packet processor of a kernel agent is responsible for detecting and
2013dispatching HSA kernels from the AQL queues associated with it. For AMD GPUs the
2014packet processor is implemented by the hardware command processor (CP),
2015asynchronous dispatch controller (ADC) and shader processor input controller
2016(SPI).
2017
2018The ROCm runtime can be used to allocate an AQL queue object. It uses the kernel
2019mode driver to initialize and register the AQL queue with CP.
2020
2021To dispatch a kernel the following actions are performed. This can occur in the
2022CPU host program, or from an HSA kernel executing on a GPU.
2023
20241. A pointer to an AQL queue for the kernel agent on which the kernel is to be
2025   executed is obtained.
20262. A pointer to the kernel descriptor (see
2027   :ref:`amdgpu-amdhsa-kernel-descriptor`) of the kernel to execute is
2028   obtained. It must be for a kernel that is contained in a code object that that
2029   was loaded by the ROCm runtime on the kernel agent with which the AQL queue is
2030   associated.
20313. Space is allocated for the kernel arguments using the ROCm runtime allocator
2032   for a memory region with the kernarg property for the kernel agent that will
2033   execute the kernel. It must be at least 16 byte aligned.
20344. Kernel argument values are assigned to the kernel argument memory
2035   allocation. The layout is defined in the *HSA Programmer's Language Reference*
2036   [HSA]_. For AMDGPU the kernel execution directly accesses the kernel argument
2037   memory in the same way constant memory is accessed. (Note that the HSA
2038   specification allows an implementation to copy the kernel argument contents to
2039   another location that is accessed by the kernel.)
20405. An AQL kernel dispatch packet is created on the AQL queue. The ROCm runtime
2041   api uses 64 bit atomic operations to reserve space in the AQL queue for the
2042   packet. The packet must be set up, and the final write must use an atomic
2043   store release to set the packet kind to ensure the packet contents are
2044   visible to the kernel agent. AQL defines a doorbell signal mechanism to
2045   notify the kernel agent that the AQL queue has been updated. These rules, and
2046   the layout of the AQL queue and kernel dispatch packet is defined in the *HSA
2047   System Architecture Specification* [HSA]_.
20486. A kernel dispatch packet includes information about the actual dispatch,
2049   such as grid and work-group size, together with information from the code
2050   object about the kernel, such as segment sizes. The ROCm runtime queries on
2051   the kernel symbol can be used to obtain the code object values which are
2052   recorded in the :ref:`amdgpu-amdhsa-code-object-metadata`.
20537. CP executes micro-code and is responsible for detecting and setting up the
2054   GPU to execute the wavefronts of a kernel dispatch.
20558. CP ensures that when the a wavefront starts executing the kernel machine
2056   code, the scalar general purpose registers (SGPR) and vector general purpose
2057   registers (VGPR) are set up as required by the machine code. The required
2058   setup is defined in the :ref:`amdgpu-amdhsa-kernel-descriptor`. The initial
2059   register state is defined in
2060   :ref:`amdgpu-amdhsa-initial-kernel-execution-state`.
20619. The prolog of the kernel machine code (see
2062   :ref:`amdgpu-amdhsa-kernel-prolog`) sets up the machine state as necessary
2063   before continuing executing the machine code that corresponds to the kernel.
206410. When the kernel dispatch has completed execution, CP signals the completion
2065    signal specified in the kernel dispatch packet if not 0.
2066
2067.. _amdgpu-amdhsa-memory-spaces:
2068
2069Memory Spaces
2070~~~~~~~~~~~~~
2071
2072The memory space properties are:
2073
2074  .. table:: AMDHSA Memory Spaces
2075     :name: amdgpu-amdhsa-memory-spaces-table
2076
2077     ================= =========== ======== ======= ==================
2078     Memory Space Name HSA Segment Hardware Address NULL Value
2079                       Name        Name     Size
2080     ================= =========== ======== ======= ==================
2081     Private           private     scratch  32      0x00000000
2082     Local             group       LDS      32      0xFFFFFFFF
2083     Global            global      global   64      0x0000000000000000
2084     Constant          constant    *same as 64      0x0000000000000000
2085                                   global*
2086     Generic           flat        flat     64      0x0000000000000000
2087     Region            N/A         GDS      32      *not implemented
2088                                                    for AMDHSA*
2089     ================= =========== ======== ======= ==================
2090
2091The global and constant memory spaces both use global virtual addresses, which
2092are the same virtual address space used by the CPU. However, some virtual
2093addresses may only be accessible to the CPU, some only accessible by the GPU,
2094and some by both.
2095
2096Using the constant memory space indicates that the data will not change during
2097the execution of the kernel. This allows scalar read instructions to be
2098used. The vector and scalar L1 caches are invalidated of volatile data before
2099each kernel dispatch execution to allow constant memory to change values between
2100kernel dispatches.
2101
2102The local memory space uses the hardware Local Data Store (LDS) which is
2103automatically allocated when the hardware creates work-groups of wavefronts, and
2104freed when all the wavefronts of a work-group have terminated. The data store
2105(DS) instructions can be used to access it.
2106
2107The private memory space uses the hardware scratch memory support. If the kernel
2108uses scratch, then the hardware allocates memory that is accessed using
2109wavefront lane dword (4 byte) interleaving. The mapping used from private
2110address to physical address is:
2111
2112  ``wavefront-scratch-base +
2113  (private-address * wavefront-size * 4) +
2114  (wavefront-lane-id * 4)``
2115
2116There are different ways that the wavefront scratch base address is determined
2117by a wavefront (see :ref:`amdgpu-amdhsa-initial-kernel-execution-state`). This
2118memory can be accessed in an interleaved manner using buffer instruction with
2119the scratch buffer descriptor and per wavefront scratch offset, by the scratch
2120instructions, or by flat instructions. If each lane of a wavefront accesses the
2121same private address, the interleaving results in adjacent dwords being accessed
2122and hence requires fewer cache lines to be fetched. Multi-dword access is not
2123supported except by flat and scratch instructions in GFX9-GFX10.
2124
2125The generic address space uses the hardware flat address support available in
2126GFX7-GFX10. This uses two fixed ranges of virtual addresses (the private and
2127local appertures), that are outside the range of addressible global memory, to
2128map from a flat address to a private or local address.
2129
2130FLAT instructions can take a flat address and access global, private (scratch)
2131and group (LDS) memory depending in if the address is within one of the
2132apperture ranges. Flat access to scratch requires hardware aperture setup and
2133setup in the kernel prologue (see :ref:`amdgpu-amdhsa-flat-scratch`). Flat
2134access to LDS requires hardware aperture setup and M0 (GFX7-GFX8) register setup
2135(see :ref:`amdgpu-amdhsa-m0`).
2136
2137To convert between a segment address and a flat address the base address of the
2138appertures address can be used. For GFX7-GFX8 these are available in the
2139:ref:`amdgpu-amdhsa-hsa-aql-queue` the address of which can be obtained with
2140Queue Ptr SGPR (see :ref:`amdgpu-amdhsa-initial-kernel-execution-state`). For
2141GFX9-GFX10 the appature base addresses are directly available as inline constant
2142registers ``SRC_SHARED_BASE/LIMIT`` and ``SRC_PRIVATE_BASE/LIMIT``. In 64 bit
2143address mode the apperture sizes are 2^32 bytes and the base is aligned to 2^32
2144which makes it easier to convert from flat to segment or segment to flat.
2145
2146Image and Samplers
2147~~~~~~~~~~~~~~~~~~
2148
2149Image and sample handles created by the ROCm runtime are 64 bit addresses of a
2150hardware 32 byte V# and 48 byte S# object respectively. In order to support the
2151HSA ``query_sampler`` operations two extra dwords are used to store the HSA BRIG
2152enumeration values for the queries that are not trivially deducible from the S#
2153representation.
2154
2155HSA Signals
2156~~~~~~~~~~~
2157
2158HSA signal handles created by the ROCm runtime are 64 bit addresses of a
2159structure allocated in memory accessible from both the CPU and GPU. The
2160structure is defined by the ROCm runtime and subject to change between releases
2161(see [AMD-ROCm-github]_).
2162
2163.. _amdgpu-amdhsa-hsa-aql-queue:
2164
2165HSA AQL Queue
2166~~~~~~~~~~~~~
2167
2168The HSA AQL queue structure is defined by the ROCm runtime and subject to change
2169between releases (see [AMD-ROCm-github]_). For some processors it contains
2170fields needed to implement certain language features such as the flat address
2171aperture bases. It also contains fields used by CP such as managing the
2172allocation of scratch memory.
2173
2174.. _amdgpu-amdhsa-kernel-descriptor:
2175
2176Kernel Descriptor
2177~~~~~~~~~~~~~~~~~
2178
2179A kernel descriptor consists of the information needed by CP to initiate the
2180execution of a kernel, including the entry point address of the machine code
2181that implements the kernel.
2182
2183Kernel Descriptor for GFX6-GFX10
2184++++++++++++++++++++++++++++++++
2185
2186CP microcode requires the Kernel descriptor to be allocated on 64 byte
2187alignment.
2188
2189  .. table:: Kernel Descriptor for GFX6-GFX10
2190     :name: amdgpu-amdhsa-kernel-descriptor-gfx6-gfx10-table
2191
2192     ======= ======= =============================== ============================
2193     Bits    Size    Field Name                      Description
2194     ======= ======= =============================== ============================
2195     31:0    4 bytes GROUP_SEGMENT_FIXED_SIZE        The amount of fixed local
2196                                                     address space memory
2197                                                     required for a work-group
2198                                                     in bytes. This does not
2199                                                     include any dynamically
2200                                                     allocated local address
2201                                                     space memory that may be
2202                                                     added when the kernel is
2203                                                     dispatched.
2204     63:32   4 bytes PRIVATE_SEGMENT_FIXED_SIZE      The amount of fixed
2205                                                     private address space
2206                                                     memory required for a
2207                                                     work-item in bytes. If
2208                                                     is_dynamic_callstack is 1
2209                                                     then additional space must
2210                                                     be added to this value for
2211                                                     the call stack.
2212     127:64  8 bytes                                 Reserved, must be 0.
2213     191:128 8 bytes KERNEL_CODE_ENTRY_BYTE_OFFSET   Byte offset (possibly
2214                                                     negative) from base
2215                                                     address of kernel
2216                                                     descriptor to kernel's
2217                                                     entry point instruction
2218                                                     which must be 256 byte
2219                                                     aligned.
2220     351:272 20                                      Reserved, must be 0.
2221             bytes
2222     383:352 4 bytes COMPUTE_PGM_RSRC3               GFX6-9
2223                                                       Reserved, must be 0.
2224                                                     GFX10
2225                                                       Compute Shader (CS)
2226                                                       program settings used by
2227                                                       CP to set up
2228                                                       ``COMPUTE_PGM_RSRC3``
2229                                                       configuration
2230                                                       register. See
2231                                                       :ref:`amdgpu-amdhsa-compute_pgm_rsrc3-gfx10-table`.
2232     415:384 4 bytes COMPUTE_PGM_RSRC1               Compute Shader (CS)
2233                                                     program settings used by
2234                                                     CP to set up
2235                                                     ``COMPUTE_PGM_RSRC1``
2236                                                     configuration
2237                                                     register. See
2238                                                     :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx10-table`.
2239     447:416 4 bytes COMPUTE_PGM_RSRC2               Compute Shader (CS)
2240                                                     program settings used by
2241                                                     CP to set up
2242                                                     ``COMPUTE_PGM_RSRC2``
2243                                                     configuration
2244                                                     register. See
2245                                                     :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx10-table`.
2246     448     1 bit   ENABLE_SGPR_PRIVATE_SEGMENT     Enable the setup of the
2247                     _BUFFER                         SGPR user data registers
2248                                                     (see
2249                                                     :ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
2250
2251                                                     The total number of SGPR
2252                                                     user data registers
2253                                                     requested must not exceed
2254                                                     16 and match value in
2255                                                     ``compute_pgm_rsrc2.user_sgpr.user_sgpr_count``.
2256                                                     Any requests beyond 16
2257                                                     will be ignored.
2258     449     1 bit   ENABLE_SGPR_DISPATCH_PTR        *see above*
2259     450     1 bit   ENABLE_SGPR_QUEUE_PTR           *see above*
2260     451     1 bit   ENABLE_SGPR_KERNARG_SEGMENT_PTR *see above*
2261     452     1 bit   ENABLE_SGPR_DISPATCH_ID         *see above*
2262     453     1 bit   ENABLE_SGPR_FLAT_SCRATCH_INIT   *see above*
2263     454     1 bit   ENABLE_SGPR_PRIVATE_SEGMENT     *see above*
2264                     _SIZE
2265     457:455 3 bits                                  Reserved, must be 0.
2266     458     1 bit   ENABLE_WAVEFRONT_SIZE32         GFX6-9
2267                                                       Reserved, must be 0.
2268                                                     GFX10
2269                                                       - If 0 execute in
2270                                                         wavefront size 64 mode.
2271                                                       - If 1 execute in
2272                                                         native wavefront size
2273                                                         32 mode.
2274     463:459 5 bits                                  Reserved, must be 0.
2275     511:464 6 bytes                                 Reserved, must be 0.
2276     512     **Total size 64 bytes.**
2277     ======= ====================================================================
2278
2279..
2280
2281  .. table:: compute_pgm_rsrc1 for GFX6-GFX10
2282     :name: amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx10-table
2283
2284     ======= ======= =============================== ===========================================================================
2285     Bits    Size    Field Name                      Description
2286     ======= ======= =============================== ===========================================================================
2287     5:0     6 bits  GRANULATED_WORKITEM_VGPR_COUNT  Number of vector register
2288                                                     blocks used by each work-item;
2289                                                     granularity is device
2290                                                     specific:
2291
2292                                                     GFX6-GFX9
2293                                                       - vgprs_used 0..256
2294                                                       - max(0, ceil(vgprs_used / 4) - 1)
2295                                                     GFX10 (wavefront size 64)
2296                                                       - max_vgpr 1..256
2297                                                       - max(0, ceil(vgprs_used / 4) - 1)
2298                                                     GFX10 (wavefront size 32)
2299                                                       - max_vgpr 1..256
2300                                                       - max(0, ceil(vgprs_used / 8) - 1)
2301
2302                                                     Where vgprs_used is defined
2303                                                     as the highest VGPR number
2304                                                     explicitly referenced plus
2305                                                     one.
2306
2307                                                     Used by CP to set up
2308                                                     ``COMPUTE_PGM_RSRC1.VGPRS``.
2309
2310                                                     The
2311                                                     :ref:`amdgpu-assembler`
2312                                                     calculates this
2313                                                     automatically for the
2314                                                     selected processor from
2315                                                     values provided to the
2316                                                     `.amdhsa_kernel` directive
2317                                                     by the
2318                                                     `.amdhsa_next_free_vgpr`
2319                                                     nested directive (see
2320                                                     :ref:`amdhsa-kernel-directives-table`).
2321     9:6     4 bits  GRANULATED_WAVEFRONT_SGPR_COUNT Number of scalar register
2322                                                     blocks used by a wavefront;
2323                                                     granularity is device
2324                                                     specific:
2325
2326                                                     GFX6-GFX8
2327                                                       - sgprs_used 0..112
2328                                                       - max(0, ceil(sgprs_used / 8) - 1)
2329                                                     GFX9
2330                                                       - sgprs_used 0..112
2331                                                       - 2 * max(0, ceil(sgprs_used / 16) - 1)
2332                                                     GFX10
2333                                                       Reserved, must be 0.
2334                                                       (128 SGPRs always
2335                                                       allocated.)
2336
2337                                                     Where sgprs_used is
2338                                                     defined as the highest
2339                                                     SGPR number explicitly
2340                                                     referenced plus one, plus
2341                                                     a target-specific number
2342                                                     of additional special
2343                                                     SGPRs for VCC,
2344                                                     FLAT_SCRATCH (GFX7+) and
2345                                                     XNACK_MASK (GFX8+), and
2346                                                     any additional
2347                                                     target-specific
2348                                                     limitations. It does not
2349                                                     include the 16 SGPRs added
2350                                                     if a trap handler is
2351                                                     enabled.
2352
2353                                                     The target-specific
2354                                                     limitations and special
2355                                                     SGPR layout are defined in
2356                                                     the hardware
2357                                                     documentation, which can
2358                                                     be found in the
2359                                                     :ref:`amdgpu-processors`
2360                                                     table.
2361
2362                                                     Used by CP to set up
2363                                                     ``COMPUTE_PGM_RSRC1.SGPRS``.
2364
2365                                                     The
2366                                                     :ref:`amdgpu-assembler`
2367                                                     calculates this
2368                                                     automatically for the
2369                                                     selected processor from
2370                                                     values provided to the
2371                                                     `.amdhsa_kernel` directive
2372                                                     by the
2373                                                     `.amdhsa_next_free_sgpr`
2374                                                     and `.amdhsa_reserve_*`
2375                                                     nested directives (see
2376                                                     :ref:`amdhsa-kernel-directives-table`).
2377     11:10   2 bits  PRIORITY                        Must be 0.
2378
2379                                                     Start executing wavefront
2380                                                     at the specified priority.
2381
2382                                                     CP is responsible for
2383                                                     filling in
2384                                                     ``COMPUTE_PGM_RSRC1.PRIORITY``.
2385     13:12   2 bits  FLOAT_ROUND_MODE_32             Wavefront starts execution
2386                                                     with specified rounding
2387                                                     mode for single (32
2388                                                     bit) floating point
2389                                                     precision floating point
2390                                                     operations.
2391
2392                                                     Floating point rounding
2393                                                     mode values are defined in
2394                                                     :ref:`amdgpu-amdhsa-floating-point-rounding-mode-enumeration-values-table`.
2395
2396                                                     Used by CP to set up
2397                                                     ``COMPUTE_PGM_RSRC1.FLOAT_MODE``.
2398     15:14   2 bits  FLOAT_ROUND_MODE_16_64          Wavefront starts execution
2399                                                     with specified rounding
2400                                                     denorm mode for half/double (16
2401                                                     and 64 bit) floating point
2402                                                     precision floating point
2403                                                     operations.
2404
2405                                                     Floating point rounding
2406                                                     mode values are defined in
2407                                                     :ref:`amdgpu-amdhsa-floating-point-rounding-mode-enumeration-values-table`.
2408
2409                                                     Used by CP to set up
2410                                                     ``COMPUTE_PGM_RSRC1.FLOAT_MODE``.
2411     17:16   2 bits  FLOAT_DENORM_MODE_32            Wavefront starts execution
2412                                                     with specified denorm mode
2413                                                     for single (32
2414                                                     bit)  floating point
2415                                                     precision floating point
2416                                                     operations.
2417
2418                                                     Floating point denorm mode
2419                                                     values are defined in
2420                                                     :ref:`amdgpu-amdhsa-floating-point-denorm-mode-enumeration-values-table`.
2421
2422                                                     Used by CP to set up
2423                                                     ``COMPUTE_PGM_RSRC1.FLOAT_MODE``.
2424     19:18   2 bits  FLOAT_DENORM_MODE_16_64         Wavefront starts execution
2425                                                     with specified denorm mode
2426                                                     for half/double (16
2427                                                     and 64 bit) floating point
2428                                                     precision floating point
2429                                                     operations.
2430
2431                                                     Floating point denorm mode
2432                                                     values are defined in
2433                                                     :ref:`amdgpu-amdhsa-floating-point-denorm-mode-enumeration-values-table`.
2434
2435                                                     Used by CP to set up
2436                                                     ``COMPUTE_PGM_RSRC1.FLOAT_MODE``.
2437     20      1 bit   PRIV                            Must be 0.
2438
2439                                                     Start executing wavefront
2440                                                     in privilege trap handler
2441                                                     mode.
2442
2443                                                     CP is responsible for
2444                                                     filling in
2445                                                     ``COMPUTE_PGM_RSRC1.PRIV``.
2446     21      1 bit   ENABLE_DX10_CLAMP               Wavefront starts execution
2447                                                     with DX10 clamp mode
2448                                                     enabled. Used by the vector
2449                                                     ALU to force DX10 style
2450                                                     treatment of NaN's (when
2451                                                     set, clamp NaN to zero,
2452                                                     otherwise pass NaN
2453                                                     through).
2454
2455                                                     Used by CP to set up
2456                                                     ``COMPUTE_PGM_RSRC1.DX10_CLAMP``.
2457     22      1 bit   DEBUG_MODE                      Must be 0.
2458
2459                                                     Start executing wavefront
2460                                                     in single step mode.
2461
2462                                                     CP is responsible for
2463                                                     filling in
2464                                                     ``COMPUTE_PGM_RSRC1.DEBUG_MODE``.
2465     23      1 bit   ENABLE_IEEE_MODE                Wavefront starts execution
2466                                                     with IEEE mode
2467                                                     enabled. Floating point
2468                                                     opcodes that support
2469                                                     exception flag gathering
2470                                                     will quiet and propagate
2471                                                     signaling-NaN inputs per
2472                                                     IEEE 754-2008. Min_dx10 and
2473                                                     max_dx10 become IEEE
2474                                                     754-2008 compliant due to
2475                                                     signaling-NaN propagation
2476                                                     and quieting.
2477
2478                                                     Used by CP to set up
2479                                                     ``COMPUTE_PGM_RSRC1.IEEE_MODE``.
2480     24      1 bit   BULKY                           Must be 0.
2481
2482                                                     Only one work-group allowed
2483                                                     to execute on a compute
2484                                                     unit.
2485
2486                                                     CP is responsible for
2487                                                     filling in
2488                                                     ``COMPUTE_PGM_RSRC1.BULKY``.
2489     25      1 bit   CDBG_USER                       Must be 0.
2490
2491                                                     Flag that can be used to
2492                                                     control debugging code.
2493
2494                                                     CP is responsible for
2495                                                     filling in
2496                                                     ``COMPUTE_PGM_RSRC1.CDBG_USER``.
2497     26      1 bit   FP16_OVFL                       GFX6-GFX8
2498                                                       Reserved, must be 0.
2499                                                     GFX9-GFX10
2500                                                       Wavefront starts execution
2501                                                       with specified fp16 overflow
2502                                                       mode.
2503
2504                                                       - If 0, fp16 overflow generates
2505                                                         +/-INF values.
2506                                                       - If 1, fp16 overflow that is the
2507                                                         result of an +/-INF input value
2508                                                         or divide by 0 produces a +/-INF,
2509                                                         otherwise clamps computed
2510                                                         overflow to +/-MAX_FP16 as
2511                                                         appropriate.
2512
2513                                                       Used by CP to set up
2514                                                       ``COMPUTE_PGM_RSRC1.FP16_OVFL``.
2515     28:27   2 bits                                  Reserved, must be 0.
2516     29      1 bit    WGP_MODE                       GFX6-GFX9
2517                                                       Reserved, must be 0.
2518                                                     GFX10
2519                                                       - If 0 execute work-groups in
2520                                                         CU wavefront execution mode.
2521                                                       - If 1 execute work-groups on
2522                                                         in WGP wavefront execution mode.
2523
2524                                                       See :ref:`amdgpu-amdhsa-memory-model`.
2525
2526                                                       Used by CP to set up
2527                                                       ``COMPUTE_PGM_RSRC1.WGP_MODE``.
2528     30      1 bit    MEM_ORDERED                    GFX6-9
2529                                                       Reserved, must be 0.
2530                                                     GFX10
2531                                                       Controls the behavior of the
2532                                                       waitcnt's vmcnt and vscnt
2533                                                       counters.
2534
2535                                                       - If 0 vmcnt reports completion
2536                                                         of load and atomic with return
2537                                                         out of order with sample
2538                                                         instructions, and the vscnt
2539                                                         reports the completion of
2540                                                         store and atomic without
2541                                                         return in order.
2542                                                       - If 1 vmcnt reports completion
2543                                                         of load, atomic with return
2544                                                         and sample instructions in
2545                                                         order, and the vscnt reports
2546                                                         the completion of store and
2547                                                         atomic without return in order.
2548
2549                                                       Used by CP to set up
2550                                                       ``COMPUTE_PGM_RSRC1.MEM_ORDERED``.
2551     31      1 bit    FWD_PROGRESS                   GFX6-9
2552                                                       Reserved, must be 0.
2553                                                     GFX10
2554                                                       - If 0 execute SIMD wavefronts
2555                                                         using oldest first policy.
2556                                                       - If 1 execute SIMD wavefronts to
2557                                                         ensure wavefronts will make some
2558                                                         forward progress.
2559
2560                                                       Used by CP to set up
2561                                                       ``COMPUTE_PGM_RSRC1.FWD_PROGRESS``.
2562     32      **Total size 4 bytes**
2563     ======= ===================================================================================================================
2564
2565..
2566
2567  .. table:: compute_pgm_rsrc2 for GFX6-GFX10
2568     :name: amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx10-table
2569
2570     ======= ======= =============================== ===========================================================================
2571     Bits    Size    Field Name                      Description
2572     ======= ======= =============================== ===========================================================================
2573     0       1 bit   ENABLE_SGPR_PRIVATE_SEGMENT     Enable the setup of the
2574                     _WAVEFRONT_OFFSET               SGPR wavefront scratch offset
2575                                                     system register (see
2576                                                     :ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
2577
2578                                                     Used by CP to set up
2579                                                     ``COMPUTE_PGM_RSRC2.SCRATCH_EN``.
2580     5:1     5 bits  USER_SGPR_COUNT                 The total number of SGPR
2581                                                     user data registers
2582                                                     requested. This number must
2583                                                     match the number of user
2584                                                     data registers enabled.
2585
2586                                                     Used by CP to set up
2587                                                     ``COMPUTE_PGM_RSRC2.USER_SGPR``.
2588     6       1 bit   ENABLE_TRAP_HANDLER             Must be 0.
2589
2590                                                     This bit represents
2591                                                     ``COMPUTE_PGM_RSRC2.TRAP_PRESENT``,
2592                                                     which is set by the CP if
2593                                                     the runtime has installed a
2594                                                     trap handler.
2595     7       1 bit   ENABLE_SGPR_WORKGROUP_ID_X      Enable the setup of the
2596                                                     system SGPR register for
2597                                                     the work-group id in the X
2598                                                     dimension (see
2599                                                     :ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
2600
2601                                                     Used by CP to set up
2602                                                     ``COMPUTE_PGM_RSRC2.TGID_X_EN``.
2603     8       1 bit   ENABLE_SGPR_WORKGROUP_ID_Y      Enable the setup of the
2604                                                     system SGPR register for
2605                                                     the work-group id in the Y
2606                                                     dimension (see
2607                                                     :ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
2608
2609                                                     Used by CP to set up
2610                                                     ``COMPUTE_PGM_RSRC2.TGID_Y_EN``.
2611     9       1 bit   ENABLE_SGPR_WORKGROUP_ID_Z      Enable the setup of the
2612                                                     system SGPR register for
2613                                                     the work-group id in the Z
2614                                                     dimension (see
2615                                                     :ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
2616
2617                                                     Used by CP to set up
2618                                                     ``COMPUTE_PGM_RSRC2.TGID_Z_EN``.
2619     10      1 bit   ENABLE_SGPR_WORKGROUP_INFO      Enable the setup of the
2620                                                     system SGPR register for
2621                                                     work-group information (see
2622                                                     :ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
2623
2624                                                     Used by CP to set up
2625                                                     ``COMPUTE_PGM_RSRC2.TGID_SIZE_EN``.
2626     12:11   2 bits  ENABLE_VGPR_WORKITEM_ID         Enable the setup of the
2627                                                     VGPR system registers used
2628                                                     for the work-item ID.
2629                                                     :ref:`amdgpu-amdhsa-system-vgpr-work-item-id-enumeration-values-table`
2630                                                     defines the values.
2631
2632                                                     Used by CP to set up
2633                                                     ``COMPUTE_PGM_RSRC2.TIDIG_CMP_CNT``.
2634     13      1 bit   ENABLE_EXCEPTION_ADDRESS_WATCH  Must be 0.
2635
2636                                                     Wavefront starts execution
2637                                                     with address watch
2638                                                     exceptions enabled which
2639                                                     are generated when L1 has
2640                                                     witnessed a thread access
2641                                                     an *address of
2642                                                     interest*.
2643
2644                                                     CP is responsible for
2645                                                     filling in the address
2646                                                     watch bit in
2647                                                     ``COMPUTE_PGM_RSRC2.EXCP_EN_MSB``
2648                                                     according to what the
2649                                                     runtime requests.
2650     14      1 bit   ENABLE_EXCEPTION_MEMORY         Must be 0.
2651
2652                                                     Wavefront starts execution
2653                                                     with memory violation
2654                                                     exceptions exceptions
2655                                                     enabled which are generated
2656                                                     when a memory violation has
2657                                                     occurred for this wavefront from
2658                                                     L1 or LDS
2659                                                     (write-to-read-only-memory,
2660                                                     mis-aligned atomic, LDS
2661                                                     address out of range,
2662                                                     illegal address, etc.).
2663
2664                                                     CP sets the memory
2665                                                     violation bit in
2666                                                     ``COMPUTE_PGM_RSRC2.EXCP_EN_MSB``
2667                                                     according to what the
2668                                                     runtime requests.
2669     23:15   9 bits  GRANULATED_LDS_SIZE             Must be 0.
2670
2671                                                     CP uses the rounded value
2672                                                     from the dispatch packet,
2673                                                     not this value, as the
2674                                                     dispatch may contain
2675                                                     dynamically allocated group
2676                                                     segment memory. CP writes
2677                                                     directly to
2678                                                     ``COMPUTE_PGM_RSRC2.LDS_SIZE``.
2679
2680                                                     Amount of group segment
2681                                                     (LDS) to allocate for each
2682                                                     work-group. Granularity is
2683                                                     device specific:
2684
2685                                                     GFX6:
2686                                                       roundup(lds-size / (64 * 4))
2687                                                     GFX7-GFX10:
2688                                                       roundup(lds-size / (128 * 4))
2689
2690     24      1 bit   ENABLE_EXCEPTION_IEEE_754_FP    Wavefront starts execution
2691                     _INVALID_OPERATION              with specified exceptions
2692                                                     enabled.
2693
2694                                                     Used by CP to set up
2695                                                     ``COMPUTE_PGM_RSRC2.EXCP_EN``
2696                                                     (set from bits 0..6).
2697
2698                                                     IEEE 754 FP Invalid
2699                                                     Operation
2700     25      1 bit   ENABLE_EXCEPTION_FP_DENORMAL    FP Denormal one or more
2701                     _SOURCE                         input operands is a
2702                                                     denormal number
2703     26      1 bit   ENABLE_EXCEPTION_IEEE_754_FP    IEEE 754 FP Division by
2704                     _DIVISION_BY_ZERO               Zero
2705     27      1 bit   ENABLE_EXCEPTION_IEEE_754_FP    IEEE 754 FP FP Overflow
2706                     _OVERFLOW
2707     28      1 bit   ENABLE_EXCEPTION_IEEE_754_FP    IEEE 754 FP Underflow
2708                     _UNDERFLOW
2709     29      1 bit   ENABLE_EXCEPTION_IEEE_754_FP    IEEE 754 FP Inexact
2710                     _INEXACT
2711     30      1 bit   ENABLE_EXCEPTION_INT_DIVIDE_BY  Integer Division by Zero
2712                     _ZERO                           (rcp_iflag_f32 instruction
2713                                                     only)
2714     31      1 bit                                   Reserved, must be 0.
2715     32      **Total size 4 bytes.**
2716     ======= ===================================================================================================================
2717
2718..
2719
2720  .. table:: compute_pgm_rsrc3 for GFX10
2721     :name: amdgpu-amdhsa-compute_pgm_rsrc3-gfx10-table
2722
2723     ======= ======= =============================== ===========================================================================
2724     Bits    Size    Field Name                      Description
2725     ======= ======= =============================== ===========================================================================
2726     3:0     4 bits  SHARED_VGPR_COUNT               Number of shared VGPRs for wavefront size 64. Granularity 8. Value 0-120.
2727                                                     compute_pgm_rsrc1.vgprs + shared_vgpr_cnt cannot exceed 64.
2728     31:4    28                                      Reserved, must be 0.
2729             bits
2730     32      **Total size 4 bytes.**
2731     ======= ===================================================================================================================
2732
2733..
2734
2735  .. table:: Floating Point Rounding Mode Enumeration Values
2736     :name: amdgpu-amdhsa-floating-point-rounding-mode-enumeration-values-table
2737
2738     ====================================== ===== ==============================
2739     Enumeration Name                       Value Description
2740     ====================================== ===== ==============================
2741     FLOAT_ROUND_MODE_NEAR_EVEN             0     Round Ties To Even
2742     FLOAT_ROUND_MODE_PLUS_INFINITY         1     Round Toward +infinity
2743     FLOAT_ROUND_MODE_MINUS_INFINITY        2     Round Toward -infinity
2744     FLOAT_ROUND_MODE_ZERO                  3     Round Toward 0
2745     ====================================== ===== ==============================
2746
2747..
2748
2749  .. table:: Floating Point Denorm Mode Enumeration Values
2750     :name: amdgpu-amdhsa-floating-point-denorm-mode-enumeration-values-table
2751
2752     ====================================== ===== ==============================
2753     Enumeration Name                       Value Description
2754     ====================================== ===== ==============================
2755     FLOAT_DENORM_MODE_FLUSH_SRC_DST        0     Flush Source and Destination
2756                                                  Denorms
2757     FLOAT_DENORM_MODE_FLUSH_DST            1     Flush Output Denorms
2758     FLOAT_DENORM_MODE_FLUSH_SRC            2     Flush Source Denorms
2759     FLOAT_DENORM_MODE_FLUSH_NONE           3     No Flush
2760     ====================================== ===== ==============================
2761
2762..
2763
2764  .. table:: System VGPR Work-Item ID Enumeration Values
2765     :name: amdgpu-amdhsa-system-vgpr-work-item-id-enumeration-values-table
2766
2767     ======================================== ===== ============================
2768     Enumeration Name                         Value Description
2769     ======================================== ===== ============================
2770     SYSTEM_VGPR_WORKITEM_ID_X                0     Set work-item X dimension
2771                                                    ID.
2772     SYSTEM_VGPR_WORKITEM_ID_X_Y              1     Set work-item X and Y
2773                                                    dimensions ID.
2774     SYSTEM_VGPR_WORKITEM_ID_X_Y_Z            2     Set work-item X, Y and Z
2775                                                    dimensions ID.
2776     SYSTEM_VGPR_WORKITEM_ID_UNDEFINED        3     Undefined.
2777     ======================================== ===== ============================
2778
2779.. _amdgpu-amdhsa-initial-kernel-execution-state:
2780
2781Initial Kernel Execution State
2782~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
2783
2784This section defines the register state that will be set up by the packet
2785processor prior to the start of execution of every wavefront. This is limited by
2786the constraints of the hardware controllers of CP/ADC/SPI.
2787
2788The order of the SGPR registers is defined, but the compiler can specify which
2789ones are actually setup in the kernel descriptor using the ``enable_sgpr_*`` bit
2790fields (see :ref:`amdgpu-amdhsa-kernel-descriptor`). The register numbers used
2791for enabled registers are dense starting at SGPR0: the first enabled register is
2792SGPR0, the next enabled register is SGPR1 etc.; disabled registers do not have
2793an SGPR number.
2794
2795The initial SGPRs comprise up to 16 User SRGPs that are set by CP and apply to
2796all wavefronts of the grid. It is possible to specify more than 16 User SGPRs using
2797the ``enable_sgpr_*`` bit fields, in which case only the first 16 are actually
2798initialized. These are then immediately followed by the System SGPRs that are
2799set up by ADC/SPI and can have different values for each wavefront of the grid
2800dispatch.
2801
2802SGPR register initial state is defined in
2803:ref:`amdgpu-amdhsa-sgpr-register-set-up-order-table`.
2804
2805  .. table:: SGPR Register Set Up Order
2806     :name: amdgpu-amdhsa-sgpr-register-set-up-order-table
2807
2808     ========== ========================== ====== ==============================
2809     SGPR Order Name                       Number Description
2810                (kernel descriptor enable  of
2811                field)                     SGPRs
2812     ========== ========================== ====== ==============================
2813     First      Private Segment Buffer     4      V# that can be used, together
2814                (enable_sgpr_private              with Scratch Wavefront Offset
2815                _segment_buffer)                  as an offset, to access the
2816                                                  private memory space using a
2817                                                  segment address.
2818
2819                                                  CP uses the value provided by
2820                                                  the runtime.
2821     then       Dispatch Ptr               2      64 bit address of AQL dispatch
2822                (enable_sgpr_dispatch_ptr)        packet for kernel dispatch
2823                                                  actually executing.
2824     then       Queue Ptr                  2      64 bit address of amd_queue_t
2825                (enable_sgpr_queue_ptr)           object for AQL queue on which
2826                                                  the dispatch packet was
2827                                                  queued.
2828     then       Kernarg Segment Ptr        2      64 bit address of Kernarg
2829                (enable_sgpr_kernarg              segment. This is directly
2830                _segment_ptr)                     copied from the
2831                                                  kernarg_address in the kernel
2832                                                  dispatch packet.
2833
2834                                                  Having CP load it once avoids
2835                                                  loading it at the beginning of
2836                                                  every wavefront.
2837     then       Dispatch Id                2      64 bit Dispatch ID of the
2838                (enable_sgpr_dispatch_id)         dispatch packet being
2839                                                  executed.
2840     then       Flat Scratch Init          2      This is 2 SGPRs:
2841                (enable_sgpr_flat_scratch
2842                _init)                            GFX6
2843                                                    Not supported.
2844                                                  GFX7-GFX8
2845                                                    The first SGPR is a 32 bit
2846                                                    byte offset from
2847                                                    ``SH_HIDDEN_PRIVATE_BASE_VIMID``
2848                                                    to per SPI base of memory
2849                                                    for scratch for the queue
2850                                                    executing the kernel
2851                                                    dispatch. CP obtains this
2852                                                    from the runtime. (The
2853                                                    Scratch Segment Buffer base
2854                                                    address is
2855                                                    ``SH_HIDDEN_PRIVATE_BASE_VIMID``
2856                                                    plus this offset.) The value
2857                                                    of Scratch Wavefront Offset must
2858                                                    be added to this offset by
2859                                                    the kernel machine code,
2860                                                    right shifted by 8, and
2861                                                    moved to the FLAT_SCRATCH_HI
2862                                                    SGPR register.
2863                                                    FLAT_SCRATCH_HI corresponds
2864                                                    to SGPRn-4 on GFX7, and
2865                                                    SGPRn-6 on GFX8 (where SGPRn
2866                                                    is the highest numbered SGPR
2867                                                    allocated to the wavefront).
2868                                                    FLAT_SCRATCH_HI is
2869                                                    multiplied by 256 (as it is
2870                                                    in units of 256 bytes) and
2871                                                    added to
2872                                                    ``SH_HIDDEN_PRIVATE_BASE_VIMID``
2873                                                    to calculate the per wavefront
2874                                                    FLAT SCRATCH BASE in flat
2875                                                    memory instructions that
2876                                                    access the scratch
2877                                                    apperture.
2878
2879                                                    The second SGPR is 32 bit
2880                                                    byte size of a single
2881                                                    work-item's scratch memory
2882                                                    usage. CP obtains this from
2883                                                    the runtime, and it is
2884                                                    always a multiple of DWORD.
2885                                                    CP checks that the value in
2886                                                    the kernel dispatch packet
2887                                                    Private Segment Byte Size is
2888                                                    not larger, and requests the
2889                                                    runtime to increase the
2890                                                    queue's scratch size if
2891                                                    necessary. The kernel code
2892                                                    must move it to
2893                                                    FLAT_SCRATCH_LO which is
2894                                                    SGPRn-3 on GFX7 and SGPRn-5
2895                                                    on GFX8. FLAT_SCRATCH_LO is
2896                                                    used as the FLAT SCRATCH
2897                                                    SIZE in flat memory
2898                                                    instructions. Having CP load
2899                                                    it once avoids loading it at
2900                                                    the beginning of every
2901                                                    wavefront.
2902                                                  GFX9-GFX10
2903                                                    This is the
2904                                                    64 bit base address of the
2905                                                    per SPI scratch backing
2906                                                    memory managed by SPI for
2907                                                    the queue executing the
2908                                                    kernel dispatch. CP obtains
2909                                                    this from the runtime (and
2910                                                    divides it if there are
2911                                                    multiple Shader Arrays each
2912                                                    with its own SPI). The value
2913                                                    of Scratch Wavefront Offset must
2914                                                    be added by the kernel
2915                                                    machine code and the result
2916                                                    moved to the FLAT_SCRATCH
2917                                                    SGPR which is SGPRn-6 and
2918                                                    SGPRn-5. It is used as the
2919                                                    FLAT SCRATCH BASE in flat
2920                                                    memory instructions.
2921     then       Private Segment Size       1      The 32 bit byte size of a
2922                                                  (enable_sgpr_private single
2923                                                  work-item's
2924                                                  scratch_segment_size) memory
2925                                                  allocation. This is the
2926                                                  value from the kernel
2927                                                  dispatch packet Private
2928                                                  Segment Byte Size rounded up
2929                                                  by CP to a multiple of
2930                                                  DWORD.
2931
2932                                                  Having CP load it once avoids
2933                                                  loading it at the beginning of
2934                                                  every wavefront.
2935
2936                                                  This is not used for
2937                                                  GFX7-GFX8 since it is the same
2938                                                  value as the second SGPR of
2939                                                  Flat Scratch Init. However, it
2940                                                  may be needed for GFX9-GFX10 which
2941                                                  changes the meaning of the
2942                                                  Flat Scratch Init value.
2943     then       Grid Work-Group Count X    1      32 bit count of the number of
2944                (enable_sgpr_grid                 work-groups in the X dimension
2945                _workgroup_count_X)               for the grid being
2946                                                  executed. Computed from the
2947                                                  fields in the kernel dispatch
2948                                                  packet as ((grid_size.x +
2949                                                  workgroup_size.x - 1) /
2950                                                  workgroup_size.x).
2951     then       Grid Work-Group Count Y    1      32 bit count of the number of
2952                (enable_sgpr_grid                 work-groups in the Y dimension
2953                _workgroup_count_Y &&             for the grid being
2954                less than 16 previous             executed. Computed from the
2955                SGPRs)                            fields in the kernel dispatch
2956                                                  packet as ((grid_size.y +
2957                                                  workgroup_size.y - 1) /
2958                                                  workgroupSize.y).
2959
2960                                                  Only initialized if <16
2961                                                  previous SGPRs initialized.
2962     then       Grid Work-Group Count Z    1      32 bit count of the number of
2963                (enable_sgpr_grid                 work-groups in the Z dimension
2964                _workgroup_count_Z &&             for the grid being
2965                less than 16 previous             executed. Computed from the
2966                SGPRs)                            fields in the kernel dispatch
2967                                                  packet as ((grid_size.z +
2968                                                  workgroup_size.z - 1) /
2969                                                  workgroupSize.z).
2970
2971                                                  Only initialized if <16
2972                                                  previous SGPRs initialized.
2973     then       Work-Group Id X            1      32 bit work-group id in X
2974                (enable_sgpr_workgroup_id         dimension of grid for
2975                _X)                               wavefront.
2976     then       Work-Group Id Y            1      32 bit work-group id in Y
2977                (enable_sgpr_workgroup_id         dimension of grid for
2978                _Y)                               wavefront.
2979     then       Work-Group Id Z            1      32 bit work-group id in Z
2980                (enable_sgpr_workgroup_id         dimension of grid for
2981                _Z)                               wavefront.
2982     then       Work-Group Info            1      {first_wavefront, 14'b0000,
2983                (enable_sgpr_workgroup            ordered_append_term[10:0],
2984                _info)                            threadgroup_size_in_wavefronts[5:0]}
2985     then       Scratch Wavefront Offset   1      32 bit byte offset from base
2986                (enable_sgpr_private              of scratch base of queue
2987                _segment_wavefront_offset)        executing the kernel
2988                                                  dispatch. Must be used as an
2989                                                  offset with Private
2990                                                  segment address when using
2991                                                  Scratch Segment Buffer. It
2992                                                  must be used to set up FLAT
2993                                                  SCRATCH for flat addressing
2994                                                  (see
2995                                                  :ref:`amdgpu-amdhsa-flat-scratch`).
2996     ========== ========================== ====== ==============================
2997
2998The order of the VGPR registers is defined, but the compiler can specify which
2999ones are actually setup in the kernel descriptor using the ``enable_vgpr*`` bit
3000fields (see :ref:`amdgpu-amdhsa-kernel-descriptor`). The register numbers used
3001for enabled registers are dense starting at VGPR0: the first enabled register is
3002VGPR0, the next enabled register is VGPR1 etc.; disabled registers do not have a
3003VGPR number.
3004
3005VGPR register initial state is defined in
3006:ref:`amdgpu-amdhsa-vgpr-register-set-up-order-table`.
3007
3008  .. table:: VGPR Register Set Up Order
3009     :name: amdgpu-amdhsa-vgpr-register-set-up-order-table
3010
3011     ========== ========================== ====== ==============================
3012     VGPR Order Name                       Number Description
3013                (kernel descriptor enable  of
3014                field)                     VGPRs
3015     ========== ========================== ====== ==============================
3016     First      Work-Item Id X             1      32 bit work item id in X
3017                (Always initialized)              dimension of work-group for
3018                                                  wavefront lane.
3019     then       Work-Item Id Y             1      32 bit work item id in Y
3020                (enable_vgpr_workitem_id          dimension of work-group for
3021                > 0)                              wavefront lane.
3022     then       Work-Item Id Z             1      32 bit work item id in Z
3023                (enable_vgpr_workitem_id          dimension of work-group for
3024                > 1)                              wavefront lane.
3025     ========== ========================== ====== ==============================
3026
3027The setting of registers is done by GPU CP/ADC/SPI hardware as follows:
3028
30291. SGPRs before the Work-Group Ids are set by CP using the 16 User Data
3030   registers.
30312. Work-group Id registers X, Y, Z are set by ADC which supports any
3032   combination including none.
30333. Scratch Wavefront Offset is set by SPI in a per wavefront basis which is why
3034   its value cannot included with the flat scratch init value which is per queue.
30354. The VGPRs are set by SPI which only supports specifying either (X), (X, Y)
3036   or (X, Y, Z).
3037
3038Flat Scratch register pair are adjacent SGRRs so they can be moved as a 64 bit
3039value to the hardware required SGPRn-3 and SGPRn-4 respectively.
3040
3041The global segment can be accessed either using buffer instructions (GFX6 which
3042has V# 64 bit address support), flat instructions (GFX7-GFX10), or global
3043instructions (GFX9-GFX10).
3044
3045If buffer operations are used then the compiler can generate a V# with the
3046following properties:
3047
3048* base address of 0
3049* no swizzle
3050* ATC: 1 if IOMMU present (such as APU)
3051* ptr64: 1
3052* MTYPE set to support memory coherence that matches the runtime (such as CC for
3053  APU and NC for dGPU).
3054
3055.. _amdgpu-amdhsa-kernel-prolog:
3056
3057Kernel Prolog
3058~~~~~~~~~~~~~
3059
3060.. _amdgpu-amdhsa-m0:
3061
3062M0
3063++
3064
3065GFX6-GFX8
3066  The M0 register must be initialized with a value at least the total LDS size
3067  if the kernel may access LDS via DS or flat operations. Total LDS size is
3068  available in dispatch packet. For M0, it is also possible to use maximum
3069  possible value of LDS for given target (0x7FFF for GFX6 and 0xFFFF for
3070  GFX7-GFX8).
3071GFX9-GFX10
3072  The M0 register is not used for range checking LDS accesses and so does not
3073  need to be initialized in the prolog.
3074
3075.. _amdgpu-amdhsa-flat-scratch:
3076
3077Flat Scratch
3078++++++++++++
3079
3080If the kernel may use flat operations to access scratch memory, the prolog code
3081must set up FLAT_SCRATCH register pair (FLAT_SCRATCH_LO/FLAT_SCRATCH_HI which
3082are in SGPRn-4/SGPRn-3). Initialization uses Flat Scratch Init and Scratch Wavefront
3083Offset SGPR registers (see :ref:`amdgpu-amdhsa-initial-kernel-execution-state`):
3084
3085GFX6
3086  Flat scratch is not supported.
3087
3088GFX7-GFX8
3089  1. The low word of Flat Scratch Init is 32 bit byte offset from
3090     ``SH_HIDDEN_PRIVATE_BASE_VIMID`` to the base of scratch backing memory
3091     being managed by SPI for the queue executing the kernel dispatch. This is
3092     the same value used in the Scratch Segment Buffer V# base address. The
3093     prolog must add the value of Scratch Wavefront Offset to get the wavefront's byte
3094     scratch backing memory offset from ``SH_HIDDEN_PRIVATE_BASE_VIMID``. Since
3095     FLAT_SCRATCH_LO is in units of 256 bytes, the offset must be right shifted
3096     by 8 before moving into FLAT_SCRATCH_LO.
3097  2. The second word of Flat Scratch Init is 32 bit byte size of a single
3098     work-items scratch memory usage. This is directly loaded from the kernel
3099     dispatch packet Private Segment Byte Size and rounded up to a multiple of
3100     DWORD. Having CP load it once avoids loading it at the beginning of every
3101     wavefront. The prolog must move it to FLAT_SCRATCH_LO for use as FLAT SCRATCH
3102     SIZE.
3103
3104GFX9-GFX10
3105  The Flat Scratch Init is the 64 bit address of the base of scratch backing
3106  memory being managed by SPI for the queue executing the kernel dispatch. The
3107  prolog must add the value of Scratch Wavefront Offset and moved to the FLAT_SCRATCH
3108  pair for use as the flat scratch base in flat memory instructions.
3109
3110.. _amdgpu-amdhsa-memory-model:
3111
3112Memory Model
3113~~~~~~~~~~~~
3114
3115This section describes the mapping of LLVM memory model onto AMDGPU machine code
3116(see :ref:`memmodel`). *The implementation is WIP.*
3117
3118.. TODO
3119   Update when implementation complete.
3120
3121The AMDGPU backend supports the memory synchronization scopes specified in
3122:ref:`amdgpu-memory-scopes`.
3123
3124The code sequences used to implement the memory model are defined in table
3125:ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx6-gfx10-table`.
3126
3127The sequences specify the order of instructions that a single thread must
3128execute. The ``s_waitcnt`` and ``buffer_wbinvl1_vol`` are defined with respect
3129to other memory instructions executed by the same thread. This allows them to be
3130moved earlier or later which can allow them to be combined with other instances
3131of the same instruction, or hoisted/sunk out of loops to improve
3132performance. Only the instructions related to the memory model are given;
3133additional ``s_waitcnt`` instructions are required to ensure registers are
3134defined before being used. These may be able to be combined with the memory
3135model ``s_waitcnt`` instructions as described above.
3136
3137The AMDGPU backend supports the following memory models:
3138
3139  HSA Memory Model [HSA]_
3140    The HSA memory model uses a single happens-before relation for all address
3141    spaces (see :ref:`amdgpu-address-spaces`).
3142  OpenCL Memory Model [OpenCL]_
3143    The OpenCL memory model which has separate happens-before relations for the
3144    global and local address spaces. Only a fence specifying both global and
3145    local address space, and seq_cst instructions join the relationships. Since
3146    the LLVM ``memfence`` instruction does not allow an address space to be
3147    specified the OpenCL fence has to convervatively assume both local and
3148    global address space was specified. However, optimizations can often be
3149    done to eliminate the additional ``s_waitcnt`` instructions when there are
3150    no intervening memory instructions which access the corresponding address
3151    space. The code sequences in the table indicate what can be omitted for the
3152    OpenCL memory. The target triple environment is used to determine if the
3153    source language is OpenCL (see :ref:`amdgpu-opencl`).
3154
3155``ds/flat_load/store/atomic`` instructions to local memory are termed LDS
3156operations.
3157
3158``buffer/global/flat_load/store/atomic`` instructions to global memory are
3159termed vector memory operations.
3160
3161For GFX6-GFX9:
3162
3163* Each agent has multiple shader arrays (SA).
3164* Each SA has multiple compute units (CU).
3165* Each CU has multiple SIMDs that execute wavefronts.
3166* The wavefronts for a single work-group are executed in the same CU but may be
3167  executed by different SIMDs.
3168* Each CU has a single LDS memory shared by the wavefronts of the work-groups
3169  executing on it.
3170* All LDS operations of a CU are performed as wavefront wide operations in a
3171  global order and involve no caching. Completion is reported to a wavefront in
3172  execution order.
3173* The LDS memory has multiple request queues shared by the SIMDs of a
3174  CU. Therefore, the LDS operations performed by different wavefronts of a work-group
3175  can be reordered relative to each other, which can result in reordering the
3176  visibility of vector memory operations with respect to LDS operations of other
3177  wavefronts in the same work-group. A ``s_waitcnt lgkmcnt(0)`` is required to
3178  ensure synchronization between LDS operations and vector memory operations
3179  between wavefronts of a work-group, but not between operations performed by the
3180  same wavefront.
3181* The vector memory operations are performed as wavefront wide operations and
3182  completion is reported to a wavefront in execution order. The exception is
3183  that for GFX7-GFX9 ``flat_load/store/atomic`` instructions can report out of
3184  vector memory order if they access LDS memory, and out of LDS operation order
3185  if they access global memory.
3186* The vector memory operations access a single vector L1 cache shared by all
3187  SIMDs a CU. Therefore, no special action is required for coherence between the
3188  lanes of a single wavefront, or for coherence between wavefronts in the same
3189  work-group. A ``buffer_wbinvl1_vol`` is required for coherence between wavefronts
3190  executing in different work-groups as they may be executing on different CUs.
3191* The scalar memory operations access a scalar L1 cache shared by all wavefronts
3192  on a group of CUs. The scalar and vector L1 caches are not coherent. However,
3193  scalar operations are used in a restricted way so do not impact the memory
3194  model. See :ref:`amdgpu-amdhsa-memory-spaces`.
3195* The vector and scalar memory operations use an L2 cache shared by all CUs on
3196  the same agent.
3197* The L2 cache has independent channels to service disjoint ranges of virtual
3198  addresses.
3199* Each CU has a separate request queue per channel. Therefore, the vector and
3200  scalar memory operations performed by wavefronts executing in different work-groups
3201  (which may be executing on different CUs) of an agent can be reordered
3202  relative to each other. A ``s_waitcnt vmcnt(0)`` is required to ensure
3203  synchronization between vector memory operations of different CUs. It ensures a
3204  previous vector memory operation has completed before executing a subsequent
3205  vector memory or LDS operation and so can be used to meet the requirements of
3206  acquire and release.
3207* The L2 cache can be kept coherent with other agents on some targets, or ranges
3208  of virtual addresses can be set up to bypass it to ensure system coherence.
3209
3210For GFX10:
3211
3212* Each agent has multiple shader arrays (SA).
3213* Each SA has multiple work-group processors (WGP).
3214* Each WGP has multiple compute units (CU).
3215* Each CU has multiple SIMDs that execute wavefronts.
3216* The wavefronts for a single work-group are executed in the same
3217  WGP. In CU wavefront execution mode the wavefronts may be executed by
3218  different SIMDs in the same CU. In WGP wavefront execution mode the
3219  wavefronts may be executed by different SIMDs in different CUs in the same
3220  WGP.
3221* Each WGP has a single LDS memory shared by the wavefronts of the work-groups
3222  executing on it.
3223* All LDS operations of a WGP are performed as wavefront wide operations in a
3224  global order and involve no caching. Completion is reported to a wavefront in
3225  execution order.
3226* The LDS memory has multiple request queues shared by the SIMDs of a
3227  WGP. Therefore, the LDS operations performed by different wavefronts of a work-group
3228  can be reordered relative to each other, which can result in reordering the
3229  visibility of vector memory operations with respect to LDS operations of other
3230  wavefronts in the same work-group. A ``s_waitcnt lgkmcnt(0)`` is required to
3231  ensure synchronization between LDS operations and vector memory operations
3232  between wavefronts of a work-group, but not between operations performed by the
3233  same wavefront.
3234* The vector memory operations are performed as wavefront wide operations.
3235  Completion of load/store/sample operations are reported to a wavefront in
3236  execution order of other load/store/sample operations performed by that
3237  wavefront.
3238* The vector memory operations access a vector L0 cache. There is a single L0
3239  cache per CU. Each SIMD of a CU accesses the same L0 cache.
3240  Therefore, no special action is required for coherence between the lanes of a
3241  single wavefront. However, a ``BUFFER_GL0_INV`` is required for coherence
3242  between wavefronts executing in the same work-group as they may be executing on
3243  SIMDs of different CUs that access different L0s. A ``BUFFER_GL0_INV`` is also
3244  required for coherence between wavefronts executing in different work-groups as
3245  they may be executing on different WGPs.
3246* The scalar memory operations access a scalar L0 cache shared by all wavefronts
3247  on a WGP. The scalar and vector L0 caches are not coherent. However, scalar
3248  operations are used in a restricted way so do not impact the memory model. See
3249  :ref:`amdgpu-amdhsa-memory-spaces`.
3250* The vector and scalar memory L0 caches use an L1 cache shared by all WGPs on
3251  the same SA. Therefore, no special action is required for coherence between
3252  the wavefronts of a single work-group. However, a ``BUFFER_GL1_INV`` is
3253  required for coherence between wavefronts executing in different work-groups as
3254  they may be executing on different SAs that access different L1s.
3255* The L1 caches have independent quadrants to service disjoint ranges of virtual
3256  addresses.
3257* Each L0 cache has a separate request queue per L1 quadrant. Therefore, the
3258  vector and scalar memory operations performed by different wavefronts, whether
3259  executing in the same or different work-groups (which may be executing on
3260  different CUs accessing different L0s), can be reordered relative to each
3261  other. A ``s_waitcnt vmcnt(0) & vscnt(0)`` is required to ensure synchronization
3262  between vector memory operations of different wavefronts. It ensures a previous
3263  vector memory operation has completed before executing a subsequent vector
3264  memory or LDS operation and so can be used to meet the requirements of acquire,
3265  release and sequential consistency.
3266* The L1 caches use an L2 cache shared by all SAs on the same agent.
3267* The L2 cache has independent channels to service disjoint ranges of virtual
3268  addresses.
3269* Each L1 quadrant of a single SA accesses a different L2 channel. Each L1
3270  quadrant has a separate request queue per L2 channel. Therefore, the vector
3271  and scalar memory operations performed by wavefronts executing in different
3272  work-groups (which may be executing on different SAs) of an agent can be
3273  reordered relative to each other. A ``s_waitcnt vmcnt(0) & vscnt(0)`` is
3274  required to ensure synchronization between vector memory operations of
3275  different SAs. It ensures a previous vector memory operation has completed
3276  before executing a subsequent vector memory and so can be used to meet the
3277  requirements of acquire, release and sequential consistency.
3278* The L2 cache can be kept coherent with other agents on some targets, or ranges
3279  of virtual addresses can be set up to bypass it to ensure system coherence.
3280
3281Private address space uses ``buffer_load/store`` using the scratch V# (GFX6-GFX8),
3282or ``scratch_load/store`` (GFX9-GFX10). Since only a single thread is accessing the
3283memory, atomic memory orderings are not meaningful and all accesses are treated
3284as non-atomic.
3285
3286Constant address space uses ``buffer/global_load`` instructions (or equivalent
3287scalar memory instructions). Since the constant address space contents do not
3288change during the execution of a kernel dispatch it is not legal to perform
3289stores, and atomic memory orderings are not meaningful and all access are
3290treated as non-atomic.
3291
3292A memory synchronization scope wider than work-group is not meaningful for the
3293group (LDS) address space and is treated as work-group.
3294
3295The memory model does not support the region address space which is treated as
3296non-atomic.
3297
3298Acquire memory ordering is not meaningful on store atomic instructions and is
3299treated as non-atomic.
3300
3301Release memory ordering is not meaningful on load atomic instructions and is
3302treated a non-atomic.
3303
3304Acquire-release memory ordering is not meaningful on load or store atomic
3305instructions and is treated as acquire and release respectively.
3306
3307AMDGPU backend only uses scalar memory operations to access memory that is
3308proven to not change during the execution of the kernel dispatch. This includes
3309constant address space and global address space for program scope const
3310variables. Therefore the kernel machine code does not have to maintain the
3311scalar L1 cache to ensure it is coherent with the vector L1 cache. The scalar
3312and vector L1 caches are invalidated between kernel dispatches by CP since
3313constant address space data may change between kernel dispatch executions. See
3314:ref:`amdgpu-amdhsa-memory-spaces`.
3315
3316The one execption is if scalar writes are used to spill SGPR registers. In this
3317case the AMDGPU backend ensures the memory location used to spill is never
3318accessed by vector memory operations at the same time. If scalar writes are used
3319then a ``s_dcache_wb`` is inserted before the ``s_endpgm`` and before a function
3320return since the locations may be used for vector memory instructions by a
3321future wavefront that uses the same scratch area, or a function call that creates a
3322frame at the same address, respectively. There is no need for a ``s_dcache_inv``
3323as all scalar writes are write-before-read in the same thread.
3324
3325For GFX6-GFX9, scratch backing memory (which is used for the private address space)
3326is accessed with MTYPE NC_NV (non-coherenent non-volatile). Since the private
3327address space is only accessed by a single thread, and is always
3328write-before-read, there is never a need to invalidate these entries from the L1
3329cache. Hence all cache invalidates are done as ``*_vol`` to only invalidate the
3330volatile cache lines.
3331
3332For GFX10, scratch backing memory (which is used for the private address space)
3333is accessed with MTYPE NC (non-coherenent). Since the private address space is
3334only accessed by a single thread, and is always write-before-read, there is
3335never a need to invalidate these entries from the L0 or L1 caches.
3336
3337For GFX10, wavefronts are executed in native mode with in-order reporting of loads
3338and sample instructions. In this mode vmcnt reports completion of load, atomic
3339with return and sample instructions in order, and the vscnt reports the
3340completion of store and atomic without return in order. See ``MEM_ORDERED`` field
3341in :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx10-table`.
3342
3343In GFX10, wavefronts can be executed in WGP or CU wavefront execution mode:
3344
3345* In WGP wavefront execution mode the wavefronts of a work-group are executed
3346  on the SIMDs of both CUs of the WGP. Therefore, explicit management of the per
3347  CU L0 caches is required for work-group synchronization. Also accesses to L1 at
3348  work-group scope need to be expicitly ordered as the accesses from different
3349  CUs are not ordered.
3350* In CU wavefront execution mode the wavefronts of a work-group are executed on
3351  the SIMDs of a single CU of the WGP. Therefore, all global memory access by
3352  the work-group access the same L0 which in turn ensures L1 accesses are
3353  ordered and so do not require explicit management of the caches for
3354  work-group synchronization.
3355
3356See ``WGP_MODE`` field in :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx10-table`
3357and :ref:`amdgpu-target-features`.
3358
3359On dGPU the kernarg backing memory is accessed as UC (uncached) to avoid needing
3360to invalidate the L2 cache. For GFX6-GFX9, this also causes it to be treated as
3361non-volatile and so is not invalidated by ``*_vol``. On APU it is accessed as CC
3362(cache coherent) and so the L2 cache will be coherent with the CPU and other
3363agents.
3364
3365  .. table:: AMDHSA Memory Model Code Sequences GFX6-GFX10
3366     :name: amdgpu-amdhsa-memory-model-code-sequences-gfx6-gfx10-table
3367
3368     ============ ============ ============== ========== =============================== ==================================
3369     LLVM Instr   LLVM Memory  LLVM Memory    AMDGPU     AMDGPU Machine Code             AMDGPU Machine Code
3370                  Ordering     Sync Scope     Address    GFX6-9                          GFX10
3371                                              Space
3372     ============ ============ ============== ========== =============================== ==================================
3373     **Non-Atomic**
3374     ----------------------------------------------------------------------------------------------------------------------
3375     load         *none*       *none*         - global   - !volatile & !nontemporal      - !volatile & !nontemporal
3376                                              - generic
3377                                              - private    1. buffer/global/flat_load      1. buffer/global/flat_load
3378                                              - constant
3379                                                         - volatile & !nontemporal       - volatile & !nontemporal
3380
3381                                                           1. buffer/global/flat_load      1. buffer/global/flat_load
3382                                                              glc=1                           glc=1 dlc=1
3383
3384                                                         - nontemporal                   - nontemporal
3385
3386                                                           1. buffer/global/flat_load      1. buffer/global/flat_load
3387                                                              glc=1 slc=1                     slc=1
3388
3389     load         *none*       *none*         - local    1. ds_load                      1. ds_load
3390     store        *none*       *none*         - global   - !nontemporal                  - !nontemporal
3391                                              - generic
3392                                              - private    1. buffer/global/flat_store     1. buffer/global/flat_store
3393                                              - constant
3394                                                         - nontemporal                   - nontemporal
3395
3396                                                           1. buffer/global/flat_stote      1. buffer/global/flat_store
3397                                                              glc=1 slc=1                      slc=1
3398
3399     store        *none*       *none*         - local    1. ds_store                     1. ds_store
3400     **Unordered Atomic**
3401     ----------------------------------------------------------------------------------------------------------------------
3402     load atomic  unordered    *any*          *any*      *Same as non-atomic*.           *Same as non-atomic*.
3403     store atomic unordered    *any*          *any*      *Same as non-atomic*.           *Same as non-atomic*.
3404     atomicrmw    unordered    *any*          *any*      *Same as monotonic              *Same as monotonic
3405                                                         atomic*.                        atomic*.
3406     **Monotonic Atomic**
3407     ----------------------------------------------------------------------------------------------------------------------
3408     load atomic  monotonic    - singlethread - global   1. buffer/global/flat_load      1. buffer/global/flat_load
3409                               - wavefront    - generic
3410     load atomic  monotonic    - workgroup    - global   1. buffer/global/flat_load      1. buffer/global/flat_load
3411                                              - generic                                     glc=1
3412
3413                                                                                           - If CU wavefront execution mode, omit glc=1.
3414
3415     load atomic  monotonic    - singlethread - local    1. ds_load                      1. ds_load
3416                               - wavefront
3417                               - workgroup
3418     load atomic  monotonic    - agent        - global   1. buffer/global/flat_load      1. buffer/global/flat_load
3419                               - system       - generic     glc=1                           glc=1 dlc=1
3420     store atomic monotonic    - singlethread - global   1. buffer/global/flat_store     1. buffer/global/flat_store
3421                               - wavefront    - generic
3422                               - workgroup
3423                               - agent
3424                               - system
3425     store atomic monotonic    - singlethread - local    1. ds_store                     1. ds_store
3426                               - wavefront
3427                               - workgroup
3428     atomicrmw    monotonic    - singlethread - global   1. buffer/global/flat_atomic    1. buffer/global/flat_atomic
3429                               - wavefront    - generic
3430                               - workgroup
3431                               - agent
3432                               - system
3433     atomicrmw    monotonic    - singlethread - local    1. ds_atomic                    1. ds_atomic
3434                               - wavefront
3435                               - workgroup
3436     **Acquire Atomic**
3437     ----------------------------------------------------------------------------------------------------------------------
3438     load atomic  acquire      - singlethread - global   1. buffer/global/ds/flat_load   1. buffer/global/ds/flat_load
3439                               - wavefront    - local
3440                                              - generic
3441     load atomic  acquire      - workgroup    - global   1. buffer/global/flat_load      1. buffer/global_load glc=1
3442
3443                                                                                           - If CU wavefront execution mode, omit glc=1.
3444
3445                                                                                         2. s_waitcnt vmcnt(0)
3446
3447                                                                                           - If CU wavefront execution mode, omit.
3448                                                                                           - Must happen before
3449                                                                                             the following buffer_gl0_inv
3450                                                                                             and before any following
3451                                                                                             global/generic
3452                                                                                             load/load
3453                                                                                             atomic/stote/store
3454                                                                                             atomic/atomicrmw.
3455
3456                                                                                         3. buffer_gl0_inv
3457
3458                                                                                           - If CU wavefront execution mode, omit.
3459                                                                                           - Ensures that
3460                                                                                             following
3461                                                                                             loads will not see
3462                                                                                             stale data.
3463
3464     load atomic  acquire      - workgroup    - local    1. ds_load                      1. ds_load
3465                                                         2. s_waitcnt lgkmcnt(0)         2. s_waitcnt lgkmcnt(0)
3466
3467                                                           - If OpenCL, omit.              - If OpenCL, omit.
3468                                                           - Must happen before            - Must happen before
3469                                                             any following                   the following buffer_gl0_inv
3470                                                             global/generic                  and before any following
3471                                                             load/load                       global/generic load/load
3472                                                             atomic/store/store              atomic/store/store
3473                                                             atomic/atomicrmw.               atomic/atomicrmw.
3474                                                           - Ensures any                   - Ensures any
3475                                                             following global                following global
3476                                                             data read is no                 data read is no
3477                                                             older than the load             older than the load
3478                                                             atomic value being              atomic value being
3479                                                             acquired.                       acquired.
3480
3481                                                                                         3. buffer_gl0_inv
3482
3483                                                                                           - If CU wavefront execution mode, omit.
3484                                                                                           - If OpenCL, omit.
3485                                                                                           - Ensures that
3486                                                                                             following
3487                                                                                             loads will not see
3488                                                                                             stale data.
3489
3490     load atomic  acquire      - workgroup    - generic  1. flat_load                    1. flat_load glc=1
3491
3492                                                                                           - If CU wavefront execution mode, omit glc=1.
3493
3494                                                         2. s_waitcnt lgkmcnt(0)         2. s_waitcnt lgkmcnt(0) &
3495                                                                                            vmcnt(0)
3496
3497                                                                                           - If CU wavefront execution mode, omit vmcnt.
3498                                                           - If OpenCL, omit.              - If OpenCL, omit
3499                                                                                             lgkmcnt(0).
3500                                                           - Must happen before            - Must happen before
3501                                                             any following                   the following
3502                                                             global/generic                  buffer_gl0_inv and any
3503                                                             load/load                       following global/generic
3504                                                             atomic/store/store              load/load
3505                                                             atomic/atomicrmw.               atomic/store/store
3506                                                                                             atomic/atomicrmw.
3507                                                           - Ensures any                   - Ensures any
3508                                                             following global                following global
3509                                                             data read is no                 data read is no
3510                                                             older than the load             older than the load
3511                                                             atomic value being              atomic value being
3512                                                             acquired.                       acquired.
3513
3514                                                                                         3. buffer_gl0_inv
3515
3516                                                                                           - If CU wavefront execution mode, omit.
3517                                                                                           - Ensures that
3518                                                                                             following
3519                                                                                             loads will not see
3520                                                                                             stale data.
3521
3522     load atomic  acquire      - agent        - global   1. buffer/global/flat_load      1. buffer/global_load
3523                               - system                     glc=1                           glc=1 dlc=1
3524                                                         2. s_waitcnt vmcnt(0)           2. s_waitcnt vmcnt(0)
3525
3526                                                           - Must happen before            - Must happen before
3527                                                             following                       following
3528                                                             buffer_wbinvl1_vol.             buffer_gl*_inv.
3529                                                           - Ensures the load              - Ensures the load
3530                                                             has completed                   has completed
3531                                                             before invalidating             before invalidating
3532                                                             the cache.                      the caches.
3533
3534                                                         3. buffer_wbinvl1_vol           3. buffer_gl0_inv;
3535                                                                                            buffer_gl1_inv
3536
3537                                                           - Must happen before            - Must happen before
3538                                                             any following                   any following
3539                                                             global/generic                  global/generic
3540                                                             load/load                       load/load
3541                                                             atomic/atomicrmw.               atomic/atomicrmw.
3542                                                           - Ensures that                  - Ensures that
3543                                                             following                       following
3544                                                             loads will not see              loads will not see
3545                                                             stale global data.              stale global data.
3546
3547     load atomic  acquire      - agent        - generic  1. flat_load glc=1              1. flat_load glc=1 dlc=1
3548                               - system                  2. s_waitcnt vmcnt(0) &         2. s_waitcnt vmcnt(0) &
3549                                                            lgkmcnt(0)                      lgkmcnt(0)
3550
3551                                                           - If OpenCL omit                - If OpenCL omit
3552                                                             lgkmcnt(0).                     lgkmcnt(0).
3553                                                           - Must happen before            - Must happen before
3554                                                             following                       following
3555                                                             buffer_wbinvl1_vol.             buffer_gl*_invl.
3556                                                           - Ensures the flat_load         - Ensures the flat_load
3557                                                             has completed                   has completed
3558                                                             before invalidating             before invalidating
3559                                                             the cache.                      the caches.
3560
3561                                                         3. buffer_wbinvl1_vol           3. buffer_gl0_inv;
3562                                                                                            buffer_gl1_inv
3563
3564                                                           - Must happen before            - Must happen before
3565                                                             any following                   any following
3566                                                             global/generic                  global/generic
3567                                                             load/load                       load/load
3568                                                             atomic/atomicrmw.               atomic/atomicrmw.
3569                                                           - Ensures that                  - Ensures that
3570                                                             following loads                 following loads
3571                                                             will not see stale              will not see stale
3572                                                             global data.                    global data.
3573
3574     atomicrmw    acquire      - singlethread - global   1. buffer/global/ds/flat_atomic 1. buffer/global/ds/flat_atomic
3575                               - wavefront    - local
3576                                              - generic
3577     atomicrmw    acquire      - workgroup    - global   1. buffer/global/flat_atomic    1. buffer/global_atomic
3578                                                                                         2. s_waitcnt vm/vscnt(0)
3579
3580                                                                                           - If CU wavefront execution mode, omit.
3581                                                                                           - Use vmcnt if atomic with
3582                                                                                             return and vscnt if atomic
3583                                                                                             with no-return.
3584                                                                                           - Must happen before
3585                                                                                             the following buffer_gl0_inv
3586                                                                                             and before any following
3587                                                                                             global/generic
3588                                                                                             load/load
3589                                                                                             atomic/stote/store
3590                                                                                             atomic/atomicrmw.
3591
3592                                                                                         3. buffer_gl0_inv
3593
3594                                                                                           - If CU wavefront execution mode, omit.
3595                                                                                           - Ensures that
3596                                                                                             following
3597                                                                                             loads will not see
3598                                                                                             stale data.
3599
3600     atomicrmw    acquire      - workgroup    - local    1. ds_atomic                    1. ds_atomic
3601                                                         2. waitcnt lgkmcnt(0)           2. waitcnt lgkmcnt(0)
3602
3603                                                           - If OpenCL, omit.              - If OpenCL, omit.
3604                                                           - Must happen before            - Must happen before
3605                                                             any following                   the following
3606                                                             global/generic                  buffer_gl0_inv.
3607                                                             load/load
3608                                                             atomic/store/store
3609                                                             atomic/atomicrmw.
3610                                                           - Ensures any                   - Ensures any
3611                                                             following global                following global
3612                                                             data read is no                 data read is no
3613                                                             older than the                  older than the
3614                                                             atomicrmw value                 atomicrmw value
3615                                                             being acquired.                 being acquired.
3616
3617                                                                                         3. buffer_gl0_inv
3618
3619                                                                                           - If OpenCL omit.
3620                                                                                           - Ensures that
3621                                                                                             following
3622                                                                                             loads will not see
3623                                                                                             stale data.
3624
3625     atomicrmw    acquire      - workgroup    - generic  1. flat_atomic                  1. flat_atomic
3626                                                         2. waitcnt lgkmcnt(0)           2. waitcnt lgkmcnt(0) &
3627                                                                                            vm/vscnt(0)
3628
3629                                                                                           - If CU wavefront execution mode, omit vm/vscnt.
3630                                                           - If OpenCL, omit.              - If OpenCL, omit
3631                                                                                             waitcnt lgkmcnt(0)..
3632                                                                                           - Use vmcnt if atomic with
3633                                                                                             return and vscnt if atomic
3634                                                                                             with no-return.
3635                                                                                             waitcnt lgkmcnt(0).
3636                                                           - Must happen before            - Must happen before
3637                                                             any following                   the following
3638                                                             global/generic                  buffer_gl0_inv.
3639                                                             load/load
3640                                                             atomic/store/store
3641                                                             atomic/atomicrmw.
3642                                                           - Ensures any                   - Ensures any
3643                                                             following global                following global
3644                                                             data read is no                 data read is no
3645                                                             older than the                  older than the
3646                                                             atomicrmw value                 atomicrmw value
3647                                                             being acquired.                 being acquired.
3648
3649                                                                                         3. buffer_gl0_inv
3650
3651                                                                                           - If CU wavefront execution mode, omit.
3652                                                                                           - Ensures that
3653                                                                                             following
3654                                                                                             loads will not see
3655                                                                                             stale data.
3656
3657     atomicrmw    acquire      - agent        - global   1. buffer/global/flat_atomic    1. buffer/global_atomic
3658                               - system                  2. s_waitcnt vmcnt(0)           2. s_waitcnt vm/vscnt(0)
3659
3660                                                                                           - Use vmcnt if atomic with
3661                                                                                             return and vscnt if atomic
3662                                                                                             with no-return.
3663                                                                                             waitcnt lgkmcnt(0).
3664                                                           - Must happen before            - Must happen before
3665                                                             following                       following
3666                                                             buffer_wbinvl1_vol.             buffer_gl*_inv.
3667                                                           - Ensures the                   - Ensures the
3668                                                             atomicrmw has                   atomicrmw has
3669                                                             completed before                completed before
3670                                                             invalidating the                invalidating the
3671                                                             cache.                          caches.
3672
3673                                                         3. buffer_wbinvl1_vol           3. buffer_gl0_inv;
3674                                                                                            buffer_gl1_inv
3675
3676                                                           - Must happen before            - Must happen before
3677                                                             any following                   any following
3678                                                             global/generic                  global/generic
3679                                                             load/load                       load/load
3680                                                             atomic/atomicrmw.               atomic/atomicrmw.
3681                                                           - Ensures that                  - Ensures that
3682                                                             following loads                 following loads
3683                                                             will not see stale              will not see stale
3684                                                             global data.                    global data.
3685
3686     atomicrmw    acquire      - agent        - generic  1. flat_atomic                  1. flat_atomic
3687                               - system                  2. s_waitcnt vmcnt(0) &         2. s_waitcnt vm/vscnt(0) &
3688                                                            lgkmcnt(0)                      lgkmcnt(0)
3689
3690                                                           - If OpenCL, omit               - If OpenCL, omit
3691                                                             lgkmcnt(0).                     lgkmcnt(0).
3692                                                                                           - Use vmcnt if atomic with
3693                                                                                             return and vscnt if atomic
3694                                                                                             with no-return.
3695                                                           - Must happen before            - Must happen before
3696                                                             following                       following
3697                                                             buffer_wbinvl1_vol.             buffer_gl*_inv.
3698                                                           - Ensures the                   - Ensures the
3699                                                             atomicrmw has                   atomicrmw has
3700                                                             completed before                completed before
3701                                                             invalidating the                invalidating the
3702                                                             cache.                          caches.
3703
3704                                                         3. buffer_wbinvl1_vol           3. buffer_gl0_inv;
3705                                                                                            buffer_gl1_inv
3706
3707                                                           - Must happen before            - Must happen before
3708                                                             any following                   any following
3709                                                             global/generic                  global/generic
3710                                                             load/load                       load/load
3711                                                             atomic/atomicrmw.               atomic/atomicrmw.
3712                                                           - Ensures that                  - Ensures that
3713                                                             following loads                 following loads
3714                                                             will not see stale              will not see stale
3715                                                             global data.                    global data.
3716
3717     fence        acquire      - singlethread *none*     *none*                          *none*
3718                               - wavefront
3719     fence        acquire      - workgroup    *none*     1. s_waitcnt lgkmcnt(0)         1. s_waitcnt lgkmcnt(0) &
3720                                                                                            vmcnt(0) & vscnt(0)
3721
3722                                                                                           - If CU wavefront execution mode, omit vmcnt and
3723                                                                                             vscnt.
3724                                                           - If OpenCL and                 - If OpenCL and
3725                                                             address space is                address space is
3726                                                             not generic, omit.              not generic, omit
3727                                                                                             lgkmcnt(0).
3728                                                                                           - If OpenCL and
3729                                                                                             address space is
3730                                                                                             local, omit
3731                                                                                             vmcnt(0) and vscnt(0).
3732                                                           - However, since LLVM           - However, since LLVM
3733                                                             currently has no                currently has no
3734                                                             address space on                address space on
3735                                                             the fence need to               the fence need to
3736                                                             conservatively                  conservatively
3737                                                             always generate. If             always generate. If
3738                                                             fence had an                    fence had an
3739                                                             address space then              address space then
3740                                                             set to address                  set to address
3741                                                             space of OpenCL                 space of OpenCL
3742                                                             fence flag, or to               fence flag, or to
3743                                                             generic if both                 generic if both
3744                                                             local and global                local and global
3745                                                             flags are                       flags are
3746                                                             specified.                      specified.
3747                                                           - Must happen after
3748                                                             any preceding
3749                                                             local/generic load
3750                                                             atomic/atomicrmw
3751                                                             with an equal or
3752                                                             wider sync scope
3753                                                             and memory ordering
3754                                                             stronger than
3755                                                             unordered (this is
3756                                                             termed the
3757                                                             fence-paired-atomic).
3758                                                           - Must happen before
3759                                                             any following
3760                                                             global/generic
3761                                                             load/load
3762                                                             atomic/store/store
3763                                                             atomic/atomicrmw.
3764                                                           - Ensures any
3765                                                             following global
3766                                                             data read is no
3767                                                             older than the
3768                                                             value read by the
3769                                                             fence-paired-atomic.
3770                                                                                           - Could be split into
3771                                                                                             separate s_waitcnt
3772                                                                                             vmcnt(0), s_waitcnt
3773                                                                                             vscnt(0) and s_waitcnt
3774                                                                                             lgkmcnt(0) to allow
3775                                                                                             them to be
3776                                                                                             independently moved
3777                                                                                             according to the
3778                                                                                             following rules.
3779                                                                                           - s_waitcnt vmcnt(0)
3780                                                                                             must happen after
3781                                                                                             any preceding
3782                                                                                             global/generic load
3783                                                                                             atomic/
3784                                                                                             atomicrmw-with-return-value
3785                                                                                             with an equal or
3786                                                                                             wider sync scope
3787                                                                                             and memory ordering
3788                                                                                             stronger than
3789                                                                                             unordered (this is
3790                                                                                             termed the
3791                                                                                             fence-paired-atomic).
3792                                                                                           - s_waitcnt vscnt(0)
3793                                                                                             must happen after
3794                                                                                             any preceding
3795                                                                                             global/generic
3796                                                                                             atomicrmw-no-return-value
3797                                                                                             with an equal or
3798                                                                                             wider sync scope
3799                                                                                             and memory ordering
3800                                                                                             stronger than
3801                                                                                             unordered (this is
3802                                                                                             termed the
3803                                                                                             fence-paired-atomic).
3804                                                                                           - s_waitcnt lgkmcnt(0)
3805                                                                                             must happen after
3806                                                                                             any preceding
3807                                                                                             local/generic load
3808                                                                                             atomic/atomicrmw
3809                                                                                             with an equal or
3810                                                                                             wider sync scope
3811                                                                                             and memory ordering
3812                                                                                             stronger than
3813                                                                                             unordered (this is
3814                                                                                             termed the
3815                                                                                             fence-paired-atomic).
3816                                                                                           - Must happen before
3817                                                                                             the following
3818                                                                                             buffer_gl0_inv.
3819                                                                                           - Ensures that the
3820                                                                                             fence-paired atomic
3821                                                                                             has completed
3822                                                                                             before invalidating
3823                                                                                             the
3824                                                                                             cache. Therefore
3825                                                                                             any following
3826                                                                                             locations read must
3827                                                                                             be no older than
3828                                                                                             the value read by
3829                                                                                             the
3830                                                                                             fence-paired-atomic.
3831
3832                                                                                         3. buffer_gl0_inv
3833
3834                                                                                           - If CU wavefront execution mode, omit.
3835                                                                                           - Ensures that
3836                                                                                             following
3837                                                                                             loads will not see
3838                                                                                             stale data.
3839
3840     fence        acquire      - agent        *none*     1. s_waitcnt lgkmcnt(0) &       1. s_waitcnt lgkmcnt(0) &
3841                               - system                     vmcnt(0)                        vmcnt(0) & vscnt(0)
3842
3843                                                           - If OpenCL and                 - If OpenCL and
3844                                                             address space is                address space is
3845                                                             not generic, omit               not generic, omit
3846                                                             lgkmcnt(0).                     lgkmcnt(0).
3847                                                                                           - If OpenCL and
3848                                                                                             address space is
3849                                                                                             local, omit
3850                                                                                             vmcnt(0) and vscnt(0).
3851                                                           - However, since LLVM           - However, since LLVM
3852                                                             currently has no                currently has no
3853                                                             address space on                address space on
3854                                                             the fence need to               the fence need to
3855                                                             conservatively                  conservatively
3856                                                             always generate                 always generate
3857                                                             (see comment for                (see comment for
3858                                                             previous fence).                previous fence).
3859                                                           - Could be split into
3860                                                             separate s_waitcnt
3861                                                             vmcnt(0) and
3862                                                             s_waitcnt
3863                                                             lgkmcnt(0) to allow
3864                                                             them to be
3865                                                             independently moved
3866                                                             according to the
3867                                                             following rules.
3868                                                           - s_waitcnt vmcnt(0)
3869                                                             must happen after
3870                                                             any preceding
3871                                                             global/generic load
3872                                                             atomic/atomicrmw
3873                                                             with an equal or
3874                                                             wider sync scope
3875                                                             and memory ordering
3876                                                             stronger than
3877                                                             unordered (this is
3878                                                             termed the
3879                                                             fence-paired-atomic).
3880                                                           - s_waitcnt lgkmcnt(0)
3881                                                             must happen after
3882                                                             any preceding
3883                                                             local/generic load
3884                                                             atomic/atomicrmw
3885                                                             with an equal or
3886                                                             wider sync scope
3887                                                             and memory ordering
3888                                                             stronger than
3889                                                             unordered (this is
3890                                                             termed the
3891                                                             fence-paired-atomic).
3892                                                           - Must happen before
3893                                                             the following
3894                                                             buffer_wbinvl1_vol.
3895                                                           - Ensures that the
3896                                                             fence-paired atomic
3897                                                             has completed
3898                                                             before invalidating
3899                                                             the
3900                                                             cache. Therefore
3901                                                             any following
3902                                                             locations read must
3903                                                             be no older than
3904                                                             the value read by
3905                                                             the
3906                                                             fence-paired-atomic.
3907                                                                                           - Could be split into
3908                                                                                             separate s_waitcnt
3909                                                                                             vmcnt(0), s_waitcnt
3910                                                                                             vscnt(0) and s_waitcnt
3911                                                                                             lgkmcnt(0) to allow
3912                                                                                             them to be
3913                                                                                             independently moved
3914                                                                                             according to the
3915                                                                                             following rules.
3916                                                                                           - s_waitcnt vmcnt(0)
3917                                                                                             must happen after
3918                                                                                             any preceding
3919                                                                                             global/generic load
3920                                                                                             atomic/
3921                                                                                             atomicrmw-with-return-value
3922                                                                                             with an equal or
3923                                                                                             wider sync scope
3924                                                                                             and memory ordering
3925                                                                                             stronger than
3926                                                                                             unordered (this is
3927                                                                                             termed the
3928                                                                                             fence-paired-atomic).
3929                                                                                           - s_waitcnt vscnt(0)
3930                                                                                             must happen after
3931                                                                                             any preceding
3932                                                                                             global/generic
3933                                                                                             atomicrmw-no-return-value
3934                                                                                             with an equal or
3935                                                                                             wider sync scope
3936                                                                                             and memory ordering
3937                                                                                             stronger than
3938                                                                                             unordered (this is
3939                                                                                             termed the
3940                                                                                             fence-paired-atomic).
3941                                                                                           - s_waitcnt lgkmcnt(0)
3942                                                                                             must happen after
3943                                                                                             any preceding
3944                                                                                             local/generic load
3945                                                                                             atomic/atomicrmw
3946                                                                                             with an equal or
3947                                                                                             wider sync scope
3948                                                                                             and memory ordering
3949                                                                                             stronger than
3950                                                                                             unordered (this is
3951                                                                                             termed the
3952                                                                                             fence-paired-atomic).
3953                                                                                           - Must happen before
3954                                                                                             the following
3955                                                                                             buffer_gl*_inv.
3956                                                                                           - Ensures that the
3957                                                                                             fence-paired atomic
3958                                                                                             has completed
3959                                                                                             before invalidating
3960                                                                                             the
3961                                                                                             caches. Therefore
3962                                                                                             any following
3963                                                                                             locations read must
3964                                                                                             be no older than
3965                                                                                             the value read by
3966                                                                                             the
3967                                                                                             fence-paired-atomic.
3968
3969                                                         2. buffer_wbinvl1_vol           2. buffer_gl0_inv;
3970                                                                                            buffer_gl1_inv
3971
3972                                                           - Must happen before any        - Must happen before any
3973                                                             following global/generic        following global/generic
3974                                                             load/load                       load/load
3975                                                             atomic/store/store              atomic/store/store
3976                                                             atomic/atomicrmw.               atomic/atomicrmw.
3977                                                           - Ensures that                  - Ensures that
3978                                                             following loads                 following loads
3979                                                             will not see stale              will not see stale
3980                                                             global data.                    global data.
3981
3982     **Release Atomic**
3983     ----------------------------------------------------------------------------------------------------------------------
3984     store atomic release      - singlethread - global   1. buffer/global/ds/flat_store  1. buffer/global/ds/flat_store
3985                               - wavefront    - local
3986                                              - generic
3987     store atomic release      - workgroup    - global   1. s_waitcnt lgkmcnt(0)         1. s_waitcnt lgkmcnt(0) &
3988                                                                                            vmcnt(0) & vscnt(0)
3989
3990                                                                                           - If CU wavefront execution mode, omit vmcnt and
3991                                                                                             vscnt.
3992                                                           - If OpenCL, omit.              - If OpenCL, omit
3993                                                                                             lgkmcnt(0).
3994                                                           - Must happen after
3995                                                             any preceding
3996                                                             local/generic
3997                                                             load/store/load
3998                                                             atomic/store
3999                                                             atomic/atomicrmw.
4000                                                                                           - Could be split into
4001                                                                                             separate s_waitcnt
4002                                                                                             vmcnt(0), s_waitcnt
4003                                                                                             vscnt(0) and s_waitcnt
4004                                                                                             lgkmcnt(0) to allow
4005                                                                                             them to be
4006                                                                                             independently moved
4007                                                                                             according to the
4008                                                                                             following rules.
4009                                                                                           - s_waitcnt vmcnt(0)
4010                                                                                             must happen after
4011                                                                                             any preceding
4012                                                                                             global/generic load/load
4013                                                                                             atomic/
4014                                                                                             atomicrmw-with-return-value.
4015                                                                                           - s_waitcnt vscnt(0)
4016                                                                                             must happen after
4017                                                                                             any preceding
4018                                                                                             global/generic
4019                                                                                             store/store
4020                                                                                             atomic/
4021                                                                                             atomicrmw-no-return-value.
4022                                                                                           - s_waitcnt lgkmcnt(0)
4023                                                                                             must happen after
4024                                                                                             any preceding
4025                                                                                             local/generic
4026                                                                                             load/store/load
4027                                                                                             atomic/store
4028                                                                                             atomic/atomicrmw.
4029                                                           - Must happen before            - Must happen before
4030                                                             the following                   the following
4031                                                             store.                          store.
4032                                                           - Ensures that all              - Ensures that all
4033                                                             memory operations               memory operations
4034                                                             to local have                   have
4035                                                             completed before                completed before
4036                                                             performing the                  performing the
4037                                                             store that is being             store that is being
4038                                                             released.                       released.
4039
4040                                                         2. buffer/global/flat_store     2. buffer/global_store
4041     store atomic release      - workgroup    - local                                    1. waitcnt vmcnt(0) & vscnt(0)
4042
4043                                                                                           - If CU wavefront execution mode, omit.
4044                                                                                           - If OpenCL, omit.
4045                                                                                           - Could be split into
4046                                                                                             separate s_waitcnt
4047                                                                                             vmcnt(0) and s_waitcnt
4048                                                                                             vscnt(0) to allow
4049                                                                                             them to be
4050                                                                                             independently moved
4051                                                                                             according to the
4052                                                                                             following rules.
4053                                                                                           - s_waitcnt vmcnt(0)
4054                                                                                             must happen after
4055                                                                                             any preceding
4056                                                                                             global/generic load/load
4057                                                                                             atomic/
4058                                                                                             atomicrmw-with-return-value.
4059                                                                                           - s_waitcnt vscnt(0)
4060                                                                                             must happen after
4061                                                                                             any preceding
4062                                                                                             global/generic
4063                                                                                             store/store atomic/
4064                                                                                             atomicrmw-no-return-value.
4065                                                                                           - Must happen before
4066                                                                                             the following
4067                                                                                             store.
4068                                                                                           - Ensures that all
4069                                                                                             global memory
4070                                                                                             operations have
4071                                                                                             completed before
4072                                                                                             performing the
4073                                                                                             store that is being
4074                                                                                             released.
4075
4076                                                         1. ds_store                     2. ds_store
4077     store atomic release      - workgroup    - generic  1. s_waitcnt lgkmcnt(0)         1. s_waitcnt lgkmcnt(0) &
4078                                                                                            vmcnt(0) & vscnt(0)
4079
4080                                                                                           - If CU wavefront execution mode, omit vmcnt and
4081                                                                                             vscnt.
4082                                                           - If OpenCL, omit.              - If OpenCL, omit
4083                                                                                             lgkmcnt(0).
4084                                                           - Must happen after
4085                                                             any preceding
4086                                                             local/generic
4087                                                             load/store/load
4088                                                             atomic/store
4089                                                             atomic/atomicrmw.
4090                                                                                           - Could be split into
4091                                                                                             separate s_waitcnt
4092                                                                                             vmcnt(0), s_waitcnt
4093                                                                                             vscnt(0) and s_waitcnt
4094                                                                                             lgkmcnt(0) to allow
4095                                                                                             them to be
4096                                                                                             independently moved
4097                                                                                             according to the
4098                                                                                             following rules.
4099                                                                                           - s_waitcnt vmcnt(0)
4100                                                                                             must happen after
4101                                                                                             any preceding
4102                                                                                             global/generic load/load
4103                                                                                             atomic/
4104                                                                                             atomicrmw-with-return-value.
4105                                                                                           - s_waitcnt vscnt(0)
4106                                                                                             must happen after
4107                                                                                             any preceding
4108                                                                                             global/generic
4109                                                                                             store/store
4110                                                                                             atomic/
4111                                                                                             atomicrmw-no-return-value.
4112                                                                                           - s_waitcnt lgkmcnt(0)
4113                                                                                             must happen after
4114                                                                                             any preceding
4115                                                                                             local/generic load/store/load
4116                                                                                             atomic/store atomic/atomicrmw.
4117                                                           - Must happen before            - Must happen before
4118                                                             the following                   the following
4119                                                             store.                          store.
4120                                                           - Ensures that all              - Ensures that all
4121                                                             memory operations               memory operations
4122                                                             to local have                   have
4123                                                             completed before                completed before
4124                                                             performing the                  performing the
4125                                                             store that is being             store that is being
4126                                                             released.                       released.
4127
4128                                                         2. flat_store                   2. flat_store
4129     store atomic release      - agent        - global   1. s_waitcnt lgkmcnt(0) &         1. s_waitcnt lgkmcnt(0) &
4130                               - system       - generic     vmcnt(0)                          vmcnt(0) & vscnt(0)
4131
4132                                                           - If OpenCL, omit               - If OpenCL, omit
4133                                                             lgkmcnt(0).                     lgkmcnt(0).
4134                                                           - Could be split into           - Could be split into
4135                                                             separate s_waitcnt              separate s_waitcnt
4136                                                             vmcnt(0) and                    vmcnt(0), s_waitcnt vscnt(0)
4137                                                             s_waitcnt                       and s_waitcnt
4138                                                             lgkmcnt(0) to allow             lgkmcnt(0) to allow
4139                                                             them to be                      them to be
4140                                                             independently moved             independently moved
4141                                                             according to the                according to the
4142                                                             following rules.                following rules.
4143                                                           - s_waitcnt vmcnt(0)            - s_waitcnt vmcnt(0)
4144                                                             must happen after               must happen after
4145                                                             any preceding                   any preceding
4146                                                             global/generic                  global/generic
4147                                                             load/store/load                 load/load
4148                                                             atomic/store                    atomic/
4149                                                             atomic/atomicrmw.               atomicrmw-with-return-value.
4150                                                                                           - s_waitcnt vscnt(0)
4151                                                                                             must happen after
4152                                                                                             any preceding
4153                                                                                             global/generic
4154                                                                                             store/store atomic/
4155                                                                                             atomicrmw-no-return-value.
4156                                                           - s_waitcnt lgkmcnt(0)          - s_waitcnt lgkmcnt(0)
4157                                                             must happen after               must happen after
4158                                                             any preceding                   any preceding
4159                                                             local/generic                   local/generic
4160                                                             load/store/load                 load/store/load
4161                                                             atomic/store                    atomic/store
4162                                                             atomic/atomicrmw.               atomic/atomicrmw.
4163                                                           - Must happen before            - Must happen before
4164                                                             the following                   the following
4165                                                             store.                          store.
4166                                                           - Ensures that all              - Ensures that all
4167                                                             memory operations               memory operations
4168                                                             to memory have                  to memory have
4169                                                             completed before                completed before
4170                                                             performing the                  performing the
4171                                                             store that is being             store that is being
4172                                                             released.                       released.
4173
4174                                                         2. buffer/global/ds/flat_store  2. buffer/global/ds/flat_store
4175     atomicrmw    release      - singlethread - global   1. buffer/global/ds/flat_atomic 1. buffer/global/ds/flat_atomic
4176                               - wavefront    - local
4177                                              - generic
4178     atomicrmw    release      - workgroup    - global   1. s_waitcnt lgkmcnt(0)         1. s_waitcnt lgkmcnt(0) &
4179                                                                                            vmcnt(0) & vscnt(0)
4180
4181                                                                                           - If CU wavefront execution mode, omit vmcnt and
4182                                                                                             vscnt.
4183                                                           - If OpenCL, omit.
4184
4185                                                           - Must happen after
4186                                                             any preceding
4187                                                             local/generic
4188                                                             load/store/load
4189                                                             atomic/store
4190                                                             atomic/atomicrmw.
4191                                                                                           - Could be split into
4192                                                                                             separate s_waitcnt
4193                                                                                             vmcnt(0), s_waitcnt
4194                                                                                             vscnt(0) and s_waitcnt
4195                                                                                             lgkmcnt(0) to allow
4196                                                                                             them to be
4197                                                                                             independently moved
4198                                                                                             according to the
4199                                                                                             following rules.
4200                                                                                           - s_waitcnt vmcnt(0)
4201                                                                                             must happen after
4202                                                                                             any preceding
4203                                                                                             global/generic load/load
4204                                                                                             atomic/
4205                                                                                             atomicrmw-with-return-value.
4206                                                                                           - s_waitcnt vscnt(0)
4207                                                                                             must happen after
4208                                                                                             any preceding
4209                                                                                             global/generic
4210                                                                                             store/store
4211                                                                                             atomic/
4212                                                                                             atomicrmw-no-return-value.
4213                                                                                           - s_waitcnt lgkmcnt(0)
4214                                                                                             must happen after
4215                                                                                             any preceding
4216                                                                                             local/generic
4217                                                                                             load/store/load
4218                                                                                             atomic/store
4219                                                                                             atomic/atomicrmw.
4220                                                           - Must happen before            - Must happen before
4221                                                             the following                   the following
4222                                                             atomicrmw.                      atomicrmw.
4223                                                           - Ensures that all              - Ensures that all
4224                                                             memory operations               memory operations
4225                                                             to local have                   have
4226                                                             completed before                completed before
4227                                                             performing the                  performing the
4228                                                             atomicrmw that is               atomicrmw that is
4229                                                             being released.                 being released.
4230
4231                                                         2. buffer/global/flat_atomic    2. buffer/global_atomic
4232     atomicrmw    release      - workgroup    - local                                    1. waitcnt vmcnt(0) & vscnt(0)
4233
4234                                                                                           - If CU wavefront execution mode, omit.
4235                                                                                           - If OpenCL, omit.
4236                                                                                           - Could be split into
4237                                                                                             separate s_waitcnt
4238                                                                                             vmcnt(0) and s_waitcnt
4239                                                                                             vscnt(0) to allow
4240                                                                                             them to be
4241                                                                                             independently moved
4242                                                                                             according to the
4243                                                                                             following rules.
4244                                                                                           - s_waitcnt vmcnt(0)
4245                                                                                             must happen after
4246                                                                                             any preceding
4247                                                                                             global/generic load/load
4248                                                                                             atomic/
4249                                                                                             atomicrmw-with-return-value.
4250                                                                                           - s_waitcnt vscnt(0)
4251                                                                                             must happen after
4252                                                                                             any preceding
4253                                                                                             global/generic
4254                                                                                             store/store atomic/
4255                                                                                             atomicrmw-no-return-value.
4256                                                                                           - Must happen before
4257                                                                                             the following
4258                                                                                             store.
4259                                                                                           - Ensures that all
4260                                                                                             global memory
4261                                                                                             operations have
4262                                                                                             completed before
4263                                                                                             performing the
4264                                                                                             store that is being
4265                                                                                             released.
4266
4267                                                         1. ds_atomic                    2. ds_atomic
4268     atomicrmw    release      - workgroup    - generic  1. s_waitcnt lgkmcnt(0)         1. s_waitcnt lgkmcnt(0) &
4269                                                                                            vmcnt(0) & vscnt(0)
4270
4271                                                                                           - If CU wavefront execution mode, omit vmcnt and
4272                                                                                             vscnt.
4273                                                           - If OpenCL, omit.              - If OpenCL, omit
4274                                                                                             waitcnt lgkmcnt(0).
4275                                                           - Must happen after
4276                                                             any preceding
4277                                                             local/generic
4278                                                             load/store/load
4279                                                             atomic/store
4280                                                             atomic/atomicrmw.
4281                                                                                           - Could be split into
4282                                                                                             separate s_waitcnt
4283                                                                                             vmcnt(0), s_waitcnt
4284                                                                                             vscnt(0) and s_waitcnt
4285                                                                                             lgkmcnt(0) to allow
4286                                                                                             them to be
4287                                                                                             independently moved
4288                                                                                             according to the
4289                                                                                             following rules.
4290                                                                                           - s_waitcnt vmcnt(0)
4291                                                                                             must happen after
4292                                                                                             any preceding
4293                                                                                             global/generic load/load
4294                                                                                             atomic/
4295                                                                                             atomicrmw-with-return-value.
4296                                                                                           - s_waitcnt vscnt(0)
4297                                                                                             must happen after
4298                                                                                             any preceding
4299                                                                                             global/generic
4300                                                                                             store/store
4301                                                                                             atomic/
4302                                                                                             atomicrmw-no-return-value.
4303                                                                                           - s_waitcnt lgkmcnt(0)
4304                                                                                             must happen after
4305                                                                                             any preceding
4306                                                                                             local/generic load/store/load
4307                                                                                             atomic/store atomic/atomicrmw.
4308                                                           - Must happen before            - Must happen before
4309                                                             the following                   the following
4310                                                             atomicrmw.                      atomicrmw.
4311                                                           - Ensures that all              - Ensures that all
4312                                                             memory operations               memory operations
4313                                                             to local have                   have
4314                                                             completed before                completed before
4315                                                             performing the                  performing the
4316                                                             atomicrmw that is               atomicrmw that is
4317                                                             being released.                 being released.
4318
4319                                                         2. flat_atomic                  2. flat_atomic
4320     atomicrmw    release      - agent        - global   1. s_waitcnt lgkmcnt(0) &       1. s_waitcnt lkkmcnt(0) &
4321                               - system       - generic     vmcnt(0)                         vmcnt(0) & vscnt(0)
4322
4323                                                           - If OpenCL, omit               - If OpenCL, omit
4324                                                             lgkmcnt(0).                     lgkmcnt(0).
4325                                                           - Could be split into           - Could be split into
4326                                                             separate s_waitcnt              separate s_waitcnt
4327                                                             vmcnt(0) and                    vmcnt(0), s_waitcnt
4328                                                             s_waitcnt                       vscnt(0) and s_waitcnt
4329                                                             lgkmcnt(0) to allow             lgkmcnt(0) to allow
4330                                                             them to be                      them to be
4331                                                             independently moved             independently moved
4332                                                             according to the                according to the
4333                                                             following rules.                following rules.
4334                                                           - s_waitcnt vmcnt(0)            - s_waitcnt vmcnt(0)
4335                                                             must happen after               must happen after
4336                                                             any preceding                   any preceding
4337                                                             global/generic                  global/generic
4338                                                             load/store/load                 load/load atomic/
4339                                                             atomic/store                    atomicrmw-with-return-value.
4340                                                             atomic/atomicrmw.
4341                                                                                           - s_waitcnt vscnt(0)
4342                                                                                             must happen after
4343                                                                                             any preceding
4344                                                                                             global/generic
4345                                                                                             store/store atomic/
4346                                                                                             atomicrmw-no-return-value.
4347                                                           - s_waitcnt lgkmcnt(0)          - s_waitcnt lgkmcnt(0)
4348                                                             must happen after               must happen after
4349                                                             any preceding                   any preceding
4350                                                             local/generic                   local/generic
4351                                                             load/store/load                 load/store/load
4352                                                             atomic/store                    atomic/store
4353                                                             atomic/atomicrmw.               atomic/atomicrmw.
4354                                                           - Must happen before            - Must happen before
4355                                                             the following                   the following
4356                                                             atomicrmw.                      atomicrmw.
4357                                                           - Ensures that all              - Ensures that all
4358                                                             memory operations               memory operations
4359                                                             to global and local             to global and local
4360                                                             have completed                  have completed
4361                                                             before performing               before performing
4362                                                             the atomicrmw that              the atomicrmw that
4363                                                             is being released.              is being released.
4364
4365                                                         2. buffer/global/ds/flat_atomic 2. buffer/global/ds/flat_atomic
4366     fence        release      - singlethread *none*     *none*                          *none*
4367                               - wavefront
4368     fence        release      - workgroup    *none*     1. s_waitcnt lgkmcnt(0)         1. s_waitcnt lgkmcnt(0) &
4369                                                                                            vmcnt(0) & vscnt(0)
4370
4371                                                                                           - If CU wavefront execution mode, omit vmcnt and
4372                                                                                             vscnt.
4373                                                           - If OpenCL and                 - If OpenCL and
4374                                                             address space is                address space is
4375                                                             not generic, omit.              not generic, omit
4376                                                                                             lgkmcnt(0).
4377                                                                                           - If OpenCL and
4378                                                                                             address space is
4379                                                                                             local, omit
4380                                                                                             vmcnt(0) and vscnt(0).
4381                                                           - However, since LLVM           - However, since LLVM
4382                                                             currently has no                currently has no
4383                                                             address space on                address space on
4384                                                             the fence need to               the fence need to
4385                                                             conservatively                  conservatively
4386                                                             always generate. If             always generate. If
4387                                                             fence had an                    fence had an
4388                                                             address space then              address space then
4389                                                             set to address                  set to address
4390                                                             space of OpenCL                 space of OpenCL
4391                                                             fence flag, or to               fence flag, or to
4392                                                             generic if both                 generic if both
4393                                                             local and global                local and global
4394                                                             flags are                       flags are
4395                                                             specified.                      specified.
4396                                                           - Must happen after
4397                                                             any preceding
4398                                                             local/generic
4399                                                             load/load
4400                                                             atomic/store/store
4401                                                             atomic/atomicrmw.
4402                                                                                           - Could be split into
4403                                                                                             separate s_waitcnt
4404                                                                                             vmcnt(0), s_waitcnt
4405                                                                                             vscnt(0) and s_waitcnt
4406                                                                                             lgkmcnt(0) to allow
4407                                                                                             them to be
4408                                                                                             independently moved
4409                                                                                             according to the
4410                                                                                             following rules.
4411                                                                                           - s_waitcnt vmcnt(0)
4412                                                                                             must happen after
4413                                                                                             any preceding
4414                                                                                             global/generic
4415                                                                                             load/load
4416                                                                                             atomic/
4417                                                                                             atomicrmw-with-return-value.
4418                                                                                           - s_waitcnt vscnt(0)
4419                                                                                             must happen after
4420                                                                                             any preceding
4421                                                                                             global/generic
4422                                                                                             store/store atomic/
4423                                                                                             atomicrmw-no-return-value.
4424                                                                                           - s_waitcnt lgkmcnt(0)
4425                                                                                             must happen after
4426                                                                                             any preceding
4427                                                                                             local/generic
4428                                                                                             load/store/load
4429                                                                                             atomic/store atomic/
4430                                                                                             atomicrmw.
4431                                                           - Must happen before            - Must happen before
4432                                                             any following store             any following store
4433                                                             atomic/atomicrmw                atomic/atomicrmw
4434                                                             with an equal or                with an equal or
4435                                                             wider sync scope                wider sync scope
4436                                                             and memory ordering             and memory ordering
4437                                                             stronger than                   stronger than
4438                                                             unordered (this is              unordered (this is
4439                                                             termed the                      termed the
4440                                                             fence-paired-atomic).           fence-paired-atomic).
4441                                                           - Ensures that all              - Ensures that all
4442                                                             memory operations               memory operations
4443                                                             to local have                   have
4444                                                             completed before                completed before
4445                                                             performing the                  performing the
4446                                                             following                       following
4447                                                             fence-paired-atomic.            fence-paired-atomic.
4448
4449     fence        release      - agent        *none*     1. s_waitcnt lgkmcnt(0) &       1. s_waitcnt lgkmcnt(0) &
4450                               - system                     vmcnt(0)                        vmcnt(0) & vscnt(0)
4451
4452                                                           - If OpenCL and                 - If OpenCL and
4453                                                             address space is                address space is
4454                                                             not generic, omit               not generic, omit
4455                                                             lgkmcnt(0).                     lgkmcnt(0).
4456                                                           - If OpenCL and                 - If OpenCL and
4457                                                             address space is                address space is
4458                                                             local, omit                     local, omit
4459                                                             vmcnt(0).                       vmcnt(0) and vscnt(0).
4460                                                           - However, since LLVM           - However, since LLVM
4461                                                             currently has no                currently has no
4462                                                             address space on                address space on
4463                                                             the fence need to               the fence need to
4464                                                             conservatively                  conservatively
4465                                                             always generate. If             always generate. If
4466                                                             fence had an                    fence had an
4467                                                             address space then              address space then
4468                                                             set to address                  set to address
4469                                                             space of OpenCL                 space of OpenCL
4470                                                             fence flag, or to               fence flag, or to
4471                                                             generic if both                 generic if both
4472                                                             local and global                local and global
4473                                                             flags are                       flags are
4474                                                             specified.                      specified.
4475                                                           - Could be split into           - Could be split into
4476                                                             separate s_waitcnt              separate s_waitcnt
4477                                                             vmcnt(0) and                    vmcnt(0), s_waitcnt
4478                                                             s_waitcnt                       vscnt(0) and s_waitcnt
4479                                                             lgkmcnt(0) to allow             lgkmcnt(0) to allow
4480                                                             them to be                      them to be
4481                                                             independently moved             independently moved
4482                                                             according to the                according to the
4483                                                             following rules.                following rules.
4484                                                           - s_waitcnt vmcnt(0)            - s_waitcnt vmcnt(0)
4485                                                             must happen after               must happen after
4486                                                             any preceding                   any preceding
4487                                                             global/generic                  global/generic
4488                                                             load/store/load                 load/load atomic/
4489                                                             atomic/store                    atomicrmw-with-return-value.
4490                                                             atomic/atomicrmw.
4491                                                                                           - s_waitcnt vscnt(0)
4492                                                                                             must happen after
4493                                                                                             any preceding
4494                                                                                             global/generic
4495                                                                                             store/store atomic/
4496                                                                                             atomicrmw-no-return-value.
4497                                                           - s_waitcnt lgkmcnt(0)          - s_waitcnt lgkmcnt(0)
4498                                                             must happen after               must happen after
4499                                                             any preceding                   any preceding
4500                                                             local/generic                   local/generic
4501                                                             load/store/load                 load/store/load
4502                                                             atomic/store                    atomic/store
4503                                                             atomic/atomicrmw.               atomic/atomicrmw.
4504                                                           - Must happen before            - Must happen before
4505                                                             any following store             any following store
4506                                                             atomic/atomicrmw                atomic/atomicrmw
4507                                                             with an equal or                with an equal or
4508                                                             wider sync scope                wider sync scope
4509                                                             and memory ordering             and memory ordering
4510                                                             stronger than                   stronger than
4511                                                             unordered (this is              unordered (this is
4512                                                             termed the                      termed the
4513                                                             fence-paired-atomic).           fence-paired-atomic).
4514                                                           - Ensures that all              - Ensures that all
4515                                                             memory operations               memory operations
4516                                                             have                            have
4517                                                             completed before                completed before
4518                                                             performing the                  performing the
4519                                                             following                       following
4520                                                             fence-paired-atomic.            fence-paired-atomic.
4521
4522     **Acquire-Release Atomic**
4523     ----------------------------------------------------------------------------------------------------------------------
4524     atomicrmw    acq_rel      - singlethread - global   1. buffer/global/ds/flat_atomic 1. buffer/global/ds/flat_atomic
4525                               - wavefront    - local
4526                                              - generic
4527     atomicrmw    acq_rel      - workgroup    - global   1. s_waitcnt lgkmcnt(0)         1. s_waitcnt lgkmcnt(0) &
4528                                                                                            vmcnt(0) & vscnt(0)
4529
4530                                                                                           - If CU wavefront execution mode, omit vmcnt and
4531                                                                                             vscnt.
4532                                                           - If OpenCL, omit.              - If OpenCL, omit
4533                                                                                             s_waitcnt lgkmcnt(0).
4534                                                           - Must happen after             - Must happen after
4535                                                             any preceding                   any preceding
4536                                                             local/generic                   local/generic
4537                                                             load/store/load                 load/store/load
4538                                                             atomic/store                    atomic/store
4539                                                             atomic/atomicrmw.               atomic/atomicrmw.
4540                                                                                           - Could be split into
4541                                                                                             separate s_waitcnt
4542                                                                                             vmcnt(0), s_waitcnt
4543                                                                                             vscnt(0) and s_waitcnt
4544                                                                                             lgkmcnt(0) to allow
4545                                                                                             them to be
4546                                                                                             independently moved
4547                                                                                             according to the
4548                                                                                             following rules.
4549                                                                                           - s_waitcnt vmcnt(0)
4550                                                                                             must happen after
4551                                                                                             any preceding
4552                                                                                             global/generic load/load
4553                                                                                             atomic/
4554                                                                                             atomicrmw-with-return-value.
4555                                                                                           - s_waitcnt vscnt(0)
4556                                                                                             must happen after
4557                                                                                             any preceding
4558                                                                                             global/generic
4559                                                                                             store/store
4560                                                                                             atomic/
4561                                                                                             atomicrmw-no-return-value.
4562                                                                                           - s_waitcnt lgkmcnt(0)
4563                                                                                             must happen after
4564                                                                                             any preceding
4565                                                                                             local/generic load/store/load
4566                                                                                             atomic/store atomic/atomicrmw.
4567                                                           - Must happen before            - Must happen before
4568                                                             the following                   the following
4569                                                             atomicrmw.                      atomicrmw.
4570                                                           - Ensures that all              - Ensures that all
4571                                                             memory operations               memory operations
4572                                                             to local have                   have
4573                                                             completed before                completed before
4574                                                             performing the                  performing the
4575                                                             atomicrmw that is               atomicrmw that is
4576                                                             being released.                 being released.
4577
4578                                                         2. buffer/global/flat_atomic    2. buffer/global_atomic
4579                                                                                         3. s_waitcnt vm/vscnt(0)
4580
4581                                                                                           - If CU wavefront execution mode, omit vm/vscnt.
4582                                                                                           - Use vmcnt if atomic with
4583                                                                                             return and vscnt if atomic
4584                                                                                             with no-return.
4585                                                                                             waitcnt lgkmcnt(0).
4586                                                                                           - Must happen before
4587                                                                                             the following
4588                                                                                             buffer_gl0_inv.
4589                                                                                           - Ensures any
4590                                                                                             following global
4591                                                                                             data read is no
4592                                                                                             older than the
4593                                                                                             atomicrmw value
4594                                                                                             being acquired.
4595
4596                                                                                         4. buffer_gl0_inv
4597
4598                                                                                           - If CU wavefront execution mode, omit.
4599                                                                                           - Ensures that
4600                                                                                             following
4601                                                                                             loads will not see
4602                                                                                             stale data.
4603
4604     atomicrmw    acq_rel      - workgroup    - local                                    1. waitcnt vmcnt(0) & vscnt(0)
4605
4606                                                                                           - If CU wavefront execution mode, omit.
4607                                                                                           - If OpenCL, omit.
4608                                                                                           - Could be split into
4609                                                                                             separate s_waitcnt
4610                                                                                             vmcnt(0) and s_waitcnt
4611                                                                                             vscnt(0) to allow
4612                                                                                             them to be
4613                                                                                             independently moved
4614                                                                                             according to the
4615                                                                                             following rules.
4616                                                                                           - s_waitcnt vmcnt(0)
4617                                                                                             must happen after
4618                                                                                             any preceding
4619                                                                                             global/generic load/load
4620                                                                                             atomic/
4621                                                                                             atomicrmw-with-return-value.
4622                                                                                           - s_waitcnt vscnt(0)
4623                                                                                             must happen after
4624                                                                                             any preceding
4625                                                                                             global/generic
4626                                                                                             store/store atomic/
4627                                                                                             atomicrmw-no-return-value.
4628                                                                                           - Must happen before
4629                                                                                             the following
4630                                                                                             store.
4631                                                                                           - Ensures that all
4632                                                                                             global memory
4633                                                                                             operations have
4634                                                                                             completed before
4635                                                                                             performing the
4636                                                                                             store that is being
4637                                                                                             released.
4638
4639                                                         1. ds_atomic                    2. ds_atomic
4640                                                         2. s_waitcnt lgkmcnt(0)         3. s_waitcnt lgkmcnt(0)
4641
4642                                                           - If OpenCL, omit.              - If OpenCL, omit.
4643                                                           - Must happen before            - Must happen before
4644                                                             any following                   the following
4645                                                             global/generic                  buffer_gl0_inv.
4646                                                             load/load
4647                                                             atomic/store/store
4648                                                             atomic/atomicrmw.
4649                                                           - Ensures any                   - Ensures any
4650                                                             following global                following global
4651                                                             data read is no                 data read is no
4652                                                             older than the load             older than the load
4653                                                             atomic value being              atomic value being
4654                                                             acquired.                       acquired.
4655
4656                                                                                         4. buffer_gl0_inv
4657
4658                                                                                           - If CU wavefront execution mode, omit.
4659                                                                                           - If OpenCL omit.
4660                                                                                           - Ensures that
4661                                                                                             following
4662                                                                                             loads will not see
4663                                                                                             stale data.
4664
4665     atomicrmw    acq_rel      - workgroup    - generic  1. s_waitcnt lgkmcnt(0)         1. s_waitcnt lgkmcnt(0) &
4666                                                                                            vmcnt(0) & vscnt(0)
4667
4668                                                                                           - If CU wavefront execution mode, omit vmcnt and
4669                                                                                             vscnt.
4670                                                           - If OpenCL, omit.              - If OpenCL, omit
4671                                                                                             waitcnt lgkmcnt(0).
4672                                                           - Must happen after
4673                                                             any preceding
4674                                                             local/generic
4675                                                             load/store/load
4676                                                             atomic/store
4677                                                             atomic/atomicrmw.
4678                                                                                           - Could be split into
4679                                                                                             separate s_waitcnt
4680                                                                                             vmcnt(0), s_waitcnt
4681                                                                                             vscnt(0) and s_waitcnt
4682                                                                                             lgkmcnt(0) to allow
4683                                                                                             them to be
4684                                                                                             independently moved
4685                                                                                             according to the
4686                                                                                             following rules.
4687                                                                                           - s_waitcnt vmcnt(0)
4688                                                                                             must happen after
4689                                                                                             any preceding
4690                                                                                             global/generic load/load
4691                                                                                             atomic/
4692                                                                                             atomicrmw-with-return-value.
4693                                                                                           - s_waitcnt vscnt(0)
4694                                                                                             must happen after
4695                                                                                             any preceding
4696                                                                                             global/generic
4697                                                                                             store/store
4698                                                                                             atomic/
4699                                                                                             atomicrmw-no-return-value.
4700                                                                                           - s_waitcnt lgkmcnt(0)
4701                                                                                             must happen after
4702                                                                                             any preceding
4703                                                                                             local/generic load/store/load
4704                                                                                             atomic/store atomic/atomicrmw.
4705                                                           - Must happen before            - Must happen before
4706                                                             the following                   the following
4707                                                             atomicrmw.                      atomicrmw.
4708                                                           - Ensures that all              - Ensures that all
4709                                                             memory operations               memory operations
4710                                                             to local have                   have
4711                                                             completed before                completed before
4712                                                             performing the                  performing the
4713                                                             atomicrmw that is               atomicrmw that is
4714                                                             being released.                 being released.
4715
4716                                                         2. flat_atomic                  2. flat_atomic
4717                                                         3. s_waitcnt lgkmcnt(0)         3. s_waitcnt lgkmcnt(0) &
4718                                                                                            vm/vscnt(0)
4719
4720                                                                                           - If CU wavefront execution mode, omit vm/vscnt.
4721                                                           - If OpenCL, omit.              - If OpenCL, omit
4722                                                                                             waitcnt lgkmcnt(0).
4723                                                           - Must happen before            - Must happen before
4724                                                             any following                   the following
4725                                                             global/generic                  buffer_gl0_inv.
4726                                                             load/load
4727                                                             atomic/store/store
4728                                                             atomic/atomicrmw.
4729                                                           - Ensures any                   - Ensures any
4730                                                             following global                following global
4731                                                             data read is no                 data read is no
4732                                                             older than the load             older than the load
4733                                                             atomic value being              atomic value being
4734                                                             acquired.                       acquired.
4735
4736                                                                                         3. buffer_gl0_inv
4737
4738                                                                                           - If CU wavefront execution mode, omit.
4739                                                                                           - Ensures that
4740                                                                                             following
4741                                                                                             loads will not see
4742                                                                                             stale data.
4743
4744     atomicrmw    acq_rel      - agent        - global   1. s_waitcnt lgkmcnt(0) &       1. s_waitcnt lgkmcnt(0) &
4745                               - system                     vmcnt(0)                        vmcnt(0) & vscnt(0)
4746
4747                                                           - If OpenCL, omit               - If OpenCL, omit
4748                                                             lgkmcnt(0).                     lgkmcnt(0).
4749                                                           - Could be split into           - Could be split into
4750                                                             separate s_waitcnt              separate s_waitcnt
4751                                                             vmcnt(0) and                    vmcnt(0), s_waitcnt
4752                                                             s_waitcnt                       vscnt(0) and s_waitcnt
4753                                                             lgkmcnt(0) to allow             lgkmcnt(0) to allow
4754                                                             them to be                      them to be
4755                                                             independently moved             independently moved
4756                                                             according to the                according to the
4757                                                             following rules.                following rules.
4758                                                           - s_waitcnt vmcnt(0)            - s_waitcnt vmcnt(0)
4759                                                             must happen after               must happen after
4760                                                             any preceding                   any preceding
4761                                                             global/generic                  global/generic
4762                                                             load/store/load                 load/load atomic/
4763                                                             atomic/store                    atomicrmw-with-return-value.
4764                                                             atomic/atomicrmw.
4765                                                                                           - s_waitcnt vscnt(0)
4766                                                                                             must happen after
4767                                                                                             any preceding
4768                                                                                             global/generic
4769                                                                                             store/store atomic/
4770                                                                                             atomicrmw-no-return-value.
4771                                                           - s_waitcnt lgkmcnt(0)          - s_waitcnt lgkmcnt(0)
4772                                                             must happen after               must happen after
4773                                                             any preceding                   any preceding
4774                                                             local/generic                   local/generic
4775                                                             load/store/load                 load/store/load
4776                                                             atomic/store                    atomic/store
4777                                                             atomic/atomicrmw.               atomic/atomicrmw.
4778                                                           - Must happen before            - Must happen before
4779                                                             the following                   the following
4780                                                             atomicrmw.                      atomicrmw.
4781                                                           - Ensures that all              - Ensures that all
4782                                                             memory operations               memory operations
4783                                                             to global have                  to global have
4784                                                             completed before                completed before
4785                                                             performing the                  performing the
4786                                                             atomicrmw that is               atomicrmw that is
4787                                                             being released.                 being released.
4788
4789                                                         2. buffer/global/flat_atomic    2. buffer/global_atomic
4790                                                         3. s_waitcnt vmcnt(0)           3. s_waitcnt vm/vscnt(0)
4791
4792                                                                                           - Use vmcnt if atomic with
4793                                                                                             return and vscnt if atomic
4794                                                                                             with no-return.
4795                                                                                             waitcnt lgkmcnt(0).
4796                                                           - Must happen before            - Must happen before
4797                                                             following                       following
4798                                                             buffer_wbinvl1_vol.             buffer_gl*_inv.
4799                                                           - Ensures the                   - Ensures the
4800                                                             atomicrmw has                   atomicrmw has
4801                                                             completed before                completed before
4802                                                             invalidating the                invalidating the
4803                                                             cache.                          caches.
4804
4805                                                         4. buffer_wbinvl1_vol           4. buffer_gl0_inv;
4806                                                                                            buffer_gl1_inv
4807
4808                                                           - Must happen before            - Must happen before
4809                                                             any following                   any following
4810                                                             global/generic                  global/generic
4811                                                             load/load                       load/load
4812                                                             atomic/atomicrmw.               atomic/atomicrmw.
4813                                                           - Ensures that                  - Ensures that
4814                                                             following loads                 following loads
4815                                                             will not see stale              will not see stale
4816                                                             global data.                    global data.
4817
4818     atomicrmw    acq_rel      - agent        - generic  1. s_waitcnt lgkmcnt(0) &       1. s_waitcnt lgkmcnt(0) &
4819                               - system                     vmcnt(0)                        vmcnt(0) & vscnt(0)
4820
4821                                                           - If OpenCL, omit               - If OpenCL, omit
4822                                                             lgkmcnt(0).                     lgkmcnt(0).
4823                                                           - Could be split into           - Could be split into
4824                                                             separate s_waitcnt              separate s_waitcnt
4825                                                             vmcnt(0) and                    vmcnt(0), s_waitcnt
4826                                                             s_waitcnt                       vscnt(0) and s_waitcnt
4827                                                             lgkmcnt(0) to allow             lgkmcnt(0) to allow
4828                                                             them to be                      them to be
4829                                                             independently moved             independently moved
4830                                                             according to the                according to the
4831                                                             following rules.                following rules.
4832                                                           - s_waitcnt vmcnt(0)            - s_waitcnt vmcnt(0)
4833                                                             must happen after               must happen after
4834                                                             any preceding                   any preceding
4835                                                             global/generic                  global/generic
4836                                                             load/store/load                 load/load atomic
4837                                                             atomic/store                    atomicrmw-with-return-value.
4838                                                             atomic/atomicrmw.
4839                                                                                           - s_waitcnt vscnt(0)
4840                                                                                             must happen after
4841                                                                                             any preceding
4842                                                                                             global/generic
4843                                                                                             store/store atomic/
4844                                                                                             atomicrmw-no-return-value.
4845                                                           - s_waitcnt lgkmcnt(0)          - s_waitcnt lgkmcnt(0)
4846                                                             must happen after               must happen after
4847                                                             any preceding                   any preceding
4848                                                             local/generic                   local/generic
4849                                                             load/store/load                 load/store/load
4850                                                             atomic/store                    atomic/store
4851                                                             atomic/atomicrmw.               atomic/atomicrmw.
4852                                                           - Must happen before            - Must happen before
4853                                                             the following                   the following
4854                                                             atomicrmw.                      atomicrmw.
4855                                                           - Ensures that all              - Ensures that all
4856                                                             memory operations               memory operations
4857                                                             to global have                  have
4858                                                             completed before                completed before
4859                                                             performing the                  performing the
4860                                                             atomicrmw that is               atomicrmw that is
4861                                                             being released.                 being released.
4862
4863                                                         2. flat_atomic                  2. flat_atomic
4864                                                         3. s_waitcnt vmcnt(0) &         3. s_waitcnt vm/vscnt(0) &
4865                                                            lgkmcnt(0)                      lgkmcnt(0)
4866
4867                                                           - If OpenCL, omit               - If OpenCL, omit
4868                                                             lgkmcnt(0).                     lgkmcnt(0).
4869                                                                                           - Use vmcnt if atomic with
4870                                                                                             return and vscnt if atomic
4871                                                                                             with no-return.
4872                                                           - Must happen before            - Must happen before
4873                                                             following                       following
4874                                                             buffer_wbinvl1_vol.             buffer_gl*_inv.
4875                                                           - Ensures the                   - Ensures the
4876                                                             atomicrmw has                   atomicrmw has
4877                                                             completed before                completed before
4878                                                             invalidating the                invalidating the
4879                                                             cache.                          caches.
4880
4881                                                         4. buffer_wbinvl1_vol           4. buffer_gl0_inv;
4882                                                                                            buffer_gl1_inv
4883
4884                                                           - Must happen before            - Must happen before
4885                                                             any following                   any following
4886                                                             global/generic                  global/generic
4887                                                             load/load                       load/load
4888                                                             atomic/atomicrmw.               atomic/atomicrmw.
4889                                                           - Ensures that                  - Ensures that
4890                                                             following loads                 following loads
4891                                                             will not see stale              will not see stale
4892                                                             global data.                    global data.
4893
4894     fence        acq_rel      - singlethread *none*     *none*                          *none*
4895                               - wavefront
4896     fence        acq_rel      - workgroup    *none*     1. s_waitcnt lgkmcnt(0)         1. s_waitcnt lgkmcnt(0) &
4897                                                                                            vmcnt(0) & vscnt(0)
4898
4899                                                                                           - If CU wavefront execution mode, omit vmcnt and
4900                                                                                             vscnt.
4901                                                           - If OpenCL and                 - If OpenCL and
4902                                                             address space is                address space is
4903                                                             not generic, omit.              not generic, omit
4904                                                                                             lgkmcnt(0).
4905                                                                                           - If OpenCL and
4906                                                                                             address space is
4907                                                                                             local, omit
4908                                                                                             vmcnt(0) and vscnt(0).
4909                                                           - However,                      - However,
4910                                                             since LLVM                      since LLVM
4911                                                             currently has no                currently has no
4912                                                             address space on                address space on
4913                                                             the fence need to               the fence need to
4914                                                             conservatively                  conservatively
4915                                                             always generate                 always generate
4916                                                             (see comment for                (see comment for
4917                                                             previous fence).                previous fence).
4918                                                           - Must happen after
4919                                                             any preceding
4920                                                             local/generic
4921                                                             load/load
4922                                                             atomic/store/store
4923                                                             atomic/atomicrmw.
4924                                                                                           - Could be split into
4925                                                                                             separate s_waitcnt
4926                                                                                             vmcnt(0), s_waitcnt
4927                                                                                             vscnt(0) and s_waitcnt
4928                                                                                             lgkmcnt(0) to allow
4929                                                                                             them to be
4930                                                                                             independently moved
4931                                                                                             according to the
4932                                                                                             following rules.
4933                                                                                           - s_waitcnt vmcnt(0)
4934                                                                                             must happen after
4935                                                                                             any preceding
4936                                                                                             global/generic
4937                                                                                             load/load
4938                                                                                             atomic/
4939                                                                                             atomicrmw-with-return-value.
4940                                                                                           - s_waitcnt vscnt(0)
4941                                                                                             must happen after
4942                                                                                             any preceding
4943                                                                                             global/generic
4944                                                                                             store/store atomic/
4945                                                                                             atomicrmw-no-return-value.
4946                                                                                           - s_waitcnt lgkmcnt(0)
4947                                                                                             must happen after
4948                                                                                             any preceding
4949                                                                                             local/generic
4950                                                                                             load/store/load
4951                                                                                             atomic/store atomic/
4952                                                                                             atomicrmw.
4953                                                           - Must happen before            - Must happen before
4954                                                             any following                   any following
4955                                                             global/generic                  global/generic
4956                                                             load/load                       load/load
4957                                                             atomic/store/store              atomic/store/store
4958                                                             atomic/atomicrmw.               atomic/atomicrmw.
4959                                                           - Ensures that all              - Ensures that all
4960                                                             memory operations               memory operations
4961                                                             to local have                   have
4962                                                             completed before                completed before
4963                                                             performing any                  performing any
4964                                                             following global                following global
4965                                                             memory operations.              memory operations.
4966                                                           - Ensures that the              - Ensures that the
4967                                                             preceding                       preceding
4968                                                             local/generic load              local/generic load
4969                                                             atomic/atomicrmw                atomic/atomicrmw
4970                                                             with an equal or                with an equal or
4971                                                             wider sync scope                wider sync scope
4972                                                             and memory ordering             and memory ordering
4973                                                             stronger than                   stronger than
4974                                                             unordered (this is              unordered (this is
4975                                                             termed the                      termed the
4976                                                             acquire-fence-paired-atomic     acquire-fence-paired-atomic
4977                                                             ) has completed                 ) has completed
4978                                                             before following                before following
4979                                                             global memory                   global memory
4980                                                             operations. This                operations. This
4981                                                             satisfies the                   satisfies the
4982                                                             requirements of                 requirements of
4983                                                             acquire.                        acquire.
4984                                                           - Ensures that all              - Ensures that all
4985                                                             previous memory                 previous memory
4986                                                             operations have                 operations have
4987                                                             completed before a              completed before a
4988                                                             following                       following
4989                                                             local/generic store             local/generic store
4990                                                             atomic/atomicrmw                atomic/atomicrmw
4991                                                             with an equal or                with an equal or
4992                                                             wider sync scope                wider sync scope
4993                                                             and memory ordering             and memory ordering
4994                                                             stronger than                   stronger than
4995                                                             unordered (this is              unordered (this is
4996                                                             termed the                      termed the
4997                                                             release-fence-paired-atomic     release-fence-paired-atomic
4998                                                             ). This satisfies the           ). This satisfies the
4999                                                             requirements of                 requirements of
5000                                                             release.                        release.
5001                                                                                           - Must happen before
5002                                                                                             the following
5003                                                                                             buffer_gl0_inv.
5004                                                                                           - Ensures that the
5005                                                                                             acquire-fence-paired
5006                                                                                             atomic has completed
5007                                                                                             before invalidating
5008                                                                                             the
5009                                                                                             cache. Therefore
5010                                                                                             any following
5011                                                                                             locations read must
5012                                                                                             be no older than
5013                                                                                             the value read by
5014                                                                                             the
5015                                                                                             acquire-fence-paired-atomic.
5016
5017                                                                                         3. buffer_gl0_inv
5018
5019                                                                                           - If CU wavefront execution mode, omit.
5020                                                                                           - Ensures that
5021                                                                                             following
5022                                                                                             loads will not see
5023                                                                                             stale data.
5024
5025     fence        acq_rel      - agent        *none*     1. s_waitcnt lgkmcnt(0) &       1. s_waitcnt lgkmcnt(0) &
5026                               - system                     vmcnt(0)                        vmcnt(0) & vscnt(0)
5027
5028                                                           - If OpenCL and                 - If OpenCL and
5029                                                             address space is                address space is
5030                                                             not generic, omit               not generic, omit
5031                                                             lgkmcnt(0).                     lgkmcnt(0).
5032                                                                                           - If OpenCL and
5033                                                                                             address space is
5034                                                                                             local, omit
5035                                                                                             vmcnt(0) and vscnt(0).
5036                                                           - However, since LLVM           - However, since LLVM
5037                                                             currently has no                currently has no
5038                                                             address space on                address space on
5039                                                             the fence need to               the fence need to
5040                                                             conservatively                  conservatively
5041                                                             always generate                 always generate
5042                                                             (see comment for                (see comment for
5043                                                             previous fence).                previous fence).
5044                                                           - Could be split into           - Could be split into
5045                                                             separate s_waitcnt              separate s_waitcnt
5046                                                             vmcnt(0) and                    vmcnt(0), s_waitcnt
5047                                                             s_waitcnt                       vscnt(0) and s_waitcnt
5048                                                             lgkmcnt(0) to allow             lgkmcnt(0) to allow
5049                                                             them to be                      them to be
5050                                                             independently moved             independently moved
5051                                                             according to the                according to the
5052                                                             following rules.                following rules.
5053                                                           - s_waitcnt vmcnt(0)            - s_waitcnt vmcnt(0)
5054                                                             must happen after               must happen after
5055                                                             any preceding                   any preceding
5056                                                             global/generic                  global/generic
5057                                                             load/store/load                 load/load
5058                                                             atomic/store                    atomic/
5059                                                             atomic/atomicrmw.               atomicrmw-with-return-value.
5060                                                                                           - s_waitcnt vscnt(0)
5061                                                                                             must happen after
5062                                                                                             any preceding
5063                                                                                             global/generic
5064                                                                                             store/store atomic/
5065                                                                                             atomicrmw-no-return-value.
5066                                                           - s_waitcnt lgkmcnt(0)          - s_waitcnt lgkmcnt(0)
5067                                                             must happen after               must happen after
5068                                                             any preceding                   any preceding
5069                                                             local/generic                   local/generic
5070                                                             load/store/load                 load/store/load
5071                                                             atomic/store                    atomic/store
5072                                                             atomic/atomicrmw.               atomic/atomicrmw.
5073                                                           - Must happen before            - Must happen before
5074                                                             the following                   the following
5075                                                             buffer_wbinvl1_vol.             buffer_gl*_inv.
5076                                                           - Ensures that the              - Ensures that the
5077                                                             preceding                       preceding
5078                                                             global/local/generic            global/local/generic
5079                                                             load                            load
5080                                                             atomic/atomicrmw                atomic/atomicrmw
5081                                                             with an equal or                with an equal or
5082                                                             wider sync scope                wider sync scope
5083                                                             and memory ordering             and memory ordering
5084                                                             stronger than                   stronger than
5085                                                             unordered (this is              unordered (this is
5086                                                             termed the                      termed the
5087                                                             acquire-fence-paired-atomic     acquire-fence-paired-atomic
5088                                                             ) has completed                 ) has completed
5089                                                             before invalidating             before invalidating
5090                                                             the cache. This                 the caches. This
5091                                                             satisfies the                   satisfies the
5092                                                             requirements of                 requirements of
5093                                                             acquire.                        acquire.
5094                                                           - Ensures that all              - Ensures that all
5095                                                             previous memory                 previous memory
5096                                                             operations have                 operations have
5097                                                             completed before a              completed before a
5098                                                             following                       following
5099                                                             global/local/generic            global/local/generic
5100                                                             store                           store
5101                                                             atomic/atomicrmw                atomic/atomicrmw
5102                                                             with an equal or                with an equal or
5103                                                             wider sync scope                wider sync scope
5104                                                             and memory ordering             and memory ordering
5105                                                             stronger than                   stronger than
5106                                                             unordered (this is              unordered (this is
5107                                                             termed the                      termed the
5108                                                             release-fence-paired-atomic     release-fence-paired-atomic
5109                                                             ). This satisfies the           ). This satisfies the
5110                                                             requirements of                 requirements of
5111                                                             release.                        release.
5112
5113                                                         2. buffer_wbinvl1_vol           2. buffer_gl0_inv;
5114                                                                                            buffer_gl1_inv
5115
5116                                                           - Must happen before            - Must happen before
5117                                                             any following                   any following
5118                                                             global/generic                  global/generic
5119                                                             load/load                       load/load
5120                                                             atomic/store/store              atomic/store/store
5121                                                             atomic/atomicrmw.               atomic/atomicrmw.
5122                                                           - Ensures that                  - Ensures that
5123                                                             following loads                 following loads
5124                                                             will not see stale              will not see stale
5125                                                             global data. This               global data. This
5126                                                             satisfies the                   satisfies the
5127                                                             requirements of                 requirements of
5128                                                             acquire.                        acquire.
5129
5130     **Sequential Consistent Atomic**
5131     ----------------------------------------------------------------------------------------------------------------------
5132     load atomic  seq_cst      - singlethread - global   *Same as corresponding          *Same as corresponding
5133                               - wavefront    - local    load atomic acquire,            load atomic acquire,
5134                                              - generic  except must generated           except must generated
5135                                                         all instructions even           all instructions even
5136                                                         for OpenCL.*                    for OpenCL.*
5137     load atomic  seq_cst      - workgroup    - global   1. s_waitcnt lgkmcnt(0)         1. s_waitcnt lgkmcnt(0) &
5138                                              - generic                                     vmcnt(0) & vscnt(0)
5139
5140                                                                                           - If CU wavefront execution mode, omit vmcnt and
5141                                                                                             vscnt.
5142                                                                                           - Could be split into
5143                                                                                             separate s_waitcnt
5144                                                                                             vmcnt(0), s_waitcnt
5145                                                                                             vscnt(0) and s_waitcnt
5146                                                                                             lgkmcnt(0) to allow
5147                                                                                             them to be
5148                                                                                             independently moved
5149                                                                                             according to the
5150                                                                                             following rules.
5151                                                           - Must                          - waitcnt lgkmcnt(0) must
5152                                                             happen after                    happen after
5153                                                             preceding                       preceding
5154                                                             global/generic load             local load
5155                                                             atomic/store                    atomic/store
5156                                                             atomic/atomicrmw                atomic/atomicrmw
5157                                                             with memory                     with memory
5158                                                             ordering of seq_cst             ordering of seq_cst
5159                                                             and with equal or               and with equal or
5160                                                             wider sync scope.               wider sync scope.
5161                                                             (Note that seq_cst              (Note that seq_cst
5162                                                             fences have their               fences have their
5163                                                             own s_waitcnt                   own s_waitcnt
5164                                                             lgkmcnt(0) and so do            lgkmcnt(0) and so do
5165                                                             not need to be                  not need to be
5166                                                             considered.)                    considered.)
5167                                                                                           - waitcnt vmcnt(0)
5168                                                                                             Must happen after
5169                                                                                             preceding
5170                                                                                             global/generic load
5171                                                                                             atomic/
5172                                                                                             atomicrmw-with-return-value
5173                                                                                             with memory
5174                                                                                             ordering of seq_cst
5175                                                                                             and with equal or
5176                                                                                             wider sync scope.
5177                                                                                             (Note that seq_cst
5178                                                                                             fences have their
5179                                                                                             own s_waitcnt
5180                                                                                             vmcnt(0) and so do
5181                                                                                             not need to be
5182                                                                                             considered.)
5183                                                                                           - waitcnt vscnt(0)
5184                                                                                             Must happen after
5185                                                                                             preceding
5186                                                                                             global/generic store
5187                                                                                             atomic/
5188                                                                                             atomicrmw-no-return-value
5189                                                                                             with memory
5190                                                                                             ordering of seq_cst
5191                                                                                             and with equal or
5192                                                                                             wider sync scope.
5193                                                                                             (Note that seq_cst
5194                                                                                             fences have their
5195                                                                                             own s_waitcnt
5196                                                                                             vscnt(0) and so do
5197                                                                                             not need to be
5198                                                                                             considered.)
5199                                                           - Ensures any                   - Ensures any
5200                                                             preceding                       preceding
5201                                                             sequential                      sequential
5202                                                             consistent local                consistent global/local
5203                                                             memory instructions             memory instructions
5204                                                             have completed                  have completed
5205                                                             before executing                before executing
5206                                                             this sequentially               this sequentially
5207                                                             consistent                      consistent
5208                                                             instruction. This               instruction. This
5209                                                             prevents reordering             prevents reordering
5210                                                             a seq_cst store                 a seq_cst store
5211                                                             followed by a                   followed by a
5212                                                             seq_cst load. (Note             seq_cst load. (Note
5213                                                             that seq_cst is                 that seq_cst is
5214                                                             stronger than                   stronger than
5215                                                             acquire/release as              acquire/release as
5216                                                             the reordering of               the reordering of
5217                                                             load acquire                    load acquire
5218                                                             followed by a store             followed by a store
5219                                                             release is                      release is
5220                                                             prevented by the                prevented by the
5221                                                             waitcnt of                      waitcnt of
5222                                                             the release, but                the release, but
5223                                                             there is nothing                there is nothing
5224                                                             preventing a store              preventing a store
5225                                                             release followed by             release followed by
5226                                                             load acquire from               load acquire from
5227                                                             competing out of                competing out of
5228                                                             order.)                         order.)
5229
5230                                                         2. *Following                   2. *Following
5231                                                            instructions same as            instructions same as
5232                                                            corresponding load              corresponding load
5233                                                            atomic acquire,                 atomic acquire,
5234                                                            except must generated           except must generated
5235                                                            all instructions even           all instructions even
5236                                                            for OpenCL.*                    for OpenCL.*
5237     load atomic  seq_cst      - workgroup    - local    *Same as corresponding
5238                                                         load atomic acquire,
5239                                                         except must generated
5240                                                         all instructions even
5241                                                         for OpenCL.*
5242
5243                                                                                         1. s_waitcnt vmcnt(0) & vscnt(0)
5244
5245                                                                                           - If CU wavefront execution mode, omit.
5246                                                                                           - Could be split into
5247                                                                                             separate s_waitcnt
5248                                                                                             vmcnt(0) and s_waitcnt
5249                                                                                             vscnt(0) to allow
5250                                                                                             them to be
5251                                                                                             independently moved
5252                                                                                             according to the
5253                                                                                             following rules.
5254                                                                                           - waitcnt vmcnt(0)
5255                                                                                             Must happen after
5256                                                                                             preceding
5257                                                                                             global/generic load
5258                                                                                             atomic/
5259                                                                                             atomicrmw-with-return-value
5260                                                                                             with memory
5261                                                                                             ordering of seq_cst
5262                                                                                             and with equal or
5263                                                                                             wider sync scope.
5264                                                                                             (Note that seq_cst
5265                                                                                             fences have their
5266                                                                                             own s_waitcnt
5267                                                                                             vmcnt(0) and so do
5268                                                                                             not need to be
5269                                                                                             considered.)
5270                                                                                           - waitcnt vscnt(0)
5271                                                                                             Must happen after
5272                                                                                             preceding
5273                                                                                             global/generic store
5274                                                                                             atomic/
5275                                                                                             atomicrmw-no-return-value
5276                                                                                             with memory
5277                                                                                             ordering of seq_cst
5278                                                                                             and with equal or
5279                                                                                             wider sync scope.
5280                                                                                             (Note that seq_cst
5281                                                                                             fences have their
5282                                                                                             own s_waitcnt
5283                                                                                             vscnt(0) and so do
5284                                                                                             not need to be
5285                                                                                             considered.)
5286                                                                                           - Ensures any
5287                                                                                             preceding
5288                                                                                             sequential
5289                                                                                             consistent global
5290                                                                                             memory instructions
5291                                                                                             have completed
5292                                                                                             before executing
5293                                                                                             this sequentially
5294                                                                                             consistent
5295                                                                                             instruction. This
5296                                                                                             prevents reordering
5297                                                                                             a seq_cst store
5298                                                                                             followed by a
5299                                                                                             seq_cst load. (Note
5300                                                                                             that seq_cst is
5301                                                                                             stronger than
5302                                                                                             acquire/release as
5303                                                                                             the reordering of
5304                                                                                             load acquire
5305                                                                                             followed by a store
5306                                                                                             release is
5307                                                                                             prevented by the
5308                                                                                             waitcnt of
5309                                                                                             the release, but
5310                                                                                             there is nothing
5311                                                                                             preventing a store
5312                                                                                             release followed by
5313                                                                                             load acquire from
5314                                                                                             competing out of
5315                                                                                             order.)
5316
5317                                                                                         2. *Following
5318                                                                                            instructions same as
5319                                                                                            corresponding load
5320                                                                                            atomic acquire,
5321                                                                                            except must generated
5322                                                                                            all instructions even
5323                                                                                            for OpenCL.*
5324
5325     load atomic  seq_cst      - agent        - global   1. s_waitcnt lgkmcnt(0) &       1. s_waitcnt lgkmcnt(0) &
5326                               - system       - generic     vmcnt(0)                        vmcnt(0) & vscnt(0)
5327
5328                                                           - Could be split into           - Could be split into
5329                                                             separate s_waitcnt              separate s_waitcnt
5330                                                             vmcnt(0)                        vmcnt(0), s_waitcnt
5331                                                             and s_waitcnt                   vscnt(0) and s_waitcnt
5332                                                             lgkmcnt(0) to allow             lgkmcnt(0) to allow
5333                                                             them to be                      them to be
5334                                                             independently moved             independently moved
5335                                                             according to the                according to the
5336                                                             following rules.                following rules.
5337                                                           - waitcnt lgkmcnt(0)            - waitcnt lgkmcnt(0)
5338                                                             must happen after               must happen after
5339                                                             preceding                       preceding
5340                                                             global/generic load             local load
5341                                                             atomic/store                    atomic/store
5342                                                             atomic/atomicrmw                atomic/atomicrmw
5343                                                             with memory                     with memory
5344                                                             ordering of seq_cst             ordering of seq_cst
5345                                                             and with equal or               and with equal or
5346                                                             wider sync scope.               wider sync scope.
5347                                                             (Note that seq_cst              (Note that seq_cst
5348                                                             fences have their               fences have their
5349                                                             own s_waitcnt                   own s_waitcnt
5350                                                             lgkmcnt(0) and so do            lgkmcnt(0) and so do
5351                                                             not need to be                  not need to be
5352                                                             considered.)                    considered.)
5353                                                           - waitcnt vmcnt(0)              - waitcnt vmcnt(0)
5354                                                             must happen after               must happen after
5355                                                             preceding                       preceding
5356                                                             global/generic load             global/generic load
5357                                                             atomic/store                    atomic/
5358                                                             atomic/atomicrmw                atomicrmw-with-return-value
5359                                                             with memory                     with memory
5360                                                             ordering of seq_cst             ordering of seq_cst
5361                                                             and with equal or               and with equal or
5362                                                             wider sync scope.               wider sync scope.
5363                                                             (Note that seq_cst              (Note that seq_cst
5364                                                             fences have their               fences have their
5365                                                             own s_waitcnt                   own s_waitcnt
5366                                                             vmcnt(0) and so do              vmcnt(0) and so do
5367                                                             not need to be                  not need to be
5368                                                             considered.)                    considered.)
5369                                                                                           - waitcnt vscnt(0)
5370                                                                                             Must happen after
5371                                                                                             preceding
5372                                                                                             global/generic store
5373                                                                                             atomic/
5374                                                                                             atomicrmw-no-return-value
5375                                                                                             with memory
5376                                                                                             ordering of seq_cst
5377                                                                                             and with equal or
5378                                                                                             wider sync scope.
5379                                                                                             (Note that seq_cst
5380                                                                                             fences have their
5381                                                                                             own s_waitcnt
5382                                                                                             vscnt(0) and so do
5383                                                                                             not need to be
5384                                                                                             considered.)
5385                                                           - Ensures any                   - Ensures any
5386                                                             preceding                       preceding
5387                                                             sequential                      sequential
5388                                                             consistent global               consistent global
5389                                                             memory instructions             memory instructions
5390                                                             have completed                  have completed
5391                                                             before executing                before executing
5392                                                             this sequentially               this sequentially
5393                                                             consistent                      consistent
5394                                                             instruction. This               instruction. This
5395                                                             prevents reordering             prevents reordering
5396                                                             a seq_cst store                 a seq_cst store
5397                                                             followed by a                   followed by a
5398                                                             seq_cst load. (Note             seq_cst load. (Note
5399                                                             that seq_cst is                 that seq_cst is
5400                                                             stronger than                   stronger than
5401                                                             acquire/release as              acquire/release as
5402                                                             the reordering of               the reordering of
5403                                                             load acquire                    load acquire
5404                                                             followed by a store             followed by a store
5405                                                             release is                      release is
5406                                                             prevented by the                prevented by the
5407                                                             waitcnt of                      waitcnt of
5408                                                             the release, but                the release, but
5409                                                             there is nothing                there is nothing
5410                                                             preventing a store              preventing a store
5411                                                             release followed by             release followed by
5412                                                             load acquire from               load acquire from
5413                                                             competing out of                competing out of
5414                                                             order.)                         order.)
5415
5416                                                         2. *Following                   2. *Following
5417                                                            instructions same as            instructions same as
5418                                                            corresponding load              corresponding load
5419                                                            atomic acquire,                 atomic acquire,
5420                                                            except must generated           except must generated
5421                                                            all instructions even           all instructions even
5422                                                            for OpenCL.*                    for OpenCL.*
5423     store atomic seq_cst      - singlethread - global   *Same as corresponding          *Same as corresponding
5424                               - wavefront    - local    store atomic release,           store atomic release,
5425                               - workgroup    - generic  except must generated           except must generated
5426                                                         all instructions even           all instructions even
5427                                                         for OpenCL.*                    for OpenCL.*
5428     store atomic seq_cst      - agent        - global   *Same as corresponding          *Same as corresponding
5429                               - system       - generic  store atomic release,           store atomic release,
5430                                                         except must generated           except must generated
5431                                                         all instructions even           all instructions even
5432                                                         for OpenCL.*                    for OpenCL.*
5433     atomicrmw    seq_cst      - singlethread - global   *Same as corresponding          *Same as corresponding
5434                               - wavefront    - local    atomicrmw acq_rel,              atomicrmw acq_rel,
5435                               - workgroup    - generic  except must generated           except must generated
5436                                                         all instructions even           all instructions even
5437                                                         for OpenCL.*                    for OpenCL.*
5438     atomicrmw    seq_cst      - agent        - global   *Same as corresponding          *Same as corresponding
5439                               - system       - generic  atomicrmw acq_rel,              atomicrmw acq_rel,
5440                                                         except must generated           except must generated
5441                                                         all instructions even           all instructions even
5442                                                         for OpenCL.*                    for OpenCL.*
5443     fence        seq_cst      - singlethread *none*     *Same as corresponding          *Same as corresponding
5444                               - wavefront               fence acq_rel,                  fence acq_rel,
5445                               - workgroup               except must generated           except must generated
5446                               - agent                   all instructions even           all instructions even
5447                               - system                  for OpenCL.*                    for OpenCL.*
5448     ============ ============ ============== ========== =============================== ==================================
5449
5450The memory order also adds the single thread optimization constrains defined in
5451table
5452:ref:`amdgpu-amdhsa-memory-model-single-thread-optimization-constraints-gfx6-gfx10-table`.
5453
5454  .. table:: AMDHSA Memory Model Single Thread Optimization Constraints GFX6-GFX10
5455     :name: amdgpu-amdhsa-memory-model-single-thread-optimization-constraints-gfx6-gfx10-table
5456
5457     ============ ==============================================================
5458     LLVM Memory  Optimization Constraints
5459     Ordering
5460     ============ ==============================================================
5461     unordered    *none*
5462     monotonic    *none*
5463     acquire      - If a load atomic/atomicrmw then no following load/load
5464                    atomic/store/ store atomic/atomicrmw/fence instruction can
5465                    be moved before the acquire.
5466                  - If a fence then same as load atomic, plus no preceding
5467                    associated fence-paired-atomic can be moved after the fence.
5468     release      - If a store atomic/atomicrmw then no preceding load/load
5469                    atomic/store/ store atomic/atomicrmw/fence instruction can
5470                    be moved after the release.
5471                  - If a fence then same as store atomic, plus no following
5472                    associated fence-paired-atomic can be moved before the
5473                    fence.
5474     acq_rel      Same constraints as both acquire and release.
5475     seq_cst      - If a load atomic then same constraints as acquire, plus no
5476                    preceding sequentially consistent load atomic/store
5477                    atomic/atomicrmw/fence instruction can be moved after the
5478                    seq_cst.
5479                  - If a store atomic then the same constraints as release, plus
5480                    no following sequentially consistent load atomic/store
5481                    atomic/atomicrmw/fence instruction can be moved before the
5482                    seq_cst.
5483                  - If an atomicrmw/fence then same constraints as acq_rel.
5484     ============ ==============================================================
5485
5486Trap Handler ABI
5487~~~~~~~~~~~~~~~~
5488
5489For code objects generated by AMDGPU backend for HSA [HSA]_ compatible runtimes
5490(such as ROCm [AMD-ROCm]_), the runtime installs a trap handler that supports
5491the ``s_trap`` instruction with the following usage:
5492
5493  .. table:: AMDGPU Trap Handler for AMDHSA OS
5494     :name: amdgpu-trap-handler-for-amdhsa-os-table
5495
5496     =================== =============== =============== =======================
5497     Usage               Code Sequence   Trap Handler    Description
5498                                         Inputs
5499     =================== =============== =============== =======================
5500     reserved            ``s_trap 0x00``                 Reserved by hardware.
5501     ``debugtrap(arg)``  ``s_trap 0x01`` ``SGPR0-1``:    Reserved for HSA
5502                                           ``queue_ptr`` ``debugtrap``
5503                                         ``VGPR0``:      intrinsic (not
5504                                           ``arg``       implemented).
5505     ``llvm.trap``       ``s_trap 0x02`` ``SGPR0-1``:    Causes dispatch to be
5506                                           ``queue_ptr`` terminated and its
5507                                                         associated queue put
5508                                                         into the error state.
5509     ``llvm.debugtrap``  ``s_trap 0x03``                 - If debugger not
5510                                                           installed then
5511                                                           behaves as a
5512                                                           no-operation. The
5513                                                           trap handler is
5514                                                           entered and
5515                                                           immediately returns
5516                                                           to continue
5517                                                           execution of the
5518                                                           wavefront.
5519                                                         - If the debugger is
5520                                                           installed, causes
5521                                                           the debug trap to be
5522                                                           reported by the
5523                                                           debugger and the
5524                                                           wavefront is put in
5525                                                           the halt state until
5526                                                           resumed by the
5527                                                           debugger.
5528     reserved            ``s_trap 0x04``                 Reserved.
5529     reserved            ``s_trap 0x05``                 Reserved.
5530     reserved            ``s_trap 0x06``                 Reserved.
5531     debugger breakpoint ``s_trap 0x07``                 Reserved for debugger
5532                                                         breakpoints.
5533     reserved            ``s_trap 0x08``                 Reserved.
5534     reserved            ``s_trap 0xfe``                 Reserved.
5535     reserved            ``s_trap 0xff``                 Reserved.
5536     =================== =============== =============== =======================
5537
5538AMDPAL
5539------
5540
5541This section provides code conventions used when the target triple OS is
5542``amdpal`` (see :ref:`amdgpu-target-triples`) for passing runtime parameters
5543from the application/runtime to each invocation of a hardware shader. These
5544parameters include both generic, application-controlled parameters called
5545*user data* as well as system-generated parameters that are a product of the
5546draw or dispatch execution.
5547
5548User Data
5549~~~~~~~~~
5550
5551Each hardware stage has a set of 32-bit *user data registers* which can be
5552written from a command buffer and then loaded into SGPRs when waves are launched
5553via a subsequent dispatch or draw operation. This is the way most arguments are
5554passed from the application/runtime to a hardware shader.
5555
5556Compute User Data
5557~~~~~~~~~~~~~~~~~
5558
5559Compute shader user data mappings are simpler than graphics shaders, and have a
5560fixed mapping.
5561
5562Note that there are always 10 available *user data entries* in registers -
5563entries beyond that limit must be fetched from memory (via the spill table
5564pointer) by the shader.
5565
5566  .. table:: PAL Compute Shader User Data Registers
5567     :name: pal-compute-user-data-registers
5568
5569     ============= ================================
5570     User Register Description
5571     ============= ================================
5572     0             Global Internal Table (32-bit pointer)
5573     1             Per-Shader Internal Table (32-bit pointer)
5574     2 - 11        Application-Controlled User Data (10 32-bit values)
5575     12            Spill Table (32-bit pointer)
5576     13 - 14       Thread Group Count (64-bit pointer)
5577     15            GDS Range
5578     ============= ================================
5579
5580Graphics User Data
5581~~~~~~~~~~~~~~~~~~
5582
5583Graphics pipelines support a much more flexible user data mapping:
5584
5585  .. table:: PAL Graphics Shader User Data Registers
5586     :name: pal-graphics-user-data-registers
5587
5588     ============= ================================
5589     User Register Description
5590     ============= ================================
5591     0             Global Internal Table (32-bit pointer)
5592     +             Per-Shader Internal Table (32-bit pointer)
5593     + 1-15        Application Controlled User Data
5594                   (1-15 Contiguous 32-bit Values in Registers)
5595     +             Spill Table (32-bit pointer)
5596     +             Draw Index (First Stage Only)
5597     +             Vertex Offset (First Stage Only)
5598     +             Instance Offset (First Stage Only)
5599     ============= ================================
5600
5601  The placement of the global internal table remains fixed in the first *user
5602  data SGPR register*. Otherwise all parameters are optional, and can be mapped
5603  to any desired *user data SGPR register*, with the following regstrictions:
5604
5605  * Draw Index, Vertex Offset, and Instance Offset can only be used by the first
5606    activehardware stage in a graphics pipeline (i.e. where the API vertex
5607    shader runs).
5608
5609  * Application-controlled user data must be mapped into a contiguous range of
5610    user data registers.
5611
5612  * The application-controlled user data range supports compaction remapping, so
5613    only *entries* that are actually consumed by the shader must be assigned to
5614    corresponding *registers*. Note that in order to support an efficient runtime
5615    implementation, the remapping must pack *registers* in the same order as
5616    *entries*, with unused *entries* removed.
5617
5618.. _pal_global_internal_table:
5619
5620Global Internal Table
5621~~~~~~~~~~~~~~~~~~~~~
5622
5623The global internal table is a table of *shader resource descriptors* (SRDs) that
5624define how certain engine-wide, runtime-managed resources should be accessed
5625from a shader. The majority of these resources have HW-defined formats, and it
5626is up to the compiler to write/read data as required by the target hardware.
5627
5628The following table illustrates the required format:
5629
5630  .. table:: PAL Global Internal Table
5631     :name: pal-git-table
5632
5633     ============= ================================
5634     Offset        Description
5635     ============= ================================
5636     0-3           Graphics Scratch SRD
5637     4-7           Compute Scratch SRD
5638     8-11          ES/GS Ring Output SRD
5639     12-15         ES/GS Ring Input SRD
5640     16-19         GS/VS Ring Output #0
5641     20-23         GS/VS Ring Output #1
5642     24-27         GS/VS Ring Output #2
5643     28-31         GS/VS Ring Output #3
5644     32-35         GS/VS Ring Input SRD
5645     36-39         Tessellation Factor Buffer SRD
5646     40-43         Off-Chip LDS Buffer SRD
5647     44-47         Off-Chip Param Cache Buffer SRD
5648     48-51         Sample Position Buffer SRD
5649     52            vaRange::ShadowDescriptorTable High Bits
5650     ============= ================================
5651
5652  The pointer to the global internal table passed to the shader as user data
5653  is a 32-bit pointer. The top 32 bits should be assumed to be the same as
5654  the top 32 bits of the pipeline, so the shader may use the program
5655  counter's top 32 bits.
5656
5657Unspecified OS
5658--------------
5659
5660This section provides code conventions used when the target triple OS is
5661empty (see :ref:`amdgpu-target-triples`).
5662
5663Trap Handler ABI
5664~~~~~~~~~~~~~~~~
5665
5666For code objects generated by AMDGPU backend for non-amdhsa OS, the runtime does
5667not install a trap handler. The ``llvm.trap`` and ``llvm.debugtrap``
5668instructions are handled as follows:
5669
5670  .. table:: AMDGPU Trap Handler for Non-AMDHSA OS
5671     :name: amdgpu-trap-handler-for-non-amdhsa-os-table
5672
5673     =============== =============== ===========================================
5674     Usage           Code Sequence   Description
5675     =============== =============== ===========================================
5676     llvm.trap       s_endpgm        Causes wavefront to be terminated.
5677     llvm.debugtrap  *none*          Compiler warning given that there is no
5678                                     trap handler installed.
5679     =============== =============== ===========================================
5680
5681Source Languages
5682================
5683
5684.. _amdgpu-opencl:
5685
5686OpenCL
5687------
5688
5689When the language is OpenCL the following differences occur:
5690
56911. The OpenCL memory model is used (see :ref:`amdgpu-amdhsa-memory-model`).
56922. The AMDGPU backend appends additional arguments to the kernel's explicit
5693   arguments for the AMDHSA OS (see
5694   :ref:`opencl-kernel-implicit-arguments-appended-for-amdhsa-os-table`).
56953. Additional metadata is generated
5696   (see :ref:`amdgpu-amdhsa-code-object-metadata`).
5697
5698  .. table:: OpenCL kernel implicit arguments appended for AMDHSA OS
5699     :name: opencl-kernel-implicit-arguments-appended-for-amdhsa-os-table
5700
5701     ======== ==== ========= ===========================================
5702     Position Byte Byte      Description
5703              Size Alignment
5704     ======== ==== ========= ===========================================
5705     1        8    8         OpenCL Global Offset X
5706     2        8    8         OpenCL Global Offset Y
5707     3        8    8         OpenCL Global Offset Z
5708     4        8    8         OpenCL address of printf buffer
5709     5        8    8         OpenCL address of virtual queue used by
5710                             enqueue_kernel.
5711     6        8    8         OpenCL address of AqlWrap struct used by
5712                             enqueue_kernel.
5713     7        8    8         Pointer argument used for Multi-gird
5714                             synchronization.
5715     ======== ==== ========= ===========================================
5716
5717.. _amdgpu-hcc:
5718
5719HCC
5720---
5721
5722When the language is HCC the following differences occur:
5723
57241. The HSA memory model is used (see :ref:`amdgpu-amdhsa-memory-model`).
5725
5726.. _amdgpu-assembler:
5727
5728Assembler
5729---------
5730
5731AMDGPU backend has LLVM-MC based assembler which is currently in development.
5732It supports AMDGCN GFX6-GFX10.
5733
5734This section describes general syntax for instructions and operands.
5735
5736Instructions
5737~~~~~~~~~~~~
5738
5739.. toctree::
5740   :hidden:
5741
5742   AMDGPU/AMDGPUAsmGFX7
5743   AMDGPU/AMDGPUAsmGFX8
5744   AMDGPU/AMDGPUAsmGFX9
5745   AMDGPU/AMDGPUAsmGFX10
5746   AMDGPUModifierSyntax
5747   AMDGPUOperandSyntax
5748   AMDGPUInstructionSyntax
5749   AMDGPUInstructionNotation
5750
5751An instruction has the following :doc:`syntax<AMDGPUInstructionSyntax>`:
5752
5753    ``<``\ *opcode*\ ``>    <``\ *operand0*\ ``>, <``\ *operand1*\ ``>,...    <``\ *modifier0*\ ``> <``\ *modifier1*\ ``>...``
5754
5755:doc:`Operands<AMDGPUOperandSyntax>` are normally comma-separated while
5756:doc:`modifiers<AMDGPUModifierSyntax>` are space-separated.
5757
5758The order of *operands* and *modifiers* is fixed.
5759Most *modifiers* are optional and may be omitted.
5760
5761See detailed instruction syntax description for :doc:`GFX7<AMDGPU/AMDGPUAsmGFX7>`,
5762:doc:`GFX8<AMDGPU/AMDGPUAsmGFX8>`, :doc:`GFX9<AMDGPU/AMDGPUAsmGFX9>`
5763and :doc:`GFX10<AMDGPU/AMDGPUAsmGFX10>`.
5764
5765Note that features under development are not included in this description.
5766
5767For more information about instructions, their semantics and supported combinations of
5768operands, refer to one of instruction set architecture manuals
5769[AMD-GCN-GFX6]_, [AMD-GCN-GFX7]_, [AMD-GCN-GFX8]_, [AMD-GCN-GFX9]_ and
5770[AMD-GCN-GFX10]_.
5771
5772Operands
5773~~~~~~~~
5774
5775Detailed description of operands may be found :doc:`here<AMDGPUOperandSyntax>`.
5776
5777Modifiers
5778~~~~~~~~~
5779
5780Detailed description of modifiers may be found :doc:`here<AMDGPUModifierSyntax>`.
5781
5782Instruction Examples
5783~~~~~~~~~~~~~~~~~~~~
5784
5785DS
5786++
5787
5788.. code-block:: nasm
5789
5790  ds_add_u32 v2, v4 offset:16
5791  ds_write_src2_b64 v2 offset0:4 offset1:8
5792  ds_cmpst_f32 v2, v4, v6
5793  ds_min_rtn_f64 v[8:9], v2, v[4:5]
5794
5795
5796For full list of supported instructions, refer to "LDS/GDS instructions" in ISA Manual.
5797
5798FLAT
5799++++
5800
5801.. code-block:: nasm
5802
5803  flat_load_dword v1, v[3:4]
5804  flat_store_dwordx3 v[3:4], v[5:7]
5805  flat_atomic_swap v1, v[3:4], v5 glc
5806  flat_atomic_cmpswap v1, v[3:4], v[5:6] glc slc
5807  flat_atomic_fmax_x2 v[1:2], v[3:4], v[5:6] glc
5808
5809For full list of supported instructions, refer to "FLAT instructions" in ISA Manual.
5810
5811MUBUF
5812+++++
5813
5814.. code-block:: nasm
5815
5816  buffer_load_dword v1, off, s[4:7], s1
5817  buffer_store_dwordx4 v[1:4], v2, ttmp[4:7], s1 offen offset:4 glc tfe
5818  buffer_store_format_xy v[1:2], off, s[4:7], s1
5819  buffer_wbinvl1
5820  buffer_atomic_inc v1, v2, s[8:11], s4 idxen offset:4 slc
5821
5822For full list of supported instructions, refer to "MUBUF Instructions" in ISA Manual.
5823
5824SMRD/SMEM
5825+++++++++
5826
5827.. code-block:: nasm
5828
5829  s_load_dword s1, s[2:3], 0xfc
5830  s_load_dwordx8 s[8:15], s[2:3], s4
5831  s_load_dwordx16 s[88:103], s[2:3], s4
5832  s_dcache_inv_vol
5833  s_memtime s[4:5]
5834
5835For full list of supported instructions, refer to "Scalar Memory Operations" in ISA Manual.
5836
5837SOP1
5838++++
5839
5840.. code-block:: nasm
5841
5842  s_mov_b32 s1, s2
5843  s_mov_b64 s[0:1], 0x80000000
5844  s_cmov_b32 s1, 200
5845  s_wqm_b64 s[2:3], s[4:5]
5846  s_bcnt0_i32_b64 s1, s[2:3]
5847  s_swappc_b64 s[2:3], s[4:5]
5848  s_cbranch_join s[4:5]
5849
5850For full list of supported instructions, refer to "SOP1 Instructions" in ISA Manual.
5851
5852SOP2
5853++++
5854
5855.. code-block:: nasm
5856
5857  s_add_u32 s1, s2, s3
5858  s_and_b64 s[2:3], s[4:5], s[6:7]
5859  s_cselect_b32 s1, s2, s3
5860  s_andn2_b32 s2, s4, s6
5861  s_lshr_b64 s[2:3], s[4:5], s6
5862  s_ashr_i32 s2, s4, s6
5863  s_bfm_b64 s[2:3], s4, s6
5864  s_bfe_i64 s[2:3], s[4:5], s6
5865  s_cbranch_g_fork s[4:5], s[6:7]
5866
5867For full list of supported instructions, refer to "SOP2 Instructions" in ISA Manual.
5868
5869SOPC
5870++++
5871
5872.. code-block:: nasm
5873
5874  s_cmp_eq_i32 s1, s2
5875  s_bitcmp1_b32 s1, s2
5876  s_bitcmp0_b64 s[2:3], s4
5877  s_setvskip s3, s5
5878
5879For full list of supported instructions, refer to "SOPC Instructions" in ISA Manual.
5880
5881SOPP
5882++++
5883
5884.. code-block:: nasm
5885
5886  s_barrier
5887  s_nop 2
5888  s_endpgm
5889  s_waitcnt 0 ; Wait for all counters to be 0
5890  s_waitcnt vmcnt(0) & expcnt(0) & lgkmcnt(0) ; Equivalent to above
5891  s_waitcnt vmcnt(1) ; Wait for vmcnt counter to be 1.
5892  s_sethalt 9
5893  s_sleep 10
5894  s_sendmsg 0x1
5895  s_sendmsg sendmsg(MSG_INTERRUPT)
5896  s_trap 1
5897
5898For full list of supported instructions, refer to "SOPP Instructions" in ISA Manual.
5899
5900Unless otherwise mentioned, little verification is performed on the operands
5901of SOPP Instructions, so it is up to the programmer to be familiar with the
5902range or acceptable values.
5903
5904VALU
5905++++
5906
5907For vector ALU instruction opcodes (VOP1, VOP2, VOP3, VOPC, VOP_DPP, VOP_SDWA),
5908the assembler will automatically use optimal encoding based on its operands.
5909To force specific encoding, one can add a suffix to the opcode of the instruction:
5910
5911* _e32 for 32-bit VOP1/VOP2/VOPC
5912* _e64 for 64-bit VOP3
5913* _dpp for VOP_DPP
5914* _sdwa for VOP_SDWA
5915
5916VOP1/VOP2/VOP3/VOPC examples:
5917
5918.. code-block:: nasm
5919
5920  v_mov_b32 v1, v2
5921  v_mov_b32_e32 v1, v2
5922  v_nop
5923  v_cvt_f64_i32_e32 v[1:2], v2
5924  v_floor_f32_e32 v1, v2
5925  v_bfrev_b32_e32 v1, v2
5926  v_add_f32_e32 v1, v2, v3
5927  v_mul_i32_i24_e64 v1, v2, 3
5928  v_mul_i32_i24_e32 v1, -3, v3
5929  v_mul_i32_i24_e32 v1, -100, v3
5930  v_addc_u32 v1, s[0:1], v2, v3, s[2:3]
5931  v_max_f16_e32 v1, v2, v3
5932
5933VOP_DPP examples:
5934
5935.. code-block:: nasm
5936
5937  v_mov_b32 v0, v0 quad_perm:[0,2,1,1]
5938  v_sin_f32 v0, v0 row_shl:1 row_mask:0xa bank_mask:0x1 bound_ctrl:0
5939  v_mov_b32 v0, v0 wave_shl:1
5940  v_mov_b32 v0, v0 row_mirror
5941  v_mov_b32 v0, v0 row_bcast:31
5942  v_mov_b32 v0, v0 quad_perm:[1,3,0,1] row_mask:0xa bank_mask:0x1 bound_ctrl:0
5943  v_add_f32 v0, v0, |v0| row_shl:1 row_mask:0xa bank_mask:0x1 bound_ctrl:0
5944  v_max_f16 v1, v2, v3 row_shl:1 row_mask:0xa bank_mask:0x1 bound_ctrl:0
5945
5946VOP_SDWA examples:
5947
5948.. code-block:: nasm
5949
5950  v_mov_b32 v1, v2 dst_sel:BYTE_0 dst_unused:UNUSED_PRESERVE src0_sel:DWORD
5951  v_min_u32 v200, v200, v1 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_1 src1_sel:DWORD
5952  v_sin_f32 v0, v0 dst_unused:UNUSED_PAD src0_sel:WORD_1
5953  v_fract_f32 v0, |v0| dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_1
5954  v_cmpx_le_u32 vcc, v1, v2 src0_sel:BYTE_2 src1_sel:WORD_0
5955
5956For full list of supported instructions, refer to "Vector ALU instructions".
5957
5958.. TODO
5959   Remove once we switch to code object v3 by default.
5960
5961.. _amdgpu-amdhsa-assembler-predefined-symbols-v2:
5962
5963Code Object V2 Predefined Symbols (-mattr=-code-object-v3)
5964~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
5965
5966.. warning:: Code Object V2 is not the default code object version emitted by
5967  this version of LLVM. For a description of the predefined symbols available
5968  with the default configuration (Code Object V3) see
5969  :ref:`amdgpu-amdhsa-assembler-predefined-symbols-v3`.
5970
5971The AMDGPU assembler defines and updates some symbols automatically. These
5972symbols do not affect code generation.
5973
5974.option.machine_version_major
5975+++++++++++++++++++++++++++++
5976
5977Set to the GFX major generation number of the target being assembled for. For
5978example, when assembling for a "GFX9" target this will be set to the integer
5979value "9". The possible GFX major generation numbers are presented in
5980:ref:`amdgpu-processors`.
5981
5982.option.machine_version_minor
5983+++++++++++++++++++++++++++++
5984
5985Set to the GFX minor generation number of the target being assembled for. For
5986example, when assembling for a "GFX810" target this will be set to the integer
5987value "1". The possible GFX minor generation numbers are presented in
5988:ref:`amdgpu-processors`.
5989
5990.option.machine_version_stepping
5991++++++++++++++++++++++++++++++++
5992
5993Set to the GFX stepping generation number of the target being assembled for.
5994For example, when assembling for a "GFX704" target this will be set to the
5995integer value "4". The possible GFX stepping generation numbers are presented
5996in :ref:`amdgpu-processors`.
5997
5998.kernel.vgpr_count
5999++++++++++++++++++
6000
6001Set to zero each time a
6002:ref:`amdgpu-amdhsa-assembler-directive-amdgpu_hsa_kernel` directive is
6003encountered. At each instruction, if the current value of this symbol is less
6004than or equal to the maximum VPGR number explicitly referenced within that
6005instruction then the symbol value is updated to equal that VGPR number plus
6006one.
6007
6008.kernel.sgpr_count
6009++++++++++++++++++
6010
6011Set to zero each time a
6012:ref:`amdgpu-amdhsa-assembler-directive-amdgpu_hsa_kernel` directive is
6013encountered. At each instruction, if the current value of this symbol is less
6014than or equal to the maximum VPGR number explicitly referenced within that
6015instruction then the symbol value is updated to equal that SGPR number plus
6016one.
6017
6018.. _amdgpu-amdhsa-assembler-directives-v2:
6019
6020Code Object V2 Directives (-mattr=-code-object-v3)
6021~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
6022
6023.. warning:: Code Object V2 is not the default code object version emitted by
6024  this version of LLVM. For a description of the directives supported with
6025  the default configuration (Code Object V3) see
6026  :ref:`amdgpu-amdhsa-assembler-directives-v3`.
6027
6028AMDGPU ABI defines auxiliary data in output code object. In assembly source,
6029one can specify them with assembler directives.
6030
6031.hsa_code_object_version major, minor
6032+++++++++++++++++++++++++++++++++++++
6033
6034*major* and *minor* are integers that specify the version of the HSA code
6035object that will be generated by the assembler.
6036
6037.hsa_code_object_isa [major, minor, stepping, vendor, arch]
6038+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
6039
6040
6041*major*, *minor*, and *stepping* are all integers that describe the instruction
6042set architecture (ISA) version of the assembly program.
6043
6044*vendor* and *arch* are quoted strings.  *vendor* should always be equal to
6045"AMD" and *arch* should always be equal to "AMDGPU".
6046
6047By default, the assembler will derive the ISA version, *vendor*, and *arch*
6048from the value of the -mcpu option that is passed to the assembler.
6049
6050.. _amdgpu-amdhsa-assembler-directive-amdgpu_hsa_kernel:
6051
6052.amdgpu_hsa_kernel (name)
6053+++++++++++++++++++++++++
6054
6055This directives specifies that the symbol with given name is a kernel entry point
6056(label) and the object should contain corresponding symbol of type STT_AMDGPU_HSA_KERNEL.
6057
6058.amd_kernel_code_t
6059++++++++++++++++++
6060
6061This directive marks the beginning of a list of key / value pairs that are used
6062to specify the amd_kernel_code_t object that will be emitted by the assembler.
6063The list must be terminated by the *.end_amd_kernel_code_t* directive.  For
6064any amd_kernel_code_t values that are unspecified a default value will be
6065used.  The default value for all keys is 0, with the following exceptions:
6066
6067- *amd_code_version_major* defaults to 1.
6068- *amd_kernel_code_version_minor* defaults to 2.
6069- *amd_machine_kind* defaults to 1.
6070- *amd_machine_version_major*, *machine_version_minor*, and
6071  *amd_machine_version_stepping* are derived from the value of the -mcpu option
6072  that is passed to the assembler.
6073- *kernel_code_entry_byte_offset* defaults to 256.
6074- *wavefront_size* defaults 6 for all targets before GFX10. For GFX10 onwards
6075  defaults to 6 if target feature ``wavefrontsize64`` is enabled, otherwise 5.
6076  Note that wavefront size is specified as a power of two, so a value of **n**
6077  means a size of 2^ **n**.
6078- *call_convention* defaults to -1.
6079- *kernarg_segment_alignment*, *group_segment_alignment*, and
6080  *private_segment_alignment* default to 4. Note that alignments are specified
6081  as a power of 2, so a value of **n** means an alignment of 2^ **n**.
6082- *enable_wgp_mode* defaults to 1 if target feature ``cumode`` is disabled for
6083  GFX10 onwards.
6084- *enable_mem_ordered* defaults to 1 for GFX10 onwards.
6085
6086The *.amd_kernel_code_t* directive must be placed immediately after the
6087function label and before any instructions.
6088
6089For a full list of amd_kernel_code_t keys, refer to AMDGPU ABI document,
6090comments in lib/Target/AMDGPU/AmdKernelCodeT.h and test/CodeGen/AMDGPU/hsa.s.
6091
6092.. _amdgpu-amdhsa-assembler-example-v2:
6093
6094Code Object V2 Example Source Code (-mattr=-code-object-v3)
6095~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
6096
6097.. warning:: Code Object V2 is not the default code object version emitted by
6098  this version of LLVM. For a description of the directives supported with
6099  the default configuration (Code Object V3) see
6100  :ref:`amdgpu-amdhsa-assembler-example-v3`.
6101
6102Here is an example of a minimal assembly source file, defining one HSA kernel:
6103
6104.. code-block:: none
6105
6106   .hsa_code_object_version 1,0
6107   .hsa_code_object_isa
6108
6109   .hsatext
6110   .globl  hello_world
6111   .p2align 8
6112   .amdgpu_hsa_kernel hello_world
6113
6114   hello_world:
6115
6116      .amd_kernel_code_t
6117         enable_sgpr_kernarg_segment_ptr = 1
6118         is_ptr64 = 1
6119         compute_pgm_rsrc1_vgprs = 0
6120         compute_pgm_rsrc1_sgprs = 0
6121         compute_pgm_rsrc2_user_sgpr = 2
6122         compute_pgm_rsrc1_wgp_mode = 0
6123         compute_pgm_rsrc1_mem_ordered = 0
6124         compute_pgm_rsrc1_fwd_progress = 1
6125     .end_amd_kernel_code_t
6126
6127     s_load_dwordx2 s[0:1], s[0:1] 0x0
6128     v_mov_b32 v0, 3.14159
6129     s_waitcnt lgkmcnt(0)
6130     v_mov_b32 v1, s0
6131     v_mov_b32 v2, s1
6132     flat_store_dword v[1:2], v0
6133     s_endpgm
6134   .Lfunc_end0:
6135        .size   hello_world, .Lfunc_end0-hello_world
6136
6137.. _amdgpu-amdhsa-assembler-predefined-symbols-v3:
6138
6139Code Object V3 Predefined Symbols (-mattr=+code-object-v3)
6140~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
6141
6142The AMDGPU assembler defines and updates some symbols automatically. These
6143symbols do not affect code generation.
6144
6145.amdgcn.gfx_generation_number
6146+++++++++++++++++++++++++++++
6147
6148Set to the GFX major generation number of the target being assembled for. For
6149example, when assembling for a "GFX9" target this will be set to the integer
6150value "9". The possible GFX major generation numbers are presented in
6151:ref:`amdgpu-processors`.
6152
6153.amdgcn.gfx_generation_minor
6154++++++++++++++++++++++++++++
6155
6156Set to the GFX minor generation number of the target being assembled for. For
6157example, when assembling for a "GFX810" target this will be set to the integer
6158value "1". The possible GFX minor generation numbers are presented in
6159:ref:`amdgpu-processors`.
6160
6161.amdgcn.gfx_generation_stepping
6162+++++++++++++++++++++++++++++++
6163
6164Set to the GFX stepping generation number of the target being assembled for.
6165For example, when assembling for a "GFX704" target this will be set to the
6166integer value "4". The possible GFX stepping generation numbers are presented
6167in :ref:`amdgpu-processors`.
6168
6169.. _amdgpu-amdhsa-assembler-symbol-next_free_vgpr:
6170
6171.amdgcn.next_free_vgpr
6172++++++++++++++++++++++
6173
6174Set to zero before assembly begins. At each instruction, if the current value
6175of this symbol is less than or equal to the maximum VGPR number explicitly
6176referenced within that instruction then the symbol value is updated to equal
6177that VGPR number plus one.
6178
6179May be used to set the `.amdhsa_next_free_vpgr` directive in
6180:ref:`amdhsa-kernel-directives-table`.
6181
6182May be set at any time, e.g. manually set to zero at the start of each kernel.
6183
6184.. _amdgpu-amdhsa-assembler-symbol-next_free_sgpr:
6185
6186.amdgcn.next_free_sgpr
6187++++++++++++++++++++++
6188
6189Set to zero before assembly begins. At each instruction, if the current value
6190of this symbol is less than or equal the maximum SGPR number explicitly
6191referenced within that instruction then the symbol value is updated to equal
6192that SGPR number plus one.
6193
6194May be used to set the `.amdhsa_next_free_spgr` directive in
6195:ref:`amdhsa-kernel-directives-table`.
6196
6197May be set at any time, e.g. manually set to zero at the start of each kernel.
6198
6199.. _amdgpu-amdhsa-assembler-directives-v3:
6200
6201Code Object V3 Directives (-mattr=+code-object-v3)
6202~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
6203
6204Directives which begin with ``.amdgcn`` are valid for all ``amdgcn``
6205architecture processors, and are not OS-specific. Directives which begin with
6206``.amdhsa`` are specific to ``amdgcn`` architecture processors when the
6207``amdhsa`` OS is specified. See :ref:`amdgpu-target-triples` and
6208:ref:`amdgpu-processors`.
6209
6210.amdgcn_target <target>
6211+++++++++++++++++++++++
6212
6213Optional directive which declares the target supported by the containing
6214assembler source file. Valid values are described in
6215:ref:`amdgpu-amdhsa-code-object-target-identification`. Used by the assembler
6216to validate command-line options such as ``-triple``, ``-mcpu``, and those
6217which specify target features.
6218
6219.amdhsa_kernel <name>
6220+++++++++++++++++++++
6221
6222Creates a correctly aligned AMDHSA kernel descriptor and a symbol,
6223``<name>.kd``, in the current location of the current section. Only valid when
6224the OS is ``amdhsa``. ``<name>`` must be a symbol that labels the first
6225instruction to execute, and does not need to be previously defined.
6226
6227Marks the beginning of a list of directives used to generate the bytes of a
6228kernel descriptor, as described in :ref:`amdgpu-amdhsa-kernel-descriptor`.
6229Directives which may appear in this list are described in
6230:ref:`amdhsa-kernel-directives-table`. Directives may appear in any order, must
6231be valid for the target being assembled for, and cannot be repeated. Directives
6232support the range of values specified by the field they reference in
6233:ref:`amdgpu-amdhsa-kernel-descriptor`. If a directive is not specified, it is
6234assumed to have its default value, unless it is marked as "Required", in which
6235case it is an error to omit the directive. This list of directives is
6236terminated by an ``.end_amdhsa_kernel`` directive.
6237
6238  .. table:: AMDHSA Kernel Assembler Directives
6239     :name: amdhsa-kernel-directives-table
6240
6241     ======================================================== =================== ============ ===================
6242     Directive                                                Default             Supported On Description
6243     ======================================================== =================== ============ ===================
6244     ``.amdhsa_group_segment_fixed_size``                     0                   GFX6-GFX10   Controls GROUP_SEGMENT_FIXED_SIZE in
6245                                                                                               :ref:`amdgpu-amdhsa-kernel-descriptor-gfx6-gfx10-table`.
6246     ``.amdhsa_private_segment_fixed_size``                   0                   GFX6-GFX10   Controls PRIVATE_SEGMENT_FIXED_SIZE in
6247                                                                                               :ref:`amdgpu-amdhsa-kernel-descriptor-gfx6-gfx10-table`.
6248     ``.amdhsa_user_sgpr_private_segment_buffer``             0                   GFX6-GFX10   Controls ENABLE_SGPR_PRIVATE_SEGMENT_BUFFER in
6249                                                                                               :ref:`amdgpu-amdhsa-kernel-descriptor-gfx6-gfx10-table`.
6250     ``.amdhsa_user_sgpr_dispatch_ptr``                       0                   GFX6-GFX10   Controls ENABLE_SGPR_DISPATCH_PTR in
6251                                                                                               :ref:`amdgpu-amdhsa-kernel-descriptor-gfx6-gfx10-table`.
6252     ``.amdhsa_user_sgpr_queue_ptr``                          0                   GFX6-GFX10   Controls ENABLE_SGPR_QUEUE_PTR in
6253                                                                                               :ref:`amdgpu-amdhsa-kernel-descriptor-gfx6-gfx10-table`.
6254     ``.amdhsa_user_sgpr_kernarg_segment_ptr``                0                   GFX6-GFX10   Controls ENABLE_SGPR_KERNARG_SEGMENT_PTR in
6255                                                                                               :ref:`amdgpu-amdhsa-kernel-descriptor-gfx6-gfx10-table`.
6256     ``.amdhsa_user_sgpr_dispatch_id``                        0                   GFX6-GFX10   Controls ENABLE_SGPR_DISPATCH_ID in
6257                                                                                               :ref:`amdgpu-amdhsa-kernel-descriptor-gfx6-gfx10-table`.
6258     ``.amdhsa_user_sgpr_flat_scratch_init``                  0                   GFX6-GFX10   Controls ENABLE_SGPR_FLAT_SCRATCH_INIT in
6259                                                                                               :ref:`amdgpu-amdhsa-kernel-descriptor-gfx6-gfx10-table`.
6260     ``.amdhsa_user_sgpr_private_segment_size``               0                   GFX6-GFX10   Controls ENABLE_SGPR_PRIVATE_SEGMENT_SIZE in
6261                                                                                               :ref:`amdgpu-amdhsa-kernel-descriptor-gfx6-gfx10-table`.
6262     ``.amdhsa_wavefront_size32``                             Target              GFX10        Controls ENABLE_WAVEFRONT_SIZE32 in
6263                                                              Feature                          :ref:`amdgpu-amdhsa-kernel-descriptor-gfx6-gfx10-table`.
6264                                                              Specific
6265                                                              (-wavefrontsize64)
6266     ``.amdhsa_system_sgpr_private_segment_wavefront_offset`` 0                   GFX6-GFX10   Controls ENABLE_SGPR_PRIVATE_SEGMENT_WAVEFRONT_OFFSET in
6267                                                                                               :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx10-table`.
6268     ``.amdhsa_system_sgpr_workgroup_id_x``                   1                   GFX6-GFX10   Controls ENABLE_SGPR_WORKGROUP_ID_X in
6269                                                                                               :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx10-table`.
6270     ``.amdhsa_system_sgpr_workgroup_id_y``                   0                   GFX6-GFX10   Controls ENABLE_SGPR_WORKGROUP_ID_Y in
6271                                                                                               :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx10-table`.
6272     ``.amdhsa_system_sgpr_workgroup_id_z``                   0                   GFX6-GFX10   Controls ENABLE_SGPR_WORKGROUP_ID_Z in
6273                                                                                               :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx10-table`.
6274     ``.amdhsa_system_sgpr_workgroup_info``                   0                   GFX6-GFX10   Controls ENABLE_SGPR_WORKGROUP_INFO in
6275                                                                                               :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx10-table`.
6276     ``.amdhsa_system_vgpr_workitem_id``                      0                   GFX6-GFX10   Controls ENABLE_VGPR_WORKITEM_ID in
6277                                                                                               :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx10-table`.
6278                                                                                               Possible values are defined in
6279                                                                                               :ref:`amdgpu-amdhsa-system-vgpr-work-item-id-enumeration-values-table`.
6280     ``.amdhsa_next_free_vgpr``                               Required            GFX6-GFX10   Maximum VGPR number explicitly referenced, plus one.
6281                                                                                               Used to calculate GRANULATED_WORKITEM_VGPR_COUNT in
6282                                                                                               :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx10-table`.
6283     ``.amdhsa_next_free_sgpr``                               Required            GFX6-GFX10   Maximum SGPR number explicitly referenced, plus one.
6284                                                                                               Used to calculate GRANULATED_WAVEFRONT_SGPR_COUNT in
6285                                                                                               :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx10-table`.
6286     ``.amdhsa_reserve_vcc``                                  1                   GFX6-GFX10   Whether the kernel may use the special VCC SGPR.
6287                                                                                               Used to calculate GRANULATED_WAVEFRONT_SGPR_COUNT in
6288                                                                                               :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx10-table`.
6289     ``.amdhsa_reserve_flat_scratch``                         1                   GFX7-GFX10   Whether the kernel may use flat instructions to access
6290                                                                                               scratch memory. Used to calculate
6291                                                                                               GRANULATED_WAVEFRONT_SGPR_COUNT in
6292                                                                                               :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx10-table`.
6293     ``.amdhsa_reserve_xnack_mask``                           Target              GFX8-GFX10   Whether the kernel may trigger XNACK replay.
6294                                                              Feature                          Used to calculate GRANULATED_WAVEFRONT_SGPR_COUNT in
6295                                                              Specific                         :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx10-table`.
6296                                                              (+xnack)
6297     ``.amdhsa_float_round_mode_32``                          0                   GFX6-GFX10   Controls FLOAT_ROUND_MODE_32 in
6298                                                                                               :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx10-table`.
6299                                                                                               Possible values are defined in
6300                                                                                               :ref:`amdgpu-amdhsa-floating-point-rounding-mode-enumeration-values-table`.
6301     ``.amdhsa_float_round_mode_16_64``                       0                   GFX6-GFX10   Controls FLOAT_ROUND_MODE_16_64 in
6302                                                                                               :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx10-table`.
6303                                                                                               Possible values are defined in
6304                                                                                               :ref:`amdgpu-amdhsa-floating-point-rounding-mode-enumeration-values-table`.
6305     ``.amdhsa_float_denorm_mode_32``                         0                   GFX6-GFX10   Controls FLOAT_DENORM_MODE_32 in
6306                                                                                               :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx10-table`.
6307                                                                                               Possible values are defined in
6308                                                                                               :ref:`amdgpu-amdhsa-floating-point-denorm-mode-enumeration-values-table`.
6309     ``.amdhsa_float_denorm_mode_16_64``                      3                   GFX6-GFX10   Controls FLOAT_DENORM_MODE_16_64 in
6310                                                                                               :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx10-table`.
6311                                                                                               Possible values are defined in
6312                                                                                               :ref:`amdgpu-amdhsa-floating-point-denorm-mode-enumeration-values-table`.
6313     ``.amdhsa_dx10_clamp``                                   1                   GFX6-GFX10   Controls ENABLE_DX10_CLAMP in
6314                                                                                               :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx10-table`.
6315     ``.amdhsa_ieee_mode``                                    1                   GFX6-GFX10   Controls ENABLE_IEEE_MODE in
6316                                                                                               :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx10-table`.
6317     ``.amdhsa_fp16_overflow``                                0                   GFX9-GFX10   Controls FP16_OVFL in
6318                                                                                               :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx10-table`.
6319     ``.amdhsa_workgroup_processor_mode``                     Target              GFX10        Controls ENABLE_WGP_MODE in
6320                                                              Feature                          :ref:`amdgpu-amdhsa-kernel-descriptor-gfx6-gfx10-table`.
6321                                                              Specific
6322                                                              (-cumode)
6323     ``.amdhsa_memory_ordered``                               1                   GFX10        Controls MEM_ORDERED in
6324                                                                                               :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx10-table`.
6325     ``.amdhsa_forward_progress``                             0                   GFX10        Controls FWD_PROGRESS in
6326                                                                                               :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx10-table`.
6327     ``.amdhsa_exception_fp_ieee_invalid_op``                 0                   GFX6-GFX10   Controls ENABLE_EXCEPTION_IEEE_754_FP_INVALID_OPERATION in
6328                                                                                               :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx10-table`.
6329     ``.amdhsa_exception_fp_denorm_src``                      0                   GFX6-GFX10   Controls ENABLE_EXCEPTION_FP_DENORMAL_SOURCE in
6330                                                                                               :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx10-table`.
6331     ``.amdhsa_exception_fp_ieee_div_zero``                   0                   GFX6-GFX10   Controls ENABLE_EXCEPTION_IEEE_754_FP_DIVISION_BY_ZERO in
6332                                                                                               :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx10-table`.
6333     ``.amdhsa_exception_fp_ieee_overflow``                   0                   GFX6-GFX10   Controls ENABLE_EXCEPTION_IEEE_754_FP_OVERFLOW in
6334                                                                                               :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx10-table`.
6335     ``.amdhsa_exception_fp_ieee_underflow``                  0                   GFX6-GFX10   Controls ENABLE_EXCEPTION_IEEE_754_FP_UNDERFLOW in
6336                                                                                               :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx10-table`.
6337     ``.amdhsa_exception_fp_ieee_inexact``                    0                   GFX6-GFX10   Controls ENABLE_EXCEPTION_IEEE_754_FP_INEXACT in
6338                                                                                               :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx10-table`.
6339     ``.amdhsa_exception_int_div_zero``                       0                   GFX6-GFX10   Controls ENABLE_EXCEPTION_INT_DIVIDE_BY_ZERO in
6340                                                                                               :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx10-table`.
6341     ======================================================== =================== ============ ===================
6342
6343.amdgpu_metadata
6344++++++++++++++++
6345
6346Optional directive which declares the contents of the ``NT_AMDGPU_METADATA``
6347note record (see :ref:`amdgpu-elf-note-records-table-v3`).
6348
6349The contents must be in the [YAML]_ markup format, with the same structure and
6350semantics described in :ref:`amdgpu-amdhsa-code-object-metadata-v3`.
6351
6352This directive is terminated by an ``.end_amdgpu_metadata`` directive.
6353
6354.. _amdgpu-amdhsa-assembler-example-v3:
6355
6356Code Object V3 Example Source Code (-mattr=+code-object-v3)
6357~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
6358
6359Here is an example of a minimal assembly source file, defining one HSA kernel:
6360
6361.. code-block:: none
6362
6363  .amdgcn_target "amdgcn-amd-amdhsa--gfx900+xnack" // optional
6364
6365  .text
6366  .globl hello_world
6367  .p2align 8
6368  .type hello_world,@function
6369  hello_world:
6370    s_load_dwordx2 s[0:1], s[0:1] 0x0
6371    v_mov_b32 v0, 3.14159
6372    s_waitcnt lgkmcnt(0)
6373    v_mov_b32 v1, s0
6374    v_mov_b32 v2, s1
6375    flat_store_dword v[1:2], v0
6376    s_endpgm
6377  .Lfunc_end0:
6378    .size   hello_world, .Lfunc_end0-hello_world
6379
6380  .rodata
6381  .p2align 6
6382  .amdhsa_kernel hello_world
6383    .amdhsa_user_sgpr_kernarg_segment_ptr 1
6384    .amdhsa_next_free_vgpr .amdgcn.next_free_vgpr
6385    .amdhsa_next_free_sgpr .amdgcn.next_free_sgpr
6386  .end_amdhsa_kernel
6387
6388  .amdgpu_metadata
6389  ---
6390  amdhsa.version:
6391    - 1
6392    - 0
6393  amdhsa.kernels:
6394    - .name: hello_world
6395      .symbol: hello_world.kd
6396      .kernarg_segment_size: 48
6397      .group_segment_fixed_size: 0
6398      .private_segment_fixed_size: 0
6399      .kernarg_segment_align: 4
6400      .wavefront_size: 64
6401      .sgpr_count: 2
6402      .vgpr_count: 3
6403      .max_flat_workgroup_size: 256
6404  ...
6405  .end_amdgpu_metadata
6406
6407If an assembly source file contains multiple kernels and/or functions, the
6408:ref:`amdgpu-amdhsa-assembler-symbol-next_free_vgpr` and
6409:ref:`amdgpu-amdhsa-assembler-symbol-next_free_sgpr` symbols may be reset using
6410the ``.set <symbol>, <expression>`` directive. For example, in the case of two
6411kernels, where ``function1`` is only called from ``kernel1`` it is sufficient
6412to group the function with the kernel that calls it and reset the symbols
6413between the two connected components:
6414
6415.. code-block:: none
6416
6417  .amdgcn_target "amdgcn-amd-amdhsa--gfx900+xnack" // optional
6418
6419  // gpr tracking symbols are implicitly set to zero
6420
6421  .text
6422  .globl kern0
6423  .p2align 8
6424  .type kern0,@function
6425  kern0:
6426    // ...
6427    s_endpgm
6428  .Lkern0_end:
6429    .size   kern0, .Lkern0_end-kern0
6430
6431  .rodata
6432  .p2align 6
6433  .amdhsa_kernel kern0
6434    // ...
6435    .amdhsa_next_free_vgpr .amdgcn.next_free_vgpr
6436    .amdhsa_next_free_sgpr .amdgcn.next_free_sgpr
6437  .end_amdhsa_kernel
6438
6439  // reset symbols to begin tracking usage in func1 and kern1
6440  .set .amdgcn.next_free_vgpr, 0
6441  .set .amdgcn.next_free_sgpr, 0
6442
6443  .text
6444  .hidden func1
6445  .global func1
6446  .p2align 2
6447  .type func1,@function
6448  func1:
6449    // ...
6450    s_setpc_b64 s[30:31]
6451  .Lfunc1_end:
6452  .size func1, .Lfunc1_end-func1
6453
6454  .globl kern1
6455  .p2align 8
6456  .type kern1,@function
6457  kern1:
6458    // ...
6459    s_getpc_b64 s[4:5]
6460    s_add_u32 s4, s4, func1@rel32@lo+4
6461    s_addc_u32 s5, s5, func1@rel32@lo+4
6462    s_swappc_b64 s[30:31], s[4:5]
6463    // ...
6464    s_endpgm
6465  .Lkern1_end:
6466    .size   kern1, .Lkern1_end-kern1
6467
6468  .rodata
6469  .p2align 6
6470  .amdhsa_kernel kern1
6471    // ...
6472    .amdhsa_next_free_vgpr .amdgcn.next_free_vgpr
6473    .amdhsa_next_free_sgpr .amdgcn.next_free_sgpr
6474  .end_amdhsa_kernel
6475
6476These symbols cannot identify connected components in order to automatically
6477track the usage for each kernel. However, in some cases careful organization of
6478the kernels and functions in the source file means there is minimal additional
6479effort required to accurately calculate GPR usage.
6480
6481Additional Documentation
6482========================
6483
6484.. [AMD-RADEON-HD-2000-3000] `AMD R6xx shader ISA <http://developer.amd.com/wordpress/media/2012/10/R600_Instruction_Set_Architecture.pdf>`__
6485.. [AMD-RADEON-HD-4000] `AMD R7xx shader ISA <http://developer.amd.com/wordpress/media/2012/10/R700-Family_Instruction_Set_Architecture.pdf>`__
6486.. [AMD-RADEON-HD-5000] `AMD Evergreen shader ISA <http://developer.amd.com/wordpress/media/2012/10/AMD_Evergreen-Family_Instruction_Set_Architecture.pdf>`__
6487.. [AMD-RADEON-HD-6000] `AMD Cayman/Trinity shader ISA <http://developer.amd.com/wordpress/media/2012/10/AMD_HD_6900_Series_Instruction_Set_Architecture.pdf>`__
6488.. [AMD-GCN-GFX6] `AMD Southern Islands Series ISA <http://developer.amd.com/wordpress/media/2012/12/AMD_Southern_Islands_Instruction_Set_Architecture.pdf>`__
6489.. [AMD-GCN-GFX7] `AMD Sea Islands Series ISA <http://developer.amd.com/wordpress/media/2013/07/AMD_Sea_Islands_Instruction_Set_Architecture.pdf>`_
6490.. [AMD-GCN-GFX8] `AMD GCN3 Instruction Set Architecture <http://amd-dev.wpengine.netdna-cdn.com/wordpress/media/2013/12/AMD_GCN3_Instruction_Set_Architecture_rev1.1.pdf>`__
6491.. [AMD-GCN-GFX9] `AMD "Vega" Instruction Set Architecture <http://developer.amd.com/wordpress/media/2013/12/Vega_Shader_ISA_28July2017.pdf>`__
6492.. [AMD-GCN-GFX10] AMD "Navi" Instruction Set Architecture *TBA*
6493.. TODO
6494   ttye Add link when made public.
6495.. [AMD-ROCm] `ROCm: Open Platform for Development, Discovery and Education Around GPU Computing <http://gpuopen.com/compute-product/rocm/>`__
6496.. [AMD-ROCm-github] `ROCm github <http://github.com/RadeonOpenCompute>`__
6497.. [HSA] `Heterogeneous System Architecture (HSA) Foundation <http://www.hsafoundation.com/>`__
6498.. [ELF] `Executable and Linkable Format (ELF) <http://www.sco.com/developers/gabi/>`__
6499.. [DWARF] `DWARF Debugging Information Format <http://dwarfstd.org/>`__
6500.. [YAML] `YAML Ain't Markup Language (YAML™) Version 1.2 <http://www.yaml.org/spec/1.2/spec.html>`__
6501.. [MsgPack] `Message Pack <http://www.msgpack.org/>`__
6502.. [OpenCL] `The OpenCL Specification Version 2.0 <http://www.khronos.org/registry/cl/specs/opencl-2.0.pdf>`__
6503.. [HRF] `Heterogeneous-race-free Memory Models <http://benedictgaster.org/wp-content/uploads/2014/01/asplos269-FINAL.pdf>`__
6504.. [CLANG-ATTR] `Attributes in Clang <http://clang.llvm.org/docs/AttributeReference.html>`__
6505