1=============================
2User Guide for AMDGPU Backend
3=============================
4
5.. contents::
6   :local:
7
8.. toctree::
9   :hidden:
10
11   AMDGPU/AMDGPUAsmGFX7
12   AMDGPU/AMDGPUAsmGFX8
13   AMDGPU/AMDGPUAsmGFX9
14   AMDGPU/AMDGPUAsmGFX900
15   AMDGPU/AMDGPUAsmGFX904
16   AMDGPU/AMDGPUAsmGFX906
17   AMDGPU/AMDGPUAsmGFX908
18   AMDGPU/AMDGPUAsmGFX90a
19   AMDGPU/AMDGPUAsmGFX10
20   AMDGPU/AMDGPUAsmGFX1011
21   AMDGPUModifierSyntax
22   AMDGPUOperandSyntax
23   AMDGPUInstructionSyntax
24   AMDGPUInstructionNotation
25   AMDGPUDwarfExtensionsForHeterogeneousDebugging
26
27Introduction
28============
29
30The AMDGPU backend provides ISA code generation for AMD GPUs, starting with the
31R600 family up until the current GCN families. It lives in the
32``llvm/lib/Target/AMDGPU`` directory.
33
34LLVM
35====
36
37.. _amdgpu-target-triples:
38
39Target Triples
40--------------
41
42Use the Clang option ``-target <Architecture>-<Vendor>-<OS>-<Environment>``
43to specify the target triple:
44
45  .. table:: AMDGPU Architectures
46     :name: amdgpu-architecture-table
47
48     ============ ==============================================================
49     Architecture Description
50     ============ ==============================================================
51     ``r600``     AMD GPUs HD2XXX-HD6XXX for graphics and compute shaders.
52     ``amdgcn``   AMD GPUs GCN GFX6 onwards for graphics and compute shaders.
53     ============ ==============================================================
54
55  .. table:: AMDGPU Vendors
56     :name: amdgpu-vendor-table
57
58     ============ ==============================================================
59     Vendor       Description
60     ============ ==============================================================
61     ``amd``      Can be used for all AMD GPU usage.
62     ``mesa3d``   Can be used if the OS is ``mesa3d``.
63     ============ ==============================================================
64
65  .. table:: AMDGPU Operating Systems
66     :name: amdgpu-os
67
68     ============== ============================================================
69     OS             Description
70     ============== ============================================================
71     *<empty>*      Defaults to the *unknown* OS.
72     ``amdhsa``     Compute kernels executed on HSA [HSA]_ compatible runtimes
73                    such as:
74
75                    - AMD's ROCm™ runtime [AMD-ROCm]_ using the *rocm-amdhsa*
76                      loader on Linux. See *AMD ROCm Platform Release Notes*
77                      [AMD-ROCm-Release-Notes]_ for supported hardware and
78                      software.
79                    - AMD's PAL runtime using the *pal-amdhsa* loader on
80                      Windows.
81
82     ``amdpal``     Graphic shaders and compute kernels executed on AMD's PAL
83                    runtime using the *pal-amdpal* loader on Windows and Linux
84                    Pro.
85     ``mesa3d``     Graphic shaders and compute kernels executed on AMD's Mesa
86                    3D runtime using the *mesa-mesa3d* loader on Linux.
87     ============== ============================================================
88
89  .. table:: AMDGPU Environments
90     :name: amdgpu-environment-table
91
92     ============ ==============================================================
93     Environment  Description
94     ============ ==============================================================
95     *<empty>*    Default.
96     ============ ==============================================================
97
98.. _amdgpu-processors:
99
100Processors
101----------
102
103Use the Clang options ``-mcpu=<target-id>`` or ``--offload-arch=<target-id>`` to
104specify the AMDGPU processor together with optional target features. See
105:ref:`amdgpu-target-id` and :ref:`amdgpu-target-features` for AMD GPU target
106specific information.
107
108Every processor supports every OS ABI (see :ref:`amdgpu-os`) with the following exceptions:
109
110* ``amdhsa`` is not supported in ``r600`` architecture (see :ref:`amdgpu-architecture-table`).
111
112
113  .. table:: AMDGPU Processors
114     :name: amdgpu-processor-table
115
116     =========== =============== ============ ===== ================= =============== =============== ======================
117     Processor   Alternative     Target       dGPU/ Target            Target          OS Support      Example
118                 Processor       Triple       APU   Features          Properties      *(see*          Products
119                                 Architecture       Supported                         `amdgpu-os`_
120                                                                                      *and
121                                                                                      corresponding
122                                                                                      runtime release
123                                                                                      notes for
124                                                                                      current
125                                                                                      information and
126                                                                                      level of
127                                                                                      support)*
128     =========== =============== ============ ===== ================= =============== =============== ======================
129     **Radeon HD 2000/3000 Series (R600)** [AMD-RADEON-HD-2000-3000]_
130     -----------------------------------------------------------------------------------------------------------------------
131     ``r600``                    ``r600``     dGPU                    - Does not
132                                                                        support
133                                                                        generic
134                                                                        address
135                                                                        space
136     ``r630``                    ``r600``     dGPU                    - Does not
137                                                                        support
138                                                                        generic
139                                                                        address
140                                                                        space
141     ``rs880``                   ``r600``     dGPU                    - Does not
142                                                                        support
143                                                                        generic
144                                                                        address
145                                                                        space
146     ``rv670``                   ``r600``     dGPU                    - Does not
147                                                                        support
148                                                                        generic
149                                                                        address
150                                                                        space
151     **Radeon HD 4000 Series (R700)** [AMD-RADEON-HD-4000]_
152     -----------------------------------------------------------------------------------------------------------------------
153     ``rv710``                   ``r600``     dGPU                    - Does not
154                                                                        support
155                                                                        generic
156                                                                        address
157                                                                        space
158     ``rv730``                   ``r600``     dGPU                    - Does not
159                                                                        support
160                                                                        generic
161                                                                        address
162                                                                        space
163     ``rv770``                   ``r600``     dGPU                    - Does not
164                                                                        support
165                                                                        generic
166                                                                        address
167                                                                        space
168     **Radeon HD 5000 Series (Evergreen)** [AMD-RADEON-HD-5000]_
169     -----------------------------------------------------------------------------------------------------------------------
170     ``cedar``                   ``r600``     dGPU                    - Does not
171                                                                        support
172                                                                        generic
173                                                                        address
174                                                                        space
175     ``cypress``                 ``r600``     dGPU                    - Does not
176                                                                        support
177                                                                        generic
178                                                                        address
179                                                                        space
180     ``juniper``                 ``r600``     dGPU                    - Does not
181                                                                        support
182                                                                        generic
183                                                                        address
184                                                                        space
185     ``redwood``                 ``r600``     dGPU                    - Does not
186                                                                        support
187                                                                        generic
188                                                                        address
189                                                                        space
190     ``sumo``                    ``r600``     dGPU                    - Does not
191                                                                        support
192                                                                        generic
193                                                                        address
194                                                                        space
195     **Radeon HD 6000 Series (Northern Islands)** [AMD-RADEON-HD-6000]_
196     -----------------------------------------------------------------------------------------------------------------------
197     ``barts``                   ``r600``     dGPU                    - Does not
198                                                                        support
199                                                                        generic
200                                                                        address
201                                                                        space
202     ``caicos``                  ``r600``     dGPU                    - Does not
203                                                                        support
204                                                                        generic
205                                                                        address
206                                                                        space
207     ``cayman``                  ``r600``     dGPU                    - Does not
208                                                                        support
209                                                                        generic
210                                                                        address
211                                                                        space
212     ``turks``                   ``r600``     dGPU                    - Does not
213                                                                        support
214                                                                        generic
215                                                                        address
216                                                                        space
217     **GCN GFX6 (Southern Islands (SI))** [AMD-GCN-GFX6]_
218     -----------------------------------------------------------------------------------------------------------------------
219     ``gfx600``  - ``tahiti``    ``amdgcn``   dGPU                    - Does not      - *pal-amdpal*
220                                                                        support
221                                                                        generic
222                                                                        address
223                                                                        space
224     ``gfx601``  - ``pitcairn``  ``amdgcn``   dGPU                    - Does not      - *pal-amdpal*
225                 - ``verde``                                            support
226                                                                        generic
227                                                                        address
228                                                                        space
229     ``gfx602``  - ``hainan``    ``amdgcn``   dGPU                    - Does not      - *pal-amdpal*
230                 - ``oland``                                            support
231                                                                        generic
232                                                                        address
233                                                                        space
234     **GCN GFX7 (Sea Islands (CI))** [AMD-GCN-GFX7]_
235     -----------------------------------------------------------------------------------------------------------------------
236     ``gfx700``  - ``kaveri``    ``amdgcn``   APU                     - Offset        - *rocm-amdhsa* - A6-7000
237                                                                        flat          - *pal-amdhsa*  - A6 Pro-7050B
238                                                                        scratch       - *pal-amdpal*  - A8-7100
239                                                                                                      - A8 Pro-7150B
240                                                                                                      - A10-7300
241                                                                                                      - A10 Pro-7350B
242                                                                                                      - FX-7500
243                                                                                                      - A8-7200P
244                                                                                                      - A10-7400P
245                                                                                                      - FX-7600P
246     ``gfx701``  - ``hawaii``    ``amdgcn``   dGPU                    - Offset        - *rocm-amdhsa* - FirePro W8100
247                                                                        flat          - *pal-amdhsa*  - FirePro W9100
248                                                                        scratch       - *pal-amdpal*  - FirePro S9150
249                                                                                                      - FirePro S9170
250     ``gfx702``                  ``amdgcn``   dGPU                    - Offset        - *rocm-amdhsa* - Radeon R9 290
251                                                                        flat          - *pal-amdhsa*  - Radeon R9 290x
252                                                                        scratch       - *pal-amdpal*  - Radeon R390
253                                                                                                      - Radeon R390x
254     ``gfx703``  - ``kabini``    ``amdgcn``   APU                     - Offset        - *pal-amdhsa*  - E1-2100
255                 - ``mullins``                                          flat          - *pal-amdpal*  - E1-2200
256                                                                        scratch                       - E1-2500
257                                                                                                      - E2-3000
258                                                                                                      - E2-3800
259                                                                                                      - A4-5000
260                                                                                                      - A4-5100
261                                                                                                      - A6-5200
262                                                                                                      - A4 Pro-3340B
263     ``gfx704``  - ``bonaire``   ``amdgcn``   dGPU                    - Offset        - *pal-amdhsa*  - Radeon HD 7790
264                                                                        flat          - *pal-amdpal*  - Radeon HD 8770
265                                                                        scratch                       - R7 260
266                                                                                                      - R7 260X
267     ``gfx705``                  ``amdgcn``   APU                     - Offset        - *pal-amdhsa*  *TBA*
268                                                                        flat          - *pal-amdpal*
269                                                                        scratch                       .. TODO::
270
271                                                                                                        Add product
272                                                                                                        names.
273
274     **GCN GFX8 (Volcanic Islands (VI))** [AMD-GCN-GFX8]_
275     -----------------------------------------------------------------------------------------------------------------------
276     ``gfx801``  - ``carrizo``   ``amdgcn``   APU   - xnack           - Offset        - *rocm-amdhsa* - A6-8500P
277                                                                        flat          - *pal-amdhsa*  - Pro A6-8500B
278                                                                        scratch       - *pal-amdpal*  - A8-8600P
279                                                                                                      - Pro A8-8600B
280                                                                                                      - FX-8800P
281                                                                                                      - Pro A12-8800B
282                                                                                                      - A10-8700P
283                                                                                                      - Pro A10-8700B
284                                                                                                      - A10-8780P
285                                                                                                      - A10-9600P
286                                                                                                      - A10-9630P
287                                                                                                      - A12-9700P
288                                                                                                      - A12-9730P
289                                                                                                      - FX-9800P
290                                                                                                      - FX-9830P
291                                                                                                      - E2-9010
292                                                                                                      - A6-9210
293                                                                                                      - A9-9410
294     ``gfx802``  - ``iceland``   ``amdgcn``   dGPU                    - Offset        - *rocm-amdhsa* - Radeon R9 285
295                 - ``tonga``                                            flat          - *pal-amdhsa*  - Radeon R9 380
296                                                                        scratch       - *pal-amdpal*  - Radeon R9 385
297     ``gfx803``  - ``fiji``      ``amdgcn``   dGPU                                    - *rocm-amdhsa* - Radeon R9 Nano
298                                                                                      - *pal-amdhsa*  - Radeon R9 Fury
299                                                                                      - *pal-amdpal*  - Radeon R9 FuryX
300                                                                                                      - Radeon Pro Duo
301                                                                                                      - FirePro S9300x2
302                                                                                                      - Radeon Instinct MI8
303     \           - ``polaris10`` ``amdgcn``   dGPU                    - Offset        - *rocm-amdhsa* - Radeon RX 470
304                                                                        flat          - *pal-amdhsa*  - Radeon RX 480
305                                                                        scratch       - *pal-amdpal*  - Radeon Instinct MI6
306     \           - ``polaris11`` ``amdgcn``   dGPU                    - Offset        - *rocm-amdhsa* - Radeon RX 460
307                                                                        flat          - *pal-amdhsa*
308                                                                        scratch       - *pal-amdpal*
309     ``gfx805``  - ``tongapro``  ``amdgcn``   dGPU                    - Offset        - *rocm-amdhsa* - FirePro S7150
310                                                                        flat          - *pal-amdhsa*  - FirePro S7100
311                                                                        scratch       - *pal-amdpal*  - FirePro W7100
312                                                                                                      - Mobile FirePro
313                                                                                                        M7170
314     ``gfx810``  - ``stoney``    ``amdgcn``   APU   - xnack           - Offset        - *rocm-amdhsa* *TBA*
315                                                                        flat          - *pal-amdhsa*
316                                                                        scratch       - *pal-amdpal*  .. TODO::
317
318                                                                                                        Add product
319                                                                                                        names.
320
321     **GCN GFX9 (Vega)** [AMD-GCN-GFX900-GFX904-VEGA]_ [AMD-GCN-GFX906-VEGA7NM]_ [AMD-GCN-GFX908-CDNA1]_
322     -----------------------------------------------------------------------------------------------------------------------
323     ``gfx900``                  ``amdgcn``   dGPU  - xnack           - Absolute      - *rocm-amdhsa* - Radeon Vega
324                                                                        flat          - *pal-amdhsa*    Frontier Edition
325                                                                        scratch       - *pal-amdpal*  - Radeon RX Vega 56
326                                                                                                      - Radeon RX Vega 64
327                                                                                                      - Radeon RX Vega 64
328                                                                                                        Liquid
329                                                                                                      - Radeon Instinct MI25
330     ``gfx902``                  ``amdgcn``   APU   - xnack           - Absolute      - *rocm-amdhsa* - Ryzen 3 2200G
331                                                                        flat          - *pal-amdhsa*  - Ryzen 5 2400G
332                                                                        scratch       - *pal-amdpal*
333     ``gfx904``                  ``amdgcn``   dGPU  - xnack                           - *rocm-amdhsa* *TBA*
334                                                                                      - *pal-amdhsa*
335                                                                                      - *pal-amdpal*  .. TODO::
336
337                                                                                                        Add product
338                                                                                                        names.
339
340     ``gfx906``                  ``amdgcn``   dGPU  - sramecc         - Absolute      - *rocm-amdhsa* - Radeon Instinct MI50
341                                                    - xnack             flat          - *pal-amdhsa*  - Radeon Instinct MI60
342                                                                        scratch       - *pal-amdpal*  - Radeon VII
343                                                                                                      - Radeon Pro VII
344     ``gfx908``                  ``amdgcn``   dGPU  - sramecc                         - *rocm-amdhsa* - AMD Instinct MI100 Accelerator
345                                                    - xnack           - Absolute
346                                                                        flat
347                                                                        scratch
348     ``gfx909``                  ``amdgcn``   APU   - xnack           - Absolute      - *pal-amdpal*  *TBA*
349                                                                        flat
350                                                                        scratch                       .. TODO::
351
352                                                                                                        Add product
353                                                                                                        names.
354
355     ``gfx90a``                  ``amdgcn``   dGPU  - sramecc         - Absolute      - *rocm-amdhsa* *TBA*
356                                                    - tgsplit           flat
357                                                    - xnack             scratch                       .. TODO::
358                                                                      - Packed
359                                                                        work-item                       Add product
360                                                                        IDs                             names.
361
362     ``gfx90c``                  ``amdgcn``   APU   - xnack           - Absolute      - *pal-amdpal*  - Ryzen 7 4700G
363                                                                        flat                          - Ryzen 7 4700GE
364                                                                        scratch                       - Ryzen 5 4600G
365                                                                                                      - Ryzen 5 4600GE
366                                                                                                      - Ryzen 3 4300G
367                                                                                                      - Ryzen 3 4300GE
368                                                                                                      - Ryzen Pro 4000G
369                                                                                                      - Ryzen 7 Pro 4700G
370                                                                                                      - Ryzen 7 Pro 4750GE
371                                                                                                      - Ryzen 5 Pro 4650G
372                                                                                                      - Ryzen 5 Pro 4650GE
373                                                                                                      - Ryzen 3 Pro 4350G
374                                                                                                      - Ryzen 3 Pro 4350GE
375
376     **GCN GFX10 (RDNA 1)** [AMD-GCN-GFX10-RDNA1]_
377     -----------------------------------------------------------------------------------------------------------------------
378     ``gfx1010``                 ``amdgcn``   dGPU  - cumode          - Absolute      - *rocm-amdhsa* - Radeon RX 5700
379                                                    - wavefrontsize64   flat          - *pal-amdhsa*  - Radeon RX 5700 XT
380                                                    - xnack             scratch       - *pal-amdpal*  - Radeon Pro 5600 XT
381                                                                                                      - Radeon Pro 5600M
382     ``gfx1011``                 ``amdgcn``   dGPU  - cumode                          - *rocm-amdhsa* - Radeon Pro V520
383                                                    - wavefrontsize64 - Absolute      - *pal-amdhsa*
384                                                    - xnack             flat          - *pal-amdpal*
385                                                                        scratch
386     ``gfx1012``                 ``amdgcn``   dGPU  - cumode          - Absolute      - *rocm-amdhsa* - Radeon RX 5500
387                                                    - wavefrontsize64   flat          - *pal-amdhsa*  - Radeon RX 5500 XT
388                                                    - xnack             scratch       - *pal-amdpal*
389     ``gfx1013``                 ``amdgcn``   APU   - cumode          - Absolute      - *rocm-amdhsa* *TBA*
390                                                    - wavefrontsize64   flat          - *pal-amdhsa*
391                                                    - xnack             scratch       - *pal-amdpal*  .. TODO::
392
393                                                                                                        Add product
394                                                                                                        names.
395
396     **GCN GFX10 (RDNA 2)** [AMD-GCN-GFX10-RDNA2]_
397     -----------------------------------------------------------------------------------------------------------------------
398     ``gfx1030``                 ``amdgcn``   dGPU  - cumode          - Absolute      - *rocm-amdhsa* - Radeon RX 6800
399                                                    - wavefrontsize64   flat          - *pal-amdhsa*  - Radeon RX 6800 XT
400                                                                        scratch       - *pal-amdpal*  - Radeon RX 6900 XT
401     ``gfx1031``                 ``amdgcn``   dGPU  - cumode          - Absolute      - *rocm-amdhsa* - Radeon RX 6700 XT
402                                                    - wavefrontsize64   flat          - *pal-amdhsa*
403                                                                        scratch       - *pal-amdpal*
404     ``gfx1032``                 ``amdgcn``   dGPU  - cumode          - Absolute      - *rocm-amdhsa* *TBA*
405                                                    - wavefrontsize64   flat          - *pal-amdhsa*
406                                                                        scratch       - *pal-amdpal*  .. TODO::
407
408                                                                                                        Add product
409                                                                                                        names.
410
411     ``gfx1033``                 ``amdgcn``   APU   - cumode          - Absolute      - *pal-amdpal*  *TBA*
412                                                    - wavefrontsize64   flat
413                                                                        scratch                       .. TODO::
414
415                                                                                                        Add product
416                                                                                                        names.
417     ``gfx1034``                 ``amdgcn``   dGPU  - cumode          - Absolute      - *pal-amdpal*  *TBA*
418                                                    - wavefrontsize64   flat
419                                                                        scratch                       .. TODO::
420
421                                                                                                        Add product
422                                                                                                        names.
423
424     ``gfx1035``                 ``amdgcn``   APU   - cumode          - Absolute      - *pal-amdpal*  *TBA*
425                                                    - wavefrontsize64   flat
426                                                                        scratch                       .. TODO::
427                                                                                                        Add product
428                                                                                                        names.
429
430     =========== =============== ============ ===== ================= =============== =============== ======================
431
432.. _amdgpu-target-features:
433
434Target Features
435---------------
436
437Target features control how code is generated to support certain
438processor specific features. Not all target features are supported by
439all processors. The runtime must ensure that the features supported by
440the device used to execute the code match the features enabled when
441generating the code. A mismatch of features may result in incorrect
442execution, or a reduction in performance.
443
444The target features supported by each processor is listed in
445:ref:`amdgpu-processor-table`.
446
447Target features are controlled by exactly one of the following Clang
448options:
449
450``-mcpu=<target-id>`` or ``--offload-arch=<target-id>``
451
452  The ``-mcpu`` and ``--offload-arch`` can specify the target feature as
453  optional components of the target ID. If omitted, the target feature has the
454  ``any`` value. See :ref:`amdgpu-target-id`.
455
456``-m[no-]<target-feature>``
457
458  Target features not specified by the target ID are specified using a
459  separate option. These target features can have an ``on`` or ``off``
460  value.  ``on`` is specified by omitting the ``no-`` prefix, and
461  ``off`` is specified by including the ``no-`` prefix. The default
462  if not specified is ``off``.
463
464For example:
465
466``-mcpu=gfx908:xnack+``
467  Enable the ``xnack`` feature.
468``-mcpu=gfx908:xnack-``
469  Disable the ``xnack`` feature.
470``-mcumode``
471  Enable the ``cumode`` feature.
472``-mno-cumode``
473  Disable the ``cumode`` feature.
474
475  .. table:: AMDGPU Target Features
476     :name: amdgpu-target-features-table
477
478     =============== ============================ ==================================================
479     Target Feature  Clang Option to Control      Description
480     Name
481     =============== ============================ ==================================================
482     cumode          - ``-m[no-]cumode``          Control the wavefront execution mode used
483                                                  when generating code for kernels. When disabled
484                                                  native WGP wavefront execution mode is used,
485                                                  when enabled CU wavefront execution mode is used
486                                                  (see :ref:`amdgpu-amdhsa-memory-model`).
487
488     sramecc         - ``-mcpu``                  If specified, generate code that can only be
489                     - ``--offload-arch``         loaded and executed in a process that has a
490                                                  matching setting for SRAMECC.
491
492                                                  If not specified for code object V2 to V3, generate
493                                                  code that can be loaded and executed in a process
494                                                  with SRAMECC enabled.
495
496                                                  If not specified for code object V4, generate
497                                                  code that can be loaded and executed in a process
498                                                  with either setting of SRAMECC.
499
500     tgsplit           ``-m[no-]tgsplit``         Enable/disable generating code that assumes
501                                                  work-groups are launched in threadgroup split mode.
502                                                  When enabled the waves of a work-group may be
503                                                  launched in different CUs.
504
505     wavefrontsize64 - ``-m[no-]wavefrontsize64`` Control the wavefront size used when
506                                                  generating code for kernels. When disabled
507                                                  native wavefront size 32 is used, when enabled
508                                                  wavefront size 64 is used.
509
510     xnack           - ``-mcpu``                  If specified, generate code that can only be
511                     - ``--offload-arch``         loaded and executed in a process that has a
512                                                  matching setting for XNACK replay.
513
514                                                  If not specified for code object V2 to V3, generate
515                                                  code that can be loaded and executed in a process
516                                                  with XNACK replay enabled.
517
518                                                  If not specified for code object V4, generate
519                                                  code that can be loaded and executed in a process
520                                                  with either setting of XNACK replay.
521
522                                                  XNACK replay can be used for demand paging and
523                                                  page migration. If enabled in the device, then if
524                                                  a page fault occurs the code may execute
525                                                  incorrectly unless generated with XNACK replay
526                                                  enabled, or generated for code object V4 without
527                                                  specifying XNACK replay. Executing code that was
528                                                  generated with XNACK replay enabled, or generated
529                                                  for code object V4 without specifying XNACK replay,
530                                                  on a device that does not have XNACK replay
531                                                  enabled will execute correctly but may be less
532                                                  performant than code generated for XNACK replay
533                                                  disabled.
534     =============== ============================ ==================================================
535
536.. _amdgpu-target-id:
537
538Target ID
539---------
540
541AMDGPU supports target IDs. See `Clang Offload Bundler
542<https://clang.llvm.org/docs/ClangOffloadBundler.html>`_ for a general
543description. The AMDGPU target specific information is:
544
545**processor**
546  Is an AMDGPU processor or alternative processor name specified in
547  :ref:`amdgpu-processor-table`. The non-canonical form target ID allows both
548  the primary processor and alternative processor names. The canonical form
549  target ID only allow the primary processor name.
550
551**target-feature**
552  Is a target feature name specified in :ref:`amdgpu-target-features-table` that
553  is supported by the processor. The target features supported by each processor
554  is specified in :ref:`amdgpu-processor-table`. Those that can be specified in
555  a target ID are marked as being controlled by ``-mcpu`` and
556  ``--offload-arch``. Each target feature must appear at most once in a target
557  ID. The non-canonical form target ID allows the target features to be
558  specified in any order. The canonical form target ID requires the target
559  features to be specified in alphabetic order.
560
561.. _amdgpu-target-id-v2-v3:
562
563Code Object V2 to V3 Target ID
564~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
565
566The target ID syntax for code object V2 to V3 is the same as defined in `Clang
567Offload Bundler <https://clang.llvm.org/docs/ClangOffloadBundler.html>`_ except
568when used in the :ref:`amdgpu-assembler-directive-amdgcn-target` assembler
569directive and the bundle entry ID. In those cases it has the following BNF
570syntax:
571
572.. code::
573
574  <target-id> ::== <processor> ( "+" <target-feature> )*
575
576Where a target feature is omitted if *Off* and present if *On* or *Any*.
577
578.. note::
579
580  The code object V2 to V3 cannot represent *Any* and treats it the same as
581  *On*.
582
583.. _amdgpu-embedding-bundled-objects:
584
585Embedding Bundled Code Objects
586------------------------------
587
588AMDGPU supports the HIP and OpenMP languages that perform code object embedding
589as described in `Clang Offload Bundler
590<https://clang.llvm.org/docs/ClangOffloadBundler.html>`_.
591
592.. note::
593
594  The target ID syntax used for code object V2 to V3 for a bundle entry ID
595  differs from that used elsewhere. See :ref:`amdgpu-target-id-v2-v3`.
596
597.. _amdgpu-address-spaces:
598
599Address Spaces
600--------------
601
602The AMDGPU architecture supports a number of memory address spaces. The address
603space names use the OpenCL standard names, with some additions.
604
605The AMDGPU address spaces correspond to target architecture specific LLVM
606address space numbers used in LLVM IR.
607
608The AMDGPU address spaces are described in
609:ref:`amdgpu-address-spaces-table`. Only 64-bit process address spaces are
610supported for the ``amdgcn`` target.
611
612  .. table:: AMDGPU Address Spaces
613     :name: amdgpu-address-spaces-table
614
615     ================================= =============== =========== ================ ======= ============================
616     ..                                                                                     64-Bit Process Address Space
617     --------------------------------- --------------- ----------- ---------------- ------------------------------------
618     Address Space Name                LLVM IR Address HSA Segment Hardware         Address NULL Value
619                                       Space Number    Name        Name             Size
620     ================================= =============== =========== ================ ======= ============================
621     Generic                           0               flat        flat             64      0x0000000000000000
622     Global                            1               global      global           64      0x0000000000000000
623     Region                            2               N/A         GDS              32      *not implemented for AMDHSA*
624     Local                             3               group       LDS              32      0xFFFFFFFF
625     Constant                          4               constant    *same as global* 64      0x0000000000000000
626     Private                           5               private     scratch          32      0xFFFFFFFF
627     Constant 32-bit                   6               *TODO*                               0x00000000
628     Buffer Fat Pointer (experimental) 7               *TODO*
629     ================================= =============== =========== ================ ======= ============================
630
631**Generic**
632  The generic address space is supported unless the *Target Properties* column
633  of :ref:`amdgpu-processor-table` specifies *Does not support generic address
634  space*.
635
636  The generic address space uses the hardware flat address support for two fixed
637  ranges of virtual addresses (the private and local apertures), that are
638  outside the range of addressable global memory, to map from a flat address to
639  a private or local address. This uses FLAT instructions that can take a flat
640  address and access global, private (scratch), and group (LDS) memory depending
641  on if the address is within one of the aperture ranges.
642
643  Flat access to scratch requires hardware aperture setup and setup in the
644  kernel prologue (see :ref:`amdgpu-amdhsa-kernel-prolog-flat-scratch`). Flat
645  access to LDS requires hardware aperture setup and M0 (GFX7-GFX8) register
646  setup (see :ref:`amdgpu-amdhsa-kernel-prolog-m0`).
647
648  To convert between a private or group address space address (termed a segment
649  address) and a flat address the base address of the corresponding aperture
650  can be used. For GFX7-GFX8 these are available in the
651  :ref:`amdgpu-amdhsa-hsa-aql-queue` the address of which can be obtained with
652  Queue Ptr SGPR (see :ref:`amdgpu-amdhsa-initial-kernel-execution-state`). For
653  GFX9-GFX10 the aperture base addresses are directly available as inline
654  constant registers ``SRC_SHARED_BASE/LIMIT`` and ``SRC_PRIVATE_BASE/LIMIT``.
655  In 64-bit address mode the aperture sizes are 2^32 bytes and the base is
656  aligned to 2^32 which makes it easier to convert from flat to segment or
657  segment to flat.
658
659  A global address space address has the same value when used as a flat address
660  so no conversion is needed.
661
662**Global and Constant**
663  The global and constant address spaces both use global virtual addresses,
664  which are the same virtual address space used by the CPU. However, some
665  virtual addresses may only be accessible to the CPU, some only accessible
666  by the GPU, and some by both.
667
668  Using the constant address space indicates that the data will not change
669  during the execution of the kernel. This allows scalar read instructions to
670  be used. As the constant address space could only be modified on the host
671  side, a generic pointer loaded from the constant address space is safe to be
672  assumed as a global pointer since only the device global memory is visible
673  and managed on the host side. The vector and scalar L1 caches are invalidated
674  of volatile data before each kernel dispatch execution to allow constant
675  memory to change values between kernel dispatches.
676
677**Region**
678  The region address space uses the hardware Global Data Store (GDS). All
679  wavefronts executing on the same device will access the same memory for any
680  given region address. However, the same region address accessed by wavefronts
681  executing on different devices will access different memory. It is higher
682  performance than global memory. It is allocated by the runtime. The data
683  store (DS) instructions can be used to access it.
684
685**Local**
686  The local address space uses the hardware Local Data Store (LDS) which is
687  automatically allocated when the hardware creates the wavefronts of a
688  work-group, and freed when all the wavefronts of a work-group have
689  terminated. All wavefronts belonging to the same work-group will access the
690  same memory for any given local address. However, the same local address
691  accessed by wavefronts belonging to different work-groups will access
692  different memory. It is higher performance than global memory. The data store
693  (DS) instructions can be used to access it.
694
695**Private**
696  The private address space uses the hardware scratch memory support which
697  automatically allocates memory when it creates a wavefront and frees it when
698  a wavefronts terminates. The memory accessed by a lane of a wavefront for any
699  given private address will be different to the memory accessed by another lane
700  of the same or different wavefront for the same private address.
701
702  If a kernel dispatch uses scratch, then the hardware allocates memory from a
703  pool of backing memory allocated by the runtime for each wavefront. The lanes
704  of the wavefront access this using dword (4 byte) interleaving. The mapping
705  used from private address to backing memory address is:
706
707    ``wavefront-scratch-base +
708    ((private-address / 4) * wavefront-size * 4) +
709    (wavefront-lane-id * 4) + (private-address % 4)``
710
711  If each lane of a wavefront accesses the same private address, the
712  interleaving results in adjacent dwords being accessed and hence requires
713  fewer cache lines to be fetched.
714
715  There are different ways that the wavefront scratch base address is
716  determined by a wavefront (see
717  :ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
718
719  Scratch memory can be accessed in an interleaved manner using buffer
720  instructions with the scratch buffer descriptor and per wavefront scratch
721  offset, by the scratch instructions, or by flat instructions. Multi-dword
722  access is not supported except by flat and scratch instructions in
723  GFX9-GFX10.
724
725**Constant 32-bit**
726  *TODO*
727
728**Buffer Fat Pointer**
729  The buffer fat pointer is an experimental address space that is currently
730  unsupported in the backend. It exposes a non-integral pointer that is in
731  the future intended to support the modelling of 128-bit buffer descriptors
732  plus a 32-bit offset into the buffer (in total encapsulating a 160-bit
733  *pointer*), allowing normal LLVM load/store/atomic operations to be used to
734  model the buffer descriptors used heavily in graphics workloads targeting
735  the backend.
736
737.. _amdgpu-memory-scopes:
738
739Memory Scopes
740-------------
741
742This section provides LLVM memory synchronization scopes supported by the AMDGPU
743backend memory model when the target triple OS is ``amdhsa`` (see
744:ref:`amdgpu-amdhsa-memory-model` and :ref:`amdgpu-target-triples`).
745
746The memory model supported is based on the HSA memory model [HSA]_ which is
747based in turn on HRF-indirect with scope inclusion [HRF]_. The happens-before
748relation is transitive over the synchronizes-with relation independent of scope
749and synchronizes-with allows the memory scope instances to be inclusive (see
750table :ref:`amdgpu-amdhsa-llvm-sync-scopes-table`).
751
752This is different to the OpenCL [OpenCL]_ memory model which does not have scope
753inclusion and requires the memory scopes to exactly match. However, this
754is conservatively correct for OpenCL.
755
756  .. table:: AMDHSA LLVM Sync Scopes
757     :name: amdgpu-amdhsa-llvm-sync-scopes-table
758
759     ======================= ===================================================
760     LLVM Sync Scope         Description
761     ======================= ===================================================
762     *none*                  The default: ``system``.
763
764                             Synchronizes with, and participates in modification
765                             and seq_cst total orderings with, other operations
766                             (except image operations) for all address spaces
767                             (except private, or generic that accesses private)
768                             provided the other operation's sync scope is:
769
770                             - ``system``.
771                             - ``agent`` and executed by a thread on the same
772                               agent.
773                             - ``workgroup`` and executed by a thread in the
774                               same work-group.
775                             - ``wavefront`` and executed by a thread in the
776                               same wavefront.
777
778     ``agent``               Synchronizes with, and participates in modification
779                             and seq_cst total orderings with, other operations
780                             (except image operations) for all address spaces
781                             (except private, or generic that accesses private)
782                             provided the other operation's sync scope is:
783
784                             - ``system`` or ``agent`` and executed by a thread
785                               on the same agent.
786                             - ``workgroup`` and executed by a thread in the
787                               same work-group.
788                             - ``wavefront`` and executed by a thread in the
789                               same wavefront.
790
791     ``workgroup``           Synchronizes with, and participates in modification
792                             and seq_cst total orderings with, other operations
793                             (except image operations) for all address spaces
794                             (except private, or generic that accesses private)
795                             provided the other operation's sync scope is:
796
797                             - ``system``, ``agent`` or ``workgroup`` and
798                               executed by a thread in the same work-group.
799                             - ``wavefront`` and executed by a thread in the
800                               same wavefront.
801
802     ``wavefront``           Synchronizes with, and participates in modification
803                             and seq_cst total orderings with, other operations
804                             (except image operations) for all address spaces
805                             (except private, or generic that accesses private)
806                             provided the other operation's sync scope is:
807
808                             - ``system``, ``agent``, ``workgroup`` or
809                               ``wavefront`` and executed by a thread in the
810                               same wavefront.
811
812     ``singlethread``        Only synchronizes with and participates in
813                             modification and seq_cst total orderings with,
814                             other operations (except image operations) running
815                             in the same thread for all address spaces (for
816                             example, in signal handlers).
817
818     ``one-as``              Same as ``system`` but only synchronizes with other
819                             operations within the same address space.
820
821     ``agent-one-as``        Same as ``agent`` but only synchronizes with other
822                             operations within the same address space.
823
824     ``workgroup-one-as``    Same as ``workgroup`` but only synchronizes with
825                             other operations within the same address space.
826
827     ``wavefront-one-as``    Same as ``wavefront`` but only synchronizes with
828                             other operations within the same address space.
829
830     ``singlethread-one-as`` Same as ``singlethread`` but only synchronizes with
831                             other operations within the same address space.
832     ======================= ===================================================
833
834LLVM IR Intrinsics
835------------------
836
837The AMDGPU backend implements the following LLVM IR intrinsics.
838
839*This section is WIP.*
840
841.. TODO::
842
843   List AMDGPU intrinsics.
844
845LLVM IR Attributes
846------------------
847
848The AMDGPU backend supports the following LLVM IR attributes.
849
850  .. table:: AMDGPU LLVM IR Attributes
851     :name: amdgpu-llvm-ir-attributes-table
852
853     ======================================= ==========================================================
854     LLVM Attribute                          Description
855     ======================================= ==========================================================
856     "amdgpu-flat-work-group-size"="min,max" Specify the minimum and maximum flat work group sizes that
857                                             will be specified when the kernel is dispatched. Generated
858                                             by the ``amdgpu_flat_work_group_size`` CLANG attribute [CLANG-ATTR]_.
859     "amdgpu-implicitarg-num-bytes"="n"      Number of kernel argument bytes to add to the kernel
860                                             argument block size for the implicit arguments. This
861                                             varies by OS and language (for OpenCL see
862                                             :ref:`opencl-kernel-implicit-arguments-appended-for-amdhsa-os-table`).
863     "amdgpu-num-sgpr"="n"                   Specifies the number of SGPRs to use. Generated by
864                                             the ``amdgpu_num_sgpr`` CLANG attribute [CLANG-ATTR]_.
865     "amdgpu-num-vgpr"="n"                   Specifies the number of VGPRs to use. Generated by the
866                                             ``amdgpu_num_vgpr`` CLANG attribute [CLANG-ATTR]_.
867     "amdgpu-waves-per-eu"="m,n"             Specify the minimum and maximum number of waves per
868                                             execution unit. Generated by the ``amdgpu_waves_per_eu``
869                                             CLANG attribute [CLANG-ATTR]_.
870     "amdgpu-ieee" true/false.               Specify whether the function expects the IEEE field of the
871                                             mode register to be set on entry. Overrides the default for
872                                             the calling convention.
873     "amdgpu-dx10-clamp" true/false.         Specify whether the function expects the DX10_CLAMP field of
874                                             the mode register to be set on entry. Overrides the default
875                                             for the calling convention.
876     ======================================= ==========================================================
877
878.. _amdgpu-elf-code-object:
879
880ELF Code Object
881===============
882
883The AMDGPU backend generates a standard ELF [ELF]_ relocatable code object that
884can be linked by ``lld`` to produce a standard ELF shared code object which can
885be loaded and executed on an AMDGPU target.
886
887.. _amdgpu-elf-header:
888
889Header
890------
891
892The AMDGPU backend uses the following ELF header:
893
894  .. table:: AMDGPU ELF Header
895     :name: amdgpu-elf-header-table
896
897     ========================== ===============================
898     Field                      Value
899     ========================== ===============================
900     ``e_ident[EI_CLASS]``      ``ELFCLASS64``
901     ``e_ident[EI_DATA]``       ``ELFDATA2LSB``
902     ``e_ident[EI_OSABI]``      - ``ELFOSABI_NONE``
903                                - ``ELFOSABI_AMDGPU_HSA``
904                                - ``ELFOSABI_AMDGPU_PAL``
905                                - ``ELFOSABI_AMDGPU_MESA3D``
906     ``e_ident[EI_ABIVERSION]`` - ``ELFABIVERSION_AMDGPU_HSA_V2``
907                                - ``ELFABIVERSION_AMDGPU_HSA_V3``
908                                - ``ELFABIVERSION_AMDGPU_HSA_V4``
909                                - ``ELFABIVERSION_AMDGPU_PAL``
910                                - ``ELFABIVERSION_AMDGPU_MESA3D``
911     ``e_type``                 - ``ET_REL``
912                                - ``ET_DYN``
913     ``e_machine``              ``EM_AMDGPU``
914     ``e_entry``                0
915     ``e_flags``                See :ref:`amdgpu-elf-header-e_flags-v2-table`,
916                                :ref:`amdgpu-elf-header-e_flags-table-v3`,
917                                and :ref:`amdgpu-elf-header-e_flags-table-v4`
918     ========================== ===============================
919
920..
921
922  .. table:: AMDGPU ELF Header Enumeration Values
923     :name: amdgpu-elf-header-enumeration-values-table
924
925     =============================== =====
926     Name                            Value
927     =============================== =====
928     ``EM_AMDGPU``                   224
929     ``ELFOSABI_NONE``               0
930     ``ELFOSABI_AMDGPU_HSA``         64
931     ``ELFOSABI_AMDGPU_PAL``         65
932     ``ELFOSABI_AMDGPU_MESA3D``      66
933     ``ELFABIVERSION_AMDGPU_HSA_V2`` 0
934     ``ELFABIVERSION_AMDGPU_HSA_V3`` 1
935     ``ELFABIVERSION_AMDGPU_HSA_V4`` 2
936     ``ELFABIVERSION_AMDGPU_PAL``    0
937     ``ELFABIVERSION_AMDGPU_MESA3D`` 0
938     =============================== =====
939
940``e_ident[EI_CLASS]``
941  The ELF class is:
942
943  * ``ELFCLASS32`` for ``r600`` architecture.
944
945  * ``ELFCLASS64`` for ``amdgcn`` architecture which only supports 64-bit
946    process address space applications.
947
948``e_ident[EI_DATA]``
949  All AMDGPU targets use ``ELFDATA2LSB`` for little-endian byte ordering.
950
951``e_ident[EI_OSABI]``
952  One of the following AMDGPU target architecture specific OS ABIs
953  (see :ref:`amdgpu-os`):
954
955  * ``ELFOSABI_NONE`` for *unknown* OS.
956
957  * ``ELFOSABI_AMDGPU_HSA`` for ``amdhsa`` OS.
958
959  * ``ELFOSABI_AMDGPU_PAL`` for ``amdpal`` OS.
960
961  * ``ELFOSABI_AMDGPU_MESA3D`` for ``mesa3D`` OS.
962
963``e_ident[EI_ABIVERSION]``
964  The ABI version of the AMDGPU target architecture specific OS ABI to which the code
965  object conforms:
966
967  * ``ELFABIVERSION_AMDGPU_HSA_V2`` is used to specify the version of AMD HSA
968    runtime ABI for code object V2. Specify using the Clang option
969    ``-mcode-object-version=2``.
970
971  * ``ELFABIVERSION_AMDGPU_HSA_V3`` is used to specify the version of AMD HSA
972    runtime ABI for code object V3. Specify using the Clang option
973    ``-mcode-object-version=3``. This is the default code object
974    version if not specified.
975
976  * ``ELFABIVERSION_AMDGPU_HSA_V4`` is used to specify the version of AMD HSA
977    runtime ABI for code object V4. Specify using the Clang option
978    ``-mcode-object-version=4``.
979
980  * ``ELFABIVERSION_AMDGPU_PAL`` is used to specify the version of AMD PAL
981    runtime ABI.
982
983  * ``ELFABIVERSION_AMDGPU_MESA3D`` is used to specify the version of AMD MESA
984    3D runtime ABI.
985
986``e_type``
987  Can be one of the following values:
988
989
990  ``ET_REL``
991    The type produced by the AMDGPU backend compiler as it is relocatable code
992    object.
993
994  ``ET_DYN``
995    The type produced by the linker as it is a shared code object.
996
997  The AMD HSA runtime loader requires a ``ET_DYN`` code object.
998
999``e_machine``
1000  The value ``EM_AMDGPU`` is used for the machine for all processors supported
1001  by the ``r600`` and ``amdgcn`` architectures (see
1002  :ref:`amdgpu-processor-table`). The specific processor is specified in the
1003  ``NT_AMD_HSA_ISA_VERSION`` note record for code object V2 (see
1004  :ref:`amdgpu-note-records-v2`) and in the ``EF_AMDGPU_MACH`` bit field of the
1005  ``e_flags`` for code object V3 to V4 (see
1006  :ref:`amdgpu-elf-header-e_flags-table-v3` and
1007  :ref:`amdgpu-elf-header-e_flags-table-v4`).
1008
1009``e_entry``
1010  The entry point is 0 as the entry points for individual kernels must be
1011  selected in order to invoke them through AQL packets.
1012
1013``e_flags``
1014  The AMDGPU backend uses the following ELF header flags:
1015
1016  .. table:: AMDGPU ELF Header ``e_flags`` for Code Object V2
1017     :name: amdgpu-elf-header-e_flags-v2-table
1018
1019     ===================================== ===== =============================
1020     Name                                  Value Description
1021     ===================================== ===== =============================
1022     ``EF_AMDGPU_FEATURE_XNACK_V2``        0x01  Indicates if the ``xnack``
1023                                                 target feature is
1024                                                 enabled for all code
1025                                                 contained in the code object.
1026                                                 If the processor
1027                                                 does not support the
1028                                                 ``xnack`` target
1029                                                 feature then must
1030                                                 be 0.
1031                                                 See
1032                                                 :ref:`amdgpu-target-features`.
1033     ``EF_AMDGPU_FEATURE_TRAP_HANDLER_V2`` 0x02  Indicates if the trap
1034                                                 handler is enabled for all
1035                                                 code contained in the code
1036                                                 object. If the processor
1037                                                 does not support a trap
1038                                                 handler then must be 0.
1039                                                 See
1040                                                 :ref:`amdgpu-target-features`.
1041     ===================================== ===== =============================
1042
1043  .. table:: AMDGPU ELF Header ``e_flags`` for Code Object V3
1044     :name: amdgpu-elf-header-e_flags-table-v3
1045
1046     ================================= ===== =============================
1047     Name                              Value Description
1048     ================================= ===== =============================
1049     ``EF_AMDGPU_MACH``                0x0ff AMDGPU processor selection
1050                                             mask for
1051                                             ``EF_AMDGPU_MACH_xxx`` values
1052                                             defined in
1053                                             :ref:`amdgpu-ef-amdgpu-mach-table`.
1054     ``EF_AMDGPU_FEATURE_XNACK_V3``    0x100 Indicates if the ``xnack``
1055                                             target feature is
1056                                             enabled for all code
1057                                             contained in the code object.
1058                                             If the processor
1059                                             does not support the
1060                                             ``xnack`` target
1061                                             feature then must
1062                                             be 0.
1063                                             See
1064                                             :ref:`amdgpu-target-features`.
1065     ``EF_AMDGPU_FEATURE_SRAMECC_V3``  0x200 Indicates if the ``sramecc``
1066                                             target feature is
1067                                             enabled for all code
1068                                             contained in the code object.
1069                                             If the processor
1070                                             does not support the
1071                                             ``sramecc`` target
1072                                             feature then must
1073                                             be 0.
1074                                             See
1075                                             :ref:`amdgpu-target-features`.
1076     ================================= ===== =============================
1077
1078  .. table:: AMDGPU ELF Header ``e_flags`` for Code Object V4
1079     :name: amdgpu-elf-header-e_flags-table-v4
1080
1081     ============================================ ===== ===================================
1082     Name                                         Value      Description
1083     ============================================ ===== ===================================
1084     ``EF_AMDGPU_MACH``                           0x0ff AMDGPU processor selection
1085                                                        mask for
1086                                                        ``EF_AMDGPU_MACH_xxx`` values
1087                                                        defined in
1088                                                        :ref:`amdgpu-ef-amdgpu-mach-table`.
1089     ``EF_AMDGPU_FEATURE_XNACK_V4``               0x300 XNACK selection mask for
1090                                                        ``EF_AMDGPU_FEATURE_XNACK_*_V4``
1091                                                        values.
1092     ``EF_AMDGPU_FEATURE_XNACK_UNSUPPORTED_V4``   0x000 XNACK unsuppored.
1093     ``EF_AMDGPU_FEATURE_XNACK_ANY_V4``           0x100 XNACK can have any value.
1094     ``EF_AMDGPU_FEATURE_XNACK_OFF_V4``           0x200 XNACK disabled.
1095     ``EF_AMDGPU_FEATURE_XNACK_ON_V4``            0x300 XNACK enabled.
1096     ``EF_AMDGPU_FEATURE_SRAMECC_V4``             0xc00 SRAMECC selection mask for
1097                                                        ``EF_AMDGPU_FEATURE_SRAMECC_*_V4``
1098                                                        values.
1099     ``EF_AMDGPU_FEATURE_SRAMECC_UNSUPPORTED_V4`` 0x000 SRAMECC unsuppored.
1100     ``EF_AMDGPU_FEATURE_SRAMECC_ANY_V4``         0x400 SRAMECC can have any value.
1101     ``EF_AMDGPU_FEATURE_SRAMECC_OFF_V4``         0x800 SRAMECC disabled,
1102     ``EF_AMDGPU_FEATURE_SRAMECC_ON_V4``          0xc00 SRAMECC enabled.
1103     ============================================ ===== ===================================
1104
1105  .. table:: AMDGPU ``EF_AMDGPU_MACH`` Values
1106     :name: amdgpu-ef-amdgpu-mach-table
1107
1108     ==================================== ========== =============================
1109     Name                                 Value      Description (see
1110                                                     :ref:`amdgpu-processor-table`)
1111     ==================================== ========== =============================
1112     ``EF_AMDGPU_MACH_NONE``              0x000      *not specified*
1113     ``EF_AMDGPU_MACH_R600_R600``         0x001      ``r600``
1114     ``EF_AMDGPU_MACH_R600_R630``         0x002      ``r630``
1115     ``EF_AMDGPU_MACH_R600_RS880``        0x003      ``rs880``
1116     ``EF_AMDGPU_MACH_R600_RV670``        0x004      ``rv670``
1117     ``EF_AMDGPU_MACH_R600_RV710``        0x005      ``rv710``
1118     ``EF_AMDGPU_MACH_R600_RV730``        0x006      ``rv730``
1119     ``EF_AMDGPU_MACH_R600_RV770``        0x007      ``rv770``
1120     ``EF_AMDGPU_MACH_R600_CEDAR``        0x008      ``cedar``
1121     ``EF_AMDGPU_MACH_R600_CYPRESS``      0x009      ``cypress``
1122     ``EF_AMDGPU_MACH_R600_JUNIPER``      0x00a      ``juniper``
1123     ``EF_AMDGPU_MACH_R600_REDWOOD``      0x00b      ``redwood``
1124     ``EF_AMDGPU_MACH_R600_SUMO``         0x00c      ``sumo``
1125     ``EF_AMDGPU_MACH_R600_BARTS``        0x00d      ``barts``
1126     ``EF_AMDGPU_MACH_R600_CAICOS``       0x00e      ``caicos``
1127     ``EF_AMDGPU_MACH_R600_CAYMAN``       0x00f      ``cayman``
1128     ``EF_AMDGPU_MACH_R600_TURKS``        0x010      ``turks``
1129     *reserved*                           0x011 -    Reserved for ``r600``
1130                                          0x01f      architecture processors.
1131     ``EF_AMDGPU_MACH_AMDGCN_GFX600``     0x020      ``gfx600``
1132     ``EF_AMDGPU_MACH_AMDGCN_GFX601``     0x021      ``gfx601``
1133     ``EF_AMDGPU_MACH_AMDGCN_GFX700``     0x022      ``gfx700``
1134     ``EF_AMDGPU_MACH_AMDGCN_GFX701``     0x023      ``gfx701``
1135     ``EF_AMDGPU_MACH_AMDGCN_GFX702``     0x024      ``gfx702``
1136     ``EF_AMDGPU_MACH_AMDGCN_GFX703``     0x025      ``gfx703``
1137     ``EF_AMDGPU_MACH_AMDGCN_GFX704``     0x026      ``gfx704``
1138     *reserved*                           0x027      Reserved.
1139     ``EF_AMDGPU_MACH_AMDGCN_GFX801``     0x028      ``gfx801``
1140     ``EF_AMDGPU_MACH_AMDGCN_GFX802``     0x029      ``gfx802``
1141     ``EF_AMDGPU_MACH_AMDGCN_GFX803``     0x02a      ``gfx803``
1142     ``EF_AMDGPU_MACH_AMDGCN_GFX810``     0x02b      ``gfx810``
1143     ``EF_AMDGPU_MACH_AMDGCN_GFX900``     0x02c      ``gfx900``
1144     ``EF_AMDGPU_MACH_AMDGCN_GFX902``     0x02d      ``gfx902``
1145     ``EF_AMDGPU_MACH_AMDGCN_GFX904``     0x02e      ``gfx904``
1146     ``EF_AMDGPU_MACH_AMDGCN_GFX906``     0x02f      ``gfx906``
1147     ``EF_AMDGPU_MACH_AMDGCN_GFX908``     0x030      ``gfx908``
1148     ``EF_AMDGPU_MACH_AMDGCN_GFX909``     0x031      ``gfx909``
1149     ``EF_AMDGPU_MACH_AMDGCN_GFX90C``     0x032      ``gfx90c``
1150     ``EF_AMDGPU_MACH_AMDGCN_GFX1010``    0x033      ``gfx1010``
1151     ``EF_AMDGPU_MACH_AMDGCN_GFX1011``    0x034      ``gfx1011``
1152     ``EF_AMDGPU_MACH_AMDGCN_GFX1012``    0x035      ``gfx1012``
1153     ``EF_AMDGPU_MACH_AMDGCN_GFX1030``    0x036      ``gfx1030``
1154     ``EF_AMDGPU_MACH_AMDGCN_GFX1031``    0x037      ``gfx1031``
1155     ``EF_AMDGPU_MACH_AMDGCN_GFX1032``    0x038      ``gfx1032``
1156     ``EF_AMDGPU_MACH_AMDGCN_GFX1033``    0x039      ``gfx1033``
1157     ``EF_AMDGPU_MACH_AMDGCN_GFX602``     0x03a      ``gfx602``
1158     ``EF_AMDGPU_MACH_AMDGCN_GFX705``     0x03b      ``gfx705``
1159     ``EF_AMDGPU_MACH_AMDGCN_GFX805``     0x03c      ``gfx805``
1160     ``EF_AMDGPU_MACH_AMDGCN_GFX1035``    0x03d      ``gfx1035``
1161     ``EF_AMDGPU_MACH_AMDGCN_GFX1034``    0x03e      ``gfx1034``
1162     ``EF_AMDGPU_MACH_AMDGCN_GFX90A``     0x03f      ``gfx90a``
1163     *reserved*                           0x040      Reserved.
1164     *reserved*                           0x041      Reserved.
1165     ``EF_AMDGPU_MACH_AMDGCN_GFX1013``    0x042      ``gfx1013``
1166     *reserved*                           0x043      Reserved.
1167     *reserved*                           0x044      Reserved.
1168     *reserved*                           0x045      Reserved.
1169     ==================================== ========== =============================
1170
1171Sections
1172--------
1173
1174An AMDGPU target ELF code object has the standard ELF sections which include:
1175
1176  .. table:: AMDGPU ELF Sections
1177     :name: amdgpu-elf-sections-table
1178
1179     ================== ================ =================================
1180     Name               Type             Attributes
1181     ================== ================ =================================
1182     ``.bss``           ``SHT_NOBITS``   ``SHF_ALLOC`` + ``SHF_WRITE``
1183     ``.data``          ``SHT_PROGBITS`` ``SHF_ALLOC`` + ``SHF_WRITE``
1184     ``.debug_``\ *\**  ``SHT_PROGBITS`` *none*
1185     ``.dynamic``       ``SHT_DYNAMIC``  ``SHF_ALLOC``
1186     ``.dynstr``        ``SHT_PROGBITS`` ``SHF_ALLOC``
1187     ``.dynsym``        ``SHT_PROGBITS`` ``SHF_ALLOC``
1188     ``.got``           ``SHT_PROGBITS`` ``SHF_ALLOC`` + ``SHF_WRITE``
1189     ``.hash``          ``SHT_HASH``     ``SHF_ALLOC``
1190     ``.note``          ``SHT_NOTE``     *none*
1191     ``.rela``\ *name*  ``SHT_RELA``     *none*
1192     ``.rela.dyn``      ``SHT_RELA``     *none*
1193     ``.rodata``        ``SHT_PROGBITS`` ``SHF_ALLOC``
1194     ``.shstrtab``      ``SHT_STRTAB``   *none*
1195     ``.strtab``        ``SHT_STRTAB``   *none*
1196     ``.symtab``        ``SHT_SYMTAB``   *none*
1197     ``.text``          ``SHT_PROGBITS`` ``SHF_ALLOC`` + ``SHF_EXECINSTR``
1198     ================== ================ =================================
1199
1200These sections have their standard meanings (see [ELF]_) and are only generated
1201if needed.
1202
1203``.debug``\ *\**
1204  The standard DWARF sections. See :ref:`amdgpu-dwarf-debug-information` for
1205  information on the DWARF produced by the AMDGPU backend.
1206
1207``.dynamic``, ``.dynstr``, ``.dynsym``, ``.hash``
1208  The standard sections used by a dynamic loader.
1209
1210``.note``
1211  See :ref:`amdgpu-note-records` for the note records supported by the AMDGPU
1212  backend.
1213
1214``.rela``\ *name*, ``.rela.dyn``
1215  For relocatable code objects, *name* is the name of the section that the
1216  relocation records apply. For example, ``.rela.text`` is the section name for
1217  relocation records associated with the ``.text`` section.
1218
1219  For linked shared code objects, ``.rela.dyn`` contains all the relocation
1220  records from each of the relocatable code object's ``.rela``\ *name* sections.
1221
1222  See :ref:`amdgpu-relocation-records` for the relocation records supported by
1223  the AMDGPU backend.
1224
1225``.text``
1226  The executable machine code for the kernels and functions they call. Generated
1227  as position independent code. See :ref:`amdgpu-code-conventions` for
1228  information on conventions used in the isa generation.
1229
1230.. _amdgpu-note-records:
1231
1232Note Records
1233------------
1234
1235The AMDGPU backend code object contains ELF note records in the ``.note``
1236section. The set of generated notes and their semantics depend on the code
1237object version; see :ref:`amdgpu-note-records-v2` and
1238:ref:`amdgpu-note-records-v3-v4`.
1239
1240As required by ``ELFCLASS32`` and ``ELFCLASS64``, minimal zero-byte padding
1241must be generated after the ``name`` field to ensure the ``desc`` field is 4
1242byte aligned. In addition, minimal zero-byte padding must be generated to
1243ensure the ``desc`` field size is a multiple of 4 bytes. The ``sh_addralign``
1244field of the ``.note`` section must be at least 4 to indicate at least 8 byte
1245alignment.
1246
1247.. _amdgpu-note-records-v2:
1248
1249Code Object V2 Note Records
1250~~~~~~~~~~~~~~~~~~~~~~~~~~~
1251
1252.. warning::
1253  Code object V2 is not the default code object version emitted by
1254  this version of LLVM.
1255
1256The AMDGPU backend code object uses the following ELF note record in the
1257``.note`` section when compiling for code object V2.
1258
1259The note record vendor field is "AMD".
1260
1261Additional note records may be present, but any which are not documented here
1262are deprecated and should not be used.
1263
1264  .. table:: AMDGPU Code Object V2 ELF Note Records
1265     :name: amdgpu-elf-note-records-v2-table
1266
1267     ===== ===================================== ======================================
1268     Name  Type                                  Description
1269     ===== ===================================== ======================================
1270     "AMD" ``NT_AMD_HSA_CODE_OBJECT_VERSION``    Code object version.
1271     "AMD" ``NT_AMD_HSA_HSAIL``                  HSAIL properties generated by the HSAIL
1272                                                 Finalizer and not the LLVM compiler.
1273     "AMD" ``NT_AMD_HSA_ISA_VERSION``            Target ISA version.
1274     "AMD" ``NT_AMD_HSA_METADATA``               Metadata null terminated string in
1275                                                 YAML [YAML]_ textual format.
1276     "AMD" ``NT_AMD_HSA_ISA_NAME``               Target ISA name.
1277     ===== ===================================== ======================================
1278
1279..
1280
1281  .. table:: AMDGPU Code Object V2 ELF Note Record Enumeration Values
1282     :name: amdgpu-elf-note-record-enumeration-values-v2-table
1283
1284     ===================================== =====
1285     Name                                  Value
1286     ===================================== =====
1287     ``NT_AMD_HSA_CODE_OBJECT_VERSION``    1
1288     ``NT_AMD_HSA_HSAIL``                  2
1289     ``NT_AMD_HSA_ISA_VERSION``            3
1290     *reserved*                            4-9
1291     ``NT_AMD_HSA_METADATA``               10
1292     ``NT_AMD_HSA_ISA_NAME``               11
1293     ===================================== =====
1294
1295``NT_AMD_HSA_CODE_OBJECT_VERSION``
1296  Specifies the code object version number. The description field has the
1297  following layout:
1298
1299  .. code:: c
1300
1301    struct amdgpu_hsa_note_code_object_version_s {
1302      uint32_t major_version;
1303      uint32_t minor_version;
1304    };
1305
1306  The ``major_version`` has a value less than or equal to 2.
1307
1308``NT_AMD_HSA_HSAIL``
1309  Specifies the HSAIL properties used by the HSAIL Finalizer. The description
1310  field has the following layout:
1311
1312  .. code:: c
1313
1314    struct amdgpu_hsa_note_hsail_s {
1315      uint32_t hsail_major_version;
1316      uint32_t hsail_minor_version;
1317      uint8_t profile;
1318      uint8_t machine_model;
1319      uint8_t default_float_round;
1320    };
1321
1322``NT_AMD_HSA_ISA_VERSION``
1323  Specifies the target ISA version. The description field has the following layout:
1324
1325  .. code:: c
1326
1327    struct amdgpu_hsa_note_isa_s {
1328      uint16_t vendor_name_size;
1329      uint16_t architecture_name_size;
1330      uint32_t major;
1331      uint32_t minor;
1332      uint32_t stepping;
1333      char vendor_and_architecture_name[1];
1334    };
1335
1336  ``vendor_name_size`` and ``architecture_name_size`` are the length of the
1337  vendor and architecture names respectively, including the NUL character.
1338
1339  ``vendor_and_architecture_name`` contains the NUL terminates string for the
1340  vendor, immediately followed by the NUL terminated string for the
1341  architecture.
1342
1343  This note record is used by the HSA runtime loader.
1344
1345  Code object V2 only supports a limited number of processors and has fixed
1346  settings for target features. See
1347  :ref:`amdgpu-elf-note-record-supported_processors-v2-table` for a list of
1348  processors and the corresponding target ID. In the table the note record ISA
1349  name is a concatenation of the vendor name, architecture name, major, minor,
1350  and stepping separated by a ":".
1351
1352  The target ID column shows the processor name and fixed target features used
1353  by the LLVM compiler. The LLVM compiler does not generate a
1354  ``NT_AMD_HSA_HSAIL`` note record.
1355
1356  A code object generated by the Finalizer also uses code object V2 and always
1357  generates a ``NT_AMD_HSA_HSAIL`` note record. The processor name and
1358  ``sramecc`` target feature is as shown in
1359  :ref:`amdgpu-elf-note-record-supported_processors-v2-table` but the ``xnack``
1360  target feature is specified by the ``EF_AMDGPU_FEATURE_XNACK_V2`` ``e_flags``
1361  bit.
1362
1363``NT_AMD_HSA_ISA_NAME``
1364  Specifies the target ISA name as a non-NUL terminated string.
1365
1366  This note record is not used by the HSA runtime loader.
1367
1368  See the ``NT_AMD_HSA_ISA_VERSION`` note record description of the code object
1369  V2's limited support of processors and fixed settings for target features.
1370
1371  See :ref:`amdgpu-elf-note-record-supported_processors-v2-table` for a mapping
1372  from the string to the corresponding target ID. If the ``xnack`` target
1373  feature is supported and enabled, the string produced by the LLVM compiler
1374  will may have a ``+xnack`` appended. The Finlizer did not do the appending and
1375  instead used the ``EF_AMDGPU_FEATURE_XNACK_V2`` ``e_flags`` bit.
1376
1377``NT_AMD_HSA_METADATA``
1378  Specifies extensible metadata associated with the code objects executed on HSA
1379  [HSA]_ compatible runtimes (see :ref:`amdgpu-os`). It is required when the
1380  target triple OS is ``amdhsa`` (see :ref:`amdgpu-target-triples`). See
1381  :ref:`amdgpu-amdhsa-code-object-metadata-v2` for the syntax of the code object
1382  metadata string.
1383
1384  .. table:: AMDGPU Code Object V2 Supported Processors and Fixed Target Feature Settings
1385     :name: amdgpu-elf-note-record-supported_processors-v2-table
1386
1387     ===================== ==========================
1388     Note Record ISA Name  Target ID
1389     ===================== ==========================
1390     ``AMD:AMDGPU:6:0:0``  ``gfx600``
1391     ``AMD:AMDGPU:6:0:1``  ``gfx601``
1392     ``AMD:AMDGPU:6:0:2``  ``gfx602``
1393     ``AMD:AMDGPU:7:0:0``  ``gfx700``
1394     ``AMD:AMDGPU:7:0:1``  ``gfx701``
1395     ``AMD:AMDGPU:7:0:2``  ``gfx702``
1396     ``AMD:AMDGPU:7:0:3``  ``gfx703``
1397     ``AMD:AMDGPU:7:0:4``  ``gfx704``
1398     ``AMD:AMDGPU:7:0:5``  ``gfx705``
1399     ``AMD:AMDGPU:8:0:0``  ``gfx802``
1400     ``AMD:AMDGPU:8:0:1``  ``gfx801:xnack+``
1401     ``AMD:AMDGPU:8:0:2``  ``gfx802``
1402     ``AMD:AMDGPU:8:0:3``  ``gfx803``
1403     ``AMD:AMDGPU:8:0:4``  ``gfx803``
1404     ``AMD:AMDGPU:8:0:5``  ``gfx805``
1405     ``AMD:AMDGPU:8:1:0``  ``gfx810:xnack+``
1406     ``AMD:AMDGPU:9:0:0``  ``gfx900:xnack-``
1407     ``AMD:AMDGPU:9:0:1``  ``gfx900:xnack+``
1408     ``AMD:AMDGPU:9:0:2``  ``gfx902:xnack-``
1409     ``AMD:AMDGPU:9:0:3``  ``gfx902:xnack+``
1410     ``AMD:AMDGPU:9:0:4``  ``gfx904:xnack-``
1411     ``AMD:AMDGPU:9:0:5``  ``gfx904:xnack+``
1412     ``AMD:AMDGPU:9:0:6``  ``gfx906:sramecc-:xnack-``
1413     ``AMD:AMDGPU:9:0:7``  ``gfx906:sramecc-:xnack+``
1414     ``AMD:AMDGPU:9:0:12`` ``gfx90c:xnack-``
1415     ===================== ==========================
1416
1417.. _amdgpu-note-records-v3-v4:
1418
1419Code Object V3 to V4 Note Records
1420~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
1421
1422The AMDGPU backend code object uses the following ELF note record in the
1423``.note`` section when compiling for code object V3 to V4.
1424
1425The note record vendor field is "AMDGPU".
1426
1427Additional note records may be present, but any which are not documented here
1428are deprecated and should not be used.
1429
1430  .. table:: AMDGPU Code Object V3 to V4 ELF Note Records
1431     :name: amdgpu-elf-note-records-table-v3-v4
1432
1433     ======== ============================== ======================================
1434     Name     Type                           Description
1435     ======== ============================== ======================================
1436     "AMDGPU" ``NT_AMDGPU_METADATA``         Metadata in Message Pack [MsgPack]_
1437                                             binary format.
1438     ======== ============================== ======================================
1439
1440..
1441
1442  .. table:: AMDGPU Code Object V3 to V4 ELF Note Record Enumeration Values
1443     :name: amdgpu-elf-note-record-enumeration-values-table-v3-v4
1444
1445     ============================== =====
1446     Name                           Value
1447     ============================== =====
1448     *reserved*                     0-31
1449     ``NT_AMDGPU_METADATA``         32
1450     ============================== =====
1451
1452``NT_AMDGPU_METADATA``
1453  Specifies extensible metadata associated with an AMDGPU code object. It is
1454  encoded as a map in the Message Pack [MsgPack]_ binary data format. See
1455  :ref:`amdgpu-amdhsa-code-object-metadata-v3` and
1456  :ref:`amdgpu-amdhsa-code-object-metadata-v4` for the map keys defined for the
1457  ``amdhsa`` OS.
1458
1459.. _amdgpu-symbols:
1460
1461Symbols
1462-------
1463
1464Symbols include the following:
1465
1466  .. table:: AMDGPU ELF Symbols
1467     :name: amdgpu-elf-symbols-table
1468
1469     ===================== ================== ================ ==================
1470     Name                  Type               Section          Description
1471     ===================== ================== ================ ==================
1472     *link-name*           ``STT_OBJECT``     - ``.data``      Global variable
1473                                              - ``.rodata``
1474                                              - ``.bss``
1475     *link-name*\ ``.kd``  ``STT_OBJECT``     - ``.rodata``    Kernel descriptor
1476     *link-name*           ``STT_FUNC``       - ``.text``      Kernel entry point
1477     *link-name*           ``STT_OBJECT``     - SHN_AMDGPU_LDS Global variable in LDS
1478     ===================== ================== ================ ==================
1479
1480Global variable
1481  Global variables both used and defined by the compilation unit.
1482
1483  If the symbol is defined in the compilation unit then it is allocated in the
1484  appropriate section according to if it has initialized data or is readonly.
1485
1486  If the symbol is external then its section is ``STN_UNDEF`` and the loader
1487  will resolve relocations using the definition provided by another code object
1488  or explicitly defined by the runtime.
1489
1490  If the symbol resides in local/group memory (LDS) then its section is the
1491  special processor specific section name ``SHN_AMDGPU_LDS``, and the
1492  ``st_value`` field describes alignment requirements as it does for common
1493  symbols.
1494
1495  .. TODO::
1496
1497     Add description of linked shared object symbols. Seems undefined symbols
1498     are marked as STT_NOTYPE.
1499
1500Kernel descriptor
1501  Every HSA kernel has an associated kernel descriptor. It is the address of the
1502  kernel descriptor that is used in the AQL dispatch packet used to invoke the
1503  kernel, not the kernel entry point. The layout of the HSA kernel descriptor is
1504  defined in :ref:`amdgpu-amdhsa-kernel-descriptor`.
1505
1506Kernel entry point
1507  Every HSA kernel also has a symbol for its machine code entry point.
1508
1509.. _amdgpu-relocation-records:
1510
1511Relocation Records
1512------------------
1513
1514AMDGPU backend generates ``Elf64_Rela`` relocation records. Supported
1515relocatable fields are:
1516
1517``word32``
1518  This specifies a 32-bit field occupying 4 bytes with arbitrary byte
1519  alignment. These values use the same byte order as other word values in the
1520  AMDGPU architecture.
1521
1522``word64``
1523  This specifies a 64-bit field occupying 8 bytes with arbitrary byte
1524  alignment. These values use the same byte order as other word values in the
1525  AMDGPU architecture.
1526
1527Following notations are used for specifying relocation calculations:
1528
1529**A**
1530  Represents the addend used to compute the value of the relocatable field.
1531
1532**G**
1533  Represents the offset into the global offset table at which the relocation
1534  entry's symbol will reside during execution.
1535
1536**GOT**
1537  Represents the address of the global offset table.
1538
1539**P**
1540  Represents the place (section offset for ``et_rel`` or address for ``et_dyn``)
1541  of the storage unit being relocated (computed using ``r_offset``).
1542
1543**S**
1544  Represents the value of the symbol whose index resides in the relocation
1545  entry. Relocations not using this must specify a symbol index of
1546  ``STN_UNDEF``.
1547
1548**B**
1549  Represents the base address of a loaded executable or shared object which is
1550  the difference between the ELF address and the actual load address.
1551  Relocations using this are only valid in executable or shared objects.
1552
1553The following relocation types are supported:
1554
1555  .. table:: AMDGPU ELF Relocation Records
1556     :name: amdgpu-elf-relocation-records-table
1557
1558     ========================== ======= =====  ==========  ==============================
1559     Relocation Type            Kind    Value  Field       Calculation
1560     ========================== ======= =====  ==========  ==============================
1561     ``R_AMDGPU_NONE``                  0      *none*      *none*
1562     ``R_AMDGPU_ABS32_LO``      Static, 1      ``word32``  (S + A) & 0xFFFFFFFF
1563                                Dynamic
1564     ``R_AMDGPU_ABS32_HI``      Static, 2      ``word32``  (S + A) >> 32
1565                                Dynamic
1566     ``R_AMDGPU_ABS64``         Static, 3      ``word64``  S + A
1567                                Dynamic
1568     ``R_AMDGPU_REL32``         Static  4      ``word32``  S + A - P
1569     ``R_AMDGPU_REL64``         Static  5      ``word64``  S + A - P
1570     ``R_AMDGPU_ABS32``         Static, 6      ``word32``  S + A
1571                                Dynamic
1572     ``R_AMDGPU_GOTPCREL``      Static  7      ``word32``  G + GOT + A - P
1573     ``R_AMDGPU_GOTPCREL32_LO`` Static  8      ``word32``  (G + GOT + A - P) & 0xFFFFFFFF
1574     ``R_AMDGPU_GOTPCREL32_HI`` Static  9      ``word32``  (G + GOT + A - P) >> 32
1575     ``R_AMDGPU_REL32_LO``      Static  10     ``word32``  (S + A - P) & 0xFFFFFFFF
1576     ``R_AMDGPU_REL32_HI``      Static  11     ``word32``  (S + A - P) >> 32
1577     *reserved*                         12
1578     ``R_AMDGPU_RELATIVE64``    Dynamic 13     ``word64``  B + A
1579     ``R_AMDGPU_REL16``         Static  14     ``word16``  ((S + A - P) - 4) / 4
1580     ========================== ======= =====  ==========  ==============================
1581
1582``R_AMDGPU_ABS32_LO`` and ``R_AMDGPU_ABS32_HI`` are only supported by
1583the ``mesa3d`` OS, which does not support ``R_AMDGPU_ABS64``.
1584
1585There is no current OS loader support for 32-bit programs and so
1586``R_AMDGPU_ABS32`` is not used.
1587
1588.. _amdgpu-loaded-code-object-path-uniform-resource-identifier:
1589
1590Loaded Code Object Path Uniform Resource Identifier (URI)
1591---------------------------------------------------------
1592
1593The AMD GPU code object loader represents the path of the ELF shared object from
1594which the code object was loaded as a textual Unifom Resource Identifier (URI).
1595Note that the code object is the in memory loaded relocated form of the ELF
1596shared object.  Multiple code objects may be loaded at different memory
1597addresses in the same process from the same ELF shared object.
1598
1599The loaded code object path URI syntax is defined by the following BNF syntax:
1600
1601.. code::
1602
1603  code_object_uri ::== file_uri | memory_uri
1604  file_uri        ::== "file://" file_path [ range_specifier ]
1605  memory_uri      ::== "memory://" process_id range_specifier
1606  range_specifier ::== [ "#" | "?" ] "offset=" number "&" "size=" number
1607  file_path       ::== URI_ENCODED_OS_FILE_PATH
1608  process_id      ::== DECIMAL_NUMBER
1609  number          ::== HEX_NUMBER | DECIMAL_NUMBER | OCTAL_NUMBER
1610
1611**number**
1612  Is a C integral literal where hexadecimal values are prefixed by "0x" or "0X",
1613  and octal values by "0".
1614
1615**file_path**
1616  Is the file's path specified as a URI encoded UTF-8 string. In URI encoding,
1617  every character that is not in the regular expression ``[a-zA-Z0-9/_.~-]`` is
1618  encoded as two uppercase hexadecimal digits proceeded by "%".  Directories in
1619  the path are separated by "/".
1620
1621**offset**
1622  Is a 0-based byte offset to the start of the code object.  For a file URI, it
1623  is from the start of the file specified by the ``file_path``, and if omitted
1624  defaults to 0. For a memory URI, it is the memory address and is required.
1625
1626**size**
1627  Is the number of bytes in the code object.  For a file URI, if omitted it
1628  defaults to the size of the file.  It is required for a memory URI.
1629
1630**process_id**
1631  Is the identity of the process owning the memory.  For Linux it is the C
1632  unsigned integral decimal literal for the process ID (PID).
1633
1634For example:
1635
1636.. code::
1637
1638  file:///dir1/dir2/file1
1639  file:///dir3/dir4/file2#offset=0x2000&size=3000
1640  memory://1234#offset=0x20000&size=3000
1641
1642.. _amdgpu-dwarf-debug-information:
1643
1644DWARF Debug Information
1645=======================
1646
1647.. warning::
1648
1649   This section describes **provisional support** for AMDGPU DWARF [DWARF]_ that
1650   is not currently fully implemented and is subject to change.
1651
1652AMDGPU generates DWARF [DWARF]_ debugging information ELF sections (see
1653:ref:`amdgpu-elf-code-object`) which contain information that maps the code
1654object executable code and data to the source language constructs. It can be
1655used by tools such as debuggers and profilers. It uses features defined in
1656:doc:`AMDGPUDwarfExtensionsForHeterogeneousDebugging` that are made available in
1657DWARF Version 4 and DWARF Version 5 as an LLVM vendor extension.
1658
1659This section defines the AMDGPU target architecture specific DWARF mappings.
1660
1661.. _amdgpu-dwarf-register-identifier:
1662
1663Register Identifier
1664-------------------
1665
1666This section defines the AMDGPU target architecture register numbers used in
1667DWARF operation expressions (see DWARF Version 5 section 2.5 and
1668:ref:`amdgpu-dwarf-operation-expressions`) and Call Frame Information
1669instructions (see DWARF Version 5 section 6.4 and
1670:ref:`amdgpu-dwarf-call-frame-information`).
1671
1672A single code object can contain code for kernels that have different wavefront
1673sizes. The vector registers and some scalar registers are based on the wavefront
1674size. AMDGPU defines distinct DWARF registers for each wavefront size. This
1675simplifies the consumer of the DWARF so that each register has a fixed size,
1676rather than being dynamic according to the wavefront size mode. Similarly,
1677distinct DWARF registers are defined for those registers that vary in size
1678according to the process address size. This allows a consumer to treat a
1679specific AMDGPU processor as a single architecture regardless of how it is
1680configured at run time. The compiler explicitly specifies the DWARF registers
1681that match the mode in which the code it is generating will be executed.
1682
1683DWARF registers are encoded as numbers, which are mapped to architecture
1684registers. The mapping for AMDGPU is defined in
1685:ref:`amdgpu-dwarf-register-mapping-table`. All AMDGPU targets use the same
1686mapping.
1687
1688.. table:: AMDGPU DWARF Register Mapping
1689   :name: amdgpu-dwarf-register-mapping-table
1690
1691   ============== ================= ======== ==================================
1692   DWARF Register AMDGPU Register   Bit Size Description
1693   ============== ================= ======== ==================================
1694   0              PC_32             32       Program Counter (PC) when
1695                                             executing in a 32-bit process
1696                                             address space. Used in the CFI to
1697                                             describe the PC of the calling
1698                                             frame.
1699   1              EXEC_MASK_32      32       Execution Mask Register when
1700                                             executing in wavefront 32 mode.
1701   2-15           *Reserved*                 *Reserved for highly accessed
1702                                             registers using DWARF shortcut.*
1703   16             PC_64             64       Program Counter (PC) when
1704                                             executing in a 64-bit process
1705                                             address space. Used in the CFI to
1706                                             describe the PC of the calling
1707                                             frame.
1708   17             EXEC_MASK_64      64       Execution Mask Register when
1709                                             executing in wavefront 64 mode.
1710   18-31          *Reserved*                 *Reserved for highly accessed
1711                                             registers using DWARF shortcut.*
1712   32-95          SGPR0-SGPR63      32       Scalar General Purpose
1713                                             Registers.
1714   96-127         *Reserved*                 *Reserved for frequently accessed
1715                                             registers using DWARF 1-byte ULEB.*
1716   128            STATUS            32       Status Register.
1717   129-511        *Reserved*                 *Reserved for future Scalar
1718                                             Architectural Registers.*
1719   512            VCC_32            32       Vector Condition Code Register
1720                                             when executing in wavefront 32
1721                                             mode.
1722   513-1023       *Reserved*                 *Reserved for future Vector
1723                                             Architectural Registers when
1724                                             executing in wavefront 32 mode.*
1725   768            VCC_64            64       Vector Condition Code Register
1726                                             when executing in wavefront 64
1727                                             mode.
1728   769-1023       *Reserved*                 *Reserved for future Vector
1729                                             Architectural Registers when
1730                                             executing in wavefront 64 mode.*
1731   1024-1087      *Reserved*                 *Reserved for padding.*
1732   1088-1129      SGPR64-SGPR105    32       Scalar General Purpose Registers.
1733   1130-1535      *Reserved*                 *Reserved for future Scalar
1734                                             General Purpose Registers.*
1735   1536-1791      VGPR0-VGPR255     32*32    Vector General Purpose Registers
1736                                             when executing in wavefront 32
1737                                             mode.
1738   1792-2047      *Reserved*                 *Reserved for future Vector
1739                                             General Purpose Registers when
1740                                             executing in wavefront 32 mode.*
1741   2048-2303      AGPR0-AGPR255     32*32    Vector Accumulation Registers
1742                                             when executing in wavefront 32
1743                                             mode.
1744   2304-2559      *Reserved*                 *Reserved for future Vector
1745                                             Accumulation Registers when
1746                                             executing in wavefront 32 mode.*
1747   2560-2815      VGPR0-VGPR255     64*32    Vector General Purpose Registers
1748                                             when executing in wavefront 64
1749                                             mode.
1750   2816-3071      *Reserved*                 *Reserved for future Vector
1751                                             General Purpose Registers when
1752                                             executing in wavefront 64 mode.*
1753   3072-3327      AGPR0-AGPR255     64*32    Vector Accumulation Registers
1754                                             when executing in wavefront 64
1755                                             mode.
1756   3328-3583      *Reserved*                 *Reserved for future Vector
1757                                             Accumulation Registers when
1758                                             executing in wavefront 64 mode.*
1759   ============== ================= ======== ==================================
1760
1761The vector registers are represented as the full size for the wavefront. They
1762are organized as consecutive dwords (32-bits), one per lane, with the dword at
1763the least significant bit position corresponding to lane 0 and so forth. DWARF
1764location expressions involving the ``DW_OP_LLVM_offset`` and
1765``DW_OP_LLVM_push_lane`` operations are used to select the part of the vector
1766register corresponding to the lane that is executing the current thread of
1767execution in languages that are implemented using a SIMD or SIMT execution
1768model.
1769
1770If the wavefront size is 32 lanes then the wavefront 32 mode register
1771definitions are used. If the wavefront size is 64 lanes then the wavefront 64
1772mode register definitions are used. Some AMDGPU targets support executing in
1773both wavefront 32 and wavefront 64 mode. The register definitions corresponding
1774to the wavefront mode of the generated code will be used.
1775
1776If code is generated to execute in a 32-bit process address space, then the
177732-bit process address space register definitions are used. If code is generated
1778to execute in a 64-bit process address space, then the 64-bit process address
1779space register definitions are used. The ``amdgcn`` target only supports the
178064-bit process address space.
1781
1782.. _amdgpu-dwarf-address-class-identifier:
1783
1784Address Class Identifier
1785------------------------
1786
1787The DWARF address class represents the source language memory space. See DWARF
1788Version 5 section 2.12 which is updated by the *DWARF Extensions For
1789Heterogeneous Debugging* section :ref:`amdgpu-dwarf-segment_addresses`.
1790
1791The DWARF address class mapping used for AMDGPU is defined in
1792:ref:`amdgpu-dwarf-address-class-mapping-table`.
1793
1794.. table:: AMDGPU DWARF Address Class Mapping
1795   :name: amdgpu-dwarf-address-class-mapping-table
1796
1797   ========================= ====== =================
1798   DWARF                            AMDGPU
1799   -------------------------------- -----------------
1800   Address Class Name        Value  Address Space
1801   ========================= ====== =================
1802   ``DW_ADDR_none``          0x0000 Generic (Flat)
1803   ``DW_ADDR_LLVM_global``   0x0001 Global
1804   ``DW_ADDR_LLVM_constant`` 0x0002 Global
1805   ``DW_ADDR_LLVM_group``    0x0003 Local (group/LDS)
1806   ``DW_ADDR_LLVM_private``  0x0004 Private (Scratch)
1807   ``DW_ADDR_AMDGPU_region`` 0x8000 Region (GDS)
1808   ========================= ====== =================
1809
1810The DWARF address class values defined in the *DWARF Extensions For
1811Heterogeneous Debugging* section :ref:`amdgpu-dwarf-segment_addresses` are used.
1812
1813In addition, ``DW_ADDR_AMDGPU_region`` is encoded as a vendor extension. This is
1814available for use for the AMD extension for access to the hardware GDS memory
1815which is scratchpad memory allocated per device.
1816
1817For AMDGPU if no ``DW_AT_address_class`` attribute is present, then the default
1818address class of ``DW_ADDR_none`` is used.
1819
1820See :ref:`amdgpu-dwarf-address-space-identifier` for information on the AMDGPU
1821mapping of DWARF address classes to DWARF address spaces, including address size
1822and NULL value.
1823
1824.. _amdgpu-dwarf-address-space-identifier:
1825
1826Address Space Identifier
1827------------------------
1828
1829DWARF address spaces correspond to target architecture specific linear
1830addressable memory areas. See DWARF Version 5 section 2.12 and *DWARF Extensions
1831For Heterogeneous Debugging* section :ref:`amdgpu-dwarf-segment_addresses`.
1832
1833The DWARF address space mapping used for AMDGPU is defined in
1834:ref:`amdgpu-dwarf-address-space-mapping-table`.
1835
1836.. table:: AMDGPU DWARF Address Space Mapping
1837   :name: amdgpu-dwarf-address-space-mapping-table
1838
1839   ======================================= ===== ======= ======== ================= =======================
1840   DWARF                                                          AMDGPU            Notes
1841   --------------------------------------- ----- ---------------- ----------------- -----------------------
1842   Address Space Name                      Value Address Bit Size Address Space
1843   --------------------------------------- ----- ------- -------- ----------------- -----------------------
1844   ..                                            64-bit  32-bit
1845                                                 process process
1846                                                 address address
1847                                                 space   space
1848   ======================================= ===== ======= ======== ================= =======================
1849   ``DW_ASPACE_none``                      0x00  64      32       Global            *default address space*
1850   ``DW_ASPACE_AMDGPU_generic``            0x01  64      32       Generic (Flat)
1851   ``DW_ASPACE_AMDGPU_region``             0x02  32      32       Region (GDS)
1852   ``DW_ASPACE_AMDGPU_local``              0x03  32      32       Local (group/LDS)
1853   *Reserved*                              0x04
1854   ``DW_ASPACE_AMDGPU_private_lane``       0x05  32      32       Private (Scratch) *focused lane*
1855   ``DW_ASPACE_AMDGPU_private_wave``       0x06  32      32       Private (Scratch) *unswizzled wavefront*
1856   ======================================= ===== ======= ======== ================= =======================
1857
1858See :ref:`amdgpu-address-spaces` for information on the AMDGPU address spaces
1859including address size and NULL value.
1860
1861The ``DW_ASPACE_none`` address space is the default target architecture address
1862space used in DWARF operations that do not specify an address space. It
1863therefore has to map to the global address space so that the ``DW_OP_addr*`` and
1864related operations can refer to addresses in the program code.
1865
1866The ``DW_ASPACE_AMDGPU_generic`` address space allows location expressions to
1867specify the flat address space. If the address corresponds to an address in the
1868local address space, then it corresponds to the wavefront that is executing the
1869focused thread of execution. If the address corresponds to an address in the
1870private address space, then it corresponds to the lane that is executing the
1871focused thread of execution for languages that are implemented using a SIMD or
1872SIMT execution model.
1873
1874.. note::
1875
1876  CUDA-like languages such as HIP that do not have address spaces in the
1877  language type system, but do allow variables to be allocated in different
1878  address spaces, need to explicitly specify the ``DW_ASPACE_AMDGPU_generic``
1879  address space in the DWARF expression operations as the default address space
1880  is the global address space.
1881
1882The ``DW_ASPACE_AMDGPU_local`` address space allows location expressions to
1883specify the local address space corresponding to the wavefront that is executing
1884the focused thread of execution.
1885
1886The ``DW_ASPACE_AMDGPU_private_lane`` address space allows location expressions
1887to specify the private address space corresponding to the lane that is executing
1888the focused thread of execution for languages that are implemented using a SIMD
1889or SIMT execution model.
1890
1891The ``DW_ASPACE_AMDGPU_private_wave`` address space allows location expressions
1892to specify the unswizzled private address space corresponding to the wavefront
1893that is executing the focused thread of execution. The wavefront view of private
1894memory is the per wavefront unswizzled backing memory layout defined in
1895:ref:`amdgpu-address-spaces`, such that address 0 corresponds to the first
1896location for the backing memory of the wavefront (namely the address is not
1897offset by ``wavefront-scratch-base``). The following formula can be used to
1898convert from a ``DW_ASPACE_AMDGPU_private_lane`` address to a
1899``DW_ASPACE_AMDGPU_private_wave`` address:
1900
1901::
1902
1903  private-address-wavefront =
1904    ((private-address-lane / 4) * wavefront-size * 4) +
1905    (wavefront-lane-id * 4) + (private-address-lane % 4)
1906
1907If the ``DW_ASPACE_AMDGPU_private_lane`` address is dword aligned, and the start
1908of the dwords for each lane starting with lane 0 is required, then this
1909simplifies to:
1910
1911::
1912
1913  private-address-wavefront =
1914    private-address-lane * wavefront-size
1915
1916A compiler can use the ``DW_ASPACE_AMDGPU_private_wave`` address space to read a
1917complete spilled vector register back into a complete vector register in the
1918CFI. The frame pointer can be a private lane address which is dword aligned,
1919which can be shifted to multiply by the wavefront size, and then used to form a
1920private wavefront address that gives a location for a contiguous set of dwords,
1921one per lane, where the vector register dwords are spilled. The compiler knows
1922the wavefront size since it generates the code. Note that the type of the
1923address may have to be converted as the size of a
1924``DW_ASPACE_AMDGPU_private_lane`` address may be smaller than the size of a
1925``DW_ASPACE_AMDGPU_private_wave`` address.
1926
1927.. _amdgpu-dwarf-lane-identifier:
1928
1929Lane identifier
1930---------------
1931
1932DWARF lane identifies specify a target architecture lane position for hardware
1933that executes in a SIMD or SIMT manner, and on which a source language maps its
1934threads of execution onto those lanes. The DWARF lane identifier is pushed by
1935the ``DW_OP_LLVM_push_lane`` DWARF expression operation. See DWARF Version 5
1936section 2.5 which is updated by *DWARF Extensions For Heterogeneous Debugging*
1937section :ref:`amdgpu-dwarf-operation-expressions`.
1938
1939For AMDGPU, the lane identifier corresponds to the hardware lane ID of a
1940wavefront. It is numbered from 0 to the wavefront size minus 1.
1941
1942Operation Expressions
1943---------------------
1944
1945DWARF expressions are used to compute program values and the locations of
1946program objects. See DWARF Version 5 section 2.5 and
1947:ref:`amdgpu-dwarf-operation-expressions`.
1948
1949DWARF location descriptions describe how to access storage which includes memory
1950and registers. When accessing storage on AMDGPU, bytes are ordered with least
1951significant bytes first, and bits are ordered within bytes with least
1952significant bits first.
1953
1954For AMDGPU CFI expressions, ``DW_OP_LLVM_select_bit_piece`` is used to describe
1955unwinding vector registers that are spilled under the execution mask to memory:
1956the zero-single location description is the vector register, and the one-single
1957location description is the spilled memory location description. The
1958``DW_OP_LLVM_form_aspace_address`` is used to specify the address space of the
1959memory location description.
1960
1961In AMDGPU expressions, ``DW_OP_LLVM_select_bit_piece`` is used by the
1962``DW_AT_LLVM_lane_pc`` attribute expression where divergent control flow is
1963controlled by the execution mask. An undefined location description together
1964with ``DW_OP_LLVM_extend`` is used to indicate the lane was not active on entry
1965to the subprogram. See :ref:`amdgpu-dwarf-dw-at-llvm-lane-pc` for an example.
1966
1967Debugger Information Entry Attributes
1968-------------------------------------
1969
1970This section describes how certain debugger information entry attributes are
1971used by AMDGPU. See the sections in DWARF Version 5 section 2 which are updated
1972by *DWARF Extensions For Heterogeneous Debugging* section
1973:ref:`amdgpu-dwarf-debugging-information-entry-attributes`.
1974
1975.. _amdgpu-dwarf-dw-at-llvm-lane-pc:
1976
1977``DW_AT_LLVM_lane_pc``
1978~~~~~~~~~~~~~~~~~~~~~~
1979
1980For AMDGPU, the ``DW_AT_LLVM_lane_pc`` attribute is used to specify the program
1981location of the separate lanes of a SIMT thread.
1982
1983If the lane is an active lane then this will be the same as the current program
1984location.
1985
1986If the lane is inactive, but was active on entry to the subprogram, then this is
1987the program location in the subprogram at which execution of the lane is
1988conceptual positioned.
1989
1990If the lane was not active on entry to the subprogram, then this will be the
1991undefined location. A client debugger can check if the lane is part of a valid
1992work-group by checking that the lane is in the range of the associated
1993work-group within the grid, accounting for partial work-groups. If it is not,
1994then the debugger can omit any information for the lane. Otherwise, the debugger
1995may repeatedly unwind the stack and inspect the ``DW_AT_LLVM_lane_pc`` of the
1996calling subprogram until it finds a non-undefined location. Conceptually the
1997lane only has the call frames that it has a non-undefined
1998``DW_AT_LLVM_lane_pc``.
1999
2000The following example illustrates how the AMDGPU backend can generate a DWARF
2001location list expression for the nested ``IF/THEN/ELSE`` structures of the
2002following subprogram pseudo code for a target with 64 lanes per wavefront.
2003
2004.. code::
2005  :number-lines:
2006
2007  SUBPROGRAM X
2008  BEGIN
2009    a;
2010    IF (c1) THEN
2011      b;
2012      IF (c2) THEN
2013        c;
2014      ELSE
2015        d;
2016      ENDIF
2017      e;
2018    ELSE
2019      f;
2020    ENDIF
2021    g;
2022  END
2023
2024The AMDGPU backend may generate the following pseudo LLVM MIR to manipulate the
2025execution mask (``EXEC``) to linearize the control flow. The condition is
2026evaluated to make a mask of the lanes for which the condition evaluates to true.
2027First the ``THEN`` region is executed by setting the ``EXEC`` mask to the
2028logical ``AND`` of the current ``EXEC`` mask with the condition mask. Then the
2029``ELSE`` region is executed by negating the ``EXEC`` mask and logical ``AND`` of
2030the saved ``EXEC`` mask at the start of the region. After the ``IF/THEN/ELSE``
2031region the ``EXEC`` mask is restored to the value it had at the beginning of the
2032region. This is shown below. Other approaches are possible, but the basic
2033concept is the same.
2034
2035.. code::
2036  :number-lines:
2037
2038  $lex_start:
2039    a;
2040    %1 = EXEC
2041    %2 = c1
2042  $lex_1_start:
2043    EXEC = %1 & %2
2044  $if_1_then:
2045      b;
2046      %3 = EXEC
2047      %4 = c2
2048  $lex_1_1_start:
2049      EXEC = %3 & %4
2050  $lex_1_1_then:
2051        c;
2052      EXEC = ~EXEC & %3
2053  $lex_1_1_else:
2054        d;
2055      EXEC = %3
2056  $lex_1_1_end:
2057      e;
2058    EXEC = ~EXEC & %1
2059  $lex_1_else:
2060      f;
2061    EXEC = %1
2062  $lex_1_end:
2063    g;
2064  $lex_end:
2065
2066To create the DWARF location list expression that defines the location
2067description of a vector of lane program locations, the LLVM MIR ``DBG_VALUE``
2068pseudo instruction can be used to annotate the linearized control flow. This can
2069be done by defining an artificial variable for the lane PC. The DWARF location
2070list expression created for it is used as the value of the
2071``DW_AT_LLVM_lane_pc`` attribute on the subprogram's debugger information entry.
2072
2073A DWARF procedure is defined for each well nested structured control flow region
2074which provides the conceptual lane program location for a lane if it is not
2075active (namely it is divergent). The DWARF operation expression for each region
2076conceptually inherits the value of the immediately enclosing region and modifies
2077it according to the semantics of the region.
2078
2079For an ``IF/THEN/ELSE`` region the divergent program location is at the start of
2080the region for the ``THEN`` region since it is executed first. For the ``ELSE``
2081region the divergent program location is at the end of the ``IF/THEN/ELSE``
2082region since the ``THEN`` region has completed.
2083
2084The lane PC artificial variable is assigned at each region transition. It uses
2085the immediately enclosing region's DWARF procedure to compute the program
2086location for each lane assuming they are divergent, and then modifies the result
2087by inserting the current program location for each lane that the ``EXEC`` mask
2088indicates is active.
2089
2090By having separate DWARF procedures for each region, they can be reused to
2091define the value for any nested region. This reduces the total size of the DWARF
2092operation expressions.
2093
2094The following provides an example using pseudo LLVM MIR.
2095
2096.. code::
2097  :number-lines:
2098
2099  $lex_start:
2100    DEFINE_DWARF %__uint_64 = DW_TAG_base_type[
2101      DW_AT_name = "__uint64";
2102      DW_AT_byte_size = 8;
2103      DW_AT_encoding = DW_ATE_unsigned;
2104    ];
2105    DEFINE_DWARF %__active_lane_pc = DW_TAG_dwarf_procedure[
2106      DW_AT_name = "__active_lane_pc";
2107      DW_AT_location = [
2108        DW_OP_regx PC;
2109        DW_OP_LLVM_extend 64, 64;
2110        DW_OP_regval_type EXEC, %uint_64;
2111        DW_OP_LLVM_select_bit_piece 64, 64;
2112      ];
2113    ];
2114    DEFINE_DWARF %__divergent_lane_pc = DW_TAG_dwarf_procedure[
2115      DW_AT_name = "__divergent_lane_pc";
2116      DW_AT_location = [
2117        DW_OP_LLVM_undefined;
2118        DW_OP_LLVM_extend 64, 64;
2119      ];
2120    ];
2121    DBG_VALUE $noreg, $noreg, %DW_AT_LLVM_lane_pc, DIExpression[
2122      DW_OP_call_ref %__divergent_lane_pc;
2123      DW_OP_call_ref %__active_lane_pc;
2124    ];
2125    a;
2126    %1 = EXEC;
2127    DBG_VALUE %1, $noreg, %__lex_1_save_exec;
2128    %2 = c1;
2129  $lex_1_start:
2130    EXEC = %1 & %2;
2131  $lex_1_then:
2132      DEFINE_DWARF %__divergent_lane_pc_1_then = DW_TAG_dwarf_procedure[
2133        DW_AT_name = "__divergent_lane_pc_1_then";
2134        DW_AT_location = DIExpression[
2135          DW_OP_call_ref %__divergent_lane_pc;
2136          DW_OP_addrx &lex_1_start;
2137          DW_OP_stack_value;
2138          DW_OP_LLVM_extend 64, 64;
2139          DW_OP_call_ref %__lex_1_save_exec;
2140          DW_OP_deref_type 64, %__uint_64;
2141          DW_OP_LLVM_select_bit_piece 64, 64;
2142        ];
2143      ];
2144      DBG_VALUE $noreg, $noreg, %DW_AT_LLVM_lane_pc, DIExpression[
2145        DW_OP_call_ref %__divergent_lane_pc_1_then;
2146        DW_OP_call_ref %__active_lane_pc;
2147      ];
2148      b;
2149      %3 = EXEC;
2150      DBG_VALUE %3, %__lex_1_1_save_exec;
2151      %4 = c2;
2152  $lex_1_1_start:
2153      EXEC = %3 & %4;
2154  $lex_1_1_then:
2155        DEFINE_DWARF %__divergent_lane_pc_1_1_then = DW_TAG_dwarf_procedure[
2156          DW_AT_name = "__divergent_lane_pc_1_1_then";
2157          DW_AT_location = DIExpression[
2158            DW_OP_call_ref %__divergent_lane_pc_1_then;
2159            DW_OP_addrx &lex_1_1_start;
2160            DW_OP_stack_value;
2161            DW_OP_LLVM_extend 64, 64;
2162            DW_OP_call_ref %__lex_1_1_save_exec;
2163            DW_OP_deref_type 64, %__uint_64;
2164            DW_OP_LLVM_select_bit_piece 64, 64;
2165          ];
2166        ];
2167        DBG_VALUE $noreg, $noreg, %DW_AT_LLVM_lane_pc, DIExpression[
2168          DW_OP_call_ref %__divergent_lane_pc_1_1_then;
2169          DW_OP_call_ref %__active_lane_pc;
2170        ];
2171        c;
2172      EXEC = ~EXEC & %3;
2173  $lex_1_1_else:
2174        DEFINE_DWARF %__divergent_lane_pc_1_1_else = DW_TAG_dwarf_procedure[
2175          DW_AT_name = "__divergent_lane_pc_1_1_else";
2176          DW_AT_location = DIExpression[
2177            DW_OP_call_ref %__divergent_lane_pc_1_then;
2178            DW_OP_addrx &lex_1_1_end;
2179            DW_OP_stack_value;
2180            DW_OP_LLVM_extend 64, 64;
2181            DW_OP_call_ref %__lex_1_1_save_exec;
2182            DW_OP_deref_type 64, %__uint_64;
2183            DW_OP_LLVM_select_bit_piece 64, 64;
2184          ];
2185        ];
2186        DBG_VALUE $noreg, $noreg, %DW_AT_LLVM_lane_pc, DIExpression[
2187          DW_OP_call_ref %__divergent_lane_pc_1_1_else;
2188          DW_OP_call_ref %__active_lane_pc;
2189        ];
2190        d;
2191      EXEC = %3;
2192  $lex_1_1_end:
2193      DBG_VALUE $noreg, $noreg, %DW_AT_LLVM_lane_pc, DIExpression[
2194        DW_OP_call_ref %__divergent_lane_pc;
2195        DW_OP_call_ref %__active_lane_pc;
2196      ];
2197      e;
2198    EXEC = ~EXEC & %1;
2199  $lex_1_else:
2200      DEFINE_DWARF %__divergent_lane_pc_1_else = DW_TAG_dwarf_procedure[
2201        DW_AT_name = "__divergent_lane_pc_1_else";
2202        DW_AT_location = DIExpression[
2203          DW_OP_call_ref %__divergent_lane_pc;
2204          DW_OP_addrx &lex_1_end;
2205          DW_OP_stack_value;
2206          DW_OP_LLVM_extend 64, 64;
2207          DW_OP_call_ref %__lex_1_save_exec;
2208          DW_OP_deref_type 64, %__uint_64;
2209          DW_OP_LLVM_select_bit_piece 64, 64;
2210        ];
2211      ];
2212      DBG_VALUE $noreg, $noreg, %DW_AT_LLVM_lane_pc, DIExpression[
2213        DW_OP_call_ref %__divergent_lane_pc_1_else;
2214        DW_OP_call_ref %__active_lane_pc;
2215      ];
2216      f;
2217    EXEC = %1;
2218  $lex_1_end:
2219    DBG_VALUE $noreg, $noreg, %DW_AT_LLVM_lane_pc DIExpression[
2220      DW_OP_call_ref %__divergent_lane_pc;
2221      DW_OP_call_ref %__active_lane_pc;
2222    ];
2223    g;
2224  $lex_end:
2225
2226The DWARF procedure ``%__active_lane_pc`` is used to update the lane pc elements
2227that are active, with the current program location.
2228
2229Artificial variables %__lex_1_save_exec and %__lex_1_1_save_exec are created for
2230the execution masks saved on entry to a region. Using the ``DBG_VALUE`` pseudo
2231instruction, location list entries will be created that describe where the
2232artificial variables are allocated at any given program location. The compiler
2233may allocate them to registers or spill them to memory.
2234
2235The DWARF procedures for each region use the values of the saved execution mask
2236artificial variables to only update the lanes that are active on entry to the
2237region. All other lanes retain the value of the enclosing region where they were
2238last active. If they were not active on entry to the subprogram, then will have
2239the undefined location description.
2240
2241Other structured control flow regions can be handled similarly. For example,
2242loops would set the divergent program location for the region at the end of the
2243loop. Any lanes active will be in the loop, and any lanes not active must have
2244exited the loop.
2245
2246An ``IF/THEN/ELSEIF/ELSEIF/...`` region can be treated as a nest of
2247``IF/THEN/ELSE`` regions.
2248
2249The DWARF procedures can use the active lane artificial variable described in
2250:ref:`amdgpu-dwarf-amdgpu-dw-at-llvm-active-lane` rather than the actual
2251``EXEC`` mask in order to support whole or quad wavefront mode.
2252
2253.. _amdgpu-dwarf-amdgpu-dw-at-llvm-active-lane:
2254
2255``DW_AT_LLVM_active_lane``
2256~~~~~~~~~~~~~~~~~~~~~~~~~~
2257
2258The ``DW_AT_LLVM_active_lane`` attribute on a subprogram debugger information
2259entry is used to specify the lanes that are conceptually active for a SIMT
2260thread.
2261
2262The execution mask may be modified to implement whole or quad wavefront mode
2263operations. For example, all lanes may need to temporarily be made active to
2264execute a whole wavefront operation. Such regions would save the ``EXEC`` mask,
2265update it to enable the necessary lanes, perform the operations, and then
2266restore the ``EXEC`` mask from the saved value. While executing the whole
2267wavefront region, the conceptual execution mask is the saved value, not the
2268``EXEC`` value.
2269
2270This is handled by defining an artificial variable for the active lane mask. The
2271active lane mask artificial variable would be the actual ``EXEC`` mask for
2272normal regions, and the saved execution mask for regions where the mask is
2273temporarily updated. The location list expression created for this artificial
2274variable is used to define the value of the ``DW_AT_LLVM_active_lane``
2275attribute.
2276
2277``DW_AT_LLVM_augmentation``
2278~~~~~~~~~~~~~~~~~~~~~~~~~~~
2279
2280For AMDGPU, the ``DW_AT_LLVM_augmentation`` attribute of a compilation unit
2281debugger information entry has the following value for the augmentation string:
2282
2283::
2284
2285  [amdgpu:v0.0]
2286
2287The "vX.Y" specifies the major X and minor Y version number of the AMDGPU
2288extensions used in the DWARF of the compilation unit. The version number
2289conforms to [SEMVER]_.
2290
2291Call Frame Information
2292----------------------
2293
2294DWARF Call Frame Information (CFI) describes how a consumer can virtually
2295*unwind* call frames in a running process or core dump. See DWARF Version 5
2296section 6.4 and :ref:`amdgpu-dwarf-call-frame-information`.
2297
2298For AMDGPU, the Common Information Entry (CIE) fields have the following values:
2299
23001.  ``augmentation`` string contains the following null-terminated UTF-8 string:
2301
2302    ::
2303
2304      [amd:v0.0]
2305
2306    The ``vX.Y`` specifies the major X and minor Y version number of the AMDGPU
2307    extensions used in this CIE or to the FDEs that use it. The version number
2308    conforms to [SEMVER]_.
2309
23102.  ``address_size`` for the ``Global`` address space is defined in
2311    :ref:`amdgpu-dwarf-address-space-identifier`.
2312
23133.  ``segment_selector_size`` is 0 as AMDGPU does not use a segment selector.
2314
23154.  ``code_alignment_factor`` is 4 bytes.
2316
2317    .. TODO::
2318
2319       Add to :ref:`amdgpu-processor-table` table.
2320
23215.  ``data_alignment_factor`` is 4 bytes.
2322
2323    .. TODO::
2324
2325       Add to :ref:`amdgpu-processor-table` table.
2326
23276.  ``return_address_register`` is ``PC_32`` for 32-bit processes and ``PC_64``
2328    for 64-bit processes defined in :ref:`amdgpu-dwarf-register-identifier`.
2329
23307.  ``initial_instructions`` Since a subprogram X with fewer registers can be
2331    called from subprogram Y that has more allocated, X will not change any of
2332    the extra registers as it cannot access them. Therefore, the default rule
2333    for all columns is ``same value``.
2334
2335For AMDGPU the register number follows the numbering defined in
2336:ref:`amdgpu-dwarf-register-identifier`.
2337
2338For AMDGPU the instructions are variable size. A consumer can subtract 1 from
2339the return address to get the address of a byte within the call site
2340instructions. See DWARF Version 5 section 6.4.4.
2341
2342Accelerated Access
2343------------------
2344
2345See DWARF Version 5 section 6.1.
2346
2347Lookup By Name Section Header
2348~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
2349
2350See DWARF Version 5 section 6.1.1.4.1 and :ref:`amdgpu-dwarf-lookup-by-name`.
2351
2352For AMDGPU the lookup by name section header table:
2353
2354``augmentation_string_size`` (uword)
2355
2356  Set to the length of the ``augmentation_string`` value which is always a
2357  multiple of 4.
2358
2359``augmentation_string`` (sequence of UTF-8 characters)
2360
2361  Contains the following UTF-8 string null padded to a multiple of 4 bytes:
2362
2363  ::
2364
2365    [amdgpu:v0.0]
2366
2367  The "vX.Y" specifies the major X and minor Y version number of the AMDGPU
2368  extensions used in the DWARF of this index. The version number conforms to
2369  [SEMVER]_.
2370
2371  .. note::
2372
2373    This is different to the DWARF Version 5 definition that requires the first
2374    4 characters to be the vendor ID. But this is consistent with the other
2375    augmentation strings and does allow multiple vendor contributions. However,
2376    backwards compatibility may be more desirable.
2377
2378Lookup By Address Section Header
2379~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
2380
2381See DWARF Version 5 section 6.1.2.
2382
2383For AMDGPU the lookup by address section header table:
2384
2385``address_size`` (ubyte)
2386
2387  Match the address size for the ``Global`` address space defined in
2388  :ref:`amdgpu-dwarf-address-space-identifier`.
2389
2390``segment_selector_size`` (ubyte)
2391
2392  AMDGPU does not use a segment selector so this is 0. The entries in the
2393  ``.debug_aranges`` do not have a segment selector.
2394
2395Line Number Information
2396-----------------------
2397
2398See DWARF Version 5 section 6.2 and :ref:`amdgpu-dwarf-line-number-information`.
2399
2400AMDGPU does not use the ``isa`` state machine registers and always sets it to 0.
2401The instruction set must be obtained from the ELF file header ``e_flags`` field
2402in the ``EF_AMDGPU_MACH`` bit position (see :ref:`ELF Header
2403<amdgpu-elf-header>`). See DWARF Version 5 section 6.2.2.
2404
2405.. TODO::
2406
2407  Should the ``isa`` state machine register be used to indicate if the code is
2408  in wavefront32 or wavefront64 mode? Or used to specify the architecture ISA?
2409
2410For AMDGPU the line number program header fields have the following values (see
2411DWARF Version 5 section 6.2.4):
2412
2413``address_size`` (ubyte)
2414  Matches the address size for the ``Global`` address space defined in
2415  :ref:`amdgpu-dwarf-address-space-identifier`.
2416
2417``segment_selector_size`` (ubyte)
2418  AMDGPU does not use a segment selector so this is 0.
2419
2420``minimum_instruction_length`` (ubyte)
2421  For GFX9-GFX10 this is 4.
2422
2423``maximum_operations_per_instruction`` (ubyte)
2424  For GFX9-GFX10 this is 1.
2425
2426Source text for online-compiled programs (for example, those compiled by the
2427OpenCL language runtime) may be embedded into the DWARF Version 5 line table.
2428See DWARF Version 5 section 6.2.4.1 which is updated by *DWARF Extensions For
2429Heterogeneous Debugging* section :ref:`DW_LNCT_LLVM_source
2430<amdgpu-dwarf-line-number-information-dw-lnct-llvm-source>`.
2431
2432The Clang option used to control source embedding in AMDGPU is defined in
2433:ref:`amdgpu-clang-debug-options-table`.
2434
2435  .. table:: AMDGPU Clang Debug Options
2436     :name: amdgpu-clang-debug-options-table
2437
2438     ==================== ==================================================
2439     Debug Flag           Description
2440     ==================== ==================================================
2441     -g[no-]embed-source  Enable/disable embedding source text in DWARF
2442                          debug sections. Useful for environments where
2443                          source cannot be written to disk, such as
2444                          when performing online compilation.
2445     ==================== ==================================================
2446
2447For example:
2448
2449``-gembed-source``
2450  Enable the embedded source.
2451
2452``-gno-embed-source``
2453  Disable the embedded source.
2454
245532-Bit and 64-Bit DWARF Formats
2456-------------------------------
2457
2458See DWARF Version 5 section 7.4 and
2459:ref:`amdgpu-dwarf-32-bit-and-64-bit-dwarf-formats`.
2460
2461For AMDGPU:
2462
2463* For the ``amdgcn`` target architecture only the 64-bit process address space
2464  is supported.
2465
2466* The producer can generate either 32-bit or 64-bit DWARF format. LLVM generates
2467  the 32-bit DWARF format.
2468
2469Unit Headers
2470------------
2471
2472For AMDGPU the following values apply for each of the unit headers described in
2473DWARF Version 5 sections 7.5.1.1, 7.5.1.2, and 7.5.1.3:
2474
2475``address_size`` (ubyte)
2476  Matches the address size for the ``Global`` address space defined in
2477  :ref:`amdgpu-dwarf-address-space-identifier`.
2478
2479.. _amdgpu-code-conventions:
2480
2481Code Conventions
2482================
2483
2484This section provides code conventions used for each supported target triple OS
2485(see :ref:`amdgpu-target-triples`).
2486
2487AMDHSA
2488------
2489
2490This section provides code conventions used when the target triple OS is
2491``amdhsa`` (see :ref:`amdgpu-target-triples`).
2492
2493.. _amdgpu-amdhsa-code-object-metadata:
2494
2495Code Object Metadata
2496~~~~~~~~~~~~~~~~~~~~
2497
2498The code object metadata specifies extensible metadata associated with the code
2499objects executed on HSA [HSA]_ compatible runtimes (see :ref:`amdgpu-os`). The
2500encoding and semantics of this metadata depends on the code object version; see
2501:ref:`amdgpu-amdhsa-code-object-metadata-v2`,
2502:ref:`amdgpu-amdhsa-code-object-metadata-v3`, and
2503:ref:`amdgpu-amdhsa-code-object-metadata-v4`.
2504
2505Code object metadata is specified in a note record (see
2506:ref:`amdgpu-note-records`) and is required when the target triple OS is
2507``amdhsa`` (see :ref:`amdgpu-target-triples`). It must contain the minimum
2508information necessary to support the HSA compatible runtime kernel queries. For
2509example, the segment sizes needed in a dispatch packet. In addition, a
2510high-level language runtime may require other information to be included. For
2511example, the AMD OpenCL runtime records kernel argument information.
2512
2513.. _amdgpu-amdhsa-code-object-metadata-v2:
2514
2515Code Object V2 Metadata
2516+++++++++++++++++++++++
2517
2518.. warning::
2519  Code object V2 is not the default code object version emitted by this version
2520  of LLVM.
2521
2522Code object V2 metadata is specified by the ``NT_AMD_HSA_METADATA`` note record
2523(see :ref:`amdgpu-note-records-v2`).
2524
2525The metadata is specified as a YAML formatted string (see [YAML]_ and
2526:doc:`YamlIO`).
2527
2528.. TODO::
2529
2530  Is the string null terminated? It probably should not if YAML allows it to
2531  contain null characters, otherwise it should be.
2532
2533The metadata is represented as a single YAML document comprised of the mapping
2534defined in table :ref:`amdgpu-amdhsa-code-object-metadata-map-v2-table` and
2535referenced tables.
2536
2537For boolean values, the string values of ``false`` and ``true`` are used for
2538false and true respectively.
2539
2540Additional information can be added to the mappings. To avoid conflicts, any
2541non-AMD key names should be prefixed by "*vendor-name*.".
2542
2543  .. table:: AMDHSA Code Object V2 Metadata Map
2544     :name: amdgpu-amdhsa-code-object-metadata-map-v2-table
2545
2546     ========== ============== ========= =======================================
2547     String Key Value Type     Required? Description
2548     ========== ============== ========= =======================================
2549     "Version"  sequence of    Required  - The first integer is the major
2550                2 integers                 version. Currently 1.
2551                                         - The second integer is the minor
2552                                           version. Currently 0.
2553     "Printf"   sequence of              Each string is encoded information
2554                strings                  about a printf function call. The
2555                                         encoded information is organized as
2556                                         fields separated by colon (':'):
2557
2558                                         ``ID:N:S[0]:S[1]:...:S[N-1]:FormatString``
2559
2560                                         where:
2561
2562                                         ``ID``
2563                                           A 32-bit integer as a unique id for
2564                                           each printf function call
2565
2566                                         ``N``
2567                                           A 32-bit integer equal to the number
2568                                           of arguments of printf function call
2569                                           minus 1
2570
2571                                         ``S[i]`` (where i = 0, 1, ... , N-1)
2572                                           32-bit integers for the size in bytes
2573                                           of the i-th FormatString argument of
2574                                           the printf function call
2575
2576                                         FormatString
2577                                           The format string passed to the
2578                                           printf function call.
2579     "Kernels"  sequence of    Required  Sequence of the mappings for each
2580                mapping                  kernel in the code object. See
2581                                         :ref:`amdgpu-amdhsa-code-object-kernel-metadata-map-v2-table`
2582                                         for the definition of the mapping.
2583     ========== ============== ========= =======================================
2584
2585..
2586
2587  .. table:: AMDHSA Code Object V2 Kernel Metadata Map
2588     :name: amdgpu-amdhsa-code-object-kernel-metadata-map-v2-table
2589
2590     ================= ============== ========= ================================
2591     String Key        Value Type     Required? Description
2592     ================= ============== ========= ================================
2593     "Name"            string         Required  Source name of the kernel.
2594     "SymbolName"      string         Required  Name of the kernel
2595                                                descriptor ELF symbol.
2596     "Language"        string                   Source language of the kernel.
2597                                                Values include:
2598
2599                                                - "OpenCL C"
2600                                                - "OpenCL C++"
2601                                                - "HCC"
2602                                                - "OpenMP"
2603
2604     "LanguageVersion" sequence of              - The first integer is the major
2605                       2 integers                 version.
2606                                                - The second integer is the
2607                                                  minor version.
2608     "Attrs"           mapping                  Mapping of kernel attributes.
2609                                                See
2610                                                :ref:`amdgpu-amdhsa-code-object-kernel-attribute-metadata-map-v2-table`
2611                                                for the mapping definition.
2612     "Args"            sequence of              Sequence of mappings of the
2613                       mapping                  kernel arguments. See
2614                                                :ref:`amdgpu-amdhsa-code-object-kernel-argument-metadata-map-v2-table`
2615                                                for the definition of the mapping.
2616     "CodeProps"       mapping                  Mapping of properties related to
2617                                                the kernel code. See
2618                                                :ref:`amdgpu-amdhsa-code-object-kernel-code-properties-metadata-map-v2-table`
2619                                                for the mapping definition.
2620     ================= ============== ========= ================================
2621
2622..
2623
2624  .. table:: AMDHSA Code Object V2 Kernel Attribute Metadata Map
2625     :name: amdgpu-amdhsa-code-object-kernel-attribute-metadata-map-v2-table
2626
2627     =================== ============== ========= ==============================
2628     String Key          Value Type     Required? Description
2629     =================== ============== ========= ==============================
2630     "ReqdWorkGroupSize" sequence of              If not 0, 0, 0 then all values
2631                         3 integers               must be >=1 and the dispatch
2632                                                  work-group size X, Y, Z must
2633                                                  correspond to the specified
2634                                                  values. Defaults to 0, 0, 0.
2635
2636                                                  Corresponds to the OpenCL
2637                                                  ``reqd_work_group_size``
2638                                                  attribute.
2639     "WorkGroupSizeHint" sequence of              The dispatch work-group size
2640                         3 integers               X, Y, Z is likely to be the
2641                                                  specified values.
2642
2643                                                  Corresponds to the OpenCL
2644                                                  ``work_group_size_hint``
2645                                                  attribute.
2646     "VecTypeHint"       string                   The name of a scalar or vector
2647                                                  type.
2648
2649                                                  Corresponds to the OpenCL
2650                                                  ``vec_type_hint`` attribute.
2651
2652     "RuntimeHandle"     string                   The external symbol name
2653                                                  associated with a kernel.
2654                                                  OpenCL runtime allocates a
2655                                                  global buffer for the symbol
2656                                                  and saves the kernel's address
2657                                                  to it, which is used for
2658                                                  device side enqueueing. Only
2659                                                  available for device side
2660                                                  enqueued kernels.
2661     =================== ============== ========= ==============================
2662
2663..
2664
2665  .. table:: AMDHSA Code Object V2 Kernel Argument Metadata Map
2666     :name: amdgpu-amdhsa-code-object-kernel-argument-metadata-map-v2-table
2667
2668     ================= ============== ========= ================================
2669     String Key        Value Type     Required? Description
2670     ================= ============== ========= ================================
2671     "Name"            string                   Kernel argument name.
2672     "TypeName"        string                   Kernel argument type name.
2673     "Size"            integer        Required  Kernel argument size in bytes.
2674     "Align"           integer        Required  Kernel argument alignment in
2675                                                bytes. Must be a power of two.
2676     "ValueKind"       string         Required  Kernel argument kind that
2677                                                specifies how to set up the
2678                                                corresponding argument.
2679                                                Values include:
2680
2681                                                "ByValue"
2682                                                  The argument is copied
2683                                                  directly into the kernarg.
2684
2685                                                "GlobalBuffer"
2686                                                  A global address space pointer
2687                                                  to the buffer data is passed
2688                                                  in the kernarg.
2689
2690                                                "DynamicSharedPointer"
2691                                                  A group address space pointer
2692                                                  to dynamically allocated LDS
2693                                                  is passed in the kernarg.
2694
2695                                                "Sampler"
2696                                                  A global address space
2697                                                  pointer to a S# is passed in
2698                                                  the kernarg.
2699
2700                                                "Image"
2701                                                  A global address space
2702                                                  pointer to a T# is passed in
2703                                                  the kernarg.
2704
2705                                                "Pipe"
2706                                                  A global address space pointer
2707                                                  to an OpenCL pipe is passed in
2708                                                  the kernarg.
2709
2710                                                "Queue"
2711                                                  A global address space pointer
2712                                                  to an OpenCL device enqueue
2713                                                  queue is passed in the
2714                                                  kernarg.
2715
2716                                                "HiddenGlobalOffsetX"
2717                                                  The OpenCL grid dispatch
2718                                                  global offset for the X
2719                                                  dimension is passed in the
2720                                                  kernarg.
2721
2722                                                "HiddenGlobalOffsetY"
2723                                                  The OpenCL grid dispatch
2724                                                  global offset for the Y
2725                                                  dimension is passed in the
2726                                                  kernarg.
2727
2728                                                "HiddenGlobalOffsetZ"
2729                                                  The OpenCL grid dispatch
2730                                                  global offset for the Z
2731                                                  dimension is passed in the
2732                                                  kernarg.
2733
2734                                                "HiddenNone"
2735                                                  An argument that is not used
2736                                                  by the kernel. Space needs to
2737                                                  be left for it, but it does
2738                                                  not need to be set up.
2739
2740                                                "HiddenPrintfBuffer"
2741                                                  A global address space pointer
2742                                                  to the runtime printf buffer
2743                                                  is passed in kernarg.
2744
2745                                                "HiddenHostcallBuffer"
2746                                                  A global address space pointer
2747                                                  to the runtime hostcall buffer
2748                                                  is passed in kernarg.
2749
2750                                                "HiddenDefaultQueue"
2751                                                  A global address space pointer
2752                                                  to the OpenCL device enqueue
2753                                                  queue that should be used by
2754                                                  the kernel by default is
2755                                                  passed in the kernarg.
2756
2757                                                "HiddenCompletionAction"
2758                                                  A global address space pointer
2759                                                  to help link enqueued kernels into
2760                                                  the ancestor tree for determining
2761                                                  when the parent kernel has finished.
2762
2763                                                "HiddenMultiGridSyncArg"
2764                                                  A global address space pointer for
2765                                                  multi-grid synchronization is
2766                                                  passed in the kernarg.
2767
2768     "ValueType"       string                   Unused and deprecated. This should no longer
2769                                                be emitted, but is accepted for compatibility.
2770
2771
2772     "PointeeAlign"    integer                  Alignment in bytes of pointee
2773                                                type for pointer type kernel
2774                                                argument. Must be a power
2775                                                of 2. Only present if
2776                                                "ValueKind" is
2777                                                "DynamicSharedPointer".
2778     "AddrSpaceQual"   string                   Kernel argument address space
2779                                                qualifier. Only present if
2780                                                "ValueKind" is "GlobalBuffer" or
2781                                                "DynamicSharedPointer". Values
2782                                                are:
2783
2784                                                - "Private"
2785                                                - "Global"
2786                                                - "Constant"
2787                                                - "Local"
2788                                                - "Generic"
2789                                                - "Region"
2790
2791                                                .. TODO::
2792
2793                                                   Is GlobalBuffer only Global
2794                                                   or Constant? Is
2795                                                   DynamicSharedPointer always
2796                                                   Local? Can HCC allow Generic?
2797                                                   How can Private or Region
2798                                                   ever happen?
2799
2800     "AccQual"         string                   Kernel argument access
2801                                                qualifier. Only present if
2802                                                "ValueKind" is "Image" or
2803                                                "Pipe". Values
2804                                                are:
2805
2806                                                - "ReadOnly"
2807                                                - "WriteOnly"
2808                                                - "ReadWrite"
2809
2810                                                .. TODO::
2811
2812                                                   Does this apply to
2813                                                   GlobalBuffer?
2814
2815     "ActualAccQual"   string                   The actual memory accesses
2816                                                performed by the kernel on the
2817                                                kernel argument. Only present if
2818                                                "ValueKind" is "GlobalBuffer",
2819                                                "Image", or "Pipe". This may be
2820                                                more restrictive than indicated
2821                                                by "AccQual" to reflect what the
2822                                                kernel actual does. If not
2823                                                present then the runtime must
2824                                                assume what is implied by
2825                                                "AccQual" and "IsConst". Values
2826                                                are:
2827
2828                                                - "ReadOnly"
2829                                                - "WriteOnly"
2830                                                - "ReadWrite"
2831
2832     "IsConst"         boolean                  Indicates if the kernel argument
2833                                                is const qualified. Only present
2834                                                if "ValueKind" is
2835                                                "GlobalBuffer".
2836
2837     "IsRestrict"      boolean                  Indicates if the kernel argument
2838                                                is restrict qualified. Only
2839                                                present if "ValueKind" is
2840                                                "GlobalBuffer".
2841
2842     "IsVolatile"      boolean                  Indicates if the kernel argument
2843                                                is volatile qualified. Only
2844                                                present if "ValueKind" is
2845                                                "GlobalBuffer".
2846
2847     "IsPipe"          boolean                  Indicates if the kernel argument
2848                                                is pipe qualified. Only present
2849                                                if "ValueKind" is "Pipe".
2850
2851                                                .. TODO::
2852
2853                                                   Can GlobalBuffer be pipe
2854                                                   qualified?
2855
2856     ================= ============== ========= ================================
2857
2858..
2859
2860  .. table:: AMDHSA Code Object V2 Kernel Code Properties Metadata Map
2861     :name: amdgpu-amdhsa-code-object-kernel-code-properties-metadata-map-v2-table
2862
2863     ============================ ============== ========= =====================
2864     String Key                   Value Type     Required? Description
2865     ============================ ============== ========= =====================
2866     "KernargSegmentSize"         integer        Required  The size in bytes of
2867                                                           the kernarg segment
2868                                                           that holds the values
2869                                                           of the arguments to
2870                                                           the kernel.
2871     "GroupSegmentFixedSize"      integer        Required  The amount of group
2872                                                           segment memory
2873                                                           required by a
2874                                                           work-group in
2875                                                           bytes. This does not
2876                                                           include any
2877                                                           dynamically allocated
2878                                                           group segment memory
2879                                                           that may be added
2880                                                           when the kernel is
2881                                                           dispatched.
2882     "PrivateSegmentFixedSize"    integer        Required  The amount of fixed
2883                                                           private address space
2884                                                           memory required for a
2885                                                           work-item in
2886                                                           bytes. If the kernel
2887                                                           uses a dynamic call
2888                                                           stack then additional
2889                                                           space must be added
2890                                                           to this value for the
2891                                                           call stack.
2892     "KernargSegmentAlign"        integer        Required  The maximum byte
2893                                                           alignment of
2894                                                           arguments in the
2895                                                           kernarg segment. Must
2896                                                           be a power of 2.
2897     "WavefrontSize"              integer        Required  Wavefront size. Must
2898                                                           be a power of 2.
2899     "NumSGPRs"                   integer        Required  Number of scalar
2900                                                           registers used by a
2901                                                           wavefront for
2902                                                           GFX6-GFX10. This
2903                                                           includes the special
2904                                                           SGPRs for VCC, Flat
2905                                                           Scratch (GFX7-GFX10)
2906                                                           and XNACK (for
2907                                                           GFX8-GFX10). It does
2908                                                           not include the 16
2909                                                           SGPR added if a trap
2910                                                           handler is
2911                                                           enabled. It is not
2912                                                           rounded up to the
2913                                                           allocation
2914                                                           granularity.
2915     "NumVGPRs"                   integer        Required  Number of vector
2916                                                           registers used by
2917                                                           each work-item for
2918                                                           GFX6-GFX10
2919     "MaxFlatWorkGroupSize"       integer        Required  Maximum flat
2920                                                           work-group size
2921                                                           supported by the
2922                                                           kernel in work-items.
2923                                                           Must be >=1 and
2924                                                           consistent with
2925                                                           ReqdWorkGroupSize if
2926                                                           not 0, 0, 0.
2927     "NumSpilledSGPRs"            integer                  Number of stores from
2928                                                           a scalar register to
2929                                                           a register allocator
2930                                                           created spill
2931                                                           location.
2932     "NumSpilledVGPRs"            integer                  Number of stores from
2933                                                           a vector register to
2934                                                           a register allocator
2935                                                           created spill
2936                                                           location.
2937     ============================ ============== ========= =====================
2938
2939.. _amdgpu-amdhsa-code-object-metadata-v3:
2940
2941Code Object V3 Metadata
2942+++++++++++++++++++++++
2943
2944Code object V3 to V4 metadata is specified by the ``NT_AMDGPU_METADATA`` note
2945record (see :ref:`amdgpu-note-records-v3-v4`).
2946
2947The metadata is represented as Message Pack formatted binary data (see
2948[MsgPack]_). The top level is a Message Pack map that includes the
2949keys defined in table
2950:ref:`amdgpu-amdhsa-code-object-metadata-map-table-v3` and referenced
2951tables.
2952
2953Additional information can be added to the maps. To avoid conflicts,
2954any key names should be prefixed by "*vendor-name*." where
2955``vendor-name`` can be the name of the vendor and specific vendor
2956tool that generates the information. The prefix is abbreviated to
2957simply "." when it appears within a map that has been added by the
2958same *vendor-name*.
2959
2960  .. table:: AMDHSA Code Object V3 Metadata Map
2961     :name: amdgpu-amdhsa-code-object-metadata-map-table-v3
2962
2963     ================= ============== ========= =======================================
2964     String Key        Value Type     Required? Description
2965     ================= ============== ========= =======================================
2966     "amdhsa.version"  sequence of    Required  - The first integer is the major
2967                       2 integers                 version. Currently 1.
2968                                                - The second integer is the minor
2969                                                  version. Currently 0.
2970     "amdhsa.printf"   sequence of              Each string is encoded information
2971                       strings                  about a printf function call. The
2972                                                encoded information is organized as
2973                                                fields separated by colon (':'):
2974
2975                                                ``ID:N:S[0]:S[1]:...:S[N-1]:FormatString``
2976
2977                                                where:
2978
2979                                                ``ID``
2980                                                  A 32-bit integer as a unique id for
2981                                                  each printf function call
2982
2983                                                ``N``
2984                                                  A 32-bit integer equal to the number
2985                                                  of arguments of printf function call
2986                                                  minus 1
2987
2988                                                ``S[i]`` (where i = 0, 1, ... , N-1)
2989                                                  32-bit integers for the size in bytes
2990                                                  of the i-th FormatString argument of
2991                                                  the printf function call
2992
2993                                                FormatString
2994                                                  The format string passed to the
2995                                                  printf function call.
2996     "amdhsa.kernels"  sequence of    Required  Sequence of the maps for each
2997                       map                      kernel in the code object. See
2998                                                :ref:`amdgpu-amdhsa-code-object-kernel-metadata-map-table-v3`
2999                                                for the definition of the keys included
3000                                                in that map.
3001     ================= ============== ========= =======================================
3002
3003..
3004
3005  .. table:: AMDHSA Code Object V3 Kernel Metadata Map
3006     :name: amdgpu-amdhsa-code-object-kernel-metadata-map-table-v3
3007
3008     =================================== ============== ========= ================================
3009     String Key                          Value Type     Required? Description
3010     =================================== ============== ========= ================================
3011     ".name"                             string         Required  Source name of the kernel.
3012     ".symbol"                           string         Required  Name of the kernel
3013                                                                  descriptor ELF symbol.
3014     ".language"                         string                   Source language of the kernel.
3015                                                                  Values include:
3016
3017                                                                  - "OpenCL C"
3018                                                                  - "OpenCL C++"
3019                                                                  - "HCC"
3020                                                                  - "HIP"
3021                                                                  - "OpenMP"
3022                                                                  - "Assembler"
3023
3024     ".language_version"                 sequence of              - The first integer is the major
3025                                         2 integers                 version.
3026                                                                  - The second integer is the
3027                                                                    minor version.
3028     ".args"                             sequence of              Sequence of maps of the
3029                                         map                      kernel arguments. See
3030                                                                  :ref:`amdgpu-amdhsa-code-object-kernel-argument-metadata-map-table-v3`
3031                                                                  for the definition of the keys
3032                                                                  included in that map.
3033     ".reqd_workgroup_size"              sequence of              If not 0, 0, 0 then all values
3034                                         3 integers               must be >=1 and the dispatch
3035                                                                  work-group size X, Y, Z must
3036                                                                  correspond to the specified
3037                                                                  values. Defaults to 0, 0, 0.
3038
3039                                                                  Corresponds to the OpenCL
3040                                                                  ``reqd_work_group_size``
3041                                                                  attribute.
3042     ".workgroup_size_hint"              sequence of              The dispatch work-group size
3043                                         3 integers               X, Y, Z is likely to be the
3044                                                                  specified values.
3045
3046                                                                  Corresponds to the OpenCL
3047                                                                  ``work_group_size_hint``
3048                                                                  attribute.
3049     ".vec_type_hint"                    string                   The name of a scalar or vector
3050                                                                  type.
3051
3052                                                                  Corresponds to the OpenCL
3053                                                                  ``vec_type_hint`` attribute.
3054
3055     ".device_enqueue_symbol"            string                   The external symbol name
3056                                                                  associated with a kernel.
3057                                                                  OpenCL runtime allocates a
3058                                                                  global buffer for the symbol
3059                                                                  and saves the kernel's address
3060                                                                  to it, which is used for
3061                                                                  device side enqueueing. Only
3062                                                                  available for device side
3063                                                                  enqueued kernels.
3064     ".kernarg_segment_size"             integer        Required  The size in bytes of
3065                                                                  the kernarg segment
3066                                                                  that holds the values
3067                                                                  of the arguments to
3068                                                                  the kernel.
3069     ".group_segment_fixed_size"         integer        Required  The amount of group
3070                                                                  segment memory
3071                                                                  required by a
3072                                                                  work-group in
3073                                                                  bytes. This does not
3074                                                                  include any
3075                                                                  dynamically allocated
3076                                                                  group segment memory
3077                                                                  that may be added
3078                                                                  when the kernel is
3079                                                                  dispatched.
3080     ".private_segment_fixed_size"       integer        Required  The amount of fixed
3081                                                                  private address space
3082                                                                  memory required for a
3083                                                                  work-item in
3084                                                                  bytes. If the kernel
3085                                                                  uses a dynamic call
3086                                                                  stack then additional
3087                                                                  space must be added
3088                                                                  to this value for the
3089                                                                  call stack.
3090     ".kernarg_segment_align"            integer        Required  The maximum byte
3091                                                                  alignment of
3092                                                                  arguments in the
3093                                                                  kernarg segment. Must
3094                                                                  be a power of 2.
3095     ".wavefront_size"                   integer        Required  Wavefront size. Must
3096                                                                  be a power of 2.
3097     ".sgpr_count"                       integer        Required  Number of scalar
3098                                                                  registers required by a
3099                                                                  wavefront for
3100                                                                  GFX6-GFX9. A register
3101                                                                  is required if it is
3102                                                                  used explicitly, or
3103                                                                  if a higher numbered
3104                                                                  register is used
3105                                                                  explicitly. This
3106                                                                  includes the special
3107                                                                  SGPRs for VCC, Flat
3108                                                                  Scratch (GFX7-GFX9)
3109                                                                  and XNACK (for
3110                                                                  GFX8-GFX9). It does
3111                                                                  not include the 16
3112                                                                  SGPR added if a trap
3113                                                                  handler is
3114                                                                  enabled. It is not
3115                                                                  rounded up to the
3116                                                                  allocation
3117                                                                  granularity.
3118     ".vgpr_count"                       integer        Required  Number of vector
3119                                                                  registers required by
3120                                                                  each work-item for
3121                                                                  GFX6-GFX9. A register
3122                                                                  is required if it is
3123                                                                  used explicitly, or
3124                                                                  if a higher numbered
3125                                                                  register is used
3126                                                                  explicitly.
3127     ".max_flat_workgroup_size"          integer        Required  Maximum flat
3128                                                                  work-group size
3129                                                                  supported by the
3130                                                                  kernel in work-items.
3131                                                                  Must be >=1 and
3132                                                                  consistent with
3133                                                                  ReqdWorkGroupSize if
3134                                                                  not 0, 0, 0.
3135     ".sgpr_spill_count"                 integer                  Number of stores from
3136                                                                  a scalar register to
3137                                                                  a register allocator
3138                                                                  created spill
3139                                                                  location.
3140     ".vgpr_spill_count"                 integer                  Number of stores from
3141                                                                  a vector register to
3142                                                                  a register allocator
3143                                                                  created spill
3144                                                                  location.
3145     =================================== ============== ========= ================================
3146
3147..
3148
3149  .. table:: AMDHSA Code Object V3 Kernel Argument Metadata Map
3150     :name: amdgpu-amdhsa-code-object-kernel-argument-metadata-map-table-v3
3151
3152     ====================== ============== ========= ================================
3153     String Key             Value Type     Required? Description
3154     ====================== ============== ========= ================================
3155     ".name"                string                   Kernel argument name.
3156     ".type_name"           string                   Kernel argument type name.
3157     ".size"                integer        Required  Kernel argument size in bytes.
3158     ".offset"              integer        Required  Kernel argument offset in
3159                                                     bytes. The offset must be a
3160                                                     multiple of the alignment
3161                                                     required by the argument.
3162     ".value_kind"          string         Required  Kernel argument kind that
3163                                                     specifies how to set up the
3164                                                     corresponding argument.
3165                                                     Values include:
3166
3167                                                     "by_value"
3168                                                       The argument is copied
3169                                                       directly into the kernarg.
3170
3171                                                     "global_buffer"
3172                                                       A global address space pointer
3173                                                       to the buffer data is passed
3174                                                       in the kernarg.
3175
3176                                                     "dynamic_shared_pointer"
3177                                                       A group address space pointer
3178                                                       to dynamically allocated LDS
3179                                                       is passed in the kernarg.
3180
3181                                                     "sampler"
3182                                                       A global address space
3183                                                       pointer to a S# is passed in
3184                                                       the kernarg.
3185
3186                                                     "image"
3187                                                       A global address space
3188                                                       pointer to a T# is passed in
3189                                                       the kernarg.
3190
3191                                                     "pipe"
3192                                                       A global address space pointer
3193                                                       to an OpenCL pipe is passed in
3194                                                       the kernarg.
3195
3196                                                     "queue"
3197                                                       A global address space pointer
3198                                                       to an OpenCL device enqueue
3199                                                       queue is passed in the
3200                                                       kernarg.
3201
3202                                                     "hidden_global_offset_x"
3203                                                       The OpenCL grid dispatch
3204                                                       global offset for the X
3205                                                       dimension is passed in the
3206                                                       kernarg.
3207
3208                                                     "hidden_global_offset_y"
3209                                                       The OpenCL grid dispatch
3210                                                       global offset for the Y
3211                                                       dimension is passed in the
3212                                                       kernarg.
3213
3214                                                     "hidden_global_offset_z"
3215                                                       The OpenCL grid dispatch
3216                                                       global offset for the Z
3217                                                       dimension is passed in the
3218                                                       kernarg.
3219
3220                                                     "hidden_none"
3221                                                       An argument that is not used
3222                                                       by the kernel. Space needs to
3223                                                       be left for it, but it does
3224                                                       not need to be set up.
3225
3226                                                     "hidden_printf_buffer"
3227                                                       A global address space pointer
3228                                                       to the runtime printf buffer
3229                                                       is passed in kernarg.
3230
3231                                                     "hidden_hostcall_buffer"
3232                                                       A global address space pointer
3233                                                       to the runtime hostcall buffer
3234                                                       is passed in kernarg.
3235
3236                                                     "hidden_default_queue"
3237                                                       A global address space pointer
3238                                                       to the OpenCL device enqueue
3239                                                       queue that should be used by
3240                                                       the kernel by default is
3241                                                       passed in the kernarg.
3242
3243                                                     "hidden_completion_action"
3244                                                       A global address space pointer
3245                                                       to help link enqueued kernels into
3246                                                       the ancestor tree for determining
3247                                                       when the parent kernel has finished.
3248
3249                                                     "hidden_multigrid_sync_arg"
3250                                                       A global address space pointer for
3251                                                       multi-grid synchronization is
3252                                                       passed in the kernarg.
3253
3254     ".value_type"          string                    Unused and deprecated. This should no longer
3255                                                      be emitted, but is accepted for compatibility.
3256
3257     ".pointee_align"       integer                  Alignment in bytes of pointee
3258                                                     type for pointer type kernel
3259                                                     argument. Must be a power
3260                                                     of 2. Only present if
3261                                                     ".value_kind" is
3262                                                     "dynamic_shared_pointer".
3263     ".address_space"       string                   Kernel argument address space
3264                                                     qualifier. Only present if
3265                                                     ".value_kind" is "global_buffer" or
3266                                                     "dynamic_shared_pointer". Values
3267                                                     are:
3268
3269                                                     - "private"
3270                                                     - "global"
3271                                                     - "constant"
3272                                                     - "local"
3273                                                     - "generic"
3274                                                     - "region"
3275
3276                                                     .. TODO::
3277
3278                                                        Is "global_buffer" only "global"
3279                                                        or "constant"? Is
3280                                                        "dynamic_shared_pointer" always
3281                                                        "local"? Can HCC allow "generic"?
3282                                                        How can "private" or "region"
3283                                                        ever happen?
3284
3285     ".access"              string                   Kernel argument access
3286                                                     qualifier. Only present if
3287                                                     ".value_kind" is "image" or
3288                                                     "pipe". Values
3289                                                     are:
3290
3291                                                     - "read_only"
3292                                                     - "write_only"
3293                                                     - "read_write"
3294
3295                                                     .. TODO::
3296
3297                                                        Does this apply to
3298                                                        "global_buffer"?
3299
3300     ".actual_access"       string                   The actual memory accesses
3301                                                     performed by the kernel on the
3302                                                     kernel argument. Only present if
3303                                                     ".value_kind" is "global_buffer",
3304                                                     "image", or "pipe". This may be
3305                                                     more restrictive than indicated
3306                                                     by ".access" to reflect what the
3307                                                     kernel actual does. If not
3308                                                     present then the runtime must
3309                                                     assume what is implied by
3310                                                     ".access" and ".is_const"      . Values
3311                                                     are:
3312
3313                                                     - "read_only"
3314                                                     - "write_only"
3315                                                     - "read_write"
3316
3317     ".is_const"            boolean                  Indicates if the kernel argument
3318                                                     is const qualified. Only present
3319                                                     if ".value_kind" is
3320                                                     "global_buffer".
3321
3322     ".is_restrict"         boolean                  Indicates if the kernel argument
3323                                                     is restrict qualified. Only
3324                                                     present if ".value_kind" is
3325                                                     "global_buffer".
3326
3327     ".is_volatile"         boolean                  Indicates if the kernel argument
3328                                                     is volatile qualified. Only
3329                                                     present if ".value_kind" is
3330                                                     "global_buffer".
3331
3332     ".is_pipe"             boolean                  Indicates if the kernel argument
3333                                                     is pipe qualified. Only present
3334                                                     if ".value_kind" is "pipe".
3335
3336                                                     .. TODO::
3337
3338                                                        Can "global_buffer" be pipe
3339                                                        qualified?
3340
3341     ====================== ============== ========= ================================
3342
3343.. _amdgpu-amdhsa-code-object-metadata-v4:
3344
3345Code Object V4 Metadata
3346+++++++++++++++++++++++
3347
3348.. warning::
3349  Code object V4 is not the default code object version emitted by this version
3350  of LLVM.
3351
3352Code object V4 metadata is the same as
3353:ref:`amdgpu-amdhsa-code-object-metadata-v3` with the changes and additions
3354defined in table :ref:`amdgpu-amdhsa-code-object-metadata-map-table-v3`.
3355
3356  .. table:: AMDHSA Code Object V4 Metadata Map Changes from :ref:`amdgpu-amdhsa-code-object-metadata-v3`
3357     :name: amdgpu-amdhsa-code-object-metadata-map-table-v4
3358
3359     ================= ============== ========= =======================================
3360     String Key        Value Type     Required? Description
3361     ================= ============== ========= =======================================
3362     "amdhsa.version"  sequence of    Required  - The first integer is the major
3363                       2 integers                 version. Currently 1.
3364                                                - The second integer is the minor
3365                                                  version. Currently 1.
3366     "amdhsa.target"   string         Required  The target name of the code using the syntax:
3367
3368                                                .. code::
3369
3370                                                  <target-triple> [ "-" <target-id> ]
3371
3372                                                A canonical target ID must be
3373                                                used. See :ref:`amdgpu-target-triples`
3374                                                and :ref:`amdgpu-target-id`.
3375     ================= ============== ========= =======================================
3376
3377..
3378
3379Kernel Dispatch
3380~~~~~~~~~~~~~~~
3381
3382The HSA architected queuing language (AQL) defines a user space memory interface
3383that can be used to control the dispatch of kernels, in an agent independent
3384way. An agent can have zero or more AQL queues created for it using an HSA
3385compatible runtime (see :ref:`amdgpu-os`), in which AQL packets (all of which
3386are 64 bytes) can be placed. See the *HSA Platform System Architecture
3387Specification* [HSA]_ for the AQL queue mechanics and packet layouts.
3388
3389The packet processor of a kernel agent is responsible for detecting and
3390dispatching HSA kernels from the AQL queues associated with it. For AMD GPUs the
3391packet processor is implemented by the hardware command processor (CP),
3392asynchronous dispatch controller (ADC) and shader processor input controller
3393(SPI).
3394
3395An HSA compatible runtime can be used to allocate an AQL queue object. It uses
3396the kernel mode driver to initialize and register the AQL queue with CP.
3397
3398To dispatch a kernel the following actions are performed. This can occur in the
3399CPU host program, or from an HSA kernel executing on a GPU.
3400
34011. A pointer to an AQL queue for the kernel agent on which the kernel is to be
3402   executed is obtained.
34032. A pointer to the kernel descriptor (see
3404   :ref:`amdgpu-amdhsa-kernel-descriptor`) of the kernel to execute is obtained.
3405   It must be for a kernel that is contained in a code object that that was
3406   loaded by an HSA compatible runtime on the kernel agent with which the AQL
3407   queue is associated.
34083. Space is allocated for the kernel arguments using the HSA compatible runtime
3409   allocator for a memory region with the kernarg property for the kernel agent
3410   that will execute the kernel. It must be at least 16-byte aligned.
34114. Kernel argument values are assigned to the kernel argument memory
3412   allocation. The layout is defined in the *HSA Programmer's Language
3413   Reference* [HSA]_. For AMDGPU the kernel execution directly accesses the
3414   kernel argument memory in the same way constant memory is accessed. (Note
3415   that the HSA specification allows an implementation to copy the kernel
3416   argument contents to another location that is accessed by the kernel.)
34175. An AQL kernel dispatch packet is created on the AQL queue. The HSA compatible
3418   runtime api uses 64-bit atomic operations to reserve space in the AQL queue
3419   for the packet. The packet must be set up, and the final write must use an
3420   atomic store release to set the packet kind to ensure the packet contents are
3421   visible to the kernel agent. AQL defines a doorbell signal mechanism to
3422   notify the kernel agent that the AQL queue has been updated. These rules, and
3423   the layout of the AQL queue and kernel dispatch packet is defined in the *HSA
3424   System Architecture Specification* [HSA]_.
34256. A kernel dispatch packet includes information about the actual dispatch,
3426   such as grid and work-group size, together with information from the code
3427   object about the kernel, such as segment sizes. The HSA compatible runtime
3428   queries on the kernel symbol can be used to obtain the code object values
3429   which are recorded in the :ref:`amdgpu-amdhsa-code-object-metadata`.
34307. CP executes micro-code and is responsible for detecting and setting up the
3431   GPU to execute the wavefronts of a kernel dispatch.
34328. CP ensures that when the a wavefront starts executing the kernel machine
3433   code, the scalar general purpose registers (SGPR) and vector general purpose
3434   registers (VGPR) are set up as required by the machine code. The required
3435   setup is defined in the :ref:`amdgpu-amdhsa-kernel-descriptor`. The initial
3436   register state is defined in
3437   :ref:`amdgpu-amdhsa-initial-kernel-execution-state`.
34389. The prolog of the kernel machine code (see
3439   :ref:`amdgpu-amdhsa-kernel-prolog`) sets up the machine state as necessary
3440   before continuing executing the machine code that corresponds to the kernel.
344110. When the kernel dispatch has completed execution, CP signals the completion
3442    signal specified in the kernel dispatch packet if not 0.
3443
3444.. _amdgpu-amdhsa-memory-spaces:
3445
3446Memory Spaces
3447~~~~~~~~~~~~~
3448
3449The memory space properties are:
3450
3451  .. table:: AMDHSA Memory Spaces
3452     :name: amdgpu-amdhsa-memory-spaces-table
3453
3454     ================= =========== ======== ======= ==================
3455     Memory Space Name HSA Segment Hardware Address NULL Value
3456                       Name        Name     Size
3457     ================= =========== ======== ======= ==================
3458     Private           private     scratch  32      0x00000000
3459     Local             group       LDS      32      0xFFFFFFFF
3460     Global            global      global   64      0x0000000000000000
3461     Constant          constant    *same as 64      0x0000000000000000
3462                                   global*
3463     Generic           flat        flat     64      0x0000000000000000
3464     Region            N/A         GDS      32      *not implemented
3465                                                    for AMDHSA*
3466     ================= =========== ======== ======= ==================
3467
3468The global and constant memory spaces both use global virtual addresses, which
3469are the same virtual address space used by the CPU. However, some virtual
3470addresses may only be accessible to the CPU, some only accessible by the GPU,
3471and some by both.
3472
3473Using the constant memory space indicates that the data will not change during
3474the execution of the kernel. This allows scalar read instructions to be
3475used. The vector and scalar L1 caches are invalidated of volatile data before
3476each kernel dispatch execution to allow constant memory to change values between
3477kernel dispatches.
3478
3479The local memory space uses the hardware Local Data Store (LDS) which is
3480automatically allocated when the hardware creates work-groups of wavefronts, and
3481freed when all the wavefronts of a work-group have terminated. The data store
3482(DS) instructions can be used to access it.
3483
3484The private memory space uses the hardware scratch memory support. If the kernel
3485uses scratch, then the hardware allocates memory that is accessed using
3486wavefront lane dword (4 byte) interleaving. The mapping used from private
3487address to physical address is:
3488
3489  ``wavefront-scratch-base +
3490  (private-address * wavefront-size * 4) +
3491  (wavefront-lane-id * 4)``
3492
3493There are different ways that the wavefront scratch base address is determined
3494by a wavefront (see :ref:`amdgpu-amdhsa-initial-kernel-execution-state`). This
3495memory can be accessed in an interleaved manner using buffer instruction with
3496the scratch buffer descriptor and per wavefront scratch offset, by the scratch
3497instructions, or by flat instructions. If each lane of a wavefront accesses the
3498same private address, the interleaving results in adjacent dwords being accessed
3499and hence requires fewer cache lines to be fetched. Multi-dword access is not
3500supported except by flat and scratch instructions in GFX9-GFX10.
3501
3502The generic address space uses the hardware flat address support available in
3503GFX7-GFX10. This uses two fixed ranges of virtual addresses (the private and
3504local apertures), that are outside the range of addressible global memory, to
3505map from a flat address to a private or local address.
3506
3507FLAT instructions can take a flat address and access global, private (scratch)
3508and group (LDS) memory depending in if the address is within one of the
3509aperture ranges. Flat access to scratch requires hardware aperture setup and
3510setup in the kernel prologue (see
3511:ref:`amdgpu-amdhsa-kernel-prolog-flat-scratch`). Flat access to LDS requires
3512hardware aperture setup and M0 (GFX7-GFX8) register setup (see
3513:ref:`amdgpu-amdhsa-kernel-prolog-m0`).
3514
3515To convert between a segment address and a flat address the base address of the
3516apertures address can be used. For GFX7-GFX8 these are available in the
3517:ref:`amdgpu-amdhsa-hsa-aql-queue` the address of which can be obtained with
3518Queue Ptr SGPR (see :ref:`amdgpu-amdhsa-initial-kernel-execution-state`). For
3519GFX9-GFX10 the aperture base addresses are directly available as inline constant
3520registers ``SRC_SHARED_BASE/LIMIT`` and ``SRC_PRIVATE_BASE/LIMIT``. In 64 bit
3521address mode the aperture sizes are 2^32 bytes and the base is aligned to 2^32
3522which makes it easier to convert from flat to segment or segment to flat.
3523
3524Image and Samplers
3525~~~~~~~~~~~~~~~~~~
3526
3527Image and sample handles created by an HSA compatible runtime (see
3528:ref:`amdgpu-os`) are 64-bit addresses of a hardware 32-byte V# and 48 byte S#
3529object respectively. In order to support the HSA ``query_sampler`` operations
3530two extra dwords are used to store the HSA BRIG enumeration values for the
3531queries that are not trivially deducible from the S# representation.
3532
3533HSA Signals
3534~~~~~~~~~~~
3535
3536HSA signal handles created by an HSA compatible runtime (see :ref:`amdgpu-os`)
3537are 64-bit addresses of a structure allocated in memory accessible from both the
3538CPU and GPU. The structure is defined by the runtime and subject to change
3539between releases. For example, see [AMD-ROCm-github]_.
3540
3541.. _amdgpu-amdhsa-hsa-aql-queue:
3542
3543HSA AQL Queue
3544~~~~~~~~~~~~~
3545
3546The HSA AQL queue structure is defined by an HSA compatible runtime (see
3547:ref:`amdgpu-os`) and subject to change between releases. For example, see
3548[AMD-ROCm-github]_. For some processors it contains fields needed to implement
3549certain language features such as the flat address aperture bases. It also
3550contains fields used by CP such as managing the allocation of scratch memory.
3551
3552.. _amdgpu-amdhsa-kernel-descriptor:
3553
3554Kernel Descriptor
3555~~~~~~~~~~~~~~~~~
3556
3557A kernel descriptor consists of the information needed by CP to initiate the
3558execution of a kernel, including the entry point address of the machine code
3559that implements the kernel.
3560
3561Code Object V3 Kernel Descriptor
3562++++++++++++++++++++++++++++++++
3563
3564CP microcode requires the Kernel descriptor to be allocated on 64-byte
3565alignment.
3566
3567The fields used by CP for code objects before V3 also match those specified in
3568:ref:`amdgpu-amdhsa-kernel-descriptor-v3-table`.
3569
3570  .. table:: Code Object V3 Kernel Descriptor
3571     :name: amdgpu-amdhsa-kernel-descriptor-v3-table
3572
3573     ======= ======= =============================== ============================
3574     Bits    Size    Field Name                      Description
3575     ======= ======= =============================== ============================
3576     31:0    4 bytes GROUP_SEGMENT_FIXED_SIZE        The amount of fixed local
3577                                                     address space memory
3578                                                     required for a work-group
3579                                                     in bytes. This does not
3580                                                     include any dynamically
3581                                                     allocated local address
3582                                                     space memory that may be
3583                                                     added when the kernel is
3584                                                     dispatched.
3585     63:32   4 bytes PRIVATE_SEGMENT_FIXED_SIZE      The amount of fixed
3586                                                     private address space
3587                                                     memory required for a
3588                                                     work-item in bytes.
3589                                                     Additional space may need to
3590                                                     be added to this value if
3591                                                     the call stack has
3592                                                     non-inlined function calls.
3593     95:64   4 bytes KERNARG_SIZE                    The size of the kernarg
3594                                                     memory pointed to by the
3595                                                     AQL dispatch packet. The
3596                                                     kernarg memory is used to
3597                                                     pass arguments to the
3598                                                     kernel.
3599
3600                                                     * If the kernarg pointer in
3601                                                       the dispatch packet is NULL
3602                                                       then there are no kernel
3603                                                       arguments.
3604                                                     * If the kernarg pointer in
3605                                                       the dispatch packet is
3606                                                       not NULL and this value
3607                                                       is 0 then the kernarg
3608                                                       memory size is
3609                                                       unspecified.
3610                                                     * If the kernarg pointer in
3611                                                       the dispatch packet is
3612                                                       not NULL and this value
3613                                                       is not 0 then the value
3614                                                       specifies the kernarg
3615                                                       memory size in bytes. It
3616                                                       is recommended to provide
3617                                                       a value as it may be used
3618                                                       by CP to optimize making
3619                                                       the kernarg memory
3620                                                       visible to the kernel
3621                                                       code.
3622
3623     127:96  4 bytes                                 Reserved, must be 0.
3624     191:128 8 bytes KERNEL_CODE_ENTRY_BYTE_OFFSET   Byte offset (possibly
3625                                                     negative) from base
3626                                                     address of kernel
3627                                                     descriptor to kernel's
3628                                                     entry point instruction
3629                                                     which must be 256 byte
3630                                                     aligned.
3631     351:272 20                                      Reserved, must be 0.
3632             bytes
3633     383:352 4 bytes COMPUTE_PGM_RSRC3               GFX6-GFX9
3634                                                       Reserved, must be 0.
3635                                                     GFX90A
3636                                                       Compute Shader (CS)
3637                                                       program settings used by
3638                                                       CP to set up
3639                                                       ``COMPUTE_PGM_RSRC3``
3640                                                       configuration
3641                                                       register. See
3642                                                       :ref:`amdgpu-amdhsa-compute_pgm_rsrc3-gfx90a-table`.
3643                                                     GFX10
3644                                                       Compute Shader (CS)
3645                                                       program settings used by
3646                                                       CP to set up
3647                                                       ``COMPUTE_PGM_RSRC3``
3648                                                       configuration
3649                                                       register. See
3650                                                       :ref:`amdgpu-amdhsa-compute_pgm_rsrc3-gfx10-table`.
3651     415:384 4 bytes COMPUTE_PGM_RSRC1               Compute Shader (CS)
3652                                                     program settings used by
3653                                                     CP to set up
3654                                                     ``COMPUTE_PGM_RSRC1``
3655                                                     configuration
3656                                                     register. See
3657                                                     :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx10-table`.
3658     447:416 4 bytes COMPUTE_PGM_RSRC2               Compute Shader (CS)
3659                                                     program settings used by
3660                                                     CP to set up
3661                                                     ``COMPUTE_PGM_RSRC2``
3662                                                     configuration
3663                                                     register. See
3664                                                     :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx10-table`.
3665     458:448 7 bits  *See separate bits below.*      Enable the setup of the
3666                                                     SGPR user data registers
3667                                                     (see
3668                                                     :ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
3669
3670                                                     The total number of SGPR
3671                                                     user data registers
3672                                                     requested must not exceed
3673                                                     16 and match value in
3674                                                     ``compute_pgm_rsrc2.user_sgpr.user_sgpr_count``.
3675                                                     Any requests beyond 16
3676                                                     will be ignored.
3677     >448    1 bit   ENABLE_SGPR_PRIVATE_SEGMENT     If the *Target Properties*
3678                     _BUFFER                         column of
3679                                                     :ref:`amdgpu-processor-table`
3680                                                     specifies *Architected flat
3681                                                     scratch* then not supported
3682                                                     and must be 0,
3683     >449    1 bit   ENABLE_SGPR_DISPATCH_PTR
3684     >450    1 bit   ENABLE_SGPR_QUEUE_PTR
3685     >451    1 bit   ENABLE_SGPR_KERNARG_SEGMENT_PTR
3686     >452    1 bit   ENABLE_SGPR_DISPATCH_ID
3687     >453    1 bit   ENABLE_SGPR_FLAT_SCRATCH_INIT   If the *Target Properties*
3688                                                     column of
3689                                                     :ref:`amdgpu-processor-table`
3690                                                     specifies *Architected flat
3691                                                     scratch* then not supported
3692                                                     and must be 0,
3693     >454    1 bit   ENABLE_SGPR_PRIVATE_SEGMENT
3694                     _SIZE
3695     457:455 3 bits                                  Reserved, must be 0.
3696     458     1 bit   ENABLE_WAVEFRONT_SIZE32         GFX6-GFX9
3697                                                       Reserved, must be 0.
3698                                                     GFX10
3699                                                       - If 0 execute in
3700                                                         wavefront size 64 mode.
3701                                                       - If 1 execute in
3702                                                         native wavefront size
3703                                                         32 mode.
3704     463:459 1 bit                                   Reserved, must be 0.
3705     464     1 bit   RESERVED_464                    Deprecated, must be 0.
3706     467:465 3 bits                                  Reserved, must be 0.
3707     468     1 bit   RESERVED_468                    Deprecated, must be 0.
3708     469:471 3 bits                                  Reserved, must be 0.
3709     511:472 5 bytes                                 Reserved, must be 0.
3710     512     **Total size 64 bytes.**
3711     ======= ====================================================================
3712
3713..
3714
3715  .. table:: compute_pgm_rsrc1 for GFX6-GFX10
3716     :name: amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx10-table
3717
3718     ======= ======= =============================== ===========================================================================
3719     Bits    Size    Field Name                      Description
3720     ======= ======= =============================== ===========================================================================
3721     5:0     6 bits  GRANULATED_WORKITEM_VGPR_COUNT  Number of vector register
3722                                                     blocks used by each work-item;
3723                                                     granularity is device
3724                                                     specific:
3725
3726                                                     GFX6-GFX9
3727                                                       - vgprs_used 0..256
3728                                                       - max(0, ceil(vgprs_used / 4) - 1)
3729                                                     GFX90A
3730                                                       - vgprs_used 0..512
3731                                                       - vgprs_used = align(arch_vgprs, 4)
3732                                                                      + acc_vgprs
3733                                                       - max(0, ceil(vgprs_used / 8) - 1)
3734                                                     GFX10 (wavefront size 64)
3735                                                       - max_vgpr 1..256
3736                                                       - max(0, ceil(vgprs_used / 4) - 1)
3737                                                     GFX10 (wavefront size 32)
3738                                                       - max_vgpr 1..256
3739                                                       - max(0, ceil(vgprs_used / 8) - 1)
3740
3741                                                     Where vgprs_used is defined
3742                                                     as the highest VGPR number
3743                                                     explicitly referenced plus
3744                                                     one.
3745
3746                                                     Used by CP to set up
3747                                                     ``COMPUTE_PGM_RSRC1.VGPRS``.
3748
3749                                                     The
3750                                                     :ref:`amdgpu-assembler`
3751                                                     calculates this
3752                                                     automatically for the
3753                                                     selected processor from
3754                                                     values provided to the
3755                                                     `.amdhsa_kernel` directive
3756                                                     by the
3757                                                     `.amdhsa_next_free_vgpr`
3758                                                     nested directive (see
3759                                                     :ref:`amdhsa-kernel-directives-table`).
3760     9:6     4 bits  GRANULATED_WAVEFRONT_SGPR_COUNT Number of scalar register
3761                                                     blocks used by a wavefront;
3762                                                     granularity is device
3763                                                     specific:
3764
3765                                                     GFX6-GFX8
3766                                                       - sgprs_used 0..112
3767                                                       - max(0, ceil(sgprs_used / 8) - 1)
3768                                                     GFX9
3769                                                       - sgprs_used 0..112
3770                                                       - 2 * max(0, ceil(sgprs_used / 16) - 1)
3771                                                     GFX10
3772                                                       Reserved, must be 0.
3773                                                       (128 SGPRs always
3774                                                       allocated.)
3775
3776                                                     Where sgprs_used is
3777                                                     defined as the highest
3778                                                     SGPR number explicitly
3779                                                     referenced plus one, plus
3780                                                     a target specific number
3781                                                     of additional special
3782                                                     SGPRs for VCC,
3783                                                     FLAT_SCRATCH (GFX7+) and
3784                                                     XNACK_MASK (GFX8+), and
3785                                                     any additional
3786                                                     target specific
3787                                                     limitations. It does not
3788                                                     include the 16 SGPRs added
3789                                                     if a trap handler is
3790                                                     enabled.
3791
3792                                                     The target specific
3793                                                     limitations and special
3794                                                     SGPR layout are defined in
3795                                                     the hardware
3796                                                     documentation, which can
3797                                                     be found in the
3798                                                     :ref:`amdgpu-processors`
3799                                                     table.
3800
3801                                                     Used by CP to set up
3802                                                     ``COMPUTE_PGM_RSRC1.SGPRS``.
3803
3804                                                     The
3805                                                     :ref:`amdgpu-assembler`
3806                                                     calculates this
3807                                                     automatically for the
3808                                                     selected processor from
3809                                                     values provided to the
3810                                                     `.amdhsa_kernel` directive
3811                                                     by the
3812                                                     `.amdhsa_next_free_sgpr`
3813                                                     and `.amdhsa_reserve_*`
3814                                                     nested directives (see
3815                                                     :ref:`amdhsa-kernel-directives-table`).
3816     11:10   2 bits  PRIORITY                        Must be 0.
3817
3818                                                     Start executing wavefront
3819                                                     at the specified priority.
3820
3821                                                     CP is responsible for
3822                                                     filling in
3823                                                     ``COMPUTE_PGM_RSRC1.PRIORITY``.
3824     13:12   2 bits  FLOAT_ROUND_MODE_32             Wavefront starts execution
3825                                                     with specified rounding
3826                                                     mode for single (32
3827                                                     bit) floating point
3828                                                     precision floating point
3829                                                     operations.
3830
3831                                                     Floating point rounding
3832                                                     mode values are defined in
3833                                                     :ref:`amdgpu-amdhsa-floating-point-rounding-mode-enumeration-values-table`.
3834
3835                                                     Used by CP to set up
3836                                                     ``COMPUTE_PGM_RSRC1.FLOAT_MODE``.
3837     15:14   2 bits  FLOAT_ROUND_MODE_16_64          Wavefront starts execution
3838                                                     with specified rounding
3839                                                     denorm mode for half/double (16
3840                                                     and 64-bit) floating point
3841                                                     precision floating point
3842                                                     operations.
3843
3844                                                     Floating point rounding
3845                                                     mode values are defined in
3846                                                     :ref:`amdgpu-amdhsa-floating-point-rounding-mode-enumeration-values-table`.
3847
3848                                                     Used by CP to set up
3849                                                     ``COMPUTE_PGM_RSRC1.FLOAT_MODE``.
3850     17:16   2 bits  FLOAT_DENORM_MODE_32            Wavefront starts execution
3851                                                     with specified denorm mode
3852                                                     for single (32
3853                                                     bit)  floating point
3854                                                     precision floating point
3855                                                     operations.
3856
3857                                                     Floating point denorm mode
3858                                                     values are defined in
3859                                                     :ref:`amdgpu-amdhsa-floating-point-denorm-mode-enumeration-values-table`.
3860
3861                                                     Used by CP to set up
3862                                                     ``COMPUTE_PGM_RSRC1.FLOAT_MODE``.
3863     19:18   2 bits  FLOAT_DENORM_MODE_16_64         Wavefront starts execution
3864                                                     with specified denorm mode
3865                                                     for half/double (16
3866                                                     and 64-bit) floating point
3867                                                     precision floating point
3868                                                     operations.
3869
3870                                                     Floating point denorm mode
3871                                                     values are defined in
3872                                                     :ref:`amdgpu-amdhsa-floating-point-denorm-mode-enumeration-values-table`.
3873
3874                                                     Used by CP to set up
3875                                                     ``COMPUTE_PGM_RSRC1.FLOAT_MODE``.
3876     20      1 bit   PRIV                            Must be 0.
3877
3878                                                     Start executing wavefront
3879                                                     in privilege trap handler
3880                                                     mode.
3881
3882                                                     CP is responsible for
3883                                                     filling in
3884                                                     ``COMPUTE_PGM_RSRC1.PRIV``.
3885     21      1 bit   ENABLE_DX10_CLAMP               Wavefront starts execution
3886                                                     with DX10 clamp mode
3887                                                     enabled. Used by the vector
3888                                                     ALU to force DX10 style
3889                                                     treatment of NaN's (when
3890                                                     set, clamp NaN to zero,
3891                                                     otherwise pass NaN
3892                                                     through).
3893
3894                                                     Used by CP to set up
3895                                                     ``COMPUTE_PGM_RSRC1.DX10_CLAMP``.
3896     22      1 bit   DEBUG_MODE                      Must be 0.
3897
3898                                                     Start executing wavefront
3899                                                     in single step mode.
3900
3901                                                     CP is responsible for
3902                                                     filling in
3903                                                     ``COMPUTE_PGM_RSRC1.DEBUG_MODE``.
3904     23      1 bit   ENABLE_IEEE_MODE                Wavefront starts execution
3905                                                     with IEEE mode
3906                                                     enabled. Floating point
3907                                                     opcodes that support
3908                                                     exception flag gathering
3909                                                     will quiet and propagate
3910                                                     signaling-NaN inputs per
3911                                                     IEEE 754-2008. Min_dx10 and
3912                                                     max_dx10 become IEEE
3913                                                     754-2008 compliant due to
3914                                                     signaling-NaN propagation
3915                                                     and quieting.
3916
3917                                                     Used by CP to set up
3918                                                     ``COMPUTE_PGM_RSRC1.IEEE_MODE``.
3919     24      1 bit   BULKY                           Must be 0.
3920
3921                                                     Only one work-group allowed
3922                                                     to execute on a compute
3923                                                     unit.
3924
3925                                                     CP is responsible for
3926                                                     filling in
3927                                                     ``COMPUTE_PGM_RSRC1.BULKY``.
3928     25      1 bit   CDBG_USER                       Must be 0.
3929
3930                                                     Flag that can be used to
3931                                                     control debugging code.
3932
3933                                                     CP is responsible for
3934                                                     filling in
3935                                                     ``COMPUTE_PGM_RSRC1.CDBG_USER``.
3936     26      1 bit   FP16_OVFL                       GFX6-GFX8
3937                                                       Reserved, must be 0.
3938                                                     GFX9-GFX10
3939                                                       Wavefront starts execution
3940                                                       with specified fp16 overflow
3941                                                       mode.
3942
3943                                                       - If 0, fp16 overflow generates
3944                                                         +/-INF values.
3945                                                       - If 1, fp16 overflow that is the
3946                                                         result of an +/-INF input value
3947                                                         or divide by 0 produces a +/-INF,
3948                                                         otherwise clamps computed
3949                                                         overflow to +/-MAX_FP16 as
3950                                                         appropriate.
3951
3952                                                       Used by CP to set up
3953                                                       ``COMPUTE_PGM_RSRC1.FP16_OVFL``.
3954     28:27   2 bits                                  Reserved, must be 0.
3955     29      1 bit    WGP_MODE                       GFX6-GFX9
3956                                                       Reserved, must be 0.
3957                                                     GFX10
3958                                                       - If 0 execute work-groups in
3959                                                         CU wavefront execution mode.
3960                                                       - If 1 execute work-groups on
3961                                                         in WGP wavefront execution mode.
3962
3963                                                       See :ref:`amdgpu-amdhsa-memory-model`.
3964
3965                                                       Used by CP to set up
3966                                                       ``COMPUTE_PGM_RSRC1.WGP_MODE``.
3967     30      1 bit    MEM_ORDERED                    GFX6-GFX9
3968                                                       Reserved, must be 0.
3969                                                     GFX10
3970                                                       Controls the behavior of the
3971                                                       s_waitcnt's vmcnt and vscnt
3972                                                       counters.
3973
3974                                                       - If 0 vmcnt reports completion
3975                                                         of load and atomic with return
3976                                                         out of order with sample
3977                                                         instructions, and the vscnt
3978                                                         reports the completion of
3979                                                         store and atomic without
3980                                                         return in order.
3981                                                       - If 1 vmcnt reports completion
3982                                                         of load, atomic with return
3983                                                         and sample instructions in
3984                                                         order, and the vscnt reports
3985                                                         the completion of store and
3986                                                         atomic without return in order.
3987
3988                                                       Used by CP to set up
3989                                                       ``COMPUTE_PGM_RSRC1.MEM_ORDERED``.
3990     31      1 bit    FWD_PROGRESS                   GFX6-GFX9
3991                                                       Reserved, must be 0.
3992                                                     GFX10
3993                                                       - If 0 execute SIMD wavefronts
3994                                                         using oldest first policy.
3995                                                       - If 1 execute SIMD wavefronts to
3996                                                         ensure wavefronts will make some
3997                                                         forward progress.
3998
3999                                                       Used by CP to set up
4000                                                       ``COMPUTE_PGM_RSRC1.FWD_PROGRESS``.
4001     32      **Total size 4 bytes**
4002     ======= ===================================================================================================================
4003
4004..
4005
4006  .. table:: compute_pgm_rsrc2 for GFX6-GFX10
4007     :name: amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx10-table
4008
4009     ======= ======= =============================== ===========================================================================
4010     Bits    Size    Field Name                      Description
4011     ======= ======= =============================== ===========================================================================
4012     0       1 bit   ENABLE_PRIVATE_SEGMENT          * Enable the setup of the
4013                                                       private segment.
4014                                                     * If the *Target Properties*
4015                                                       column of
4016                                                       :ref:`amdgpu-processor-table`
4017                                                       does not specify
4018                                                       *Architected flat
4019                                                       scratch* then enable the
4020                                                       setup of the SGPR
4021                                                       wavefront scratch offset
4022                                                       system register (see
4023                                                       :ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
4024                                                     * If the *Target Properties*
4025                                                       column of
4026                                                       :ref:`amdgpu-processor-table`
4027                                                       specifies *Architected
4028                                                       flat scratch* then enable
4029                                                       the setup of the
4030                                                       FLAT_SCRATCH register
4031                                                       pair (see
4032                                                       :ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
4033
4034                                                     Used by CP to set up
4035                                                     ``COMPUTE_PGM_RSRC2.SCRATCH_EN``.
4036     5:1     5 bits  USER_SGPR_COUNT                 The total number of SGPR
4037                                                     user data registers
4038                                                     requested. This number must
4039                                                     match the number of user
4040                                                     data registers enabled.
4041
4042                                                     Used by CP to set up
4043                                                     ``COMPUTE_PGM_RSRC2.USER_SGPR``.
4044     6       1 bit   ENABLE_TRAP_HANDLER             Must be 0.
4045
4046                                                     This bit represents
4047                                                     ``COMPUTE_PGM_RSRC2.TRAP_PRESENT``,
4048                                                     which is set by the CP if
4049                                                     the runtime has installed a
4050                                                     trap handler.
4051     7       1 bit   ENABLE_SGPR_WORKGROUP_ID_X      Enable the setup of the
4052                                                     system SGPR register for
4053                                                     the work-group id in the X
4054                                                     dimension (see
4055                                                     :ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
4056
4057                                                     Used by CP to set up
4058                                                     ``COMPUTE_PGM_RSRC2.TGID_X_EN``.
4059     8       1 bit   ENABLE_SGPR_WORKGROUP_ID_Y      Enable the setup of the
4060                                                     system SGPR register for
4061                                                     the work-group id in the Y
4062                                                     dimension (see
4063                                                     :ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
4064
4065                                                     Used by CP to set up
4066                                                     ``COMPUTE_PGM_RSRC2.TGID_Y_EN``.
4067     9       1 bit   ENABLE_SGPR_WORKGROUP_ID_Z      Enable the setup of the
4068                                                     system SGPR register for
4069                                                     the work-group id in the Z
4070                                                     dimension (see
4071                                                     :ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
4072
4073                                                     Used by CP to set up
4074                                                     ``COMPUTE_PGM_RSRC2.TGID_Z_EN``.
4075     10      1 bit   ENABLE_SGPR_WORKGROUP_INFO      Enable the setup of the
4076                                                     system SGPR register for
4077                                                     work-group information (see
4078                                                     :ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
4079
4080                                                     Used by CP to set up
4081                                                     ``COMPUTE_PGM_RSRC2.TGID_SIZE_EN``.
4082     12:11   2 bits  ENABLE_VGPR_WORKITEM_ID         Enable the setup of the
4083                                                     VGPR system registers used
4084                                                     for the work-item ID.
4085                                                     :ref:`amdgpu-amdhsa-system-vgpr-work-item-id-enumeration-values-table`
4086                                                     defines the values.
4087
4088                                                     Used by CP to set up
4089                                                     ``COMPUTE_PGM_RSRC2.TIDIG_CMP_CNT``.
4090     13      1 bit   ENABLE_EXCEPTION_ADDRESS_WATCH  Must be 0.
4091
4092                                                     Wavefront starts execution
4093                                                     with address watch
4094                                                     exceptions enabled which
4095                                                     are generated when L1 has
4096                                                     witnessed a thread access
4097                                                     an *address of
4098                                                     interest*.
4099
4100                                                     CP is responsible for
4101                                                     filling in the address
4102                                                     watch bit in
4103                                                     ``COMPUTE_PGM_RSRC2.EXCP_EN_MSB``
4104                                                     according to what the
4105                                                     runtime requests.
4106     14      1 bit   ENABLE_EXCEPTION_MEMORY         Must be 0.
4107
4108                                                     Wavefront starts execution
4109                                                     with memory violation
4110                                                     exceptions exceptions
4111                                                     enabled which are generated
4112                                                     when a memory violation has
4113                                                     occurred for this wavefront from
4114                                                     L1 or LDS
4115                                                     (write-to-read-only-memory,
4116                                                     mis-aligned atomic, LDS
4117                                                     address out of range,
4118                                                     illegal address, etc.).
4119
4120                                                     CP sets the memory
4121                                                     violation bit in
4122                                                     ``COMPUTE_PGM_RSRC2.EXCP_EN_MSB``
4123                                                     according to what the
4124                                                     runtime requests.
4125     23:15   9 bits  GRANULATED_LDS_SIZE             Must be 0.
4126
4127                                                     CP uses the rounded value
4128                                                     from the dispatch packet,
4129                                                     not this value, as the
4130                                                     dispatch may contain
4131                                                     dynamically allocated group
4132                                                     segment memory. CP writes
4133                                                     directly to
4134                                                     ``COMPUTE_PGM_RSRC2.LDS_SIZE``.
4135
4136                                                     Amount of group segment
4137                                                     (LDS) to allocate for each
4138                                                     work-group. Granularity is
4139                                                     device specific:
4140
4141                                                     GFX6
4142                                                       roundup(lds-size / (64 * 4))
4143                                                     GFX7-GFX10
4144                                                       roundup(lds-size / (128 * 4))
4145
4146     24      1 bit   ENABLE_EXCEPTION_IEEE_754_FP    Wavefront starts execution
4147                     _INVALID_OPERATION              with specified exceptions
4148                                                     enabled.
4149
4150                                                     Used by CP to set up
4151                                                     ``COMPUTE_PGM_RSRC2.EXCP_EN``
4152                                                     (set from bits 0..6).
4153
4154                                                     IEEE 754 FP Invalid
4155                                                     Operation
4156     25      1 bit   ENABLE_EXCEPTION_FP_DENORMAL    FP Denormal one or more
4157                     _SOURCE                         input operands is a
4158                                                     denormal number
4159     26      1 bit   ENABLE_EXCEPTION_IEEE_754_FP    IEEE 754 FP Division by
4160                     _DIVISION_BY_ZERO               Zero
4161     27      1 bit   ENABLE_EXCEPTION_IEEE_754_FP    IEEE 754 FP FP Overflow
4162                     _OVERFLOW
4163     28      1 bit   ENABLE_EXCEPTION_IEEE_754_FP    IEEE 754 FP Underflow
4164                     _UNDERFLOW
4165     29      1 bit   ENABLE_EXCEPTION_IEEE_754_FP    IEEE 754 FP Inexact
4166                     _INEXACT
4167     30      1 bit   ENABLE_EXCEPTION_INT_DIVIDE_BY  Integer Division by Zero
4168                     _ZERO                           (rcp_iflag_f32 instruction
4169                                                     only)
4170     31      1 bit                                   Reserved, must be 0.
4171     32      **Total size 4 bytes.**
4172     ======= ===================================================================================================================
4173
4174..
4175
4176  .. table:: compute_pgm_rsrc3 for GFX90A
4177     :name: amdgpu-amdhsa-compute_pgm_rsrc3-gfx90a-table
4178
4179     ======= ======= =============================== ===========================================================================
4180     Bits    Size    Field Name                      Description
4181     ======= ======= =============================== ===========================================================================
4182     5:0     6 bits  ACCUM_OFFSET                    Offset of a first AccVGPR in the unified register file. Granularity 4.
4183                                                     Value 0-63. 0 - accum-offset = 4, 1 - accum-offset = 8, ...,
4184                                                     63 - accum-offset = 256.
4185     6:15    10                                      Reserved, must be 0.
4186             bits
4187     16      1 bit   TG_SPLIT                        - If 0 the waves of a work-group are
4188                                                       launched in the same CU.
4189                                                     - If 1 the waves of a work-group can be
4190                                                       launched in different CUs. The waves
4191                                                       cannot use S_BARRIER or LDS.
4192     17:31   15                                      Reserved, must be 0.
4193             bits
4194     32      **Total size 4 bytes.**
4195     ======= ===================================================================================================================
4196
4197..
4198
4199  .. table:: compute_pgm_rsrc3 for GFX10
4200     :name: amdgpu-amdhsa-compute_pgm_rsrc3-gfx10-table
4201
4202     ======= ======= =============================== ===========================================================================
4203     Bits    Size    Field Name                      Description
4204     ======= ======= =============================== ===========================================================================
4205     3:0     4 bits  SHARED_VGPR_COUNT               Number of shared VGPRs for wavefront size 64. Granularity 8. Value 0-120.
4206                                                     compute_pgm_rsrc1.vgprs + shared_vgpr_cnt cannot exceed 64.
4207     31:4    28                                      Reserved, must be 0.
4208             bits
4209     32      **Total size 4 bytes.**
4210     ======= ===================================================================================================================
4211
4212..
4213
4214  .. table:: Floating Point Rounding Mode Enumeration Values
4215     :name: amdgpu-amdhsa-floating-point-rounding-mode-enumeration-values-table
4216
4217     ====================================== ===== ==============================
4218     Enumeration Name                       Value Description
4219     ====================================== ===== ==============================
4220     FLOAT_ROUND_MODE_NEAR_EVEN             0     Round Ties To Even
4221     FLOAT_ROUND_MODE_PLUS_INFINITY         1     Round Toward +infinity
4222     FLOAT_ROUND_MODE_MINUS_INFINITY        2     Round Toward -infinity
4223     FLOAT_ROUND_MODE_ZERO                  3     Round Toward 0
4224     ====================================== ===== ==============================
4225
4226..
4227
4228  .. table:: Floating Point Denorm Mode Enumeration Values
4229     :name: amdgpu-amdhsa-floating-point-denorm-mode-enumeration-values-table
4230
4231     ====================================== ===== ==============================
4232     Enumeration Name                       Value Description
4233     ====================================== ===== ==============================
4234     FLOAT_DENORM_MODE_FLUSH_SRC_DST        0     Flush Source and Destination
4235                                                  Denorms
4236     FLOAT_DENORM_MODE_FLUSH_DST            1     Flush Output Denorms
4237     FLOAT_DENORM_MODE_FLUSH_SRC            2     Flush Source Denorms
4238     FLOAT_DENORM_MODE_FLUSH_NONE           3     No Flush
4239     ====================================== ===== ==============================
4240
4241..
4242
4243  .. table:: System VGPR Work-Item ID Enumeration Values
4244     :name: amdgpu-amdhsa-system-vgpr-work-item-id-enumeration-values-table
4245
4246     ======================================== ===== ============================
4247     Enumeration Name                         Value Description
4248     ======================================== ===== ============================
4249     SYSTEM_VGPR_WORKITEM_ID_X                0     Set work-item X dimension
4250                                                    ID.
4251     SYSTEM_VGPR_WORKITEM_ID_X_Y              1     Set work-item X and Y
4252                                                    dimensions ID.
4253     SYSTEM_VGPR_WORKITEM_ID_X_Y_Z            2     Set work-item X, Y and Z
4254                                                    dimensions ID.
4255     SYSTEM_VGPR_WORKITEM_ID_UNDEFINED        3     Undefined.
4256     ======================================== ===== ============================
4257
4258.. _amdgpu-amdhsa-initial-kernel-execution-state:
4259
4260Initial Kernel Execution State
4261~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
4262
4263This section defines the register state that will be set up by the packet
4264processor prior to the start of execution of every wavefront. This is limited by
4265the constraints of the hardware controllers of CP/ADC/SPI.
4266
4267The order of the SGPR registers is defined, but the compiler can specify which
4268ones are actually setup in the kernel descriptor using the ``enable_sgpr_*`` bit
4269fields (see :ref:`amdgpu-amdhsa-kernel-descriptor`). The register numbers used
4270for enabled registers are dense starting at SGPR0: the first enabled register is
4271SGPR0, the next enabled register is SGPR1 etc.; disabled registers do not have
4272an SGPR number.
4273
4274The initial SGPRs comprise up to 16 User SRGPs that are set by CP and apply to
4275all wavefronts of the grid. It is possible to specify more than 16 User SGPRs
4276using the ``enable_sgpr_*`` bit fields, in which case only the first 16 are
4277actually initialized. These are then immediately followed by the System SGPRs
4278that are set up by ADC/SPI and can have different values for each wavefront of
4279the grid dispatch.
4280
4281SGPR register initial state is defined in
4282:ref:`amdgpu-amdhsa-sgpr-register-set-up-order-table`.
4283
4284  .. table:: SGPR Register Set Up Order
4285     :name: amdgpu-amdhsa-sgpr-register-set-up-order-table
4286
4287     ========== ========================== ====== ==============================
4288     SGPR Order Name                       Number Description
4289                (kernel descriptor enable  of
4290                field)                     SGPRs
4291     ========== ========================== ====== ==============================
4292     First      Private Segment Buffer     4      See
4293                (enable_sgpr_private              :ref:`amdgpu-amdhsa-kernel-prolog-private-segment-buffer`.
4294                _segment_buffer)
4295     then       Dispatch Ptr               2      64-bit address of AQL dispatch
4296                (enable_sgpr_dispatch_ptr)        packet for kernel dispatch
4297                                                  actually executing.
4298     then       Queue Ptr                  2      64-bit address of amd_queue_t
4299                (enable_sgpr_queue_ptr)           object for AQL queue on which
4300                                                  the dispatch packet was
4301                                                  queued.
4302     then       Kernarg Segment Ptr        2      64-bit address of Kernarg
4303                (enable_sgpr_kernarg              segment. This is directly
4304                _segment_ptr)                     copied from the
4305                                                  kernarg_address in the kernel
4306                                                  dispatch packet.
4307
4308                                                  Having CP load it once avoids
4309                                                  loading it at the beginning of
4310                                                  every wavefront.
4311     then       Dispatch Id                2      64-bit Dispatch ID of the
4312                (enable_sgpr_dispatch_id)         dispatch packet being
4313                                                  executed.
4314     then       Flat Scratch Init          2      See
4315                (enable_sgpr_flat_scratch         :ref:`amdgpu-amdhsa-kernel-prolog-flat-scratch`.
4316                _init)
4317     then       Private Segment Size       1      The 32-bit byte size of a
4318                (enable_sgpr_private              single work-item's memory
4319                _segment_size)                    allocation. This is the
4320                                                  value from the kernel
4321                                                  dispatch packet Private
4322                                                  Segment Byte Size rounded up
4323                                                  by CP to a multiple of
4324                                                  DWORD.
4325
4326                                                  Having CP load it once avoids
4327                                                  loading it at the beginning of
4328                                                  every wavefront.
4329
4330                                                  This is not used for
4331                                                  GFX7-GFX8 since it is the same
4332                                                  value as the second SGPR of
4333                                                  Flat Scratch Init. However, it
4334                                                  may be needed for GFX9-GFX10 which
4335                                                  changes the meaning of the
4336                                                  Flat Scratch Init value.
4337     then       Work-Group Id X            1      32-bit work-group id in X
4338                (enable_sgpr_workgroup_id         dimension of grid for
4339                _X)                               wavefront.
4340     then       Work-Group Id Y            1      32-bit work-group id in Y
4341                (enable_sgpr_workgroup_id         dimension of grid for
4342                _Y)                               wavefront.
4343     then       Work-Group Id Z            1      32-bit work-group id in Z
4344                (enable_sgpr_workgroup_id         dimension of grid for
4345                _Z)                               wavefront.
4346     then       Work-Group Info            1      {first_wavefront, 14'b0000,
4347                (enable_sgpr_workgroup            ordered_append_term[10:0],
4348                _info)                            threadgroup_size_in_wavefronts[5:0]}
4349     then       Scratch Wavefront Offset   1      See
4350                (enable_sgpr_private              :ref:`amdgpu-amdhsa-kernel-prolog-flat-scratch`.
4351                _segment_wavefront_offset)        and
4352                                                  :ref:`amdgpu-amdhsa-kernel-prolog-private-segment-buffer`.
4353     ========== ========================== ====== ==============================
4354
4355The order of the VGPR registers is defined, but the compiler can specify which
4356ones are actually setup in the kernel descriptor using the ``enable_vgpr*`` bit
4357fields (see :ref:`amdgpu-amdhsa-kernel-descriptor`). The register numbers used
4358for enabled registers are dense starting at VGPR0: the first enabled register is
4359VGPR0, the next enabled register is VGPR1 etc.; disabled registers do not have a
4360VGPR number.
4361
4362There are different methods used for the VGPR initial state:
4363
4364* Unless the *Target Properties* column of :ref:`amdgpu-processor-table`
4365  specifies otherwise, a separate VGPR register is used per work-item ID. The
4366  VGPR register initial state for this method is defined in
4367  :ref:`amdgpu-amdhsa-vgpr-register-set-up-order-for-unpacked-work-item-id-method-table`.
4368* If *Target Properties* column of :ref:`amdgpu-processor-table`
4369  specifies *Packed work-item IDs*, the initial value of VGPR0 register is used
4370  for all work-item IDs. The register layout for this method is defined in
4371  :ref:`amdgpu-amdhsa-register-layout-for-packed-work-item-id-method-table`.
4372
4373  .. table:: VGPR Register Set Up Order for Unpacked Work-Item ID Method
4374     :name: amdgpu-amdhsa-vgpr-register-set-up-order-for-unpacked-work-item-id-method-table
4375
4376     ========== ========================== ====== ==============================
4377     VGPR Order Name                       Number Description
4378                (kernel descriptor enable  of
4379                field)                     VGPRs
4380     ========== ========================== ====== ==============================
4381     First      Work-Item Id X             1      32-bit work-item id in X
4382                (Always initialized)              dimension of work-group for
4383                                                  wavefront lane.
4384     then       Work-Item Id Y             1      32-bit work-item id in Y
4385                (enable_vgpr_workitem_id          dimension of work-group for
4386                > 0)                              wavefront lane.
4387     then       Work-Item Id Z             1      32-bit work-item id in Z
4388                (enable_vgpr_workitem_id          dimension of work-group for
4389                > 1)                              wavefront lane.
4390     ========== ========================== ====== ==============================
4391
4392..
4393
4394  .. table:: Register Layout for Packed Work-Item ID Method
4395     :name: amdgpu-amdhsa-register-layout-for-packed-work-item-id-method-table
4396
4397     ======= ======= ================ =========================================
4398     Bits    Size    Field Name       Description
4399     ======= ======= ================ =========================================
4400     0:9     10 bits Work-Item Id X   Work-item id in X
4401                                      dimension of work-group for
4402                                      wavefront lane.
4403
4404                                      Always initialized.
4405
4406     10:19   10 bits Work-Item Id Y   Work-item id in Y
4407                                      dimension of work-group for
4408                                      wavefront lane.
4409
4410                                      Initialized if enable_vgpr_workitem_id >
4411                                      0, otherwise set to 0.
4412     20:29   10 bits Work-Item Id Z   Work-item id in Z
4413                                      dimension of work-group for
4414                                      wavefront lane.
4415
4416                                      Initialized if enable_vgpr_workitem_id >
4417                                      1, otherwise set to 0.
4418     30:31   2 bits                   Reserved, set to 0.
4419     ======= ======= ================ =========================================
4420
4421The setting of registers is done by GPU CP/ADC/SPI hardware as follows:
4422
44231. SGPRs before the Work-Group Ids are set by CP using the 16 User Data
4424   registers.
44252. Work-group Id registers X, Y, Z are set by ADC which supports any
4426   combination including none.
44273. Scratch Wavefront Offset is set by SPI in a per wavefront basis which is why
4428   its value cannot be included with the flat scratch init value which is per
4429   queue (see :ref:`amdgpu-amdhsa-kernel-prolog-flat-scratch`).
44304. The VGPRs are set by SPI which only supports specifying either (X), (X, Y)
4431   or (X, Y, Z).
44325. Flat Scratch register pair initialization is described in
4433   :ref:`amdgpu-amdhsa-kernel-prolog-flat-scratch`.
4434
4435The global segment can be accessed either using buffer instructions (GFX6 which
4436has V# 64-bit address support), flat instructions (GFX7-GFX10), or global
4437instructions (GFX9-GFX10).
4438
4439If buffer operations are used, then the compiler can generate a V# with the
4440following properties:
4441
4442* base address of 0
4443* no swizzle
4444* ATC: 1 if IOMMU present (such as APU)
4445* ptr64: 1
4446* MTYPE set to support memory coherence that matches the runtime (such as CC for
4447  APU and NC for dGPU).
4448
4449.. _amdgpu-amdhsa-kernel-prolog:
4450
4451Kernel Prolog
4452~~~~~~~~~~~~~
4453
4454The compiler performs initialization in the kernel prologue depending on the
4455target and information about things like stack usage in the kernel and called
4456functions. Some of this initialization requires the compiler to request certain
4457User and System SGPRs be present in the
4458:ref:`amdgpu-amdhsa-initial-kernel-execution-state` via the
4459:ref:`amdgpu-amdhsa-kernel-descriptor`.
4460
4461.. _amdgpu-amdhsa-kernel-prolog-cfi:
4462
4463CFI
4464+++
4465
44661.  The CFI return address is undefined.
4467
44682.  The CFI CFA is defined using an expression which evaluates to a location
4469    description that comprises one memory location description for the
4470    ``DW_ASPACE_AMDGPU_private_lane`` address space address ``0``.
4471
4472.. _amdgpu-amdhsa-kernel-prolog-m0:
4473
4474M0
4475++
4476
4477GFX6-GFX8
4478  The M0 register must be initialized with a value at least the total LDS size
4479  if the kernel may access LDS via DS or flat operations. Total LDS size is
4480  available in dispatch packet. For M0, it is also possible to use maximum
4481  possible value of LDS for given target (0x7FFF for GFX6 and 0xFFFF for
4482  GFX7-GFX8).
4483GFX9-GFX10
4484  The M0 register is not used for range checking LDS accesses and so does not
4485  need to be initialized in the prolog.
4486
4487.. _amdgpu-amdhsa-kernel-prolog-stack-pointer:
4488
4489Stack Pointer
4490+++++++++++++
4491
4492If the kernel has function calls it must set up the ABI stack pointer described
4493in :ref:`amdgpu-amdhsa-function-call-convention-non-kernel-functions` by setting
4494SGPR32 to the unswizzled scratch offset of the address past the last local
4495allocation.
4496
4497.. _amdgpu-amdhsa-kernel-prolog-frame-pointer:
4498
4499Frame Pointer
4500+++++++++++++
4501
4502If the kernel needs a frame pointer for the reasons defined in
4503``SIFrameLowering`` then SGPR33 is used and is always set to ``0`` in the
4504kernel prolog. If a frame pointer is not required then all uses of the frame
4505pointer are replaced with immediate ``0`` offsets.
4506
4507.. _amdgpu-amdhsa-kernel-prolog-flat-scratch:
4508
4509Flat Scratch
4510++++++++++++
4511
4512There are different methods used for initializing flat scratch:
4513
4514* If the *Target Properties* column of :ref:`amdgpu-processor-table`
4515  specifies *Does not support generic address space*:
4516
4517  Flat scratch is not supported and there is no flat scratch register pair.
4518
4519* If the *Target Properties* column of :ref:`amdgpu-processor-table`
4520  specifies *Offset flat scratch*:
4521
4522  If the kernel or any function it calls may use flat operations to access
4523  scratch memory, the prolog code must set up the FLAT_SCRATCH register pair
4524  (FLAT_SCRATCH_LO/FLAT_SCRATCH_HI). Initialization uses Flat Scratch Init and
4525  Scratch Wavefront Offset SGPR registers (see
4526  :ref:`amdgpu-amdhsa-initial-kernel-execution-state`):
4527
4528  1. The low word of Flat Scratch Init is the 32-bit byte offset from
4529     ``SH_HIDDEN_PRIVATE_BASE_VIMID`` to the base of scratch backing memory
4530     being managed by SPI for the queue executing the kernel dispatch. This is
4531     the same value used in the Scratch Segment Buffer V# base address.
4532
4533     CP obtains this from the runtime. (The Scratch Segment Buffer base address
4534     is ``SH_HIDDEN_PRIVATE_BASE_VIMID`` plus this offset.)
4535
4536     The prolog must add the value of Scratch Wavefront Offset to get the
4537     wavefront's byte scratch backing memory offset from
4538     ``SH_HIDDEN_PRIVATE_BASE_VIMID``.
4539
4540     The Scratch Wavefront Offset must also be used as an offset with Private
4541     segment address when using the Scratch Segment Buffer.
4542
4543     Since FLAT_SCRATCH_LO is in units of 256 bytes, the offset must be right
4544     shifted by 8 before moving into FLAT_SCRATCH_HI.
4545
4546     FLAT_SCRATCH_HI corresponds to SGPRn-4 on GFX7, and SGPRn-6 on GFX8 (where
4547     SGPRn is the highest numbered SGPR allocated to the wavefront).
4548     FLAT_SCRATCH_HI is multiplied by 256 (as it is in units of 256 bytes) and
4549     added to ``SH_HIDDEN_PRIVATE_BASE_VIMID`` to calculate the per wavefront
4550     FLAT SCRATCH BASE in flat memory instructions that access the scratch
4551     aperture.
4552  2. The second word of Flat Scratch Init is 32-bit byte size of a single
4553     work-items scratch memory usage.
4554
4555     CP obtains this from the runtime, and it is always a multiple of DWORD. CP
4556     checks that the value in the kernel dispatch packet Private Segment Byte
4557     Size is not larger and requests the runtime to increase the queue's scratch
4558     size if necessary.
4559
4560     CP directly loads from the kernel dispatch packet Private Segment Byte Size
4561     field and rounds up to a multiple of DWORD. Having CP load it once avoids
4562     loading it at the beginning of every wavefront.
4563
4564     The kernel prolog code must move it to FLAT_SCRATCH_LO which is SGPRn-3 on
4565     GFX7 and SGPRn-5 on GFX8. FLAT_SCRATCH_LO is used as the FLAT SCRATCH SIZE
4566     in flat memory instructions.
4567
4568* If the *Target Properties* column of :ref:`amdgpu-processor-table`
4569  specifies *Absolute flat scratch*:
4570
4571  If the kernel or any function it calls may use flat operations to access
4572  scratch memory, the prolog code must set up the FLAT_SCRATCH register pair
4573  (FLAT_SCRATCH_LO/FLAT_SCRATCH_HI which are in SGPRn-4/SGPRn-3). Initialization
4574  uses Flat Scratch Init and Scratch Wavefront Offset SGPR registers (see
4575  :ref:`amdgpu-amdhsa-initial-kernel-execution-state`):
4576
4577  The Flat Scratch Init is the 64-bit address of the base of scratch backing
4578  memory being managed by SPI for the queue executing the kernel dispatch.
4579
4580  CP obtains this from the runtime.
4581
4582  The kernel prolog must add the value of the wave's Scratch Wavefront Offset
4583  and move the result as a 64-bit value to the FLAT_SCRATCH SGPR register pair
4584  which is SGPRn-6 and SGPRn-5. It is used as the FLAT SCRATCH BASE in flat
4585  memory instructions.
4586
4587  The Scratch Wavefront Offset must also be used as an offset with Private
4588  segment address when using the Scratch Segment Buffer (see
4589  :ref:`amdgpu-amdhsa-kernel-prolog-private-segment-buffer`).
4590
4591* If the *Target Properties* column of :ref:`amdgpu-processor-table`
4592  specifies *Architected flat scratch*:
4593
4594  If ENABLE_PRIVATE_SEGMENT is enabled in
4595  :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx10-table` then the FLAT_SCRATCH
4596  register pair will be initialized to the 64-bit address of the base of scratch
4597  backing memory being managed by SPI for the queue executing the kernel
4598  dispatch plus the value of the wave's Scratch Wavefront Offset for use as the
4599  flat scratch base in flat memory instructions.
4600
4601.. _amdgpu-amdhsa-kernel-prolog-private-segment-buffer:
4602
4603Private Segment Buffer
4604++++++++++++++++++++++
4605
4606If the *Target Properties* column of :ref:`amdgpu-processor-table` specifies
4607*Architected flat scratch* then a Private Segment Buffer is not supported.
4608Instead the flat SCRATCH instructions are used.
4609
4610Otherwise, Private Segment Buffer SGPR register is used to initialize 4 SGPRs
4611that are used as a V# to access scratch. CP uses the value provided by the
4612runtime. It is used, together with Scratch Wavefront Offset as an offset, to
4613access the private memory space using a segment address. See
4614:ref:`amdgpu-amdhsa-initial-kernel-execution-state`.
4615
4616The scratch V# is a four-aligned SGPR and always selected for the kernel as
4617follows:
4618
4619  - If it is known during instruction selection that there is stack usage,
4620    SGPR0-3 is reserved for use as the scratch V#.  Stack usage is assumed if
4621    optimizations are disabled (``-O0``), if stack objects already exist (for
4622    locals, etc.), or if there are any function calls.
4623
4624  - Otherwise, four high numbered SGPRs beginning at a four-aligned SGPR index
4625    are reserved for the tentative scratch V#. These will be used if it is
4626    determined that spilling is needed.
4627
4628    - If no use is made of the tentative scratch V#, then it is unreserved,
4629      and the register count is determined ignoring it.
4630    - If use is made of the tentative scratch V#, then its register numbers
4631      are shifted to the first four-aligned SGPR index after the highest one
4632      allocated by the register allocator, and all uses are updated. The
4633      register count includes them in the shifted location.
4634    - In either case, if the processor has the SGPR allocation bug, the
4635      tentative allocation is not shifted or unreserved in order to ensure
4636      the register count is higher to workaround the bug.
4637
4638    .. note::
4639
4640      This approach of using a tentative scratch V# and shifting the register
4641      numbers if used avoids having to perform register allocation a second
4642      time if the tentative V# is eliminated. This is more efficient and
4643      avoids the problem that the second register allocation may perform
4644      spilling which will fail as there is no longer a scratch V#.
4645
4646When the kernel prolog code is being emitted it is known whether the scratch V#
4647described above is actually used. If it is, the prolog code must set it up by
4648copying the Private Segment Buffer to the scratch V# registers and then adding
4649the Private Segment Wavefront Offset to the queue base address in the V#. The
4650result is a V# with a base address pointing to the beginning of the wavefront
4651scratch backing memory.
4652
4653The Private Segment Buffer is always requested, but the Private Segment
4654Wavefront Offset is only requested if it is used (see
4655:ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
4656
4657.. _amdgpu-amdhsa-memory-model:
4658
4659Memory Model
4660~~~~~~~~~~~~
4661
4662This section describes the mapping of the LLVM memory model onto AMDGPU machine
4663code (see :ref:`memmodel`).
4664
4665The AMDGPU backend supports the memory synchronization scopes specified in
4666:ref:`amdgpu-memory-scopes`.
4667
4668The code sequences used to implement the memory model specify the order of
4669instructions that a single thread must execute. The ``s_waitcnt`` and cache
4670management instructions such as ``buffer_wbinvl1_vol`` are defined with respect
4671to other memory instructions executed by the same thread. This allows them to be
4672moved earlier or later which can allow them to be combined with other instances
4673of the same instruction, or hoisted/sunk out of loops to improve performance.
4674Only the instructions related to the memory model are given; additional
4675``s_waitcnt`` instructions are required to ensure registers are defined before
4676being used. These may be able to be combined with the memory model ``s_waitcnt``
4677instructions as described above.
4678
4679The AMDGPU backend supports the following memory models:
4680
4681  HSA Memory Model [HSA]_
4682    The HSA memory model uses a single happens-before relation for all address
4683    spaces (see :ref:`amdgpu-address-spaces`).
4684  OpenCL Memory Model [OpenCL]_
4685    The OpenCL memory model which has separate happens-before relations for the
4686    global and local address spaces. Only a fence specifying both global and
4687    local address space, and seq_cst instructions join the relationships. Since
4688    the LLVM ``memfence`` instruction does not allow an address space to be
4689    specified the OpenCL fence has to conservatively assume both local and
4690    global address space was specified. However, optimizations can often be
4691    done to eliminate the additional ``s_waitcnt`` instructions when there are
4692    no intervening memory instructions which access the corresponding address
4693    space. The code sequences in the table indicate what can be omitted for the
4694    OpenCL memory. The target triple environment is used to determine if the
4695    source language is OpenCL (see :ref:`amdgpu-opencl`).
4696
4697``ds/flat_load/store/atomic`` instructions to local memory are termed LDS
4698operations.
4699
4700``buffer/global/flat_load/store/atomic`` instructions to global memory are
4701termed vector memory operations.
4702
4703Private address space uses ``buffer_load/store`` using the scratch V#
4704(GFX6-GFX8), or ``scratch_load/store`` (GFX9-GFX10). Since only a single thread
4705is accessing the memory, atomic memory orderings are not meaningful, and all
4706accesses are treated as non-atomic.
4707
4708Constant address space uses ``buffer/global_load`` instructions (or equivalent
4709scalar memory instructions). Since the constant address space contents do not
4710change during the execution of a kernel dispatch it is not legal to perform
4711stores, and atomic memory orderings are not meaningful, and all accesses are
4712treated as non-atomic.
4713
4714A memory synchronization scope wider than work-group is not meaningful for the
4715group (LDS) address space and is treated as work-group.
4716
4717The memory model does not support the region address space which is treated as
4718non-atomic.
4719
4720Acquire memory ordering is not meaningful on store atomic instructions and is
4721treated as non-atomic.
4722
4723Release memory ordering is not meaningful on load atomic instructions and is
4724treated a non-atomic.
4725
4726Acquire-release memory ordering is not meaningful on load or store atomic
4727instructions and is treated as acquire and release respectively.
4728
4729The memory order also adds the single thread optimization constraints defined in
4730table
4731:ref:`amdgpu-amdhsa-memory-model-single-thread-optimization-constraints-table`.
4732
4733  .. table:: AMDHSA Memory Model Single Thread Optimization Constraints
4734     :name: amdgpu-amdhsa-memory-model-single-thread-optimization-constraints-table
4735
4736     ============ ==============================================================
4737     LLVM Memory  Optimization Constraints
4738     Ordering
4739     ============ ==============================================================
4740     unordered    *none*
4741     monotonic    *none*
4742     acquire      - If a load atomic/atomicrmw then no following load/load
4743                    atomic/store/store atomic/atomicrmw/fence instruction can be
4744                    moved before the acquire.
4745                  - If a fence then same as load atomic, plus no preceding
4746                    associated fence-paired-atomic can be moved after the fence.
4747     release      - If a store atomic/atomicrmw then no preceding load/load
4748                    atomic/store/store atomic/atomicrmw/fence instruction can be
4749                    moved after the release.
4750                  - If a fence then same as store atomic, plus no following
4751                    associated fence-paired-atomic can be moved before the
4752                    fence.
4753     acq_rel      Same constraints as both acquire and release.
4754     seq_cst      - If a load atomic then same constraints as acquire, plus no
4755                    preceding sequentially consistent load atomic/store
4756                    atomic/atomicrmw/fence instruction can be moved after the
4757                    seq_cst.
4758                  - If a store atomic then the same constraints as release, plus
4759                    no following sequentially consistent load atomic/store
4760                    atomic/atomicrmw/fence instruction can be moved before the
4761                    seq_cst.
4762                  - If an atomicrmw/fence then same constraints as acq_rel.
4763     ============ ==============================================================
4764
4765The code sequences used to implement the memory model are defined in the
4766following sections:
4767
4768* :ref:`amdgpu-amdhsa-memory-model-gfx6-gfx9`
4769* :ref:`amdgpu-amdhsa-memory-model-gfx90a`
4770* :ref:`amdgpu-amdhsa-memory-model-gfx10`
4771
4772.. _amdgpu-amdhsa-memory-model-gfx6-gfx9:
4773
4774Memory Model GFX6-GFX9
4775++++++++++++++++++++++
4776
4777For GFX6-GFX9:
4778
4779* Each agent has multiple shader arrays (SA).
4780* Each SA has multiple compute units (CU).
4781* Each CU has multiple SIMDs that execute wavefronts.
4782* The wavefronts for a single work-group are executed in the same CU but may be
4783  executed by different SIMDs.
4784* Each CU has a single LDS memory shared by the wavefronts of the work-groups
4785  executing on it.
4786* All LDS operations of a CU are performed as wavefront wide operations in a
4787  global order and involve no caching. Completion is reported to a wavefront in
4788  execution order.
4789* The LDS memory has multiple request queues shared by the SIMDs of a
4790  CU. Therefore, the LDS operations performed by different wavefronts of a
4791  work-group can be reordered relative to each other, which can result in
4792  reordering the visibility of vector memory operations with respect to LDS
4793  operations of other wavefronts in the same work-group. A ``s_waitcnt
4794  lgkmcnt(0)`` is required to ensure synchronization between LDS operations and
4795  vector memory operations between wavefronts of a work-group, but not between
4796  operations performed by the same wavefront.
4797* The vector memory operations are performed as wavefront wide operations and
4798  completion is reported to a wavefront in execution order. The exception is
4799  that for GFX7-GFX9 ``flat_load/store/atomic`` instructions can report out of
4800  vector memory order if they access LDS memory, and out of LDS operation order
4801  if they access global memory.
4802* The vector memory operations access a single vector L1 cache shared by all
4803  SIMDs a CU. Therefore, no special action is required for coherence between the
4804  lanes of a single wavefront, or for coherence between wavefronts in the same
4805  work-group. A ``buffer_wbinvl1_vol`` is required for coherence between
4806  wavefronts executing in different work-groups as they may be executing on
4807  different CUs.
4808* The scalar memory operations access a scalar L1 cache shared by all wavefronts
4809  on a group of CUs. The scalar and vector L1 caches are not coherent. However,
4810  scalar operations are used in a restricted way so do not impact the memory
4811  model. See :ref:`amdgpu-amdhsa-memory-spaces`.
4812* The vector and scalar memory operations use an L2 cache shared by all CUs on
4813  the same agent.
4814* The L2 cache has independent channels to service disjoint ranges of virtual
4815  addresses.
4816* Each CU has a separate request queue per channel. Therefore, the vector and
4817  scalar memory operations performed by wavefronts executing in different
4818  work-groups (which may be executing on different CUs) of an agent can be
4819  reordered relative to each other. A ``s_waitcnt vmcnt(0)`` is required to
4820  ensure synchronization between vector memory operations of different CUs. It
4821  ensures a previous vector memory operation has completed before executing a
4822  subsequent vector memory or LDS operation and so can be used to meet the
4823  requirements of acquire and release.
4824* The L2 cache can be kept coherent with other agents on some targets, or ranges
4825  of virtual addresses can be set up to bypass it to ensure system coherence.
4826
4827Scalar memory operations are only used to access memory that is proven to not
4828change during the execution of the kernel dispatch. This includes constant
4829address space and global address space for program scope ``const`` variables.
4830Therefore, the kernel machine code does not have to maintain the scalar cache to
4831ensure it is coherent with the vector caches. The scalar and vector caches are
4832invalidated between kernel dispatches by CP since constant address space data
4833may change between kernel dispatch executions. See
4834:ref:`amdgpu-amdhsa-memory-spaces`.
4835
4836The one exception is if scalar writes are used to spill SGPR registers. In this
4837case the AMDGPU backend ensures the memory location used to spill is never
4838accessed by vector memory operations at the same time. If scalar writes are used
4839then a ``s_dcache_wb`` is inserted before the ``s_endpgm`` and before a function
4840return since the locations may be used for vector memory instructions by a
4841future wavefront that uses the same scratch area, or a function call that
4842creates a frame at the same address, respectively. There is no need for a
4843``s_dcache_inv`` as all scalar writes are write-before-read in the same thread.
4844
4845For kernarg backing memory:
4846
4847* CP invalidates the L1 cache at the start of each kernel dispatch.
4848* On dGPU the kernarg backing memory is allocated in host memory accessed as
4849  MTYPE UC (uncached) to avoid needing to invalidate the L2 cache. This also
4850  causes it to be treated as non-volatile and so is not invalidated by
4851  ``*_vol``.
4852* On APU the kernarg backing memory it is accessed as MTYPE CC (cache coherent)
4853  and so the L2 cache will be coherent with the CPU and other agents.
4854
4855Scratch backing memory (which is used for the private address space) is accessed
4856with MTYPE NC_NV (non-coherent non-volatile). Since the private address space is
4857only accessed by a single thread, and is always write-before-read, there is
4858never a need to invalidate these entries from the L1 cache. Hence all cache
4859invalidates are done as ``*_vol`` to only invalidate the volatile cache lines.
4860
4861The code sequences used to implement the memory model for GFX6-GFX9 are defined
4862in table :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx6-gfx9-table`.
4863
4864  .. table:: AMDHSA Memory Model Code Sequences GFX6-GFX9
4865     :name: amdgpu-amdhsa-memory-model-code-sequences-gfx6-gfx9-table
4866
4867     ============ ============ ============== ========== ================================
4868     LLVM Instr   LLVM Memory  LLVM Memory    AMDGPU     AMDGPU Machine Code
4869                  Ordering     Sync Scope     Address    GFX6-GFX9
4870                                              Space
4871     ============ ============ ============== ========== ================================
4872     **Non-Atomic**
4873     ------------------------------------------------------------------------------------
4874     load         *none*       *none*         - global   - !volatile & !nontemporal
4875                                              - generic
4876                                              - private    1. buffer/global/flat_load
4877                                              - constant
4878                                                         - !volatile & nontemporal
4879
4880                                                           1. buffer/global/flat_load
4881                                                              glc=1 slc=1
4882
4883                                                         - volatile
4884
4885                                                           1. buffer/global/flat_load
4886                                                              glc=1
4887                                                           2. s_waitcnt vmcnt(0)
4888
4889                                                            - Must happen before
4890                                                              any following volatile
4891                                                              global/generic
4892                                                              load/store.
4893                                                            - Ensures that
4894                                                              volatile
4895                                                              operations to
4896                                                              different
4897                                                              addresses will not
4898                                                              be reordered by
4899                                                              hardware.
4900
4901     load         *none*       *none*         - local    1. ds_load
4902     store        *none*       *none*         - global   - !volatile & !nontemporal
4903                                              - generic
4904                                              - private    1. buffer/global/flat_store
4905                                              - constant
4906                                                         - !volatile & nontemporal
4907
4908                                                           1. buffer/global/flat_store
4909                                                              glc=1 slc=1
4910
4911                                                         - volatile
4912
4913                                                           1. buffer/global/flat_store
4914                                                           2. s_waitcnt vmcnt(0)
4915
4916                                                            - Must happen before
4917                                                              any following volatile
4918                                                              global/generic
4919                                                              load/store.
4920                                                            - Ensures that
4921                                                              volatile
4922                                                              operations to
4923                                                              different
4924                                                              addresses will not
4925                                                              be reordered by
4926                                                              hardware.
4927
4928     store        *none*       *none*         - local    1. ds_store
4929     **Unordered Atomic**
4930     ------------------------------------------------------------------------------------
4931     load atomic  unordered    *any*          *any*      *Same as non-atomic*.
4932     store atomic unordered    *any*          *any*      *Same as non-atomic*.
4933     atomicrmw    unordered    *any*          *any*      *Same as monotonic atomic*.
4934     **Monotonic Atomic**
4935     ------------------------------------------------------------------------------------
4936     load atomic  monotonic    - singlethread - global   1. buffer/global/ds/flat_load
4937                               - wavefront    - local
4938                               - workgroup    - generic
4939     load atomic  monotonic    - agent        - global   1. buffer/global/flat_load
4940                               - system       - generic     glc=1
4941     store atomic monotonic    - singlethread - global   1. buffer/global/flat_store
4942                               - wavefront    - generic
4943                               - workgroup
4944                               - agent
4945                               - system
4946     store atomic monotonic    - singlethread - local    1. ds_store
4947                               - wavefront
4948                               - workgroup
4949     atomicrmw    monotonic    - singlethread - global   1. buffer/global/flat_atomic
4950                               - wavefront    - generic
4951                               - workgroup
4952                               - agent
4953                               - system
4954     atomicrmw    monotonic    - singlethread - local    1. ds_atomic
4955                               - wavefront
4956                               - workgroup
4957     **Acquire Atomic**
4958     ------------------------------------------------------------------------------------
4959     load atomic  acquire      - singlethread - global   1. buffer/global/ds/flat_load
4960                               - wavefront    - local
4961                                              - generic
4962     load atomic  acquire      - workgroup    - global   1. buffer/global_load
4963     load atomic  acquire      - workgroup    - local    1. ds/flat_load
4964                                              - generic  2. s_waitcnt lgkmcnt(0)
4965
4966                                                           - If OpenCL, omit.
4967                                                           - Must happen before
4968                                                             any following
4969                                                             global/generic
4970                                                             load/load
4971                                                             atomic/store/store
4972                                                             atomic/atomicrmw.
4973                                                           - Ensures any
4974                                                             following global
4975                                                             data read is no
4976                                                             older than a local load
4977                                                             atomic value being
4978                                                             acquired.
4979
4980     load atomic  acquire      - agent        - global   1. buffer/global_load
4981                               - system                     glc=1
4982                                                         2. s_waitcnt vmcnt(0)
4983
4984                                                           - Must happen before
4985                                                             following
4986                                                             buffer_wbinvl1_vol.
4987                                                           - Ensures the load
4988                                                             has completed
4989                                                             before invalidating
4990                                                             the cache.
4991
4992                                                         3. buffer_wbinvl1_vol
4993
4994                                                           - Must happen before
4995                                                             any following
4996                                                             global/generic
4997                                                             load/load
4998                                                             atomic/atomicrmw.
4999                                                           - Ensures that
5000                                                             following
5001                                                             loads will not see
5002                                                             stale global data.
5003
5004     load atomic  acquire      - agent        - generic  1. flat_load glc=1
5005                               - system                  2. s_waitcnt vmcnt(0) &
5006                                                            lgkmcnt(0)
5007
5008                                                           - If OpenCL omit
5009                                                             lgkmcnt(0).
5010                                                           - Must happen before
5011                                                             following
5012                                                             buffer_wbinvl1_vol.
5013                                                           - Ensures the flat_load
5014                                                             has completed
5015                                                             before invalidating
5016                                                             the cache.
5017
5018                                                         3. buffer_wbinvl1_vol
5019
5020                                                           - Must happen before
5021                                                             any following
5022                                                             global/generic
5023                                                             load/load
5024                                                             atomic/atomicrmw.
5025                                                           - Ensures that
5026                                                             following loads
5027                                                             will not see stale
5028                                                             global data.
5029
5030     atomicrmw    acquire      - singlethread - global   1. buffer/global/ds/flat_atomic
5031                               - wavefront    - local
5032                                              - generic
5033     atomicrmw    acquire      - workgroup    - global   1. buffer/global_atomic
5034     atomicrmw    acquire      - workgroup    - local    1. ds/flat_atomic
5035                                              - generic  2. s_waitcnt lgkmcnt(0)
5036
5037                                                           - If OpenCL, omit.
5038                                                           - Must happen before
5039                                                             any following
5040                                                             global/generic
5041                                                             load/load
5042                                                             atomic/store/store
5043                                                             atomic/atomicrmw.
5044                                                           - Ensures any
5045                                                             following global
5046                                                             data read is no
5047                                                             older than a local
5048                                                             atomicrmw value
5049                                                             being acquired.
5050
5051     atomicrmw    acquire      - agent        - global   1. buffer/global_atomic
5052                               - system                  2. s_waitcnt vmcnt(0)
5053
5054                                                           - Must happen before
5055                                                             following
5056                                                             buffer_wbinvl1_vol.
5057                                                           - Ensures the
5058                                                             atomicrmw has
5059                                                             completed before
5060                                                             invalidating the
5061                                                             cache.
5062
5063                                                         3. buffer_wbinvl1_vol
5064
5065                                                           - Must happen before
5066                                                             any following
5067                                                             global/generic
5068                                                             load/load
5069                                                             atomic/atomicrmw.
5070                                                           - Ensures that
5071                                                             following loads
5072                                                             will not see stale
5073                                                             global data.
5074
5075     atomicrmw    acquire      - agent        - generic  1. flat_atomic
5076                               - system                  2. s_waitcnt vmcnt(0) &
5077                                                            lgkmcnt(0)
5078
5079                                                           - If OpenCL, omit
5080                                                             lgkmcnt(0).
5081                                                           - Must happen before
5082                                                             following
5083                                                             buffer_wbinvl1_vol.
5084                                                           - Ensures the
5085                                                             atomicrmw has
5086                                                             completed before
5087                                                             invalidating the
5088                                                             cache.
5089
5090                                                         3. buffer_wbinvl1_vol
5091
5092                                                           - Must happen before
5093                                                             any following
5094                                                             global/generic
5095                                                             load/load
5096                                                             atomic/atomicrmw.
5097                                                           - Ensures that
5098                                                             following loads
5099                                                             will not see stale
5100                                                             global data.
5101
5102     fence        acquire      - singlethread *none*     *none*
5103                               - wavefront
5104     fence        acquire      - workgroup    *none*     1. s_waitcnt lgkmcnt(0)
5105
5106                                                           - If OpenCL and
5107                                                             address space is
5108                                                             not generic, omit.
5109                                                           - However, since LLVM
5110                                                             currently has no
5111                                                             address space on
5112                                                             the fence need to
5113                                                             conservatively
5114                                                             always generate. If
5115                                                             fence had an
5116                                                             address space then
5117                                                             set to address
5118                                                             space of OpenCL
5119                                                             fence flag, or to
5120                                                             generic if both
5121                                                             local and global
5122                                                             flags are
5123                                                             specified.
5124                                                           - Must happen after
5125                                                             any preceding
5126                                                             local/generic load
5127                                                             atomic/atomicrmw
5128                                                             with an equal or
5129                                                             wider sync scope
5130                                                             and memory ordering
5131                                                             stronger than
5132                                                             unordered (this is
5133                                                             termed the
5134                                                             fence-paired-atomic).
5135                                                           - Must happen before
5136                                                             any following
5137                                                             global/generic
5138                                                             load/load
5139                                                             atomic/store/store
5140                                                             atomic/atomicrmw.
5141                                                           - Ensures any
5142                                                             following global
5143                                                             data read is no
5144                                                             older than the
5145                                                             value read by the
5146                                                             fence-paired-atomic.
5147
5148     fence        acquire      - agent        *none*     1. s_waitcnt lgkmcnt(0) &
5149                               - system                     vmcnt(0)
5150
5151                                                           - If OpenCL and
5152                                                             address space is
5153                                                             not generic, omit
5154                                                             lgkmcnt(0).
5155                                                           - However, since LLVM
5156                                                             currently has no
5157                                                             address space on
5158                                                             the fence need to
5159                                                             conservatively
5160                                                             always generate
5161                                                             (see comment for
5162                                                             previous fence).
5163                                                           - Could be split into
5164                                                             separate s_waitcnt
5165                                                             vmcnt(0) and
5166                                                             s_waitcnt
5167                                                             lgkmcnt(0) to allow
5168                                                             them to be
5169                                                             independently moved
5170                                                             according to the
5171                                                             following rules.
5172                                                           - s_waitcnt vmcnt(0)
5173                                                             must happen after
5174                                                             any preceding
5175                                                             global/generic load
5176                                                             atomic/atomicrmw
5177                                                             with an equal or
5178                                                             wider sync scope
5179                                                             and memory ordering
5180                                                             stronger than
5181                                                             unordered (this is
5182                                                             termed the
5183                                                             fence-paired-atomic).
5184                                                           - s_waitcnt lgkmcnt(0)
5185                                                             must happen after
5186                                                             any preceding
5187                                                             local/generic load
5188                                                             atomic/atomicrmw
5189                                                             with an equal or
5190                                                             wider sync scope
5191                                                             and memory ordering
5192                                                             stronger than
5193                                                             unordered (this is
5194                                                             termed the
5195                                                             fence-paired-atomic).
5196                                                           - Must happen before
5197                                                             the following
5198                                                             buffer_wbinvl1_vol.
5199                                                           - Ensures that the
5200                                                             fence-paired atomic
5201                                                             has completed
5202                                                             before invalidating
5203                                                             the
5204                                                             cache. Therefore
5205                                                             any following
5206                                                             locations read must
5207                                                             be no older than
5208                                                             the value read by
5209                                                             the
5210                                                             fence-paired-atomic.
5211
5212                                                         2. buffer_wbinvl1_vol
5213
5214                                                           - Must happen before any
5215                                                             following global/generic
5216                                                             load/load
5217                                                             atomic/store/store
5218                                                             atomic/atomicrmw.
5219                                                           - Ensures that
5220                                                             following loads
5221                                                             will not see stale
5222                                                             global data.
5223
5224     **Release Atomic**
5225     ------------------------------------------------------------------------------------
5226     store atomic release      - singlethread - global   1. buffer/global/ds/flat_store
5227                               - wavefront    - local
5228                                              - generic
5229     store atomic release      - workgroup    - global   1. s_waitcnt lgkmcnt(0)
5230                                              - generic
5231                                                           - If OpenCL, omit.
5232                                                           - Must happen after
5233                                                             any preceding
5234                                                             local/generic
5235                                                             load/store/load
5236                                                             atomic/store
5237                                                             atomic/atomicrmw.
5238                                                           - Must happen before
5239                                                             the following
5240                                                             store.
5241                                                           - Ensures that all
5242                                                             memory operations
5243                                                             to local have
5244                                                             completed before
5245                                                             performing the
5246                                                             store that is being
5247                                                             released.
5248
5249                                                         2. buffer/global/flat_store
5250     store atomic release      - workgroup    - local    1. ds_store
5251     store atomic release      - agent        - global   1. s_waitcnt lgkmcnt(0) &
5252                               - system       - generic     vmcnt(0)
5253
5254                                                           - If OpenCL and
5255                                                             address space is
5256                                                             not generic, omit
5257                                                             lgkmcnt(0).
5258                                                           - Could be split into
5259                                                             separate s_waitcnt
5260                                                             vmcnt(0) and
5261                                                             s_waitcnt
5262                                                             lgkmcnt(0) to allow
5263                                                             them to be
5264                                                             independently moved
5265                                                             according to the
5266                                                             following rules.
5267                                                           - s_waitcnt vmcnt(0)
5268                                                             must happen after
5269                                                             any preceding
5270                                                             global/generic
5271                                                             load/store/load
5272                                                             atomic/store
5273                                                             atomic/atomicrmw.
5274                                                           - s_waitcnt lgkmcnt(0)
5275                                                             must happen after
5276                                                             any preceding
5277                                                             local/generic
5278                                                             load/store/load
5279                                                             atomic/store
5280                                                             atomic/atomicrmw.
5281                                                           - Must happen before
5282                                                             the following
5283                                                             store.
5284                                                           - Ensures that all
5285                                                             memory operations
5286                                                             to memory have
5287                                                             completed before
5288                                                             performing the
5289                                                             store that is being
5290                                                             released.
5291
5292                                                         2. buffer/global/flat_store
5293     atomicrmw    release      - singlethread - global   1. buffer/global/ds/flat_atomic
5294                               - wavefront    - local
5295                                              - generic
5296     atomicrmw    release      - workgroup    - global   1. s_waitcnt lgkmcnt(0)
5297                                              - generic
5298                                                           - If OpenCL, omit.
5299                                                           - Must happen after
5300                                                             any preceding
5301                                                             local/generic
5302                                                             load/store/load
5303                                                             atomic/store
5304                                                             atomic/atomicrmw.
5305                                                           - Must happen before
5306                                                             the following
5307                                                             atomicrmw.
5308                                                           - Ensures that all
5309                                                             memory operations
5310                                                             to local have
5311                                                             completed before
5312                                                             performing the
5313                                                             atomicrmw that is
5314                                                             being released.
5315
5316                                                         2. buffer/global/flat_atomic
5317     atomicrmw    release      - workgroup    - local    1. ds_atomic
5318     atomicrmw    release      - agent        - global   1. s_waitcnt lgkmcnt(0) &
5319                               - system       - generic     vmcnt(0)
5320
5321                                                           - If OpenCL, omit
5322                                                             lgkmcnt(0).
5323                                                           - Could be split into
5324                                                             separate s_waitcnt
5325                                                             vmcnt(0) and
5326                                                             s_waitcnt
5327                                                             lgkmcnt(0) to allow
5328                                                             them to be
5329                                                             independently moved
5330                                                             according to the
5331                                                             following rules.
5332                                                           - s_waitcnt vmcnt(0)
5333                                                             must happen after
5334                                                             any preceding
5335                                                             global/generic
5336                                                             load/store/load
5337                                                             atomic/store
5338                                                             atomic/atomicrmw.
5339                                                           - s_waitcnt lgkmcnt(0)
5340                                                             must happen after
5341                                                             any preceding
5342                                                             local/generic
5343                                                             load/store/load
5344                                                             atomic/store
5345                                                             atomic/atomicrmw.
5346                                                           - Must happen before
5347                                                             the following
5348                                                             atomicrmw.
5349                                                           - Ensures that all
5350                                                             memory operations
5351                                                             to global and local
5352                                                             have completed
5353                                                             before performing
5354                                                             the atomicrmw that
5355                                                             is being released.
5356
5357                                                         2. buffer/global/flat_atomic
5358     fence        release      - singlethread *none*     *none*
5359                               - wavefront
5360     fence        release      - workgroup    *none*     1. s_waitcnt lgkmcnt(0)
5361
5362                                                           - If OpenCL and
5363                                                             address space is
5364                                                             not generic, omit.
5365                                                           - However, since LLVM
5366                                                             currently has no
5367                                                             address space on
5368                                                             the fence need to
5369                                                             conservatively
5370                                                             always generate. If
5371                                                             fence had an
5372                                                             address space then
5373                                                             set to address
5374                                                             space of OpenCL
5375                                                             fence flag, or to
5376                                                             generic if both
5377                                                             local and global
5378                                                             flags are
5379                                                             specified.
5380                                                           - Must happen after
5381                                                             any preceding
5382                                                             local/generic
5383                                                             load/load
5384                                                             atomic/store/store
5385                                                             atomic/atomicrmw.
5386                                                           - Must happen before
5387                                                             any following store
5388                                                             atomic/atomicrmw
5389                                                             with an equal or
5390                                                             wider sync scope
5391                                                             and memory ordering
5392                                                             stronger than
5393                                                             unordered (this is
5394                                                             termed the
5395                                                             fence-paired-atomic).
5396                                                           - Ensures that all
5397                                                             memory operations
5398                                                             to local have
5399                                                             completed before
5400                                                             performing the
5401                                                             following
5402                                                             fence-paired-atomic.
5403
5404     fence        release      - agent        *none*     1. s_waitcnt lgkmcnt(0) &
5405                               - system                     vmcnt(0)
5406
5407                                                           - If OpenCL and
5408                                                             address space is
5409                                                             not generic, omit
5410                                                             lgkmcnt(0).
5411                                                           - If OpenCL and
5412                                                             address space is
5413                                                             local, omit
5414                                                             vmcnt(0).
5415                                                           - However, since LLVM
5416                                                             currently has no
5417                                                             address space on
5418                                                             the fence need to
5419                                                             conservatively
5420                                                             always generate. If
5421                                                             fence had an
5422                                                             address space then
5423                                                             set to address
5424                                                             space of OpenCL
5425                                                             fence flag, or to
5426                                                             generic if both
5427                                                             local and global
5428                                                             flags are
5429                                                             specified.
5430                                                           - Could be split into
5431                                                             separate s_waitcnt
5432                                                             vmcnt(0) and
5433                                                             s_waitcnt
5434                                                             lgkmcnt(0) to allow
5435                                                             them to be
5436                                                             independently moved
5437                                                             according to the
5438                                                             following rules.
5439                                                           - s_waitcnt vmcnt(0)
5440                                                             must happen after
5441                                                             any preceding
5442                                                             global/generic
5443                                                             load/store/load
5444                                                             atomic/store
5445                                                             atomic/atomicrmw.
5446                                                           - s_waitcnt lgkmcnt(0)
5447                                                             must happen after
5448                                                             any preceding
5449                                                             local/generic
5450                                                             load/store/load
5451                                                             atomic/store
5452                                                             atomic/atomicrmw.
5453                                                           - Must happen before
5454                                                             any following store
5455                                                             atomic/atomicrmw
5456                                                             with an equal or
5457                                                             wider sync scope
5458                                                             and memory ordering
5459                                                             stronger than
5460                                                             unordered (this is
5461                                                             termed the
5462                                                             fence-paired-atomic).
5463                                                           - Ensures that all
5464                                                             memory operations
5465                                                             have
5466                                                             completed before
5467                                                             performing the
5468                                                             following
5469                                                             fence-paired-atomic.
5470
5471     **Acquire-Release Atomic**
5472     ------------------------------------------------------------------------------------
5473     atomicrmw    acq_rel      - singlethread - global   1. buffer/global/ds/flat_atomic
5474                               - wavefront    - local
5475                                              - generic
5476     atomicrmw    acq_rel      - workgroup    - global   1. s_waitcnt lgkmcnt(0)
5477
5478                                                           - If OpenCL, omit.
5479                                                           - Must happen after
5480                                                             any preceding
5481                                                             local/generic
5482                                                             load/store/load
5483                                                             atomic/store
5484                                                             atomic/atomicrmw.
5485                                                           - Must happen before
5486                                                             the following
5487                                                             atomicrmw.
5488                                                           - Ensures that all
5489                                                             memory operations
5490                                                             to local have
5491                                                             completed before
5492                                                             performing the
5493                                                             atomicrmw that is
5494                                                             being released.
5495
5496                                                         2. buffer/global_atomic
5497
5498     atomicrmw    acq_rel      - workgroup    - local    1. ds_atomic
5499                                                         2. s_waitcnt lgkmcnt(0)
5500
5501                                                           - If OpenCL, omit.
5502                                                           - Must happen before
5503                                                             any following
5504                                                             global/generic
5505                                                             load/load
5506                                                             atomic/store/store
5507                                                             atomic/atomicrmw.
5508                                                           - Ensures any
5509                                                             following global
5510                                                             data read is no
5511                                                             older than the local load
5512                                                             atomic value being
5513                                                             acquired.
5514
5515     atomicrmw    acq_rel      - workgroup    - generic  1. s_waitcnt lgkmcnt(0)
5516
5517                                                           - If OpenCL, omit.
5518                                                           - Must happen after
5519                                                             any preceding
5520                                                             local/generic
5521                                                             load/store/load
5522                                                             atomic/store
5523                                                             atomic/atomicrmw.
5524                                                           - Must happen before
5525                                                             the following
5526                                                             atomicrmw.
5527                                                           - Ensures that all
5528                                                             memory operations
5529                                                             to local have
5530                                                             completed before
5531                                                             performing the
5532                                                             atomicrmw that is
5533                                                             being released.
5534
5535                                                         2. flat_atomic
5536                                                         3. s_waitcnt lgkmcnt(0)
5537
5538                                                           - If OpenCL, omit.
5539                                                           - Must happen before
5540                                                             any following
5541                                                             global/generic
5542                                                             load/load
5543                                                             atomic/store/store
5544                                                             atomic/atomicrmw.
5545                                                           - Ensures any
5546                                                             following global
5547                                                             data read is no
5548                                                             older than a local load
5549                                                             atomic value being
5550                                                             acquired.
5551
5552     atomicrmw    acq_rel      - agent        - global   1. s_waitcnt lgkmcnt(0) &
5553                               - system                     vmcnt(0)
5554
5555                                                           - If OpenCL, omit
5556                                                             lgkmcnt(0).
5557                                                           - Could be split into
5558                                                             separate s_waitcnt
5559                                                             vmcnt(0) and
5560                                                             s_waitcnt
5561                                                             lgkmcnt(0) to allow
5562                                                             them to be
5563                                                             independently moved
5564                                                             according to the
5565                                                             following rules.
5566                                                           - s_waitcnt vmcnt(0)
5567                                                             must happen after
5568                                                             any preceding
5569                                                             global/generic
5570                                                             load/store/load
5571                                                             atomic/store
5572                                                             atomic/atomicrmw.
5573                                                           - s_waitcnt lgkmcnt(0)
5574                                                             must happen after
5575                                                             any preceding
5576                                                             local/generic
5577                                                             load/store/load
5578                                                             atomic/store
5579                                                             atomic/atomicrmw.
5580                                                           - Must happen before
5581                                                             the following
5582                                                             atomicrmw.
5583                                                           - Ensures that all
5584                                                             memory operations
5585                                                             to global have
5586                                                             completed before
5587                                                             performing the
5588                                                             atomicrmw that is
5589                                                             being released.
5590
5591                                                         2. buffer/global_atomic
5592                                                         3. s_waitcnt vmcnt(0)
5593
5594                                                           - Must happen before
5595                                                             following
5596                                                             buffer_wbinvl1_vol.
5597                                                           - Ensures the
5598                                                             atomicrmw has
5599                                                             completed before
5600                                                             invalidating the
5601                                                             cache.
5602
5603                                                         4. buffer_wbinvl1_vol
5604
5605                                                           - Must happen before
5606                                                             any following
5607                                                             global/generic
5608                                                             load/load
5609                                                             atomic/atomicrmw.
5610                                                           - Ensures that
5611                                                             following loads
5612                                                             will not see stale
5613                                                             global data.
5614
5615     atomicrmw    acq_rel      - agent        - generic  1. s_waitcnt lgkmcnt(0) &
5616                               - system                     vmcnt(0)
5617
5618                                                           - If OpenCL, omit
5619                                                             lgkmcnt(0).
5620                                                           - Could be split into
5621                                                             separate s_waitcnt
5622                                                             vmcnt(0) and
5623                                                             s_waitcnt
5624                                                             lgkmcnt(0) to allow
5625                                                             them to be
5626                                                             independently moved
5627                                                             according to the
5628                                                             following rules.
5629                                                           - s_waitcnt vmcnt(0)
5630                                                             must happen after
5631                                                             any preceding
5632                                                             global/generic
5633                                                             load/store/load
5634                                                             atomic/store
5635                                                             atomic/atomicrmw.
5636                                                           - s_waitcnt lgkmcnt(0)
5637                                                             must happen after
5638                                                             any preceding
5639                                                             local/generic
5640                                                             load/store/load
5641                                                             atomic/store
5642                                                             atomic/atomicrmw.
5643                                                           - Must happen before
5644                                                             the following
5645                                                             atomicrmw.
5646                                                           - Ensures that all
5647                                                             memory operations
5648                                                             to global have
5649                                                             completed before
5650                                                             performing the
5651                                                             atomicrmw that is
5652                                                             being released.
5653
5654                                                         2. flat_atomic
5655                                                         3. s_waitcnt vmcnt(0) &
5656                                                            lgkmcnt(0)
5657
5658                                                           - If OpenCL, omit
5659                                                             lgkmcnt(0).
5660                                                           - Must happen before
5661                                                             following
5662                                                             buffer_wbinvl1_vol.
5663                                                           - Ensures the
5664                                                             atomicrmw has
5665                                                             completed before
5666                                                             invalidating the
5667                                                             cache.
5668
5669                                                         4. buffer_wbinvl1_vol
5670
5671                                                           - Must happen before
5672                                                             any following
5673                                                             global/generic
5674                                                             load/load
5675                                                             atomic/atomicrmw.
5676                                                           - Ensures that
5677                                                             following loads
5678                                                             will not see stale
5679                                                             global data.
5680
5681     fence        acq_rel      - singlethread *none*     *none*
5682                               - wavefront
5683     fence        acq_rel      - workgroup    *none*     1. s_waitcnt lgkmcnt(0)
5684
5685                                                           - If OpenCL and
5686                                                             address space is
5687                                                             not generic, omit.
5688                                                           - However,
5689                                                             since LLVM
5690                                                             currently has no
5691                                                             address space on
5692                                                             the fence need to
5693                                                             conservatively
5694                                                             always generate
5695                                                             (see comment for
5696                                                             previous fence).
5697                                                           - Must happen after
5698                                                             any preceding
5699                                                             local/generic
5700                                                             load/load
5701                                                             atomic/store/store
5702                                                             atomic/atomicrmw.
5703                                                           - Must happen before
5704                                                             any following
5705                                                             global/generic
5706                                                             load/load
5707                                                             atomic/store/store
5708                                                             atomic/atomicrmw.
5709                                                           - Ensures that all
5710                                                             memory operations
5711                                                             to local have
5712                                                             completed before
5713                                                             performing any
5714                                                             following global
5715                                                             memory operations.
5716                                                           - Ensures that the
5717                                                             preceding
5718                                                             local/generic load
5719                                                             atomic/atomicrmw
5720                                                             with an equal or
5721                                                             wider sync scope
5722                                                             and memory ordering
5723                                                             stronger than
5724                                                             unordered (this is
5725                                                             termed the
5726                                                             acquire-fence-paired-atomic)
5727                                                             has completed
5728                                                             before following
5729                                                             global memory
5730                                                             operations. This
5731                                                             satisfies the
5732                                                             requirements of
5733                                                             acquire.
5734                                                           - Ensures that all
5735                                                             previous memory
5736                                                             operations have
5737                                                             completed before a
5738                                                             following
5739                                                             local/generic store
5740                                                             atomic/atomicrmw
5741                                                             with an equal or
5742                                                             wider sync scope
5743                                                             and memory ordering
5744                                                             stronger than
5745                                                             unordered (this is
5746                                                             termed the
5747                                                             release-fence-paired-atomic).
5748                                                             This satisfies the
5749                                                             requirements of
5750                                                             release.
5751
5752     fence        acq_rel      - agent        *none*     1. s_waitcnt lgkmcnt(0) &
5753                               - system                     vmcnt(0)
5754
5755                                                           - If OpenCL and
5756                                                             address space is
5757                                                             not generic, omit
5758                                                             lgkmcnt(0).
5759                                                           - However, since LLVM
5760                                                             currently has no
5761                                                             address space on
5762                                                             the fence need to
5763                                                             conservatively
5764                                                             always generate
5765                                                             (see comment for
5766                                                             previous fence).
5767                                                           - Could be split into
5768                                                             separate s_waitcnt
5769                                                             vmcnt(0) and
5770                                                             s_waitcnt
5771                                                             lgkmcnt(0) to allow
5772                                                             them to be
5773                                                             independently moved
5774                                                             according to the
5775                                                             following rules.
5776                                                           - s_waitcnt vmcnt(0)
5777                                                             must happen after
5778                                                             any preceding
5779                                                             global/generic
5780                                                             load/store/load
5781                                                             atomic/store
5782                                                             atomic/atomicrmw.
5783                                                           - s_waitcnt lgkmcnt(0)
5784                                                             must happen after
5785                                                             any preceding
5786                                                             local/generic
5787                                                             load/store/load
5788                                                             atomic/store
5789                                                             atomic/atomicrmw.
5790                                                           - Must happen before
5791                                                             the following
5792                                                             buffer_wbinvl1_vol.
5793                                                           - Ensures that the
5794                                                             preceding
5795                                                             global/local/generic
5796                                                             load
5797                                                             atomic/atomicrmw
5798                                                             with an equal or
5799                                                             wider sync scope
5800                                                             and memory ordering
5801                                                             stronger than
5802                                                             unordered (this is
5803                                                             termed the
5804                                                             acquire-fence-paired-atomic)
5805                                                             has completed
5806                                                             before invalidating
5807                                                             the cache. This
5808                                                             satisfies the
5809                                                             requirements of
5810                                                             acquire.
5811                                                           - Ensures that all
5812                                                             previous memory
5813                                                             operations have
5814                                                             completed before a
5815                                                             following
5816                                                             global/local/generic
5817                                                             store
5818                                                             atomic/atomicrmw
5819                                                             with an equal or
5820                                                             wider sync scope
5821                                                             and memory ordering
5822                                                             stronger than
5823                                                             unordered (this is
5824                                                             termed the
5825                                                             release-fence-paired-atomic).
5826                                                             This satisfies the
5827                                                             requirements of
5828                                                             release.
5829
5830                                                         2. buffer_wbinvl1_vol
5831
5832                                                           - Must happen before
5833                                                             any following
5834                                                             global/generic
5835                                                             load/load
5836                                                             atomic/store/store
5837                                                             atomic/atomicrmw.
5838                                                           - Ensures that
5839                                                             following loads
5840                                                             will not see stale
5841                                                             global data. This
5842                                                             satisfies the
5843                                                             requirements of
5844                                                             acquire.
5845
5846     **Sequential Consistent Atomic**
5847     ------------------------------------------------------------------------------------
5848     load atomic  seq_cst      - singlethread - global   *Same as corresponding
5849                               - wavefront    - local    load atomic acquire,
5850                                              - generic  except must generated
5851                                                         all instructions even
5852                                                         for OpenCL.*
5853     load atomic  seq_cst      - workgroup    - global   1. s_waitcnt lgkmcnt(0)
5854                                              - generic
5855
5856                                                           - Must
5857                                                             happen after
5858                                                             preceding
5859                                                             local/generic load
5860                                                             atomic/store
5861                                                             atomic/atomicrmw
5862                                                             with memory
5863                                                             ordering of seq_cst
5864                                                             and with equal or
5865                                                             wider sync scope.
5866                                                             (Note that seq_cst
5867                                                             fences have their
5868                                                             own s_waitcnt
5869                                                             lgkmcnt(0) and so do
5870                                                             not need to be
5871                                                             considered.)
5872                                                           - Ensures any
5873                                                             preceding
5874                                                             sequential
5875                                                             consistent local
5876                                                             memory instructions
5877                                                             have completed
5878                                                             before executing
5879                                                             this sequentially
5880                                                             consistent
5881                                                             instruction. This
5882                                                             prevents reordering
5883                                                             a seq_cst store
5884                                                             followed by a
5885                                                             seq_cst load. (Note
5886                                                             that seq_cst is
5887                                                             stronger than
5888                                                             acquire/release as
5889                                                             the reordering of
5890                                                             load acquire
5891                                                             followed by a store
5892                                                             release is
5893                                                             prevented by the
5894                                                             s_waitcnt of
5895                                                             the release, but
5896                                                             there is nothing
5897                                                             preventing a store
5898                                                             release followed by
5899                                                             load acquire from
5900                                                             completing out of
5901                                                             order. The s_waitcnt
5902                                                             could be placed after
5903                                                             seq_store or before
5904                                                             the seq_load. We
5905                                                             choose the load to
5906                                                             make the s_waitcnt be
5907                                                             as late as possible
5908                                                             so that the store
5909                                                             may have already
5910                                                             completed.)
5911
5912                                                         2. *Following
5913                                                            instructions same as
5914                                                            corresponding load
5915                                                            atomic acquire,
5916                                                            except must generated
5917                                                            all instructions even
5918                                                            for OpenCL.*
5919     load atomic  seq_cst      - workgroup    - local    *Same as corresponding
5920                                                         load atomic acquire,
5921                                                         except must generated
5922                                                         all instructions even
5923                                                         for OpenCL.*
5924
5925     load atomic  seq_cst      - agent        - global   1. s_waitcnt lgkmcnt(0) &
5926                               - system       - generic     vmcnt(0)
5927
5928                                                           - Could be split into
5929                                                             separate s_waitcnt
5930                                                             vmcnt(0)
5931                                                             and s_waitcnt
5932                                                             lgkmcnt(0) to allow
5933                                                             them to be
5934                                                             independently moved
5935                                                             according to the
5936                                                             following rules.
5937                                                           - s_waitcnt lgkmcnt(0)
5938                                                             must happen after
5939                                                             preceding
5940                                                             global/generic load
5941                                                             atomic/store
5942                                                             atomic/atomicrmw
5943                                                             with memory
5944                                                             ordering of seq_cst
5945                                                             and with equal or
5946                                                             wider sync scope.
5947                                                             (Note that seq_cst
5948                                                             fences have their
5949                                                             own s_waitcnt
5950                                                             lgkmcnt(0) and so do
5951                                                             not need to be
5952                                                             considered.)
5953                                                           - s_waitcnt vmcnt(0)
5954                                                             must happen after
5955                                                             preceding
5956                                                             global/generic load
5957                                                             atomic/store
5958                                                             atomic/atomicrmw
5959                                                             with memory
5960                                                             ordering of seq_cst
5961                                                             and with equal or
5962                                                             wider sync scope.
5963                                                             (Note that seq_cst
5964                                                             fences have their
5965                                                             own s_waitcnt
5966                                                             vmcnt(0) and so do
5967                                                             not need to be
5968                                                             considered.)
5969                                                           - Ensures any
5970                                                             preceding
5971                                                             sequential
5972                                                             consistent global
5973                                                             memory instructions
5974                                                             have completed
5975                                                             before executing
5976                                                             this sequentially
5977                                                             consistent
5978                                                             instruction. This
5979                                                             prevents reordering
5980                                                             a seq_cst store
5981                                                             followed by a
5982                                                             seq_cst load. (Note
5983                                                             that seq_cst is
5984                                                             stronger than
5985                                                             acquire/release as
5986                                                             the reordering of
5987                                                             load acquire
5988                                                             followed by a store
5989                                                             release is
5990                                                             prevented by the
5991                                                             s_waitcnt of
5992                                                             the release, but
5993                                                             there is nothing
5994                                                             preventing a store
5995                                                             release followed by
5996                                                             load acquire from
5997                                                             completing out of
5998                                                             order. The s_waitcnt
5999                                                             could be placed after
6000                                                             seq_store or before
6001                                                             the seq_load. We
6002                                                             choose the load to
6003                                                             make the s_waitcnt be
6004                                                             as late as possible
6005                                                             so that the store
6006                                                             may have already
6007                                                             completed.)
6008
6009                                                         2. *Following
6010                                                            instructions same as
6011                                                            corresponding load
6012                                                            atomic acquire,
6013                                                            except must generated
6014                                                            all instructions even
6015                                                            for OpenCL.*
6016     store atomic seq_cst      - singlethread - global   *Same as corresponding
6017                               - wavefront    - local    store atomic release,
6018                               - workgroup    - generic  except must generated
6019                               - agent                   all instructions even
6020                               - system                  for OpenCL.*
6021     atomicrmw    seq_cst      - singlethread - global   *Same as corresponding
6022                               - wavefront    - local    atomicrmw acq_rel,
6023                               - workgroup    - generic  except must generated
6024                               - agent                   all instructions even
6025                               - system                  for OpenCL.*
6026     fence        seq_cst      - singlethread *none*     *Same as corresponding
6027                               - wavefront               fence acq_rel,
6028                               - workgroup               except must generated
6029                               - agent                   all instructions even
6030                               - system                  for OpenCL.*
6031     ============ ============ ============== ========== ================================
6032
6033.. _amdgpu-amdhsa-memory-model-gfx90a:
6034
6035Memory Model GFX90A
6036+++++++++++++++++++
6037
6038For GFX90A:
6039
6040* Each agent has multiple shader arrays (SA).
6041* Each SA has multiple compute units (CU).
6042* Each CU has multiple SIMDs that execute wavefronts.
6043* The wavefronts for a single work-group are executed in the same CU but may be
6044  executed by different SIMDs. The exception is when in tgsplit execution mode
6045  when the wavefronts may be executed by different SIMDs in different CUs.
6046* Each CU has a single LDS memory shared by the wavefronts of the work-groups
6047  executing on it. The exception is when in tgsplit execution mode when no LDS
6048  is allocated as wavefronts of the same work-group can be in different CUs.
6049* All LDS operations of a CU are performed as wavefront wide operations in a
6050  global order and involve no caching. Completion is reported to a wavefront in
6051  execution order.
6052* The LDS memory has multiple request queues shared by the SIMDs of a
6053  CU. Therefore, the LDS operations performed by different wavefronts of a
6054  work-group can be reordered relative to each other, which can result in
6055  reordering the visibility of vector memory operations with respect to LDS
6056  operations of other wavefronts in the same work-group. A ``s_waitcnt
6057  lgkmcnt(0)`` is required to ensure synchronization between LDS operations and
6058  vector memory operations between wavefronts of a work-group, but not between
6059  operations performed by the same wavefront.
6060* The vector memory operations are performed as wavefront wide operations and
6061  completion is reported to a wavefront in execution order. The exception is
6062  that ``flat_load/store/atomic`` instructions can report out of vector memory
6063  order if they access LDS memory, and out of LDS operation order if they access
6064  global memory.
6065* The vector memory operations access a single vector L1 cache shared by all
6066  SIMDs a CU. Therefore:
6067
6068  * No special action is required for coherence between the lanes of a single
6069    wavefront.
6070
6071  * No special action is required for coherence between wavefronts in the same
6072    work-group since they execute on the same CU. The exception is when in
6073    tgsplit execution mode as wavefronts of the same work-group can be in
6074    different CUs and so a ``buffer_wbinvl1_vol`` is required as described in
6075    the following item.
6076
6077  * A ``buffer_wbinvl1_vol`` is required for coherence between wavefronts
6078    executing in different work-groups as they may be executing on different
6079    CUs.
6080
6081* The scalar memory operations access a scalar L1 cache shared by all wavefronts
6082  on a group of CUs. The scalar and vector L1 caches are not coherent. However,
6083  scalar operations are used in a restricted way so do not impact the memory
6084  model. See :ref:`amdgpu-amdhsa-memory-spaces`.
6085* The vector and scalar memory operations use an L2 cache shared by all CUs on
6086  the same agent.
6087
6088  * The L2 cache has independent channels to service disjoint ranges of virtual
6089    addresses.
6090  * Each CU has a separate request queue per channel. Therefore, the vector and
6091    scalar memory operations performed by wavefronts executing in different
6092    work-groups (which may be executing on different CUs), or the same
6093    work-group if executing in tgsplit mode, of an agent can be reordered
6094    relative to each other. A ``s_waitcnt vmcnt(0)`` is required to ensure
6095    synchronization between vector memory operations of different CUs. It
6096    ensures a previous vector memory operation has completed before executing a
6097    subsequent vector memory or LDS operation and so can be used to meet the
6098    requirements of acquire and release.
6099  * The L2 cache of one agent can be kept coherent with other agents by:
6100    using the MTYPE RW (read-write) or MTYPE CC (cache-coherent) with the PTE
6101    C-bit for memory local to the L2; and using the MTYPE NC (non-coherent) with
6102    the PTE C-bit set or MTYPE UC (uncached) for memory not local to the L2.
6103
6104    * Any local memory cache lines will be automatically invalidated by writes
6105      from CUs associated with other L2 caches, or writes from the CPU, due to
6106      the cache probe caused by coherent requests. Coherent requests are caused
6107      by GPU accesses to pages with the PTE C-bit set, by CPU accesses over
6108      XGMI, and by PCIe requests that are configured to be coherent requests.
6109    * XGMI accesses from the CPU to local memory may be cached on the CPU.
6110      Subsequent access from the GPU will automatically invalidate or writeback
6111      the CPU cache due to the L2 probe filter and and the PTE C-bit being set.
6112    * Since all work-groups on the same agent share the same L2, no L2
6113      invalidation or writeback is required for coherence.
6114    * To ensure coherence of local and remote memory writes of work-groups in
6115      different agents a ``buffer_wbl2`` is required. It will writeback dirty L2
6116      cache lines of MTYPE RW (used for local coarse grain memory) and MTYPE NC
6117      ()used for remote coarse grain memory). Note that MTYPE CC (used for local
6118      fine grain memory) causes write through to DRAM, and MTYPE UC (used for
6119      remote fine grain memory) bypasses the L2, so both will never result in
6120      dirty L2 cache lines.
6121    * To ensure coherence of local and remote memory reads of work-groups in
6122      different agents a ``buffer_invl2`` is required. It will invalidate L2
6123      cache lines with MTYPE NC (used for remote coarse grain memory). Note that
6124      MTYPE CC (used for local fine grain memory) and MTYPE RW (used for local
6125      coarse memory) cause local reads to be invalidated by remote writes with
6126      with the PTE C-bit so these cache lines are not invalidated. Note that
6127      MTYPE UC (used for remote fine grain memory) bypasses the L2, so will
6128      never result in L2 cache lines that need to be invalidated.
6129
6130  * PCIe access from the GPU to the CPU memory is kept coherent by using the
6131    MTYPE UC (uncached) which bypasses the L2.
6132
6133Scalar memory operations are only used to access memory that is proven to not
6134change during the execution of the kernel dispatch. This includes constant
6135address space and global address space for program scope ``const`` variables.
6136Therefore, the kernel machine code does not have to maintain the scalar cache to
6137ensure it is coherent with the vector caches. The scalar and vector caches are
6138invalidated between kernel dispatches by CP since constant address space data
6139may change between kernel dispatch executions. See
6140:ref:`amdgpu-amdhsa-memory-spaces`.
6141
6142The one exception is if scalar writes are used to spill SGPR registers. In this
6143case the AMDGPU backend ensures the memory location used to spill is never
6144accessed by vector memory operations at the same time. If scalar writes are used
6145then a ``s_dcache_wb`` is inserted before the ``s_endpgm`` and before a function
6146return since the locations may be used for vector memory instructions by a
6147future wavefront that uses the same scratch area, or a function call that
6148creates a frame at the same address, respectively. There is no need for a
6149``s_dcache_inv`` as all scalar writes are write-before-read in the same thread.
6150
6151For kernarg backing memory:
6152
6153* CP invalidates the L1 cache at the start of each kernel dispatch.
6154* On dGPU over XGMI or PCIe the kernarg backing memory is allocated in host
6155  memory accessed as MTYPE UC (uncached) to avoid needing to invalidate the L2
6156  cache. This also causes it to be treated as non-volatile and so is not
6157  invalidated by ``*_vol``.
6158* On APU the kernarg backing memory is accessed as MTYPE CC (cache coherent) and
6159  so the L2 cache will be coherent with the CPU and other agents.
6160
6161Scratch backing memory (which is used for the private address space) is accessed
6162with MTYPE NC_NV (non-coherent non-volatile). Since the private address space is
6163only accessed by a single thread, and is always write-before-read, there is
6164never a need to invalidate these entries from the L1 cache. Hence all cache
6165invalidates are done as ``*_vol`` to only invalidate the volatile cache lines.
6166
6167The code sequences used to implement the memory model for GFX90A are defined
6168in table :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx90a-table`.
6169
6170  .. table:: AMDHSA Memory Model Code Sequences GFX90A
6171     :name: amdgpu-amdhsa-memory-model-code-sequences-gfx90a-table
6172
6173     ============ ============ ============== ========== ================================
6174     LLVM Instr   LLVM Memory  LLVM Memory    AMDGPU     AMDGPU Machine Code
6175                  Ordering     Sync Scope     Address    GFX90A
6176                                              Space
6177     ============ ============ ============== ========== ================================
6178     **Non-Atomic**
6179     ------------------------------------------------------------------------------------
6180     load         *none*       *none*         - global   - !volatile & !nontemporal
6181                                              - generic
6182                                              - private    1. buffer/global/flat_load
6183                                              - constant
6184                                                         - !volatile & nontemporal
6185
6186                                                           1. buffer/global/flat_load
6187                                                              glc=1 slc=1
6188
6189                                                         - volatile
6190
6191                                                           1. buffer/global/flat_load
6192                                                              glc=1
6193                                                           2. s_waitcnt vmcnt(0)
6194
6195                                                            - Must happen before
6196                                                              any following volatile
6197                                                              global/generic
6198                                                              load/store.
6199                                                            - Ensures that
6200                                                              volatile
6201                                                              operations to
6202                                                              different
6203                                                              addresses will not
6204                                                              be reordered by
6205                                                              hardware.
6206
6207     load         *none*       *none*         - local    1. ds_load
6208     store        *none*       *none*         - global   - !volatile & !nontemporal
6209                                              - generic
6210                                              - private    1. buffer/global/flat_store
6211                                              - constant
6212                                                         - !volatile & nontemporal
6213
6214                                                           1. buffer/global/flat_store
6215                                                              glc=1 slc=1
6216
6217                                                         - volatile
6218
6219                                                           1. buffer/global/flat_store
6220                                                           2. s_waitcnt vmcnt(0)
6221
6222                                                            - Must happen before
6223                                                              any following volatile
6224                                                              global/generic
6225                                                              load/store.
6226                                                            - Ensures that
6227                                                              volatile
6228                                                              operations to
6229                                                              different
6230                                                              addresses will not
6231                                                              be reordered by
6232                                                              hardware.
6233
6234     store        *none*       *none*         - local    1. ds_store
6235     **Unordered Atomic**
6236     ------------------------------------------------------------------------------------
6237     load atomic  unordered    *any*          *any*      *Same as non-atomic*.
6238     store atomic unordered    *any*          *any*      *Same as non-atomic*.
6239     atomicrmw    unordered    *any*          *any*      *Same as monotonic atomic*.
6240     **Monotonic Atomic**
6241     ------------------------------------------------------------------------------------
6242     load atomic  monotonic    - singlethread - global   1. buffer/global/flat_load
6243                               - wavefront    - generic
6244     load atomic  monotonic    - workgroup    - global   1. buffer/global/flat_load
6245                                              - generic     glc=1
6246
6247                                                           - If not TgSplit execution
6248                                                             mode, omit glc=1.
6249
6250     load atomic  monotonic    - singlethread - local    *If TgSplit execution mode,
6251                               - wavefront               local address space cannot
6252                               - workgroup               be used.*
6253
6254                                                         1. ds_load
6255     load atomic  monotonic    - agent        - global   1. buffer/global/flat_load
6256                                              - generic     glc=1
6257     load atomic  monotonic    - system       - global   1. buffer/global/flat_load
6258                                              - generic     glc=1
6259     store atomic monotonic    - singlethread - global   1. buffer/global/flat_store
6260                               - wavefront    - generic
6261                               - workgroup
6262                               - agent
6263     store atomic monotonic    - system       - global   1. buffer/global/flat_store
6264                                              - generic
6265     store atomic monotonic    - singlethread - local    *If TgSplit execution mode,
6266                               - wavefront               local address space cannot
6267                               - workgroup               be used.*
6268
6269                                                         1. ds_store
6270     atomicrmw    monotonic    - singlethread - global   1. buffer/global/flat_atomic
6271                               - wavefront    - generic
6272                               - workgroup
6273                               - agent
6274     atomicrmw    monotonic    - system       - global   1. buffer/global/flat_atomic
6275                                              - generic
6276     atomicrmw    monotonic    - singlethread - local    *If TgSplit execution mode,
6277                               - wavefront               local address space cannot
6278                               - workgroup               be used.*
6279
6280                                                         1. ds_atomic
6281     **Acquire Atomic**
6282     ------------------------------------------------------------------------------------
6283     load atomic  acquire      - singlethread - global   1. buffer/global/ds/flat_load
6284                               - wavefront    - local
6285                                              - generic
6286     load atomic  acquire      - workgroup    - global   1. buffer/global_load glc=1
6287
6288                                                           - If not TgSplit execution
6289                                                             mode, omit glc=1.
6290
6291                                                         2. s_waitcnt vmcnt(0)
6292
6293                                                           - If not TgSplit execution
6294                                                             mode, omit.
6295                                                           - Must happen before the
6296                                                             following buffer_wbinvl1_vol.
6297
6298                                                         3. buffer_wbinvl1_vol
6299
6300                                                           - If not TgSplit execution
6301                                                             mode, omit.
6302                                                           - Must happen before
6303                                                             any following
6304                                                             global/generic
6305                                                             load/load
6306                                                             atomic/store/store
6307                                                             atomic/atomicrmw.
6308                                                           - Ensures that
6309                                                             following
6310                                                             loads will not see
6311                                                             stale data.
6312
6313     load atomic  acquire      - workgroup    - local    *If TgSplit execution mode,
6314                                                         local address space cannot
6315                                                         be used.*
6316
6317                                                         1. ds_load
6318                                                         2. s_waitcnt lgkmcnt(0)
6319
6320                                                           - If OpenCL, omit.
6321                                                           - Must happen before
6322                                                             any following
6323                                                             global/generic
6324                                                             load/load
6325                                                             atomic/store/store
6326                                                             atomic/atomicrmw.
6327                                                           - Ensures any
6328                                                             following global
6329                                                             data read is no
6330                                                             older than the local load
6331                                                             atomic value being
6332                                                             acquired.
6333
6334     load atomic  acquire      - workgroup    - generic  1. flat_load glc=1
6335
6336                                                           - If not TgSplit execution
6337                                                             mode, omit glc=1.
6338
6339                                                         2. s_waitcnt lgkm/vmcnt(0)
6340
6341                                                           - Use lgkmcnt(0) if not
6342                                                             TgSplit execution mode
6343                                                             and vmcnt(0) if TgSplit
6344                                                             execution mode.
6345                                                           - If OpenCL, omit lgkmcnt(0).
6346                                                           - Must happen before
6347                                                             the following
6348                                                             buffer_wbinvl1_vol and any
6349                                                             following global/generic
6350                                                             load/load
6351                                                             atomic/store/store
6352                                                             atomic/atomicrmw.
6353                                                           - Ensures any
6354                                                             following global
6355                                                             data read is no
6356                                                             older than a local load
6357                                                             atomic value being
6358                                                             acquired.
6359
6360                                                         3. buffer_wbinvl1_vol
6361
6362                                                           - If not TgSplit execution
6363                                                             mode, omit.
6364                                                           - Ensures that
6365                                                             following
6366                                                             loads will not see
6367                                                             stale data.
6368
6369     load atomic  acquire      - agent        - global   1. buffer/global_load
6370                                                            glc=1
6371                                                         2. s_waitcnt vmcnt(0)
6372
6373                                                           - Must happen before
6374                                                             following
6375                                                             buffer_wbinvl1_vol.
6376                                                           - Ensures the load
6377                                                             has completed
6378                                                             before invalidating
6379                                                             the cache.
6380
6381                                                         3. buffer_wbinvl1_vol
6382
6383                                                           - Must happen before
6384                                                             any following
6385                                                             global/generic
6386                                                             load/load
6387                                                             atomic/atomicrmw.
6388                                                           - Ensures that
6389                                                             following
6390                                                             loads will not see
6391                                                             stale global data.
6392
6393     load atomic  acquire      - system       - global   1. buffer/global/flat_load
6394                                                            glc=1
6395                                                         2. s_waitcnt vmcnt(0)
6396
6397                                                           - Must happen before
6398                                                             following buffer_invl2 and
6399                                                             buffer_wbinvl1_vol.
6400                                                           - Ensures the load
6401                                                             has completed
6402                                                             before invalidating
6403                                                             the cache.
6404
6405                                                         3. buffer_invl2;
6406                                                            buffer_wbinvl1_vol
6407
6408                                                           - Must happen before
6409                                                             any following
6410                                                             global/generic
6411                                                             load/load
6412                                                             atomic/atomicrmw.
6413                                                           - Ensures that
6414                                                             following
6415                                                             loads will not see
6416                                                             stale L1 global data,
6417                                                             nor see stale L2 MTYPE
6418                                                             NC global data.
6419                                                             MTYPE RW and CC memory will
6420                                                             never be stale in L2 due to
6421                                                             the memory probes.
6422
6423     load atomic  acquire      - agent        - generic  1. flat_load glc=1
6424                                                         2. s_waitcnt vmcnt(0) &
6425                                                            lgkmcnt(0)
6426
6427                                                           - If TgSplit execution mode,
6428                                                             omit lgkmcnt(0).
6429                                                           - If OpenCL omit
6430                                                             lgkmcnt(0).
6431                                                           - Must happen before
6432                                                             following
6433                                                             buffer_wbinvl1_vol.
6434                                                           - Ensures the flat_load
6435                                                             has completed
6436                                                             before invalidating
6437                                                             the cache.
6438
6439                                                         3. buffer_wbinvl1_vol
6440
6441                                                           - Must happen before
6442                                                             any following
6443                                                             global/generic
6444                                                             load/load
6445                                                             atomic/atomicrmw.
6446                                                           - Ensures that
6447                                                             following loads
6448                                                             will not see stale
6449                                                             global data.
6450
6451     load atomic  acquire      - system       - generic  1. flat_load glc=1
6452                                                         2. s_waitcnt vmcnt(0) &
6453                                                            lgkmcnt(0)
6454
6455                                                           - If TgSplit execution mode,
6456                                                             omit lgkmcnt(0).
6457                                                           - If OpenCL omit
6458                                                             lgkmcnt(0).
6459                                                           - Must happen before
6460                                                             following
6461                                                             buffer_invl2 and
6462                                                             buffer_wbinvl1_vol.
6463                                                           - Ensures the flat_load
6464                                                             has completed
6465                                                             before invalidating
6466                                                             the caches.
6467
6468                                                         3. buffer_invl2;
6469                                                            buffer_wbinvl1_vol
6470
6471                                                           - Must happen before
6472                                                             any following
6473                                                             global/generic
6474                                                             load/load
6475                                                             atomic/atomicrmw.
6476                                                           - Ensures that
6477                                                             following
6478                                                             loads will not see
6479                                                             stale L1 global data,
6480                                                             nor see stale L2 MTYPE
6481                                                             NC global data.
6482                                                             MTYPE RW and CC memory will
6483                                                             never be stale in L2 due to
6484                                                             the memory probes.
6485
6486     atomicrmw    acquire      - singlethread - global   1. buffer/global/flat_atomic
6487                               - wavefront    - generic
6488     atomicrmw    acquire      - singlethread - local    *If TgSplit execution mode,
6489                               - wavefront               local address space cannot
6490                                                         be used.*
6491
6492                                                         1. ds_atomic
6493     atomicrmw    acquire      - workgroup    - global   1. buffer/global_atomic
6494                                                         2. s_waitcnt vmcnt(0)
6495
6496                                                           - If not TgSplit execution
6497                                                             mode, omit.
6498                                                           - Must happen before the
6499                                                             following buffer_wbinvl1_vol.
6500                                                           - Ensures the atomicrmw
6501                                                             has completed
6502                                                             before invalidating
6503                                                             the cache.
6504
6505                                                         3. buffer_wbinvl1_vol
6506
6507                                                           - If not TgSplit execution
6508                                                             mode, omit.
6509                                                           - Must happen before
6510                                                             any following
6511                                                             global/generic
6512                                                             load/load
6513                                                             atomic/atomicrmw.
6514                                                           - Ensures that
6515                                                             following loads
6516                                                             will not see stale
6517                                                             global data.
6518
6519     atomicrmw    acquire      - workgroup    - local    *If TgSplit execution mode,
6520                                                         local address space cannot
6521                                                         be used.*
6522
6523                                                         1. ds_atomic
6524                                                         2. s_waitcnt lgkmcnt(0)
6525
6526                                                           - If OpenCL, omit.
6527                                                           - Must happen before
6528                                                             any following
6529                                                             global/generic
6530                                                             load/load
6531                                                             atomic/store/store
6532                                                             atomic/atomicrmw.
6533                                                           - Ensures any
6534                                                             following global
6535                                                             data read is no
6536                                                             older than the local
6537                                                             atomicrmw value
6538                                                             being acquired.
6539
6540     atomicrmw    acquire      - workgroup    - generic  1. flat_atomic
6541                                                         2. s_waitcnt lgkm/vmcnt(0)
6542
6543                                                           - Use lgkmcnt(0) if not
6544                                                             TgSplit execution mode
6545                                                             and vmcnt(0) if TgSplit
6546                                                             execution mode.
6547                                                           - If OpenCL, omit lgkmcnt(0).
6548                                                           - Must happen before
6549                                                             the following
6550                                                             buffer_wbinvl1_vol and
6551                                                             any following
6552                                                             global/generic
6553                                                             load/load
6554                                                             atomic/store/store
6555                                                             atomic/atomicrmw.
6556                                                           - Ensures any
6557                                                             following global
6558                                                             data read is no
6559                                                             older than a local
6560                                                             atomicrmw value
6561                                                             being acquired.
6562
6563                                                         3. buffer_wbinvl1_vol
6564
6565                                                           - If not TgSplit execution
6566                                                             mode, omit.
6567                                                           - Ensures that
6568                                                             following
6569                                                             loads will not see
6570                                                             stale data.
6571
6572     atomicrmw    acquire      - agent        - global   1. buffer/global_atomic
6573                                                         2. s_waitcnt vmcnt(0)
6574
6575                                                           - Must happen before
6576                                                             following
6577                                                             buffer_wbinvl1_vol.
6578                                                           - Ensures the
6579                                                             atomicrmw has
6580                                                             completed before
6581                                                             invalidating the
6582                                                             cache.
6583
6584                                                         3. buffer_wbinvl1_vol
6585
6586                                                           - Must happen before
6587                                                             any following
6588                                                             global/generic
6589                                                             load/load
6590                                                             atomic/atomicrmw.
6591                                                           - Ensures that
6592                                                             following loads
6593                                                             will not see stale
6594                                                             global data.
6595
6596     atomicrmw    acquire      - system       - global   1. buffer/global_atomic
6597                                                         2. s_waitcnt vmcnt(0)
6598
6599                                                           - Must happen before
6600                                                             following buffer_invl2 and
6601                                                             buffer_wbinvl1_vol.
6602                                                           - Ensures the
6603                                                             atomicrmw has
6604                                                             completed before
6605                                                             invalidating the
6606                                                             caches.
6607
6608                                                         3. buffer_invl2;
6609                                                            buffer_wbinvl1_vol
6610
6611                                                           - Must happen before
6612                                                             any following
6613                                                             global/generic
6614                                                             load/load
6615                                                             atomic/atomicrmw.
6616                                                           - Ensures that
6617                                                             following
6618                                                             loads will not see
6619                                                             stale L1 global data,
6620                                                             nor see stale L2 MTYPE
6621                                                             NC global data.
6622                                                             MTYPE RW and CC memory will
6623                                                             never be stale in L2 due to
6624                                                             the memory probes.
6625
6626     atomicrmw    acquire      - agent        - generic  1. flat_atomic
6627                                                         2. s_waitcnt vmcnt(0) &
6628                                                            lgkmcnt(0)
6629
6630                                                           - If TgSplit execution mode,
6631                                                             omit lgkmcnt(0).
6632                                                           - If OpenCL, omit
6633                                                             lgkmcnt(0).
6634                                                           - Must happen before
6635                                                             following
6636                                                             buffer_wbinvl1_vol.
6637                                                           - Ensures the
6638                                                             atomicrmw has
6639                                                             completed before
6640                                                             invalidating the
6641                                                             cache.
6642
6643                                                         3. buffer_wbinvl1_vol
6644
6645                                                           - Must happen before
6646                                                             any following
6647                                                             global/generic
6648                                                             load/load
6649                                                             atomic/atomicrmw.
6650                                                           - Ensures that
6651                                                             following loads
6652                                                             will not see stale
6653                                                             global data.
6654
6655     atomicrmw    acquire      - system       - generic  1. flat_atomic
6656                                                         2. s_waitcnt vmcnt(0) &
6657                                                            lgkmcnt(0)
6658
6659                                                           - If TgSplit execution mode,
6660                                                             omit lgkmcnt(0).
6661                                                           - If OpenCL, omit
6662                                                             lgkmcnt(0).
6663                                                           - Must happen before
6664                                                             following
6665                                                             buffer_invl2 and
6666                                                             buffer_wbinvl1_vol.
6667                                                           - Ensures the
6668                                                             atomicrmw has
6669                                                             completed before
6670                                                             invalidating the
6671                                                             caches.
6672
6673                                                         3. buffer_invl2;
6674                                                            buffer_wbinvl1_vol
6675
6676                                                           - Must happen before
6677                                                             any following
6678                                                             global/generic
6679                                                             load/load
6680                                                             atomic/atomicrmw.
6681                                                           - Ensures that
6682                                                             following
6683                                                             loads will not see
6684                                                             stale L1 global data,
6685                                                             nor see stale L2 MTYPE
6686                                                             NC global data.
6687                                                             MTYPE RW and CC memory will
6688                                                             never be stale in L2 due to
6689                                                             the memory probes.
6690
6691     fence        acquire      - singlethread *none*     *none*
6692                               - wavefront
6693     fence        acquire      - workgroup    *none*     1. s_waitcnt lgkm/vmcnt(0)
6694
6695                                                           - Use lgkmcnt(0) if not
6696                                                             TgSplit execution mode
6697                                                             and vmcnt(0) if TgSplit
6698                                                             execution mode.
6699                                                           - If OpenCL and
6700                                                             address space is
6701                                                             not generic, omit
6702                                                             lgkmcnt(0).
6703                                                           - If OpenCL and
6704                                                             address space is
6705                                                             local, omit
6706                                                             vmcnt(0).
6707                                                           - However, since LLVM
6708                                                             currently has no
6709                                                             address space on
6710                                                             the fence need to
6711                                                             conservatively
6712                                                             always generate. If
6713                                                             fence had an
6714                                                             address space then
6715                                                             set to address
6716                                                             space of OpenCL
6717                                                             fence flag, or to
6718                                                             generic if both
6719                                                             local and global
6720                                                             flags are
6721                                                             specified.
6722                                                           - s_waitcnt vmcnt(0)
6723                                                             must happen after
6724                                                             any preceding
6725                                                             global/generic load
6726                                                             atomic/
6727                                                             atomicrmw
6728                                                             with an equal or
6729                                                             wider sync scope
6730                                                             and memory ordering
6731                                                             stronger than
6732                                                             unordered (this is
6733                                                             termed the
6734                                                             fence-paired-atomic).
6735                                                           - s_waitcnt lgkmcnt(0)
6736                                                             must happen after
6737                                                             any preceding
6738                                                             local/generic load
6739                                                             atomic/atomicrmw
6740                                                             with an equal or
6741                                                             wider sync scope
6742                                                             and memory ordering
6743                                                             stronger than
6744                                                             unordered (this is
6745                                                             termed the
6746                                                             fence-paired-atomic).
6747                                                           - Must happen before
6748                                                             the following
6749                                                             buffer_wbinvl1_vol and
6750                                                             any following
6751                                                             global/generic
6752                                                             load/load
6753                                                             atomic/store/store
6754                                                             atomic/atomicrmw.
6755                                                           - Ensures any
6756                                                             following global
6757                                                             data read is no
6758                                                             older than the
6759                                                             value read by the
6760                                                             fence-paired-atomic.
6761
6762                                                         2. buffer_wbinvl1_vol
6763
6764                                                           - If not TgSplit execution
6765                                                             mode, omit.
6766                                                           - Ensures that
6767                                                             following
6768                                                             loads will not see
6769                                                             stale data.
6770
6771     fence        acquire      - agent        *none*     1. s_waitcnt lgkmcnt(0) &
6772                                                            vmcnt(0)
6773
6774                                                           - If TgSplit execution mode,
6775                                                             omit lgkmcnt(0).
6776                                                           - If OpenCL and
6777                                                             address space is
6778                                                             not generic, omit
6779                                                             lgkmcnt(0).
6780                                                           - However, since LLVM
6781                                                             currently has no
6782                                                             address space on
6783                                                             the fence need to
6784                                                             conservatively
6785                                                             always generate
6786                                                             (see comment for
6787                                                             previous fence).
6788                                                           - Could be split into
6789                                                             separate s_waitcnt
6790                                                             vmcnt(0) and
6791                                                             s_waitcnt
6792                                                             lgkmcnt(0) to allow
6793                                                             them to be
6794                                                             independently moved
6795                                                             according to the
6796                                                             following rules.
6797                                                           - s_waitcnt vmcnt(0)
6798                                                             must happen after
6799                                                             any preceding
6800                                                             global/generic load
6801                                                             atomic/atomicrmw
6802                                                             with an equal or
6803                                                             wider sync scope
6804                                                             and memory ordering
6805                                                             stronger than
6806                                                             unordered (this is
6807                                                             termed the
6808                                                             fence-paired-atomic).
6809                                                           - s_waitcnt lgkmcnt(0)
6810                                                             must happen after
6811                                                             any preceding
6812                                                             local/generic load
6813                                                             atomic/atomicrmw
6814                                                             with an equal or
6815                                                             wider sync scope
6816                                                             and memory ordering
6817                                                             stronger than
6818                                                             unordered (this is
6819                                                             termed the
6820                                                             fence-paired-atomic).
6821                                                           - Must happen before
6822                                                             the following
6823                                                             buffer_wbinvl1_vol.
6824                                                           - Ensures that the
6825                                                             fence-paired atomic
6826                                                             has completed
6827                                                             before invalidating
6828                                                             the
6829                                                             cache. Therefore
6830                                                             any following
6831                                                             locations read must
6832                                                             be no older than
6833                                                             the value read by
6834                                                             the
6835                                                             fence-paired-atomic.
6836
6837                                                         2. buffer_wbinvl1_vol
6838
6839                                                           - Must happen before any
6840                                                             following global/generic
6841                                                             load/load
6842                                                             atomic/store/store
6843                                                             atomic/atomicrmw.
6844                                                           - Ensures that
6845                                                             following loads
6846                                                             will not see stale
6847                                                             global data.
6848
6849     fence        acquire      - system       *none*     1. s_waitcnt lgkmcnt(0) &
6850                                                            vmcnt(0)
6851
6852                                                           - If TgSplit execution mode,
6853                                                             omit lgkmcnt(0).
6854                                                           - If OpenCL and
6855                                                             address space is
6856                                                             not generic, omit
6857                                                             lgkmcnt(0).
6858                                                           - However, since LLVM
6859                                                             currently has no
6860                                                             address space on
6861                                                             the fence need to
6862                                                             conservatively
6863                                                             always generate
6864                                                             (see comment for
6865                                                             previous fence).
6866                                                           - Could be split into
6867                                                             separate s_waitcnt
6868                                                             vmcnt(0) and
6869                                                             s_waitcnt
6870                                                             lgkmcnt(0) to allow
6871                                                             them to be
6872                                                             independently moved
6873                                                             according to the
6874                                                             following rules.
6875                                                           - s_waitcnt vmcnt(0)
6876                                                             must happen after
6877                                                             any preceding
6878                                                             global/generic load
6879                                                             atomic/atomicrmw
6880                                                             with an equal or
6881                                                             wider sync scope
6882                                                             and memory ordering
6883                                                             stronger than
6884                                                             unordered (this is
6885                                                             termed the
6886                                                             fence-paired-atomic).
6887                                                           - s_waitcnt lgkmcnt(0)
6888                                                             must happen after
6889                                                             any preceding
6890                                                             local/generic load
6891                                                             atomic/atomicrmw
6892                                                             with an equal or
6893                                                             wider sync scope
6894                                                             and memory ordering
6895                                                             stronger than
6896                                                             unordered (this is
6897                                                             termed the
6898                                                             fence-paired-atomic).
6899                                                           - Must happen before
6900                                                             the following buffer_invl2 and
6901                                                             buffer_wbinvl1_vol.
6902                                                           - Ensures that the
6903                                                             fence-paired atomic
6904                                                             has completed
6905                                                             before invalidating
6906                                                             the
6907                                                             cache. Therefore
6908                                                             any following
6909                                                             locations read must
6910                                                             be no older than
6911                                                             the value read by
6912                                                             the
6913                                                             fence-paired-atomic.
6914
6915                                                         2. buffer_invl2;
6916                                                            buffer_wbinvl1_vol
6917
6918                                                           - Must happen before any
6919                                                             following global/generic
6920                                                             load/load
6921                                                             atomic/store/store
6922                                                             atomic/atomicrmw.
6923                                                           - Ensures that
6924                                                             following
6925                                                             loads will not see
6926                                                             stale L1 global data,
6927                                                             nor see stale L2 MTYPE
6928                                                             NC global data.
6929                                                             MTYPE RW and CC memory will
6930                                                             never be stale in L2 due to
6931                                                             the memory probes.
6932     **Release Atomic**
6933     ------------------------------------------------------------------------------------
6934     store atomic release      - singlethread - global   1. buffer/global/flat_store
6935                               - wavefront    - generic
6936     store atomic release      - singlethread - local    *If TgSplit execution mode,
6937                               - wavefront               local address space cannot
6938                                                         be used.*
6939
6940                                                         1. ds_store
6941     store atomic release      - workgroup    - global   1. s_waitcnt lgkm/vmcnt(0)
6942                                              - generic
6943                                                           - Use lgkmcnt(0) if not
6944                                                             TgSplit execution mode
6945                                                             and vmcnt(0) if TgSplit
6946                                                             execution mode.
6947                                                           - If OpenCL, omit lgkmcnt(0).
6948                                                           - s_waitcnt vmcnt(0)
6949                                                             must happen after
6950                                                             any preceding
6951                                                             global/generic load/store/
6952                                                             load atomic/store atomic/
6953                                                             atomicrmw.
6954                                                           - s_waitcnt lgkmcnt(0)
6955                                                             must happen after
6956                                                             any preceding
6957                                                             local/generic
6958                                                             load/store/load
6959                                                             atomic/store
6960                                                             atomic/atomicrmw.
6961                                                           - Must happen before
6962                                                             the following
6963                                                             store.
6964                                                           - Ensures that all
6965                                                             memory operations
6966                                                             have
6967                                                             completed before
6968                                                             performing the
6969                                                             store that is being
6970                                                             released.
6971
6972                                                         2. buffer/global/flat_store
6973     store atomic release      - workgroup    - local    *If TgSplit execution mode,
6974                                                         local address space cannot
6975                                                         be used.*
6976
6977                                                         1. ds_store
6978     store atomic release      - agent        - global   1. s_waitcnt lgkmcnt(0) &
6979                                              - generic     vmcnt(0)
6980
6981                                                           - If TgSplit execution mode,
6982                                                             omit lgkmcnt(0).
6983                                                           - If OpenCL and
6984                                                             address space is
6985                                                             not generic, omit
6986                                                             lgkmcnt(0).
6987                                                           - Could be split into
6988                                                             separate s_waitcnt
6989                                                             vmcnt(0) and
6990                                                             s_waitcnt
6991                                                             lgkmcnt(0) to allow
6992                                                             them to be
6993                                                             independently moved
6994                                                             according to the
6995                                                             following rules.
6996                                                           - s_waitcnt vmcnt(0)
6997                                                             must happen after
6998                                                             any preceding
6999                                                             global/generic
7000                                                             load/store/load
7001                                                             atomic/store
7002                                                             atomic/atomicrmw.
7003                                                           - s_waitcnt lgkmcnt(0)
7004                                                             must happen after
7005                                                             any preceding
7006                                                             local/generic
7007                                                             load/store/load
7008                                                             atomic/store
7009                                                             atomic/atomicrmw.
7010                                                           - Must happen before
7011                                                             the following
7012                                                             store.
7013                                                           - Ensures that all
7014                                                             memory operations
7015                                                             to memory have
7016                                                             completed before
7017                                                             performing the
7018                                                             store that is being
7019                                                             released.
7020
7021                                                         2. buffer/global/flat_store
7022     store atomic release      - system       - global   1. buffer_wbl2
7023                                              - generic
7024                                                           - Must happen before
7025                                                             following s_waitcnt.
7026                                                           - Performs L2 writeback to
7027                                                             ensure previous
7028                                                             global/generic
7029                                                             store/atomicrmw are
7030                                                             visible at system scope.
7031
7032                                                         2. s_waitcnt lgkmcnt(0) &
7033                                                            vmcnt(0)
7034
7035                                                           - If TgSplit execution mode,
7036                                                             omit lgkmcnt(0).
7037                                                           - If OpenCL and
7038                                                             address space is
7039                                                             not generic, omit
7040                                                             lgkmcnt(0).
7041                                                           - Could be split into
7042                                                             separate s_waitcnt
7043                                                             vmcnt(0) and
7044                                                             s_waitcnt
7045                                                             lgkmcnt(0) to allow
7046                                                             them to be
7047                                                             independently moved
7048                                                             according to the
7049                                                             following rules.
7050                                                           - s_waitcnt vmcnt(0)
7051                                                             must happen after any
7052                                                             preceding
7053                                                             global/generic
7054                                                             load/store/load
7055                                                             atomic/store
7056                                                             atomic/atomicrmw.
7057                                                           - s_waitcnt lgkmcnt(0)
7058                                                             must happen after any
7059                                                             preceding
7060                                                             local/generic
7061                                                             load/store/load
7062                                                             atomic/store
7063                                                             atomic/atomicrmw.
7064                                                           - Must happen before
7065                                                             the following
7066                                                             store.
7067                                                           - Ensures that all
7068                                                             memory operations
7069                                                             to memory and the L2
7070                                                             writeback have
7071                                                             completed before
7072                                                             performing the
7073                                                             store that is being
7074                                                             released.
7075
7076                                                         3. buffer/global/flat_store
7077     atomicrmw    release      - singlethread - global   1. buffer/global/flat_atomic
7078                               - wavefront    - generic
7079     atomicrmw    release      - singlethread - local    *If TgSplit execution mode,
7080                               - wavefront               local address space cannot
7081                                                         be used.*
7082
7083                                                         1. ds_atomic
7084     atomicrmw    release      - workgroup    - global   1. s_waitcnt lgkm/vmcnt(0)
7085                                              - generic
7086                                                           - Use lgkmcnt(0) if not
7087                                                             TgSplit execution mode
7088                                                             and vmcnt(0) if TgSplit
7089                                                             execution mode.
7090                                                           - If OpenCL, omit
7091                                                             lgkmcnt(0).
7092                                                           - s_waitcnt vmcnt(0)
7093                                                             must happen after
7094                                                             any preceding
7095                                                             global/generic load/store/
7096                                                             load atomic/store atomic/
7097                                                             atomicrmw.
7098                                                           - s_waitcnt lgkmcnt(0)
7099                                                             must happen after
7100                                                             any preceding
7101                                                             local/generic
7102                                                             load/store/load
7103                                                             atomic/store
7104                                                             atomic/atomicrmw.
7105                                                           - Must happen before
7106                                                             the following
7107                                                             atomicrmw.
7108                                                           - Ensures that all
7109                                                             memory operations
7110                                                             have
7111                                                             completed before
7112                                                             performing the
7113                                                             atomicrmw that is
7114                                                             being released.
7115
7116                                                         2. buffer/global/flat_atomic
7117     atomicrmw    release      - workgroup    - local    *If TgSplit execution mode,
7118                                                         local address space cannot
7119                                                         be used.*
7120
7121                                                         1. ds_atomic
7122     atomicrmw    release      - agent        - global   1. s_waitcnt lgkmcnt(0) &
7123                                              - generic     vmcnt(0)
7124
7125                                                           - If TgSplit execution mode,
7126                                                             omit lgkmcnt(0).
7127                                                           - If OpenCL, omit
7128                                                             lgkmcnt(0).
7129                                                           - Could be split into
7130                                                             separate s_waitcnt
7131                                                             vmcnt(0) and
7132                                                             s_waitcnt
7133                                                             lgkmcnt(0) to allow
7134                                                             them to be
7135                                                             independently moved
7136                                                             according to the
7137                                                             following rules.
7138                                                           - s_waitcnt vmcnt(0)
7139                                                             must happen after
7140                                                             any preceding
7141                                                             global/generic
7142                                                             load/store/load
7143                                                             atomic/store
7144                                                             atomic/atomicrmw.
7145                                                           - s_waitcnt lgkmcnt(0)
7146                                                             must happen after
7147                                                             any preceding
7148                                                             local/generic
7149                                                             load/store/load
7150                                                             atomic/store
7151                                                             atomic/atomicrmw.
7152                                                           - Must happen before
7153                                                             the following
7154                                                             atomicrmw.
7155                                                           - Ensures that all
7156                                                             memory operations
7157                                                             to global and local
7158                                                             have completed
7159                                                             before performing
7160                                                             the atomicrmw that
7161                                                             is being released.
7162
7163                                                         2. buffer/global/flat_atomic
7164     atomicrmw    release      - system       - global   1. buffer_wbl2
7165                                              - generic
7166                                                           - Must happen before
7167                                                             following s_waitcnt.
7168                                                           - Performs L2 writeback to
7169                                                             ensure previous
7170                                                             global/generic
7171                                                             store/atomicrmw are
7172                                                             visible at system scope.
7173
7174                                                         2. s_waitcnt lgkmcnt(0) &
7175                                                            vmcnt(0)
7176
7177                                                           - If TgSplit execution mode,
7178                                                             omit lgkmcnt(0).
7179                                                           - If OpenCL, omit
7180                                                             lgkmcnt(0).
7181                                                           - Could be split into
7182                                                             separate s_waitcnt
7183                                                             vmcnt(0) and
7184                                                             s_waitcnt
7185                                                             lgkmcnt(0) to allow
7186                                                             them to be
7187                                                             independently moved
7188                                                             according to the
7189                                                             following rules.
7190                                                           - s_waitcnt vmcnt(0)
7191                                                             must happen after
7192                                                             any preceding
7193                                                             global/generic
7194                                                             load/store/load
7195                                                             atomic/store
7196                                                             atomic/atomicrmw.
7197                                                           - s_waitcnt lgkmcnt(0)
7198                                                             must happen after
7199                                                             any preceding
7200                                                             local/generic
7201                                                             load/store/load
7202                                                             atomic/store
7203                                                             atomic/atomicrmw.
7204                                                           - Must happen before
7205                                                             the following
7206                                                             atomicrmw.
7207                                                           - Ensures that all
7208                                                             memory operations
7209                                                             to memory and the L2
7210                                                             writeback have
7211                                                             completed before
7212                                                             performing the
7213                                                             store that is being
7214                                                             released.
7215
7216                                                         3. buffer/global/flat_atomic
7217     fence        release      - singlethread *none*     *none*
7218                               - wavefront
7219     fence        release      - workgroup    *none*     1. s_waitcnt lgkm/vmcnt(0)
7220
7221                                                           - Use lgkmcnt(0) if not
7222                                                             TgSplit execution mode
7223                                                             and vmcnt(0) if TgSplit
7224                                                             execution mode.
7225                                                           - If OpenCL and
7226                                                             address space is
7227                                                             not generic, omit
7228                                                             lgkmcnt(0).
7229                                                           - If OpenCL and
7230                                                             address space is
7231                                                             local, omit
7232                                                             vmcnt(0).
7233                                                           - However, since LLVM
7234                                                             currently has no
7235                                                             address space on
7236                                                             the fence need to
7237                                                             conservatively
7238                                                             always generate. If
7239                                                             fence had an
7240                                                             address space then
7241                                                             set to address
7242                                                             space of OpenCL
7243                                                             fence flag, or to
7244                                                             generic if both
7245                                                             local and global
7246                                                             flags are
7247                                                             specified.
7248                                                           - s_waitcnt vmcnt(0)
7249                                                             must happen after
7250                                                             any preceding
7251                                                             global/generic
7252                                                             load/store/
7253                                                             load atomic/store atomic/
7254                                                             atomicrmw.
7255                                                           - s_waitcnt lgkmcnt(0)
7256                                                             must happen after
7257                                                             any preceding
7258                                                             local/generic
7259                                                             load/load
7260                                                             atomic/store/store
7261                                                             atomic/atomicrmw.
7262                                                           - Must happen before
7263                                                             any following store
7264                                                             atomic/atomicrmw
7265                                                             with an equal or
7266                                                             wider sync scope
7267                                                             and memory ordering
7268                                                             stronger than
7269                                                             unordered (this is
7270                                                             termed the
7271                                                             fence-paired-atomic).
7272                                                           - Ensures that all
7273                                                             memory operations
7274                                                             have
7275                                                             completed before
7276                                                             performing the
7277                                                             following
7278                                                             fence-paired-atomic.
7279
7280     fence        release      - agent        *none*     1. s_waitcnt lgkmcnt(0) &
7281                                                            vmcnt(0)
7282
7283                                                           - If TgSplit execution mode,
7284                                                             omit lgkmcnt(0).
7285                                                           - If OpenCL and
7286                                                             address space is
7287                                                             not generic, omit
7288                                                             lgkmcnt(0).
7289                                                           - If OpenCL and
7290                                                             address space is
7291                                                             local, omit
7292                                                             vmcnt(0).
7293                                                           - However, since LLVM
7294                                                             currently has no
7295                                                             address space on
7296                                                             the fence need to
7297                                                             conservatively
7298                                                             always generate. If
7299                                                             fence had an
7300                                                             address space then
7301                                                             set to address
7302                                                             space of OpenCL
7303                                                             fence flag, or to
7304                                                             generic if both
7305                                                             local and global
7306                                                             flags are
7307                                                             specified.
7308                                                           - Could be split into
7309                                                             separate s_waitcnt
7310                                                             vmcnt(0) and
7311                                                             s_waitcnt
7312                                                             lgkmcnt(0) to allow
7313                                                             them to be
7314                                                             independently moved
7315                                                             according to the
7316                                                             following rules.
7317                                                           - s_waitcnt vmcnt(0)
7318                                                             must happen after
7319                                                             any preceding
7320                                                             global/generic
7321                                                             load/store/load
7322                                                             atomic/store
7323                                                             atomic/atomicrmw.
7324                                                           - s_waitcnt lgkmcnt(0)
7325                                                             must happen after
7326                                                             any preceding
7327                                                             local/generic
7328                                                             load/store/load
7329                                                             atomic/store
7330                                                             atomic/atomicrmw.
7331                                                           - Must happen before
7332                                                             any following store
7333                                                             atomic/atomicrmw
7334                                                             with an equal or
7335                                                             wider sync scope
7336                                                             and memory ordering
7337                                                             stronger than
7338                                                             unordered (this is
7339                                                             termed the
7340                                                             fence-paired-atomic).
7341                                                           - Ensures that all
7342                                                             memory operations
7343                                                             have
7344                                                             completed before
7345                                                             performing the
7346                                                             following
7347                                                             fence-paired-atomic.
7348
7349     fence        release      - system       *none*     1. buffer_wbl2
7350
7351                                                           - If OpenCL and
7352                                                             address space is
7353                                                             local, omit.
7354                                                           - Must happen before
7355                                                             following s_waitcnt.
7356                                                           - Performs L2 writeback to
7357                                                             ensure previous
7358                                                             global/generic
7359                                                             store/atomicrmw are
7360                                                             visible at system scope.
7361
7362                                                         2. s_waitcnt lgkmcnt(0) &
7363                                                            vmcnt(0)
7364
7365                                                           - If TgSplit execution mode,
7366                                                             omit lgkmcnt(0).
7367                                                           - If OpenCL and
7368                                                             address space is
7369                                                             not generic, omit
7370                                                             lgkmcnt(0).
7371                                                           - If OpenCL and
7372                                                             address space is
7373                                                             local, omit
7374                                                             vmcnt(0).
7375                                                           - However, since LLVM
7376                                                             currently has no
7377                                                             address space on
7378                                                             the fence need to
7379                                                             conservatively
7380                                                             always generate. If
7381                                                             fence had an
7382                                                             address space then
7383                                                             set to address
7384                                                             space of OpenCL
7385                                                             fence flag, or to
7386                                                             generic if both
7387                                                             local and global
7388                                                             flags are
7389                                                             specified.
7390                                                           - Could be split into
7391                                                             separate s_waitcnt
7392                                                             vmcnt(0) and
7393                                                             s_waitcnt
7394                                                             lgkmcnt(0) to allow
7395                                                             them to be
7396                                                             independently moved
7397                                                             according to the
7398                                                             following rules.
7399                                                           - s_waitcnt vmcnt(0)
7400                                                             must happen after
7401                                                             any preceding
7402                                                             global/generic
7403                                                             load/store/load
7404                                                             atomic/store
7405                                                             atomic/atomicrmw.
7406                                                           - s_waitcnt lgkmcnt(0)
7407                                                             must happen after
7408                                                             any preceding
7409                                                             local/generic
7410                                                             load/store/load
7411                                                             atomic/store
7412                                                             atomic/atomicrmw.
7413                                                           - Must happen before
7414                                                             any following store
7415                                                             atomic/atomicrmw
7416                                                             with an equal or
7417                                                             wider sync scope
7418                                                             and memory ordering
7419                                                             stronger than
7420                                                             unordered (this is
7421                                                             termed the
7422                                                             fence-paired-atomic).
7423                                                           - Ensures that all
7424                                                             memory operations
7425                                                             have
7426                                                             completed before
7427                                                             performing the
7428                                                             following
7429                                                             fence-paired-atomic.
7430
7431     **Acquire-Release Atomic**
7432     ------------------------------------------------------------------------------------
7433     atomicrmw    acq_rel      - singlethread - global   1. buffer/global/flat_atomic
7434                               - wavefront    - generic
7435     atomicrmw    acq_rel      - singlethread - local    *If TgSplit execution mode,
7436                               - wavefront               local address space cannot
7437                                                         be used.*
7438
7439                                                         1. ds_atomic
7440     atomicrmw    acq_rel      - workgroup    - global   1. s_waitcnt lgkm/vmcnt(0)
7441
7442                                                           - Use lgkmcnt(0) if not
7443                                                             TgSplit execution mode
7444                                                             and vmcnt(0) if TgSplit
7445                                                             execution mode.
7446                                                           - If OpenCL, omit
7447                                                             lgkmcnt(0).
7448                                                           - Must happen after
7449                                                             any preceding
7450                                                             local/generic
7451                                                             load/store/load
7452                                                             atomic/store
7453                                                             atomic/atomicrmw.
7454                                                           - s_waitcnt vmcnt(0)
7455                                                             must happen after
7456                                                             any preceding
7457                                                             global/generic load/store/
7458                                                             load atomic/store atomic/
7459                                                             atomicrmw.
7460                                                           - s_waitcnt lgkmcnt(0)
7461                                                             must happen after
7462                                                             any preceding
7463                                                             local/generic
7464                                                             load/store/load
7465                                                             atomic/store
7466                                                             atomic/atomicrmw.
7467                                                           - Must happen before
7468                                                             the following
7469                                                             atomicrmw.
7470                                                           - Ensures that all
7471                                                             memory operations
7472                                                             have
7473                                                             completed before
7474                                                             performing the
7475                                                             atomicrmw that is
7476                                                             being released.
7477
7478                                                         2. buffer/global_atomic
7479                                                         3. s_waitcnt vmcnt(0)
7480
7481                                                           - If not TgSplit execution
7482                                                             mode, omit.
7483                                                           - Must happen before
7484                                                             the following
7485                                                             buffer_wbinvl1_vol.
7486                                                           - Ensures any
7487                                                             following global
7488                                                             data read is no
7489                                                             older than the
7490                                                             atomicrmw value
7491                                                             being acquired.
7492
7493                                                         4. buffer_wbinvl1_vol
7494
7495                                                           - If not TgSplit execution
7496                                                             mode, omit.
7497                                                           - Ensures that
7498                                                             following
7499                                                             loads will not see
7500                                                             stale data.
7501
7502     atomicrmw    acq_rel      - workgroup    - local    *If TgSplit execution mode,
7503                                                         local address space cannot
7504                                                         be used.*
7505
7506                                                         1. ds_atomic
7507                                                         2. s_waitcnt lgkmcnt(0)
7508
7509                                                           - If OpenCL, omit.
7510                                                           - Must happen before
7511                                                             any following
7512                                                             global/generic
7513                                                             load/load
7514                                                             atomic/store/store
7515                                                             atomic/atomicrmw.
7516                                                           - Ensures any
7517                                                             following global
7518                                                             data read is no
7519                                                             older than the local load
7520                                                             atomic value being
7521                                                             acquired.
7522
7523     atomicrmw    acq_rel      - workgroup    - generic  1. s_waitcnt lgkm/vmcnt(0)
7524
7525                                                           - Use lgkmcnt(0) if not
7526                                                             TgSplit execution mode
7527                                                             and vmcnt(0) if TgSplit
7528                                                             execution mode.
7529                                                           - If OpenCL, omit
7530                                                             lgkmcnt(0).
7531                                                           - s_waitcnt vmcnt(0)
7532                                                             must happen after
7533                                                             any preceding
7534                                                             global/generic load/store/
7535                                                             load atomic/store atomic/
7536                                                             atomicrmw.
7537                                                           - s_waitcnt lgkmcnt(0)
7538                                                             must happen after
7539                                                             any preceding
7540                                                             local/generic
7541                                                             load/store/load
7542                                                             atomic/store
7543                                                             atomic/atomicrmw.
7544                                                           - Must happen before
7545                                                             the following
7546                                                             atomicrmw.
7547                                                           - Ensures that all
7548                                                             memory operations
7549                                                             have
7550                                                             completed before
7551                                                             performing the
7552                                                             atomicrmw that is
7553                                                             being released.
7554
7555                                                         2. flat_atomic
7556                                                         3. s_waitcnt lgkmcnt(0) &
7557                                                            vmcnt(0)
7558
7559                                                           - If not TgSplit execution
7560                                                             mode, omit vmcnt(0).
7561                                                           - If OpenCL, omit
7562                                                             lgkmcnt(0).
7563                                                           - Must happen before
7564                                                             the following
7565                                                             buffer_wbinvl1_vol and
7566                                                             any following
7567                                                             global/generic
7568                                                             load/load
7569                                                             atomic/store/store
7570                                                             atomic/atomicrmw.
7571                                                           - Ensures any
7572                                                             following global
7573                                                             data read is no
7574                                                             older than a local load
7575                                                             atomic value being
7576                                                             acquired.
7577
7578                                                         3. buffer_wbinvl1_vol
7579
7580                                                           - If not TgSplit execution
7581                                                             mode, omit.
7582                                                           - Ensures that
7583                                                             following
7584                                                             loads will not see
7585                                                             stale data.
7586
7587     atomicrmw    acq_rel      - agent        - global   1. s_waitcnt lgkmcnt(0) &
7588                                                            vmcnt(0)
7589
7590                                                           - If TgSplit execution mode,
7591                                                             omit lgkmcnt(0).
7592                                                           - If OpenCL, omit
7593                                                             lgkmcnt(0).
7594                                                           - Could be split into
7595                                                             separate s_waitcnt
7596                                                             vmcnt(0) and
7597                                                             s_waitcnt
7598                                                             lgkmcnt(0) to allow
7599                                                             them to be
7600                                                             independently moved
7601                                                             according to the
7602                                                             following rules.
7603                                                           - s_waitcnt vmcnt(0)
7604                                                             must happen after
7605                                                             any preceding
7606                                                             global/generic
7607                                                             load/store/load
7608                                                             atomic/store
7609                                                             atomic/atomicrmw.
7610                                                           - s_waitcnt lgkmcnt(0)
7611                                                             must happen after
7612                                                             any preceding
7613                                                             local/generic
7614                                                             load/store/load
7615                                                             atomic/store
7616                                                             atomic/atomicrmw.
7617                                                           - Must happen before
7618                                                             the following
7619                                                             atomicrmw.
7620                                                           - Ensures that all
7621                                                             memory operations
7622                                                             to global have
7623                                                             completed before
7624                                                             performing the
7625                                                             atomicrmw that is
7626                                                             being released.
7627
7628                                                         2. buffer/global_atomic
7629                                                         3. s_waitcnt vmcnt(0)
7630
7631                                                           - Must happen before
7632                                                             following
7633                                                             buffer_wbinvl1_vol.
7634                                                           - Ensures the
7635                                                             atomicrmw has
7636                                                             completed before
7637                                                             invalidating the
7638                                                             cache.
7639
7640                                                         4. buffer_wbinvl1_vol
7641
7642                                                           - Must happen before
7643                                                             any following
7644                                                             global/generic
7645                                                             load/load
7646                                                             atomic/atomicrmw.
7647                                                           - Ensures that
7648                                                             following loads
7649                                                             will not see stale
7650                                                             global data.
7651
7652     atomicrmw    acq_rel      - system       - global   1. buffer_wbl2
7653
7654                                                           - Must happen before
7655                                                             following s_waitcnt.
7656                                                           - Performs L2 writeback to
7657                                                             ensure previous
7658                                                             global/generic
7659                                                             store/atomicrmw are
7660                                                             visible at system scope.
7661
7662                                                         2. s_waitcnt lgkmcnt(0) &
7663                                                            vmcnt(0)
7664
7665                                                           - If TgSplit execution mode,
7666                                                             omit lgkmcnt(0).
7667                                                           - If OpenCL, omit
7668                                                             lgkmcnt(0).
7669                                                           - Could be split into
7670                                                             separate s_waitcnt
7671                                                             vmcnt(0) and
7672                                                             s_waitcnt
7673                                                             lgkmcnt(0) to allow
7674                                                             them to be
7675                                                             independently moved
7676                                                             according to the
7677                                                             following rules.
7678                                                           - s_waitcnt vmcnt(0)
7679                                                             must happen after
7680                                                             any preceding
7681                                                             global/generic
7682                                                             load/store/load
7683                                                             atomic/store
7684                                                             atomic/atomicrmw.
7685                                                           - s_waitcnt lgkmcnt(0)
7686                                                             must happen after
7687                                                             any preceding
7688                                                             local/generic
7689                                                             load/store/load
7690                                                             atomic/store
7691                                                             atomic/atomicrmw.
7692                                                           - Must happen before
7693                                                             the following
7694                                                             atomicrmw.
7695                                                           - Ensures that all
7696                                                             memory operations
7697                                                             to global and L2 writeback
7698                                                             have completed before
7699                                                             performing the
7700                                                             atomicrmw that is
7701                                                             being released.
7702
7703                                                         3. buffer/global_atomic
7704                                                         4. s_waitcnt vmcnt(0)
7705
7706                                                           - Must happen before
7707                                                             following buffer_invl2 and
7708                                                             buffer_wbinvl1_vol.
7709                                                           - Ensures the
7710                                                             atomicrmw has
7711                                                             completed before
7712                                                             invalidating the
7713                                                             caches.
7714
7715                                                         5. buffer_invl2;
7716                                                            buffer_wbinvl1_vol
7717
7718                                                           - Must happen before
7719                                                             any following
7720                                                             global/generic
7721                                                             load/load
7722                                                             atomic/atomicrmw.
7723                                                           - Ensures that
7724                                                             following
7725                                                             loads will not see
7726                                                             stale L1 global data,
7727                                                             nor see stale L2 MTYPE
7728                                                             NC global data.
7729                                                             MTYPE RW and CC memory will
7730                                                             never be stale in L2 due to
7731                                                             the memory probes.
7732
7733     atomicrmw    acq_rel      - agent        - generic  1. s_waitcnt lgkmcnt(0) &
7734                                                            vmcnt(0)
7735
7736                                                           - If TgSplit execution mode,
7737                                                             omit lgkmcnt(0).
7738                                                           - If OpenCL, omit
7739                                                             lgkmcnt(0).
7740                                                           - Could be split into
7741                                                             separate s_waitcnt
7742                                                             vmcnt(0) and
7743                                                             s_waitcnt
7744                                                             lgkmcnt(0) to allow
7745                                                             them to be
7746                                                             independently moved
7747                                                             according to the
7748                                                             following rules.
7749                                                           - s_waitcnt vmcnt(0)
7750                                                             must happen after
7751                                                             any preceding
7752                                                             global/generic
7753                                                             load/store/load
7754                                                             atomic/store
7755                                                             atomic/atomicrmw.
7756                                                           - s_waitcnt lgkmcnt(0)
7757                                                             must happen after
7758                                                             any preceding
7759                                                             local/generic
7760                                                             load/store/load
7761                                                             atomic/store
7762                                                             atomic/atomicrmw.
7763                                                           - Must happen before
7764                                                             the following
7765                                                             atomicrmw.
7766                                                           - Ensures that all
7767                                                             memory operations
7768                                                             to global have
7769                                                             completed before
7770                                                             performing the
7771                                                             atomicrmw that is
7772                                                             being released.
7773
7774                                                         2. flat_atomic
7775                                                         3. s_waitcnt vmcnt(0) &
7776                                                            lgkmcnt(0)
7777
7778                                                           - If TgSplit execution mode,
7779                                                             omit lgkmcnt(0).
7780                                                           - If OpenCL, omit
7781                                                             lgkmcnt(0).
7782                                                           - Must happen before
7783                                                             following
7784                                                             buffer_wbinvl1_vol.
7785                                                           - Ensures the
7786                                                             atomicrmw has
7787                                                             completed before
7788                                                             invalidating the
7789                                                             cache.
7790
7791                                                         4. buffer_wbinvl1_vol
7792
7793                                                           - Must happen before
7794                                                             any following
7795                                                             global/generic
7796                                                             load/load
7797                                                             atomic/atomicrmw.
7798                                                           - Ensures that
7799                                                             following loads
7800                                                             will not see stale
7801                                                             global data.
7802
7803     atomicrmw    acq_rel      - system       - generic  1. buffer_wbl2
7804
7805                                                           - Must happen before
7806                                                             following s_waitcnt.
7807                                                           - Performs L2 writeback to
7808                                                             ensure previous
7809                                                             global/generic
7810                                                             store/atomicrmw are
7811                                                             visible at system scope.
7812
7813                                                         2. s_waitcnt lgkmcnt(0) &
7814                                                            vmcnt(0)
7815
7816                                                           - If TgSplit execution mode,
7817                                                             omit lgkmcnt(0).
7818                                                           - If OpenCL, omit
7819                                                             lgkmcnt(0).
7820                                                           - Could be split into
7821                                                             separate s_waitcnt
7822                                                             vmcnt(0) and
7823                                                             s_waitcnt
7824                                                             lgkmcnt(0) to allow
7825                                                             them to be
7826                                                             independently moved
7827                                                             according to the
7828                                                             following rules.
7829                                                           - s_waitcnt vmcnt(0)
7830                                                             must happen after
7831                                                             any preceding
7832                                                             global/generic
7833                                                             load/store/load
7834                                                             atomic/store
7835                                                             atomic/atomicrmw.
7836                                                           - s_waitcnt lgkmcnt(0)
7837                                                             must happen after
7838                                                             any preceding
7839                                                             local/generic
7840                                                             load/store/load
7841                                                             atomic/store
7842                                                             atomic/atomicrmw.
7843                                                           - Must happen before
7844                                                             the following
7845                                                             atomicrmw.
7846                                                           - Ensures that all
7847                                                             memory operations
7848                                                             to global and L2 writeback
7849                                                             have completed before
7850                                                             performing the
7851                                                             atomicrmw that is
7852                                                             being released.
7853
7854                                                         3. flat_atomic
7855                                                         4. s_waitcnt vmcnt(0) &
7856                                                            lgkmcnt(0)
7857
7858                                                           - If TgSplit execution mode,
7859                                                             omit lgkmcnt(0).
7860                                                           - If OpenCL, omit
7861                                                             lgkmcnt(0).
7862                                                           - Must happen before
7863                                                             following buffer_invl2 and
7864                                                             buffer_wbinvl1_vol.
7865                                                           - Ensures the
7866                                                             atomicrmw has
7867                                                             completed before
7868                                                             invalidating the
7869                                                             caches.
7870
7871                                                         5. buffer_invl2;
7872                                                            buffer_wbinvl1_vol
7873
7874                                                           - Must happen before
7875                                                             any following
7876                                                             global/generic
7877                                                             load/load
7878                                                             atomic/atomicrmw.
7879                                                           - Ensures that
7880                                                             following
7881                                                             loads will not see
7882                                                             stale L1 global data,
7883                                                             nor see stale L2 MTYPE
7884                                                             NC global data.
7885                                                             MTYPE RW and CC memory will
7886                                                             never be stale in L2 due to
7887                                                             the memory probes.
7888
7889     fence        acq_rel      - singlethread *none*     *none*
7890                               - wavefront
7891     fence        acq_rel      - workgroup    *none*     1. s_waitcnt lgkm/vmcnt(0)
7892
7893                                                           - Use lgkmcnt(0) if not
7894                                                             TgSplit execution mode
7895                                                             and vmcnt(0) if TgSplit
7896                                                             execution mode.
7897                                                           - If OpenCL and
7898                                                             address space is
7899                                                             not generic, omit
7900                                                             lgkmcnt(0).
7901                                                           - If OpenCL and
7902                                                             address space is
7903                                                             local, omit
7904                                                             vmcnt(0).
7905                                                           - However,
7906                                                             since LLVM
7907                                                             currently has no
7908                                                             address space on
7909                                                             the fence need to
7910                                                             conservatively
7911                                                             always generate
7912                                                             (see comment for
7913                                                             previous fence).
7914                                                           - s_waitcnt vmcnt(0)
7915                                                             must happen after
7916                                                             any preceding
7917                                                             global/generic
7918                                                             load/store/
7919                                                             load atomic/store atomic/
7920                                                             atomicrmw.
7921                                                           - s_waitcnt lgkmcnt(0)
7922                                                             must happen after
7923                                                             any preceding
7924                                                             local/generic
7925                                                             load/load
7926                                                             atomic/store/store
7927                                                             atomic/atomicrmw.
7928                                                           - Must happen before
7929                                                             any following
7930                                                             global/generic
7931                                                             load/load
7932                                                             atomic/store/store
7933                                                             atomic/atomicrmw.
7934                                                           - Ensures that all
7935                                                             memory operations
7936                                                             have
7937                                                             completed before
7938                                                             performing any
7939                                                             following global
7940                                                             memory operations.
7941                                                           - Ensures that the
7942                                                             preceding
7943                                                             local/generic load
7944                                                             atomic/atomicrmw
7945                                                             with an equal or
7946                                                             wider sync scope
7947                                                             and memory ordering
7948                                                             stronger than
7949                                                             unordered (this is
7950                                                             termed the
7951                                                             acquire-fence-paired-atomic)
7952                                                             has completed
7953                                                             before following
7954                                                             global memory
7955                                                             operations. This
7956                                                             satisfies the
7957                                                             requirements of
7958                                                             acquire.
7959                                                           - Ensures that all
7960                                                             previous memory
7961                                                             operations have
7962                                                             completed before a
7963                                                             following
7964                                                             local/generic store
7965                                                             atomic/atomicrmw
7966                                                             with an equal or
7967                                                             wider sync scope
7968                                                             and memory ordering
7969                                                             stronger than
7970                                                             unordered (this is
7971                                                             termed the
7972                                                             release-fence-paired-atomic).
7973                                                             This satisfies the
7974                                                             requirements of
7975                                                             release.
7976                                                           - Must happen before
7977                                                             the following
7978                                                             buffer_wbinvl1_vol.
7979                                                           - Ensures that the
7980                                                             acquire-fence-paired
7981                                                             atomic has completed
7982                                                             before invalidating
7983                                                             the
7984                                                             cache. Therefore
7985                                                             any following
7986                                                             locations read must
7987                                                             be no older than
7988                                                             the value read by
7989                                                             the
7990                                                             acquire-fence-paired-atomic.
7991
7992                                                         2. buffer_wbinvl1_vol
7993
7994                                                           - If not TgSplit execution
7995                                                             mode, omit.
7996                                                           - Ensures that
7997                                                             following
7998                                                             loads will not see
7999                                                             stale data.
8000
8001     fence        acq_rel      - agent        *none*     1. s_waitcnt lgkmcnt(0) &
8002                                                            vmcnt(0)
8003
8004                                                           - If TgSplit execution mode,
8005                                                             omit lgkmcnt(0).
8006                                                           - If OpenCL and
8007                                                             address space is
8008                                                             not generic, omit
8009                                                             lgkmcnt(0).
8010                                                           - However, since LLVM
8011                                                             currently has no
8012                                                             address space on
8013                                                             the fence need to
8014                                                             conservatively
8015                                                             always generate
8016                                                             (see comment for
8017                                                             previous fence).
8018                                                           - Could be split into
8019                                                             separate s_waitcnt
8020                                                             vmcnt(0) and
8021                                                             s_waitcnt
8022                                                             lgkmcnt(0) to allow
8023                                                             them to be
8024                                                             independently moved
8025                                                             according to the
8026                                                             following rules.
8027                                                           - s_waitcnt vmcnt(0)
8028                                                             must happen after
8029                                                             any preceding
8030                                                             global/generic
8031                                                             load/store/load
8032                                                             atomic/store
8033                                                             atomic/atomicrmw.
8034                                                           - s_waitcnt lgkmcnt(0)
8035                                                             must happen after
8036                                                             any preceding
8037                                                             local/generic
8038                                                             load/store/load
8039                                                             atomic/store
8040                                                             atomic/atomicrmw.
8041                                                           - Must happen before
8042                                                             the following
8043                                                             buffer_wbinvl1_vol.
8044                                                           - Ensures that the
8045                                                             preceding
8046                                                             global/local/generic
8047                                                             load
8048                                                             atomic/atomicrmw
8049                                                             with an equal or
8050                                                             wider sync scope
8051                                                             and memory ordering
8052                                                             stronger than
8053                                                             unordered (this is
8054                                                             termed the
8055                                                             acquire-fence-paired-atomic)
8056                                                             has completed
8057                                                             before invalidating
8058                                                             the cache. This
8059                                                             satisfies the
8060                                                             requirements of
8061                                                             acquire.
8062                                                           - Ensures that all
8063                                                             previous memory
8064                                                             operations have
8065                                                             completed before a
8066                                                             following
8067                                                             global/local/generic
8068                                                             store
8069                                                             atomic/atomicrmw
8070                                                             with an equal or
8071                                                             wider sync scope
8072                                                             and memory ordering
8073                                                             stronger than
8074                                                             unordered (this is
8075                                                             termed the
8076                                                             release-fence-paired-atomic).
8077                                                             This satisfies the
8078                                                             requirements of
8079                                                             release.
8080
8081                                                         2. buffer_wbinvl1_vol
8082
8083                                                           - Must happen before
8084                                                             any following
8085                                                             global/generic
8086                                                             load/load
8087                                                             atomic/store/store
8088                                                             atomic/atomicrmw.
8089                                                           - Ensures that
8090                                                             following loads
8091                                                             will not see stale
8092                                                             global data. This
8093                                                             satisfies the
8094                                                             requirements of
8095                                                             acquire.
8096
8097     fence        acq_rel      - system       *none*     1. buffer_wbl2
8098
8099                                                           - If OpenCL and
8100                                                             address space is
8101                                                             local, omit.
8102                                                           - Must happen before
8103                                                             following s_waitcnt.
8104                                                           - Performs L2 writeback to
8105                                                             ensure previous
8106                                                             global/generic
8107                                                             store/atomicrmw are
8108                                                             visible at system scope.
8109
8110                                                         2. s_waitcnt lgkmcnt(0) &
8111                                                            vmcnt(0)
8112
8113                                                           - If TgSplit execution mode,
8114                                                             omit lgkmcnt(0).
8115                                                           - If OpenCL and
8116                                                             address space is
8117                                                             not generic, omit
8118                                                             lgkmcnt(0).
8119                                                           - However, since LLVM
8120                                                             currently has no
8121                                                             address space on
8122                                                             the fence need to
8123                                                             conservatively
8124                                                             always generate
8125                                                             (see comment for
8126                                                             previous fence).
8127                                                           - Could be split into
8128                                                             separate s_waitcnt
8129                                                             vmcnt(0) and
8130                                                             s_waitcnt
8131                                                             lgkmcnt(0) to allow
8132                                                             them to be
8133                                                             independently moved
8134                                                             according to the
8135                                                             following rules.
8136                                                           - s_waitcnt vmcnt(0)
8137                                                             must happen after
8138                                                             any preceding
8139                                                             global/generic
8140                                                             load/store/load
8141                                                             atomic/store
8142                                                             atomic/atomicrmw.
8143                                                           - s_waitcnt lgkmcnt(0)
8144                                                             must happen after
8145                                                             any preceding
8146                                                             local/generic
8147                                                             load/store/load
8148                                                             atomic/store
8149                                                             atomic/atomicrmw.
8150                                                           - Must happen before
8151                                                             the following buffer_invl2 and
8152                                                             buffer_wbinvl1_vol.
8153                                                           - Ensures that the
8154                                                             preceding
8155                                                             global/local/generic
8156                                                             load
8157                                                             atomic/atomicrmw
8158                                                             with an equal or
8159                                                             wider sync scope
8160                                                             and memory ordering
8161                                                             stronger than
8162                                                             unordered (this is
8163                                                             termed the
8164                                                             acquire-fence-paired-atomic)
8165                                                             has completed
8166                                                             before invalidating
8167                                                             the cache. This
8168                                                             satisfies the
8169                                                             requirements of
8170                                                             acquire.
8171                                                           - Ensures that all
8172                                                             previous memory
8173                                                             operations have
8174                                                             completed before a
8175                                                             following
8176                                                             global/local/generic
8177                                                             store
8178                                                             atomic/atomicrmw
8179                                                             with an equal or
8180                                                             wider sync scope
8181                                                             and memory ordering
8182                                                             stronger than
8183                                                             unordered (this is
8184                                                             termed the
8185                                                             release-fence-paired-atomic).
8186                                                             This satisfies the
8187                                                             requirements of
8188                                                             release.
8189
8190                                                         3.  buffer_invl2;
8191                                                             buffer_wbinvl1_vol
8192
8193                                                           - Must happen before
8194                                                             any following
8195                                                             global/generic
8196                                                             load/load
8197                                                             atomic/store/store
8198                                                             atomic/atomicrmw.
8199                                                           - Ensures that
8200                                                             following
8201                                                             loads will not see
8202                                                             stale L1 global data,
8203                                                             nor see stale L2 MTYPE
8204                                                             NC global data.
8205                                                             MTYPE RW and CC memory will
8206                                                             never be stale in L2 due to
8207                                                             the memory probes.
8208
8209     **Sequential Consistent Atomic**
8210     ------------------------------------------------------------------------------------
8211     load atomic  seq_cst      - singlethread - global   *Same as corresponding
8212                               - wavefront    - local    load atomic acquire,
8213                                              - generic  except must generated
8214                                                         all instructions even
8215                                                         for OpenCL.*
8216     load atomic  seq_cst      - workgroup    - global   1. s_waitcnt lgkm/vmcnt(0)
8217                                              - generic
8218                                                           - Use lgkmcnt(0) if not
8219                                                             TgSplit execution mode
8220                                                             and vmcnt(0) if TgSplit
8221                                                             execution mode.
8222                                                           - s_waitcnt lgkmcnt(0) must
8223                                                             happen after
8224                                                             preceding
8225                                                             local/generic load
8226                                                             atomic/store
8227                                                             atomic/atomicrmw
8228                                                             with memory
8229                                                             ordering of seq_cst
8230                                                             and with equal or
8231                                                             wider sync scope.
8232                                                             (Note that seq_cst
8233                                                             fences have their
8234                                                             own s_waitcnt
8235                                                             lgkmcnt(0) and so do
8236                                                             not need to be
8237                                                             considered.)
8238                                                           - s_waitcnt vmcnt(0)
8239                                                             must happen after
8240                                                             preceding
8241                                                             global/generic load
8242                                                             atomic/store
8243                                                             atomic/atomicrmw
8244                                                             with memory
8245                                                             ordering of seq_cst
8246                                                             and with equal or
8247                                                             wider sync scope.
8248                                                             (Note that seq_cst
8249                                                             fences have their
8250                                                             own s_waitcnt
8251                                                             vmcnt(0) and so do
8252                                                             not need to be
8253                                                             considered.)
8254                                                           - Ensures any
8255                                                             preceding
8256                                                             sequential
8257                                                             consistent global/local
8258                                                             memory instructions
8259                                                             have completed
8260                                                             before executing
8261                                                             this sequentially
8262                                                             consistent
8263                                                             instruction. This
8264                                                             prevents reordering
8265                                                             a seq_cst store
8266                                                             followed by a
8267                                                             seq_cst load. (Note
8268                                                             that seq_cst is
8269                                                             stronger than
8270                                                             acquire/release as
8271                                                             the reordering of
8272                                                             load acquire
8273                                                             followed by a store
8274                                                             release is
8275                                                             prevented by the
8276                                                             s_waitcnt of
8277                                                             the release, but
8278                                                             there is nothing
8279                                                             preventing a store
8280                                                             release followed by
8281                                                             load acquire from
8282                                                             completing out of
8283                                                             order. The s_waitcnt
8284                                                             could be placed after
8285                                                             seq_store or before
8286                                                             the seq_load. We
8287                                                             choose the load to
8288                                                             make the s_waitcnt be
8289                                                             as late as possible
8290                                                             so that the store
8291                                                             may have already
8292                                                             completed.)
8293
8294                                                         2. *Following
8295                                                            instructions same as
8296                                                            corresponding load
8297                                                            atomic acquire,
8298                                                            except must generated
8299                                                            all instructions even
8300                                                            for OpenCL.*
8301     load atomic  seq_cst      - workgroup    - local    *If TgSplit execution mode,
8302                                                         local address space cannot
8303                                                         be used.*
8304
8305                                                         *Same as corresponding
8306                                                         load atomic acquire,
8307                                                         except must generated
8308                                                         all instructions even
8309                                                         for OpenCL.*
8310
8311     load atomic  seq_cst      - agent        - global   1. s_waitcnt lgkmcnt(0) &
8312                               - system       - generic     vmcnt(0)
8313
8314                                                           - If TgSplit execution mode,
8315                                                             omit lgkmcnt(0).
8316                                                           - Could be split into
8317                                                             separate s_waitcnt
8318                                                             vmcnt(0)
8319                                                             and s_waitcnt
8320                                                             lgkmcnt(0) to allow
8321                                                             them to be
8322                                                             independently moved
8323                                                             according to the
8324                                                             following rules.
8325                                                           - s_waitcnt lgkmcnt(0)
8326                                                             must happen after
8327                                                             preceding
8328                                                             global/generic load
8329                                                             atomic/store
8330                                                             atomic/atomicrmw
8331                                                             with memory
8332                                                             ordering of seq_cst
8333                                                             and with equal or
8334                                                             wider sync scope.
8335                                                             (Note that seq_cst
8336                                                             fences have their
8337                                                             own s_waitcnt
8338                                                             lgkmcnt(0) and so do
8339                                                             not need to be
8340                                                             considered.)
8341                                                           - s_waitcnt vmcnt(0)
8342                                                             must happen after
8343                                                             preceding
8344                                                             global/generic load
8345                                                             atomic/store
8346                                                             atomic/atomicrmw
8347                                                             with memory
8348                                                             ordering of seq_cst
8349                                                             and with equal or
8350                                                             wider sync scope.
8351                                                             (Note that seq_cst
8352                                                             fences have their
8353                                                             own s_waitcnt
8354                                                             vmcnt(0) and so do
8355                                                             not need to be
8356                                                             considered.)
8357                                                           - Ensures any
8358                                                             preceding
8359                                                             sequential
8360                                                             consistent global
8361                                                             memory instructions
8362                                                             have completed
8363                                                             before executing
8364                                                             this sequentially
8365                                                             consistent
8366                                                             instruction. This
8367                                                             prevents reordering
8368                                                             a seq_cst store
8369                                                             followed by a
8370                                                             seq_cst load. (Note
8371                                                             that seq_cst is
8372                                                             stronger than
8373                                                             acquire/release as
8374                                                             the reordering of
8375                                                             load acquire
8376                                                             followed by a store
8377                                                             release is
8378                                                             prevented by the
8379                                                             s_waitcnt of
8380                                                             the release, but
8381                                                             there is nothing
8382                                                             preventing a store
8383                                                             release followed by
8384                                                             load acquire from
8385                                                             completing out of
8386                                                             order. The s_waitcnt
8387                                                             could be placed after
8388                                                             seq_store or before
8389                                                             the seq_load. We
8390                                                             choose the load to
8391                                                             make the s_waitcnt be
8392                                                             as late as possible
8393                                                             so that the store
8394                                                             may have already
8395                                                             completed.)
8396
8397                                                         2. *Following
8398                                                            instructions same as
8399                                                            corresponding load
8400                                                            atomic acquire,
8401                                                            except must generated
8402                                                            all instructions even
8403                                                            for OpenCL.*
8404     store atomic seq_cst      - singlethread - global   *Same as corresponding
8405                               - wavefront    - local    store atomic release,
8406                               - workgroup    - generic  except must generated
8407                               - agent                   all instructions even
8408                               - system                  for OpenCL.*
8409     atomicrmw    seq_cst      - singlethread - global   *Same as corresponding
8410                               - wavefront    - local    atomicrmw acq_rel,
8411                               - workgroup    - generic  except must generated
8412                               - agent                   all instructions even
8413                               - system                  for OpenCL.*
8414     fence        seq_cst      - singlethread *none*     *Same as corresponding
8415                               - wavefront               fence acq_rel,
8416                               - workgroup               except must generated
8417                               - agent                   all instructions even
8418                               - system                  for OpenCL.*
8419     ============ ============ ============== ========== ================================
8420
8421.. _amdgpu-amdhsa-memory-model-gfx10:
8422
8423Memory Model GFX10
8424++++++++++++++++++
8425
8426For GFX10:
8427
8428* Each agent has multiple shader arrays (SA).
8429* Each SA has multiple work-group processors (WGP).
8430* Each WGP has multiple compute units (CU).
8431* Each CU has multiple SIMDs that execute wavefronts.
8432* The wavefronts for a single work-group are executed in the same
8433  WGP. In CU wavefront execution mode the wavefronts may be executed by
8434  different SIMDs in the same CU. In WGP wavefront execution mode the
8435  wavefronts may be executed by different SIMDs in different CUs in the same
8436  WGP.
8437* Each WGP has a single LDS memory shared by the wavefronts of the work-groups
8438  executing on it.
8439* All LDS operations of a WGP are performed as wavefront wide operations in a
8440  global order and involve no caching. Completion is reported to a wavefront in
8441  execution order.
8442* The LDS memory has multiple request queues shared by the SIMDs of a
8443  WGP. Therefore, the LDS operations performed by different wavefronts of a
8444  work-group can be reordered relative to each other, which can result in
8445  reordering the visibility of vector memory operations with respect to LDS
8446  operations of other wavefronts in the same work-group. A ``s_waitcnt
8447  lgkmcnt(0)`` is required to ensure synchronization between LDS operations and
8448  vector memory operations between wavefronts of a work-group, but not between
8449  operations performed by the same wavefront.
8450* The vector memory operations are performed as wavefront wide operations.
8451  Completion of load/store/sample operations are reported to a wavefront in
8452  execution order of other load/store/sample operations performed by that
8453  wavefront.
8454* The vector memory operations access a vector L0 cache. There is a single L0
8455  cache per CU. Each SIMD of a CU accesses the same L0 cache. Therefore, no
8456  special action is required for coherence between the lanes of a single
8457  wavefront. However, a ``buffer_gl0_inv`` is required for coherence between
8458  wavefronts executing in the same work-group as they may be executing on SIMDs
8459  of different CUs that access different L0s. A ``buffer_gl0_inv`` is also
8460  required for coherence between wavefronts executing in different work-groups
8461  as they may be executing on different WGPs.
8462* The scalar memory operations access a scalar L0 cache shared by all wavefronts
8463  on a WGP. The scalar and vector L0 caches are not coherent. However, scalar
8464  operations are used in a restricted way so do not impact the memory model. See
8465  :ref:`amdgpu-amdhsa-memory-spaces`.
8466* The vector and scalar memory L0 caches use an L1 cache shared by all WGPs on
8467  the same SA. Therefore, no special action is required for coherence between
8468  the wavefronts of a single work-group. However, a ``buffer_gl1_inv`` is
8469  required for coherence between wavefronts executing in different work-groups
8470  as they may be executing on different SAs that access different L1s.
8471* The L1 caches have independent quadrants to service disjoint ranges of virtual
8472  addresses.
8473* Each L0 cache has a separate request queue per L1 quadrant. Therefore, the
8474  vector and scalar memory operations performed by different wavefronts, whether
8475  executing in the same or different work-groups (which may be executing on
8476  different CUs accessing different L0s), can be reordered relative to each
8477  other. A ``s_waitcnt vmcnt(0) & vscnt(0)`` is required to ensure
8478  synchronization between vector memory operations of different wavefronts. It
8479  ensures a previous vector memory operation has completed before executing a
8480  subsequent vector memory or LDS operation and so can be used to meet the
8481  requirements of acquire, release and sequential consistency.
8482* The L1 caches use an L2 cache shared by all SAs on the same agent.
8483* The L2 cache has independent channels to service disjoint ranges of virtual
8484  addresses.
8485* Each L1 quadrant of a single SA accesses a different L2 channel. Each L1
8486  quadrant has a separate request queue per L2 channel. Therefore, the vector
8487  and scalar memory operations performed by wavefronts executing in different
8488  work-groups (which may be executing on different SAs) of an agent can be
8489  reordered relative to each other. A ``s_waitcnt vmcnt(0) & vscnt(0)`` is
8490  required to ensure synchronization between vector memory operations of
8491  different SAs. It ensures a previous vector memory operation has completed
8492  before executing a subsequent vector memory and so can be used to meet the
8493  requirements of acquire, release and sequential consistency.
8494* The L2 cache can be kept coherent with other agents on some targets, or ranges
8495  of virtual addresses can be set up to bypass it to ensure system coherence.
8496
8497Scalar memory operations are only used to access memory that is proven to not
8498change during the execution of the kernel dispatch. This includes constant
8499address space and global address space for program scope ``const`` variables.
8500Therefore, the kernel machine code does not have to maintain the scalar cache to
8501ensure it is coherent with the vector caches. The scalar and vector caches are
8502invalidated between kernel dispatches by CP since constant address space data
8503may change between kernel dispatch executions. See
8504:ref:`amdgpu-amdhsa-memory-spaces`.
8505
8506The one exception is if scalar writes are used to spill SGPR registers. In this
8507case the AMDGPU backend ensures the memory location used to spill is never
8508accessed by vector memory operations at the same time. If scalar writes are used
8509then a ``s_dcache_wb`` is inserted before the ``s_endpgm`` and before a function
8510return since the locations may be used for vector memory instructions by a
8511future wavefront that uses the same scratch area, or a function call that
8512creates a frame at the same address, respectively. There is no need for a
8513``s_dcache_inv`` as all scalar writes are write-before-read in the same thread.
8514
8515For kernarg backing memory:
8516
8517* CP invalidates the L0 and L1 caches at the start of each kernel dispatch.
8518* On dGPU the kernarg backing memory is accessed as MTYPE UC (uncached) to avoid
8519  needing to invalidate the L2 cache.
8520* On APU the kernarg backing memory is accessed as MTYPE CC (cache coherent) and
8521  so the L2 cache will be coherent with the CPU and other agents.
8522
8523Scratch backing memory (which is used for the private address space) is accessed
8524with MTYPE NC (non-coherent). Since the private address space is only accessed
8525by a single thread, and is always write-before-read, there is never a need to
8526invalidate these entries from the L0 or L1 caches.
8527
8528Wavefronts are executed in native mode with in-order reporting of loads and
8529sample instructions. In this mode vmcnt reports completion of load, atomic with
8530return and sample instructions in order, and the vscnt reports the completion of
8531store and atomic without return in order. See ``MEM_ORDERED`` field in
8532:ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx10-table`.
8533
8534Wavefronts can be executed in WGP or CU wavefront execution mode:
8535
8536* In WGP wavefront execution mode the wavefronts of a work-group are executed
8537  on the SIMDs of both CUs of the WGP. Therefore, explicit management of the per
8538  CU L0 caches is required for work-group synchronization. Also accesses to L1
8539  at work-group scope need to be explicitly ordered as the accesses from
8540  different CUs are not ordered.
8541* In CU wavefront execution mode the wavefronts of a work-group are executed on
8542  the SIMDs of a single CU of the WGP. Therefore, all global memory access by
8543  the work-group access the same L0 which in turn ensures L1 accesses are
8544  ordered and so do not require explicit management of the caches for
8545  work-group synchronization.
8546
8547See ``WGP_MODE`` field in
8548:ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx10-table` and
8549:ref:`amdgpu-target-features`.
8550
8551The code sequences used to implement the memory model for GFX10 are defined in
8552table :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx10-table`.
8553
8554  .. table:: AMDHSA Memory Model Code Sequences GFX10
8555     :name: amdgpu-amdhsa-memory-model-code-sequences-gfx10-table
8556
8557     ============ ============ ============== ========== ================================
8558     LLVM Instr   LLVM Memory  LLVM Memory    AMDGPU     AMDGPU Machine Code
8559                  Ordering     Sync Scope     Address    GFX10
8560                                              Space
8561     ============ ============ ============== ========== ================================
8562     **Non-Atomic**
8563     ------------------------------------------------------------------------------------
8564     load         *none*       *none*         - global   - !volatile & !nontemporal
8565                                              - generic
8566                                              - private    1. buffer/global/flat_load
8567                                              - constant
8568                                                         - !volatile & nontemporal
8569
8570                                                           1. buffer/global/flat_load
8571                                                              slc=1
8572
8573                                                         - volatile
8574
8575                                                           1. buffer/global/flat_load
8576                                                              glc=1 dlc=1
8577                                                           2. s_waitcnt vmcnt(0)
8578
8579                                                            - Must happen before
8580                                                              any following volatile
8581                                                              global/generic
8582                                                              load/store.
8583                                                            - Ensures that
8584                                                              volatile
8585                                                              operations to
8586                                                              different
8587                                                              addresses will not
8588                                                              be reordered by
8589                                                              hardware.
8590
8591     load         *none*       *none*         - local    1. ds_load
8592     store        *none*       *none*         - global   - !volatile & !nontemporal
8593                                              - generic
8594                                              - private    1. buffer/global/flat_store
8595                                              - constant
8596                                                         - !volatile & nontemporal
8597
8598                                                            1. buffer/global/flat_store
8599                                                               slc=1
8600
8601                                                         - volatile
8602
8603                                                            1. buffer/global/flat_store
8604                                                            2. s_waitcnt vscnt(0)
8605
8606                                                            - Must happen before
8607                                                              any following volatile
8608                                                              global/generic
8609                                                              load/store.
8610                                                            - Ensures that
8611                                                              volatile
8612                                                              operations to
8613                                                              different
8614                                                              addresses will not
8615                                                              be reordered by
8616                                                              hardware.
8617
8618     store        *none*       *none*         - local    1. ds_store
8619     **Unordered Atomic**
8620     ------------------------------------------------------------------------------------
8621     load atomic  unordered    *any*          *any*      *Same as non-atomic*.
8622     store atomic unordered    *any*          *any*      *Same as non-atomic*.
8623     atomicrmw    unordered    *any*          *any*      *Same as monotonic atomic*.
8624     **Monotonic Atomic**
8625     ------------------------------------------------------------------------------------
8626     load atomic  monotonic    - singlethread - global   1. buffer/global/flat_load
8627                               - wavefront    - generic
8628     load atomic  monotonic    - workgroup    - global   1. buffer/global/flat_load
8629                                              - generic     glc=1
8630
8631                                                           - If CU wavefront execution
8632                                                             mode, omit glc=1.
8633
8634     load atomic  monotonic    - singlethread - local    1. ds_load
8635                               - wavefront
8636                               - workgroup
8637     load atomic  monotonic    - agent        - global   1. buffer/global/flat_load
8638                               - system       - generic     glc=1 dlc=1
8639     store atomic monotonic    - singlethread - global   1. buffer/global/flat_store
8640                               - wavefront    - generic
8641                               - workgroup
8642                               - agent
8643                               - system
8644     store atomic monotonic    - singlethread - local    1. ds_store
8645                               - wavefront
8646                               - workgroup
8647     atomicrmw    monotonic    - singlethread - global   1. buffer/global/flat_atomic
8648                               - wavefront    - generic
8649                               - workgroup
8650                               - agent
8651                               - system
8652     atomicrmw    monotonic    - singlethread - local    1. ds_atomic
8653                               - wavefront
8654                               - workgroup
8655     **Acquire Atomic**
8656     ------------------------------------------------------------------------------------
8657     load atomic  acquire      - singlethread - global   1. buffer/global/ds/flat_load
8658                               - wavefront    - local
8659                                              - generic
8660     load atomic  acquire      - workgroup    - global   1. buffer/global_load glc=1
8661
8662                                                           - If CU wavefront execution
8663                                                             mode, omit glc=1.
8664
8665                                                         2. s_waitcnt vmcnt(0)
8666
8667                                                           - If CU wavefront execution
8668                                                             mode, omit.
8669                                                           - Must happen before
8670                                                             the following buffer_gl0_inv
8671                                                             and before any following
8672                                                             global/generic
8673                                                             load/load
8674                                                             atomic/store/store
8675                                                             atomic/atomicrmw.
8676
8677                                                         3. buffer_gl0_inv
8678
8679                                                           - If CU wavefront execution
8680                                                             mode, omit.
8681                                                           - Ensures that
8682                                                             following
8683                                                             loads will not see
8684                                                             stale data.
8685
8686     load atomic  acquire      - workgroup    - local    1. ds_load
8687                                                         2. s_waitcnt lgkmcnt(0)
8688
8689                                                           - If OpenCL, omit.
8690                                                           - Must happen before
8691                                                             the following buffer_gl0_inv
8692                                                             and before any following
8693                                                             global/generic load/load
8694                                                             atomic/store/store
8695                                                             atomic/atomicrmw.
8696                                                           - Ensures any
8697                                                             following global
8698                                                             data read is no
8699                                                             older than the local load
8700                                                             atomic value being
8701                                                             acquired.
8702
8703                                                         3. buffer_gl0_inv
8704
8705                                                           - If CU wavefront execution
8706                                                             mode, omit.
8707                                                           - If OpenCL, omit.
8708                                                           - Ensures that
8709                                                             following
8710                                                             loads will not see
8711                                                             stale data.
8712
8713     load atomic  acquire      - workgroup    - generic  1. flat_load glc=1
8714
8715                                                           - If CU wavefront execution
8716                                                             mode, omit glc=1.
8717
8718                                                         2. s_waitcnt lgkmcnt(0) &
8719                                                            vmcnt(0)
8720
8721                                                           - If CU wavefront execution
8722                                                             mode, omit vmcnt(0).
8723                                                           - If OpenCL, omit
8724                                                             lgkmcnt(0).
8725                                                           - Must happen before
8726                                                             the following
8727                                                             buffer_gl0_inv and any
8728                                                             following global/generic
8729                                                             load/load
8730                                                             atomic/store/store
8731                                                             atomic/atomicrmw.
8732                                                           - Ensures any
8733                                                             following global
8734                                                             data read is no
8735                                                             older than a local load
8736                                                             atomic value being
8737                                                             acquired.
8738
8739                                                         3. buffer_gl0_inv
8740
8741                                                           - If CU wavefront execution
8742                                                             mode, omit.
8743                                                           - Ensures that
8744                                                             following
8745                                                             loads will not see
8746                                                             stale data.
8747
8748     load atomic  acquire      - agent        - global   1. buffer/global_load
8749                               - system                     glc=1 dlc=1
8750                                                         2. s_waitcnt vmcnt(0)
8751
8752                                                           - Must happen before
8753                                                             following
8754                                                             buffer_gl*_inv.
8755                                                           - Ensures the load
8756                                                             has completed
8757                                                             before invalidating
8758                                                             the caches.
8759
8760                                                         3. buffer_gl0_inv;
8761                                                            buffer_gl1_inv
8762
8763                                                           - Must happen before
8764                                                             any following
8765                                                             global/generic
8766                                                             load/load
8767                                                             atomic/atomicrmw.
8768                                                           - Ensures that
8769                                                             following
8770                                                             loads will not see
8771                                                             stale global data.
8772
8773     load atomic  acquire      - agent        - generic  1. flat_load glc=1 dlc=1
8774                               - system                  2. s_waitcnt vmcnt(0) &
8775                                                            lgkmcnt(0)
8776
8777                                                           - If OpenCL omit
8778                                                             lgkmcnt(0).
8779                                                           - Must happen before
8780                                                             following
8781                                                             buffer_gl*_invl.
8782                                                           - Ensures the flat_load
8783                                                             has completed
8784                                                             before invalidating
8785                                                             the caches.
8786
8787                                                         3. buffer_gl0_inv;
8788                                                            buffer_gl1_inv
8789
8790                                                           - Must happen before
8791                                                             any following
8792                                                             global/generic
8793                                                             load/load
8794                                                             atomic/atomicrmw.
8795                                                           - Ensures that
8796                                                             following loads
8797                                                             will not see stale
8798                                                             global data.
8799
8800     atomicrmw    acquire      - singlethread - global   1. buffer/global/ds/flat_atomic
8801                               - wavefront    - local
8802                                              - generic
8803     atomicrmw    acquire      - workgroup    - global   1. buffer/global_atomic
8804                                                         2. s_waitcnt vm/vscnt(0)
8805
8806                                                           - If CU wavefront execution
8807                                                             mode, omit.
8808                                                           - Use vmcnt(0) if atomic with
8809                                                             return and vscnt(0) if
8810                                                             atomic with no-return.
8811                                                           - Must happen before
8812                                                             the following buffer_gl0_inv
8813                                                             and before any following
8814                                                             global/generic
8815                                                             load/load
8816                                                             atomic/store/store
8817                                                             atomic/atomicrmw.
8818
8819                                                         3. buffer_gl0_inv
8820
8821                                                           - If CU wavefront execution
8822                                                             mode, omit.
8823                                                           - Ensures that
8824                                                             following
8825                                                             loads will not see
8826                                                             stale data.
8827
8828     atomicrmw    acquire      - workgroup    - local    1. ds_atomic
8829                                                         2. s_waitcnt lgkmcnt(0)
8830
8831                                                           - If OpenCL, omit.
8832                                                           - Must happen before
8833                                                             the following
8834                                                             buffer_gl0_inv.
8835                                                           - Ensures any
8836                                                             following global
8837                                                             data read is no
8838                                                             older than the local
8839                                                             atomicrmw value
8840                                                             being acquired.
8841
8842                                                         3. buffer_gl0_inv
8843
8844                                                           - If OpenCL omit.
8845                                                           - Ensures that
8846                                                             following
8847                                                             loads will not see
8848                                                             stale data.
8849
8850     atomicrmw    acquire      - workgroup    - generic  1. flat_atomic
8851                                                         2. s_waitcnt lgkmcnt(0) &
8852                                                            vm/vscnt(0)
8853
8854                                                           - If CU wavefront execution
8855                                                             mode, omit vm/vscnt(0).
8856                                                           - If OpenCL, omit lgkmcnt(0).
8857                                                           - Use vmcnt(0) if atomic with
8858                                                             return and vscnt(0) if
8859                                                             atomic with no-return.
8860                                                           - Must happen before
8861                                                             the following
8862                                                             buffer_gl0_inv.
8863                                                           - Ensures any
8864                                                             following global
8865                                                             data read is no
8866                                                             older than a local
8867                                                             atomicrmw value
8868                                                             being acquired.
8869
8870                                                         3. buffer_gl0_inv
8871
8872                                                           - If CU wavefront execution
8873                                                             mode, omit.
8874                                                           - Ensures that
8875                                                             following
8876                                                             loads will not see
8877                                                             stale data.
8878
8879     atomicrmw    acquire      - agent        - global   1. buffer/global_atomic
8880                               - system                  2. s_waitcnt vm/vscnt(0)
8881
8882                                                           - Use vmcnt(0) if atomic with
8883                                                             return and vscnt(0) if
8884                                                             atomic with no-return.
8885                                                           - Must happen before
8886                                                             following
8887                                                             buffer_gl*_inv.
8888                                                           - Ensures the
8889                                                             atomicrmw has
8890                                                             completed before
8891                                                             invalidating the
8892                                                             caches.
8893
8894                                                         3. buffer_gl0_inv;
8895                                                            buffer_gl1_inv
8896
8897                                                           - Must happen before
8898                                                             any following
8899                                                             global/generic
8900                                                             load/load
8901                                                             atomic/atomicrmw.
8902                                                           - Ensures that
8903                                                             following loads
8904                                                             will not see stale
8905                                                             global data.
8906
8907     atomicrmw    acquire      - agent        - generic  1. flat_atomic
8908                               - system                  2. s_waitcnt vm/vscnt(0) &
8909                                                            lgkmcnt(0)
8910
8911                                                           - If OpenCL, omit
8912                                                             lgkmcnt(0).
8913                                                           - Use vmcnt(0) if atomic with
8914                                                             return and vscnt(0) if
8915                                                             atomic with no-return.
8916                                                           - Must happen before
8917                                                             following
8918                                                             buffer_gl*_inv.
8919                                                           - Ensures the
8920                                                             atomicrmw has
8921                                                             completed before
8922                                                             invalidating the
8923                                                             caches.
8924
8925                                                         3. buffer_gl0_inv;
8926                                                            buffer_gl1_inv
8927
8928                                                           - Must happen before
8929                                                             any following
8930                                                             global/generic
8931                                                             load/load
8932                                                             atomic/atomicrmw.
8933                                                           - Ensures that
8934                                                             following loads
8935                                                             will not see stale
8936                                                             global data.
8937
8938     fence        acquire      - singlethread *none*     *none*
8939                               - wavefront
8940     fence        acquire      - workgroup    *none*     1. s_waitcnt lgkmcnt(0) &
8941                                                            vmcnt(0) & vscnt(0)
8942
8943                                                           - If CU wavefront execution
8944                                                             mode, omit vmcnt(0) and
8945                                                             vscnt(0).
8946                                                           - If OpenCL and
8947                                                             address space is
8948                                                             not generic, omit
8949                                                             lgkmcnt(0).
8950                                                           - If OpenCL and
8951                                                             address space is
8952                                                             local, omit
8953                                                             vmcnt(0) and vscnt(0).
8954                                                           - However, since LLVM
8955                                                             currently has no
8956                                                             address space on
8957                                                             the fence need to
8958                                                             conservatively
8959                                                             always generate. If
8960                                                             fence had an
8961                                                             address space then
8962                                                             set to address
8963                                                             space of OpenCL
8964                                                             fence flag, or to
8965                                                             generic if both
8966                                                             local and global
8967                                                             flags are
8968                                                             specified.
8969                                                           - Could be split into
8970                                                             separate s_waitcnt
8971                                                             vmcnt(0), s_waitcnt
8972                                                             vscnt(0) and s_waitcnt
8973                                                             lgkmcnt(0) to allow
8974                                                             them to be
8975                                                             independently moved
8976                                                             according to the
8977                                                             following rules.
8978                                                           - s_waitcnt vmcnt(0)
8979                                                             must happen after
8980                                                             any preceding
8981                                                             global/generic load
8982                                                             atomic/
8983                                                             atomicrmw-with-return-value
8984                                                             with an equal or
8985                                                             wider sync scope
8986                                                             and memory ordering
8987                                                             stronger than
8988                                                             unordered (this is
8989                                                             termed the
8990                                                             fence-paired-atomic).
8991                                                           - s_waitcnt vscnt(0)
8992                                                             must happen after
8993                                                             any preceding
8994                                                             global/generic
8995                                                             atomicrmw-no-return-value
8996                                                             with an equal or
8997                                                             wider sync scope
8998                                                             and memory ordering
8999                                                             stronger than
9000                                                             unordered (this is
9001                                                             termed the
9002                                                             fence-paired-atomic).
9003                                                           - s_waitcnt lgkmcnt(0)
9004                                                             must happen after
9005                                                             any preceding
9006                                                             local/generic load
9007                                                             atomic/atomicrmw
9008                                                             with an equal or
9009                                                             wider sync scope
9010                                                             and memory ordering
9011                                                             stronger than
9012                                                             unordered (this is
9013                                                             termed the
9014                                                             fence-paired-atomic).
9015                                                           - Must happen before
9016                                                             the following
9017                                                             buffer_gl0_inv.
9018                                                           - Ensures that the
9019                                                             fence-paired atomic
9020                                                             has completed
9021                                                             before invalidating
9022                                                             the
9023                                                             cache. Therefore
9024                                                             any following
9025                                                             locations read must
9026                                                             be no older than
9027                                                             the value read by
9028                                                             the
9029                                                             fence-paired-atomic.
9030
9031                                                         3. buffer_gl0_inv
9032
9033                                                           - If CU wavefront execution
9034                                                             mode, omit.
9035                                                           - Ensures that
9036                                                             following
9037                                                             loads will not see
9038                                                             stale data.
9039
9040     fence        acquire      - agent        *none*     1. s_waitcnt lgkmcnt(0) &
9041                               - system                     vmcnt(0) & vscnt(0)
9042
9043                                                           - If OpenCL and
9044                                                             address space is
9045                                                             not generic, omit
9046                                                             lgkmcnt(0).
9047                                                           - If OpenCL and
9048                                                             address space is
9049                                                             local, omit
9050                                                             vmcnt(0) and vscnt(0).
9051                                                           - However, since LLVM
9052                                                             currently has no
9053                                                             address space on
9054                                                             the fence need to
9055                                                             conservatively
9056                                                             always generate
9057                                                             (see comment for
9058                                                             previous fence).
9059                                                           - Could be split into
9060                                                             separate s_waitcnt
9061                                                             vmcnt(0), s_waitcnt
9062                                                             vscnt(0) and s_waitcnt
9063                                                             lgkmcnt(0) to allow
9064                                                             them to be
9065                                                             independently moved
9066                                                             according to the
9067                                                             following rules.
9068                                                           - s_waitcnt vmcnt(0)
9069                                                             must happen after
9070                                                             any preceding
9071                                                             global/generic load
9072                                                             atomic/
9073                                                             atomicrmw-with-return-value
9074                                                             with an equal or
9075                                                             wider sync scope
9076                                                             and memory ordering
9077                                                             stronger than
9078                                                             unordered (this is
9079                                                             termed the
9080                                                             fence-paired-atomic).
9081                                                           - s_waitcnt vscnt(0)
9082                                                             must happen after
9083                                                             any preceding
9084                                                             global/generic
9085                                                             atomicrmw-no-return-value
9086                                                             with an equal or
9087                                                             wider sync scope
9088                                                             and memory ordering
9089                                                             stronger than
9090                                                             unordered (this is
9091                                                             termed the
9092                                                             fence-paired-atomic).
9093                                                           - s_waitcnt lgkmcnt(0)
9094                                                             must happen after
9095                                                             any preceding
9096                                                             local/generic load
9097                                                             atomic/atomicrmw
9098                                                             with an equal or
9099                                                             wider sync scope
9100                                                             and memory ordering
9101                                                             stronger than
9102                                                             unordered (this is
9103                                                             termed the
9104                                                             fence-paired-atomic).
9105                                                           - Must happen before
9106                                                             the following
9107                                                             buffer_gl*_inv.
9108                                                           - Ensures that the
9109                                                             fence-paired atomic
9110                                                             has completed
9111                                                             before invalidating
9112                                                             the
9113                                                             caches. Therefore
9114                                                             any following
9115                                                             locations read must
9116                                                             be no older than
9117                                                             the value read by
9118                                                             the
9119                                                             fence-paired-atomic.
9120
9121                                                         2. buffer_gl0_inv;
9122                                                            buffer_gl1_inv
9123
9124                                                           - Must happen before any
9125                                                             following global/generic
9126                                                             load/load
9127                                                             atomic/store/store
9128                                                             atomic/atomicrmw.
9129                                                           - Ensures that
9130                                                             following loads
9131                                                             will not see stale
9132                                                             global data.
9133
9134     **Release Atomic**
9135     ------------------------------------------------------------------------------------
9136     store atomic release      - singlethread - global   1. buffer/global/ds/flat_store
9137                               - wavefront    - local
9138                                              - generic
9139     store atomic release      - workgroup    - global   1. s_waitcnt lgkmcnt(0) &
9140                                              - generic     vmcnt(0) & vscnt(0)
9141
9142                                                           - If CU wavefront execution
9143                                                             mode, omit vmcnt(0) and
9144                                                             vscnt(0).
9145                                                           - If OpenCL, omit
9146                                                             lgkmcnt(0).
9147                                                           - Could be split into
9148                                                             separate s_waitcnt
9149                                                             vmcnt(0), s_waitcnt
9150                                                             vscnt(0) and s_waitcnt
9151                                                             lgkmcnt(0) to allow
9152                                                             them to be
9153                                                             independently moved
9154                                                             according to the
9155                                                             following rules.
9156                                                           - s_waitcnt vmcnt(0)
9157                                                             must happen after
9158                                                             any preceding
9159                                                             global/generic load/load
9160                                                             atomic/
9161                                                             atomicrmw-with-return-value.
9162                                                           - s_waitcnt vscnt(0)
9163                                                             must happen after
9164                                                             any preceding
9165                                                             global/generic
9166                                                             store/store
9167                                                             atomic/
9168                                                             atomicrmw-no-return-value.
9169                                                           - s_waitcnt lgkmcnt(0)
9170                                                             must happen after
9171                                                             any preceding
9172                                                             local/generic
9173                                                             load/store/load
9174                                                             atomic/store
9175                                                             atomic/atomicrmw.
9176                                                           - Must happen before
9177                                                             the following
9178                                                             store.
9179                                                           - Ensures that all
9180                                                             memory operations
9181                                                             have
9182                                                             completed before
9183                                                             performing the
9184                                                             store that is being
9185                                                             released.
9186
9187                                                         2. buffer/global/flat_store
9188     store atomic release      - workgroup    - local    1. s_waitcnt vmcnt(0) & vscnt(0)
9189
9190                                                           - If CU wavefront execution
9191                                                             mode, omit.
9192                                                           - If OpenCL, omit.
9193                                                           - Could be split into
9194                                                             separate s_waitcnt
9195                                                             vmcnt(0) and s_waitcnt
9196                                                             vscnt(0) to allow
9197                                                             them to be
9198                                                             independently moved
9199                                                             according to the
9200                                                             following rules.
9201                                                           - s_waitcnt vmcnt(0)
9202                                                             must happen after
9203                                                             any preceding
9204                                                             global/generic load/load
9205                                                             atomic/
9206                                                             atomicrmw-with-return-value.
9207                                                           - s_waitcnt vscnt(0)
9208                                                             must happen after
9209                                                             any preceding
9210                                                             global/generic
9211                                                             store/store atomic/
9212                                                             atomicrmw-no-return-value.
9213                                                           - Must happen before
9214                                                             the following
9215                                                             store.
9216                                                           - Ensures that all
9217                                                             global memory
9218                                                             operations have
9219                                                             completed before
9220                                                             performing the
9221                                                             store that is being
9222                                                             released.
9223
9224                                                         2. ds_store
9225     store atomic release      - agent        - global   1. s_waitcnt lgkmcnt(0) &
9226                               - system       - generic     vmcnt(0) & vscnt(0)
9227
9228                                                           - If OpenCL and
9229                                                             address space is
9230                                                             not generic, omit
9231                                                             lgkmcnt(0).
9232                                                           - Could be split into
9233                                                             separate s_waitcnt
9234                                                             vmcnt(0), s_waitcnt vscnt(0)
9235                                                             and s_waitcnt
9236                                                             lgkmcnt(0) to allow
9237                                                             them to be
9238                                                             independently moved
9239                                                             according to the
9240                                                             following rules.
9241                                                           - s_waitcnt vmcnt(0)
9242                                                             must happen after
9243                                                             any preceding
9244                                                             global/generic
9245                                                             load/load
9246                                                             atomic/
9247                                                             atomicrmw-with-return-value.
9248                                                           - s_waitcnt vscnt(0)
9249                                                             must happen after
9250                                                             any preceding
9251                                                             global/generic
9252                                                             store/store atomic/
9253                                                             atomicrmw-no-return-value.
9254                                                           - s_waitcnt lgkmcnt(0)
9255                                                             must happen after
9256                                                             any preceding
9257                                                             local/generic
9258                                                             load/store/load
9259                                                             atomic/store
9260                                                             atomic/atomicrmw.
9261                                                           - Must happen before
9262                                                             the following
9263                                                             store.
9264                                                           - Ensures that all
9265                                                             memory operations
9266                                                             have
9267                                                             completed before
9268                                                             performing the
9269                                                             store that is being
9270                                                             released.
9271
9272                                                         2. buffer/global/flat_store
9273     atomicrmw    release      - singlethread - global   1. buffer/global/ds/flat_atomic
9274                               - wavefront    - local
9275                                              - generic
9276     atomicrmw    release      - workgroup    - global   1. s_waitcnt lgkmcnt(0) &
9277                                              - generic     vmcnt(0) & vscnt(0)
9278
9279                                                           - If CU wavefront execution
9280                                                             mode, omit vmcnt(0) and
9281                                                             vscnt(0).
9282                                                           - If OpenCL, omit lgkmcnt(0).
9283                                                           - Could be split into
9284                                                             separate s_waitcnt
9285                                                             vmcnt(0), s_waitcnt
9286                                                             vscnt(0) and s_waitcnt
9287                                                             lgkmcnt(0) to allow
9288                                                             them to be
9289                                                             independently moved
9290                                                             according to the
9291                                                             following rules.
9292                                                           - s_waitcnt vmcnt(0)
9293                                                             must happen after
9294                                                             any preceding
9295                                                             global/generic load/load
9296                                                             atomic/
9297                                                             atomicrmw-with-return-value.
9298                                                           - s_waitcnt vscnt(0)
9299                                                             must happen after
9300                                                             any preceding
9301                                                             global/generic
9302                                                             store/store
9303                                                             atomic/
9304                                                             atomicrmw-no-return-value.
9305                                                           - s_waitcnt lgkmcnt(0)
9306                                                             must happen after
9307                                                             any preceding
9308                                                             local/generic
9309                                                             load/store/load
9310                                                             atomic/store
9311                                                             atomic/atomicrmw.
9312                                                           - Must happen before
9313                                                             the following
9314                                                             atomicrmw.
9315                                                           - Ensures that all
9316                                                             memory operations
9317                                                             have
9318                                                             completed before
9319                                                             performing the
9320                                                             atomicrmw that is
9321                                                             being released.
9322
9323                                                         2. buffer/global/flat_atomic
9324     atomicrmw    release      - workgroup    - local    1. s_waitcnt vmcnt(0) & vscnt(0)
9325
9326                                                           - If CU wavefront execution
9327                                                             mode, omit.
9328                                                           - If OpenCL, omit.
9329                                                           - Could be split into
9330                                                             separate s_waitcnt
9331                                                             vmcnt(0) and s_waitcnt
9332                                                             vscnt(0) to allow
9333                                                             them to be
9334                                                             independently moved
9335                                                             according to the
9336                                                             following rules.
9337                                                           - s_waitcnt vmcnt(0)
9338                                                             must happen after
9339                                                             any preceding
9340                                                             global/generic load/load
9341                                                             atomic/
9342                                                             atomicrmw-with-return-value.
9343                                                           - s_waitcnt vscnt(0)
9344                                                             must happen after
9345                                                             any preceding
9346                                                             global/generic
9347                                                             store/store atomic/
9348                                                             atomicrmw-no-return-value.
9349                                                           - Must happen before
9350                                                             the following
9351                                                             store.
9352                                                           - Ensures that all
9353                                                             global memory
9354                                                             operations have
9355                                                             completed before
9356                                                             performing the
9357                                                             store that is being
9358                                                             released.
9359
9360                                                         2. ds_atomic
9361     atomicrmw    release      - agent        - global   1. s_waitcnt lgkmcnt(0) &
9362                               - system       - generic      vmcnt(0) & vscnt(0)
9363
9364                                                           - If OpenCL, omit
9365                                                             lgkmcnt(0).
9366                                                           - Could be split into
9367                                                             separate s_waitcnt
9368                                                             vmcnt(0), s_waitcnt
9369                                                             vscnt(0) and s_waitcnt
9370                                                             lgkmcnt(0) to allow
9371                                                             them to be
9372                                                             independently moved
9373                                                             according to the
9374                                                             following rules.
9375                                                           - s_waitcnt vmcnt(0)
9376                                                             must happen after
9377                                                             any preceding
9378                                                             global/generic
9379                                                             load/load atomic/
9380                                                             atomicrmw-with-return-value.
9381                                                           - s_waitcnt vscnt(0)
9382                                                             must happen after
9383                                                             any preceding
9384                                                             global/generic
9385                                                             store/store atomic/
9386                                                             atomicrmw-no-return-value.
9387                                                           - s_waitcnt lgkmcnt(0)
9388                                                             must happen after
9389                                                             any preceding
9390                                                             local/generic
9391                                                             load/store/load
9392                                                             atomic/store
9393                                                             atomic/atomicrmw.
9394                                                           - Must happen before
9395                                                             the following
9396                                                             atomicrmw.
9397                                                           - Ensures that all
9398                                                             memory operations
9399                                                             to global and local
9400                                                             have completed
9401                                                             before performing
9402                                                             the atomicrmw that
9403                                                             is being released.
9404
9405                                                         2. buffer/global/flat_atomic
9406     fence        release      - singlethread *none*     *none*
9407                               - wavefront
9408     fence        release      - workgroup    *none*     1. s_waitcnt lgkmcnt(0) &
9409                                                            vmcnt(0) & vscnt(0)
9410
9411                                                           - If CU wavefront execution
9412                                                             mode, omit vmcnt(0) and
9413                                                             vscnt(0).
9414                                                           - If OpenCL and
9415                                                             address space is
9416                                                             not generic, omit
9417                                                             lgkmcnt(0).
9418                                                           - If OpenCL and
9419                                                             address space is
9420                                                             local, omit
9421                                                             vmcnt(0) and vscnt(0).
9422                                                           - However, since LLVM
9423                                                             currently has no
9424                                                             address space on
9425                                                             the fence need to
9426                                                             conservatively
9427                                                             always generate. If
9428                                                             fence had an
9429                                                             address space then
9430                                                             set to address
9431                                                             space of OpenCL
9432                                                             fence flag, or to
9433                                                             generic if both
9434                                                             local and global
9435                                                             flags are
9436                                                             specified.
9437                                                           - Could be split into
9438                                                             separate s_waitcnt
9439                                                             vmcnt(0), s_waitcnt
9440                                                             vscnt(0) and s_waitcnt
9441                                                             lgkmcnt(0) to allow
9442                                                             them to be
9443                                                             independently moved
9444                                                             according to the
9445                                                             following rules.
9446                                                           - s_waitcnt vmcnt(0)
9447                                                             must happen after
9448                                                             any preceding
9449                                                             global/generic
9450                                                             load/load
9451                                                             atomic/
9452                                                             atomicrmw-with-return-value.
9453                                                           - s_waitcnt vscnt(0)
9454                                                             must happen after
9455                                                             any preceding
9456                                                             global/generic
9457                                                             store/store atomic/
9458                                                             atomicrmw-no-return-value.
9459                                                           - s_waitcnt lgkmcnt(0)
9460                                                             must happen after
9461                                                             any preceding
9462                                                             local/generic
9463                                                             load/store/load
9464                                                             atomic/store atomic/
9465                                                             atomicrmw.
9466                                                           - Must happen before
9467                                                             any following store
9468                                                             atomic/atomicrmw
9469                                                             with an equal or
9470                                                             wider sync scope
9471                                                             and memory ordering
9472                                                             stronger than
9473                                                             unordered (this is
9474                                                             termed the
9475                                                             fence-paired-atomic).
9476                                                           - Ensures that all
9477                                                             memory operations
9478                                                             have
9479                                                             completed before
9480                                                             performing the
9481                                                             following
9482                                                             fence-paired-atomic.
9483
9484     fence        release      - agent        *none*     1. s_waitcnt lgkmcnt(0) &
9485                               - system                     vmcnt(0) & vscnt(0)
9486
9487                                                           - If OpenCL and
9488                                                             address space is
9489                                                             not generic, omit
9490                                                             lgkmcnt(0).
9491                                                           - If OpenCL and
9492                                                             address space is
9493                                                             local, omit
9494                                                             vmcnt(0) and vscnt(0).
9495                                                           - However, since LLVM
9496                                                             currently has no
9497                                                             address space on
9498                                                             the fence need to
9499                                                             conservatively
9500                                                             always generate. If
9501                                                             fence had an
9502                                                             address space then
9503                                                             set to address
9504                                                             space of OpenCL
9505                                                             fence flag, or to
9506                                                             generic if both
9507                                                             local and global
9508                                                             flags are
9509                                                             specified.
9510                                                           - Could be split into
9511                                                             separate s_waitcnt
9512                                                             vmcnt(0), s_waitcnt
9513                                                             vscnt(0) and s_waitcnt
9514                                                             lgkmcnt(0) to allow
9515                                                             them to be
9516                                                             independently moved
9517                                                             according to the
9518                                                             following rules.
9519                                                           - s_waitcnt vmcnt(0)
9520                                                             must happen after
9521                                                             any preceding
9522                                                             global/generic
9523                                                             load/load atomic/
9524                                                             atomicrmw-with-return-value.
9525                                                           - s_waitcnt vscnt(0)
9526                                                             must happen after
9527                                                             any preceding
9528                                                             global/generic
9529                                                             store/store atomic/
9530                                                             atomicrmw-no-return-value.
9531                                                           - s_waitcnt lgkmcnt(0)
9532                                                             must happen after
9533                                                             any preceding
9534                                                             local/generic
9535                                                             load/store/load
9536                                                             atomic/store
9537                                                             atomic/atomicrmw.
9538                                                           - Must happen before
9539                                                             any following store
9540                                                             atomic/atomicrmw
9541                                                             with an equal or
9542                                                             wider sync scope
9543                                                             and memory ordering
9544                                                             stronger than
9545                                                             unordered (this is
9546                                                             termed the
9547                                                             fence-paired-atomic).
9548                                                           - Ensures that all
9549                                                             memory operations
9550                                                             have
9551                                                             completed before
9552                                                             performing the
9553                                                             following
9554                                                             fence-paired-atomic.
9555
9556     **Acquire-Release Atomic**
9557     ------------------------------------------------------------------------------------
9558     atomicrmw    acq_rel      - singlethread - global   1. buffer/global/ds/flat_atomic
9559                               - wavefront    - local
9560                                              - generic
9561     atomicrmw    acq_rel      - workgroup    - global   1. s_waitcnt lgkmcnt(0) &
9562                                                            vmcnt(0) & vscnt(0)
9563
9564                                                           - If CU wavefront execution
9565                                                             mode, omit vmcnt(0) and
9566                                                             vscnt(0).
9567                                                           - If OpenCL, omit
9568                                                             lgkmcnt(0).
9569                                                           - Must happen after
9570                                                             any preceding
9571                                                             local/generic
9572                                                             load/store/load
9573                                                             atomic/store
9574                                                             atomic/atomicrmw.
9575                                                           - Could be split into
9576                                                             separate s_waitcnt
9577                                                             vmcnt(0), s_waitcnt
9578                                                             vscnt(0), and s_waitcnt
9579                                                             lgkmcnt(0) to allow
9580                                                             them to be
9581                                                             independently moved
9582                                                             according to the
9583                                                             following rules.
9584                                                           - s_waitcnt vmcnt(0)
9585                                                             must happen after
9586                                                             any preceding
9587                                                             global/generic load/load
9588                                                             atomic/
9589                                                             atomicrmw-with-return-value.
9590                                                           - s_waitcnt vscnt(0)
9591                                                             must happen after
9592                                                             any preceding
9593                                                             global/generic
9594                                                             store/store
9595                                                             atomic/
9596                                                             atomicrmw-no-return-value.
9597                                                           - s_waitcnt lgkmcnt(0)
9598                                                             must happen after
9599                                                             any preceding
9600                                                             local/generic
9601                                                             load/store/load
9602                                                             atomic/store
9603                                                             atomic/atomicrmw.
9604                                                           - Must happen before
9605                                                             the following
9606                                                             atomicrmw.
9607                                                           - Ensures that all
9608                                                             memory operations
9609                                                             have
9610                                                             completed before
9611                                                             performing the
9612                                                             atomicrmw that is
9613                                                             being released.
9614
9615                                                         2. buffer/global_atomic
9616                                                         3. s_waitcnt vm/vscnt(0)
9617
9618                                                           - If CU wavefront execution
9619                                                             mode, omit.
9620                                                           - Use vmcnt(0) if atomic with
9621                                                             return and vscnt(0) if
9622                                                             atomic with no-return.
9623                                                           - Must happen before
9624                                                             the following
9625                                                             buffer_gl0_inv.
9626                                                           - Ensures any
9627                                                             following global
9628                                                             data read is no
9629                                                             older than the
9630                                                             atomicrmw value
9631                                                             being acquired.
9632
9633                                                         4. buffer_gl0_inv
9634
9635                                                           - If CU wavefront execution
9636                                                             mode, omit.
9637                                                           - Ensures that
9638                                                             following
9639                                                             loads will not see
9640                                                             stale data.
9641
9642     atomicrmw    acq_rel      - workgroup    - local    1. s_waitcnt vmcnt(0) & vscnt(0)
9643
9644                                                           - If CU wavefront execution
9645                                                             mode, omit.
9646                                                           - If OpenCL, omit.
9647                                                           - Could be split into
9648                                                             separate s_waitcnt
9649                                                             vmcnt(0) and s_waitcnt
9650                                                             vscnt(0) to allow
9651                                                             them to be
9652                                                             independently moved
9653                                                             according to the
9654                                                             following rules.
9655                                                           - s_waitcnt vmcnt(0)
9656                                                             must happen after
9657                                                             any preceding
9658                                                             global/generic load/load
9659                                                             atomic/
9660                                                             atomicrmw-with-return-value.
9661                                                           - s_waitcnt vscnt(0)
9662                                                             must happen after
9663                                                             any preceding
9664                                                             global/generic
9665                                                             store/store atomic/
9666                                                             atomicrmw-no-return-value.
9667                                                           - Must happen before
9668                                                             the following
9669                                                             store.
9670                                                           - Ensures that all
9671                                                             global memory
9672                                                             operations have
9673                                                             completed before
9674                                                             performing the
9675                                                             store that is being
9676                                                             released.
9677
9678                                                         2. ds_atomic
9679                                                         3. s_waitcnt lgkmcnt(0)
9680
9681                                                           - If OpenCL, omit.
9682                                                           - Must happen before
9683                                                             the following
9684                                                             buffer_gl0_inv.
9685                                                           - Ensures any
9686                                                             following global
9687                                                             data read is no
9688                                                             older than the local load
9689                                                             atomic value being
9690                                                             acquired.
9691
9692                                                         4. buffer_gl0_inv
9693
9694                                                           - If CU wavefront execution
9695                                                             mode, omit.
9696                                                           - If OpenCL omit.
9697                                                           - Ensures that
9698                                                             following
9699                                                             loads will not see
9700                                                             stale data.
9701
9702     atomicrmw    acq_rel      - workgroup    - generic  1. s_waitcnt lgkmcnt(0) &
9703                                                            vmcnt(0) & vscnt(0)
9704
9705                                                           - If CU wavefront execution
9706                                                             mode, omit vmcnt(0) and
9707                                                             vscnt(0).
9708                                                           - If OpenCL, omit lgkmcnt(0).
9709                                                           - Could be split into
9710                                                             separate s_waitcnt
9711                                                             vmcnt(0), s_waitcnt
9712                                                             vscnt(0) and s_waitcnt
9713                                                             lgkmcnt(0) to allow
9714                                                             them to be
9715                                                             independently moved
9716                                                             according to the
9717                                                             following rules.
9718                                                           - s_waitcnt vmcnt(0)
9719                                                             must happen after
9720                                                             any preceding
9721                                                             global/generic load/load
9722                                                             atomic/
9723                                                             atomicrmw-with-return-value.
9724                                                           - s_waitcnt vscnt(0)
9725                                                             must happen after
9726                                                             any preceding
9727                                                             global/generic
9728                                                             store/store
9729                                                             atomic/
9730                                                             atomicrmw-no-return-value.
9731                                                           - s_waitcnt lgkmcnt(0)
9732                                                             must happen after
9733                                                             any preceding
9734                                                             local/generic
9735                                                             load/store/load
9736                                                             atomic/store
9737                                                             atomic/atomicrmw.
9738                                                           - Must happen before
9739                                                             the following
9740                                                             atomicrmw.
9741                                                           - Ensures that all
9742                                                             memory operations
9743                                                             have
9744                                                             completed before
9745                                                             performing the
9746                                                             atomicrmw that is
9747                                                             being released.
9748
9749                                                         2. flat_atomic
9750                                                         3. s_waitcnt lgkmcnt(0) &
9751                                                            vmcnt(0) & vscnt(0)
9752
9753                                                           - If CU wavefront execution
9754                                                             mode, omit vmcnt(0) and
9755                                                             vscnt(0).
9756                                                           - If OpenCL, omit lgkmcnt(0).
9757                                                           - Must happen before
9758                                                             the following
9759                                                             buffer_gl0_inv.
9760                                                           - Ensures any
9761                                                             following global
9762                                                             data read is no
9763                                                             older than the load
9764                                                             atomic value being
9765                                                             acquired.
9766
9767                                                         3. buffer_gl0_inv
9768
9769                                                           - If CU wavefront execution
9770                                                             mode, omit.
9771                                                           - Ensures that
9772                                                             following
9773                                                             loads will not see
9774                                                             stale data.
9775
9776     atomicrmw    acq_rel      - agent        - global   1. s_waitcnt lgkmcnt(0) &
9777                               - system                     vmcnt(0) & vscnt(0)
9778
9779                                                           - If OpenCL, omit
9780                                                             lgkmcnt(0).
9781                                                           - Could be split into
9782                                                             separate s_waitcnt
9783                                                             vmcnt(0), s_waitcnt
9784                                                             vscnt(0) and s_waitcnt
9785                                                             lgkmcnt(0) to allow
9786                                                             them to be
9787                                                             independently moved
9788                                                             according to the
9789                                                             following rules.
9790                                                           - s_waitcnt vmcnt(0)
9791                                                             must happen after
9792                                                             any preceding
9793                                                             global/generic
9794                                                             load/load atomic/
9795                                                             atomicrmw-with-return-value.
9796                                                           - s_waitcnt vscnt(0)
9797                                                             must happen after
9798                                                             any preceding
9799                                                             global/generic
9800                                                             store/store atomic/
9801                                                             atomicrmw-no-return-value.
9802                                                           - s_waitcnt lgkmcnt(0)
9803                                                             must happen after
9804                                                             any preceding
9805                                                             local/generic
9806                                                             load/store/load
9807                                                             atomic/store
9808                                                             atomic/atomicrmw.
9809                                                           - Must happen before
9810                                                             the following
9811                                                             atomicrmw.
9812                                                           - Ensures that all
9813                                                             memory operations
9814                                                             to global have
9815                                                             completed before
9816                                                             performing the
9817                                                             atomicrmw that is
9818                                                             being released.
9819
9820                                                         2. buffer/global_atomic
9821                                                         3. s_waitcnt vm/vscnt(0)
9822
9823                                                           - Use vmcnt(0) if atomic with
9824                                                             return and vscnt(0) if
9825                                                             atomic with no-return.
9826                                                           - Must happen before
9827                                                             following
9828                                                             buffer_gl*_inv.
9829                                                           - Ensures the
9830                                                             atomicrmw has
9831                                                             completed before
9832                                                             invalidating the
9833                                                             caches.
9834
9835                                                         4. buffer_gl0_inv;
9836                                                            buffer_gl1_inv
9837
9838                                                           - Must happen before
9839                                                             any following
9840                                                             global/generic
9841                                                             load/load
9842                                                             atomic/atomicrmw.
9843                                                           - Ensures that
9844                                                             following loads
9845                                                             will not see stale
9846                                                             global data.
9847
9848     atomicrmw    acq_rel      - agent        - generic  1. s_waitcnt lgkmcnt(0) &
9849                               - system                     vmcnt(0) & vscnt(0)
9850
9851                                                           - If OpenCL, omit
9852                                                             lgkmcnt(0).
9853                                                           - Could be split into
9854                                                             separate s_waitcnt
9855                                                             vmcnt(0), s_waitcnt
9856                                                             vscnt(0), and s_waitcnt
9857                                                             lgkmcnt(0) to allow
9858                                                             them to be
9859                                                             independently moved
9860                                                             according to the
9861                                                             following rules.
9862                                                           - s_waitcnt vmcnt(0)
9863                                                             must happen after
9864                                                             any preceding
9865                                                             global/generic
9866                                                             load/load atomic
9867                                                             atomicrmw-with-return-value.
9868                                                           - s_waitcnt vscnt(0)
9869                                                             must happen after
9870                                                             any preceding
9871                                                             global/generic
9872                                                             store/store atomic/
9873                                                             atomicrmw-no-return-value.
9874                                                           - s_waitcnt lgkmcnt(0)
9875                                                             must happen after
9876                                                             any preceding
9877                                                             local/generic
9878                                                             load/store/load
9879                                                             atomic/store
9880                                                             atomic/atomicrmw.
9881                                                           - Must happen before
9882                                                             the following
9883                                                             atomicrmw.
9884                                                           - Ensures that all
9885                                                             memory operations
9886                                                             have
9887                                                             completed before
9888                                                             performing the
9889                                                             atomicrmw that is
9890                                                             being released.
9891
9892                                                         2. flat_atomic
9893                                                         3. s_waitcnt vm/vscnt(0) &
9894                                                            lgkmcnt(0)
9895
9896                                                           - If OpenCL, omit
9897                                                             lgkmcnt(0).
9898                                                           - Use vmcnt(0) if atomic with
9899                                                             return and vscnt(0) if
9900                                                             atomic with no-return.
9901                                                           - Must happen before
9902                                                             following
9903                                                             buffer_gl*_inv.
9904                                                           - Ensures the
9905                                                             atomicrmw has
9906                                                             completed before
9907                                                             invalidating the
9908                                                             caches.
9909
9910                                                         4. buffer_gl0_inv;
9911                                                            buffer_gl1_inv
9912
9913                                                           - Must happen before
9914                                                             any following
9915                                                             global/generic
9916                                                             load/load
9917                                                             atomic/atomicrmw.
9918                                                           - Ensures that
9919                                                             following loads
9920                                                             will not see stale
9921                                                             global data.
9922
9923     fence        acq_rel      - singlethread *none*     *none*
9924                               - wavefront
9925     fence        acq_rel      - workgroup    *none*     1. s_waitcnt lgkmcnt(0) &
9926                                                            vmcnt(0) & vscnt(0)
9927
9928                                                           - If CU wavefront execution
9929                                                             mode, omit vmcnt(0) and
9930                                                             vscnt(0).
9931                                                           - If OpenCL and
9932                                                             address space is
9933                                                             not generic, omit
9934                                                             lgkmcnt(0).
9935                                                           - If OpenCL and
9936                                                             address space is
9937                                                             local, omit
9938                                                             vmcnt(0) and vscnt(0).
9939                                                           - However,
9940                                                             since LLVM
9941                                                             currently has no
9942                                                             address space on
9943                                                             the fence need to
9944                                                             conservatively
9945                                                             always generate
9946                                                             (see comment for
9947                                                             previous fence).
9948                                                           - Could be split into
9949                                                             separate s_waitcnt
9950                                                             vmcnt(0), s_waitcnt
9951                                                             vscnt(0) and s_waitcnt
9952                                                             lgkmcnt(0) to allow
9953                                                             them to be
9954                                                             independently moved
9955                                                             according to the
9956                                                             following rules.
9957                                                           - s_waitcnt vmcnt(0)
9958                                                             must happen after
9959                                                             any preceding
9960                                                             global/generic
9961                                                             load/load
9962                                                             atomic/
9963                                                             atomicrmw-with-return-value.
9964                                                           - s_waitcnt vscnt(0)
9965                                                             must happen after
9966                                                             any preceding
9967                                                             global/generic
9968                                                             store/store atomic/
9969                                                             atomicrmw-no-return-value.
9970                                                           - s_waitcnt lgkmcnt(0)
9971                                                             must happen after
9972                                                             any preceding
9973                                                             local/generic
9974                                                             load/store/load
9975                                                             atomic/store atomic/
9976                                                             atomicrmw.
9977                                                           - Must happen before
9978                                                             any following
9979                                                             global/generic
9980                                                             load/load
9981                                                             atomic/store/store
9982                                                             atomic/atomicrmw.
9983                                                           - Ensures that all
9984                                                             memory operations
9985                                                             have
9986                                                             completed before
9987                                                             performing any
9988                                                             following global
9989                                                             memory operations.
9990                                                           - Ensures that the
9991                                                             preceding
9992                                                             local/generic load
9993                                                             atomic/atomicrmw
9994                                                             with an equal or
9995                                                             wider sync scope
9996                                                             and memory ordering
9997                                                             stronger than
9998                                                             unordered (this is
9999                                                             termed the
10000                                                             acquire-fence-paired-atomic)
10001                                                             has completed
10002                                                             before following
10003                                                             global memory
10004                                                             operations. This
10005                                                             satisfies the
10006                                                             requirements of
10007                                                             acquire.
10008                                                           - Ensures that all
10009                                                             previous memory
10010                                                             operations have
10011                                                             completed before a
10012                                                             following
10013                                                             local/generic store
10014                                                             atomic/atomicrmw
10015                                                             with an equal or
10016                                                             wider sync scope
10017                                                             and memory ordering
10018                                                             stronger than
10019                                                             unordered (this is
10020                                                             termed the
10021                                                             release-fence-paired-atomic).
10022                                                             This satisfies the
10023                                                             requirements of
10024                                                             release.
10025                                                           - Must happen before
10026                                                             the following
10027                                                             buffer_gl0_inv.
10028                                                           - Ensures that the
10029                                                             acquire-fence-paired
10030                                                             atomic has completed
10031                                                             before invalidating
10032                                                             the
10033                                                             cache. Therefore
10034                                                             any following
10035                                                             locations read must
10036                                                             be no older than
10037                                                             the value read by
10038                                                             the
10039                                                             acquire-fence-paired-atomic.
10040
10041                                                         3. buffer_gl0_inv
10042
10043                                                           - If CU wavefront execution
10044                                                             mode, omit.
10045                                                           - Ensures that
10046                                                             following
10047                                                             loads will not see
10048                                                             stale data.
10049
10050     fence        acq_rel      - agent        *none*     1. s_waitcnt lgkmcnt(0) &
10051                               - system                     vmcnt(0) & vscnt(0)
10052
10053                                                           - If OpenCL and
10054                                                             address space is
10055                                                             not generic, omit
10056                                                             lgkmcnt(0).
10057                                                           - If OpenCL and
10058                                                             address space is
10059                                                             local, omit
10060                                                             vmcnt(0) and vscnt(0).
10061                                                           - However, since LLVM
10062                                                             currently has no
10063                                                             address space on
10064                                                             the fence need to
10065                                                             conservatively
10066                                                             always generate
10067                                                             (see comment for
10068                                                             previous fence).
10069                                                           - Could be split into
10070                                                             separate s_waitcnt
10071                                                             vmcnt(0), s_waitcnt
10072                                                             vscnt(0) and s_waitcnt
10073                                                             lgkmcnt(0) to allow
10074                                                             them to be
10075                                                             independently moved
10076                                                             according to the
10077                                                             following rules.
10078                                                           - s_waitcnt vmcnt(0)
10079                                                             must happen after
10080                                                             any preceding
10081                                                             global/generic
10082                                                             load/load
10083                                                             atomic/
10084                                                             atomicrmw-with-return-value.
10085                                                           - s_waitcnt vscnt(0)
10086                                                             must happen after
10087                                                             any preceding
10088                                                             global/generic
10089                                                             store/store atomic/
10090                                                             atomicrmw-no-return-value.
10091                                                           - s_waitcnt lgkmcnt(0)
10092                                                             must happen after
10093                                                             any preceding
10094                                                             local/generic
10095                                                             load/store/load
10096                                                             atomic/store
10097                                                             atomic/atomicrmw.
10098                                                           - Must happen before
10099                                                             the following
10100                                                             buffer_gl*_inv.
10101                                                           - Ensures that the
10102                                                             preceding
10103                                                             global/local/generic
10104                                                             load
10105                                                             atomic/atomicrmw
10106                                                             with an equal or
10107                                                             wider sync scope
10108                                                             and memory ordering
10109                                                             stronger than
10110                                                             unordered (this is
10111                                                             termed the
10112                                                             acquire-fence-paired-atomic)
10113                                                             has completed
10114                                                             before invalidating
10115                                                             the caches. This
10116                                                             satisfies the
10117                                                             requirements of
10118                                                             acquire.
10119                                                           - Ensures that all
10120                                                             previous memory
10121                                                             operations have
10122                                                             completed before a
10123                                                             following
10124                                                             global/local/generic
10125                                                             store
10126                                                             atomic/atomicrmw
10127                                                             with an equal or
10128                                                             wider sync scope
10129                                                             and memory ordering
10130                                                             stronger than
10131                                                             unordered (this is
10132                                                             termed the
10133                                                             release-fence-paired-atomic).
10134                                                             This satisfies the
10135                                                             requirements of
10136                                                             release.
10137
10138                                                         2. buffer_gl0_inv;
10139                                                            buffer_gl1_inv
10140
10141                                                           - Must happen before
10142                                                             any following
10143                                                             global/generic
10144                                                             load/load
10145                                                             atomic/store/store
10146                                                             atomic/atomicrmw.
10147                                                           - Ensures that
10148                                                             following loads
10149                                                             will not see stale
10150                                                             global data. This
10151                                                             satisfies the
10152                                                             requirements of
10153                                                             acquire.
10154
10155     **Sequential Consistent Atomic**
10156     ------------------------------------------------------------------------------------
10157     load atomic  seq_cst      - singlethread - global   *Same as corresponding
10158                               - wavefront    - local    load atomic acquire,
10159                                              - generic  except must generated
10160                                                         all instructions even
10161                                                         for OpenCL.*
10162     load atomic  seq_cst      - workgroup    - global   1. s_waitcnt lgkmcnt(0) &
10163                                              - generic     vmcnt(0) & vscnt(0)
10164
10165                                                           - If CU wavefront execution
10166                                                             mode, omit vmcnt(0) and
10167                                                             vscnt(0).
10168                                                           - Could be split into
10169                                                             separate s_waitcnt
10170                                                             vmcnt(0), s_waitcnt
10171                                                             vscnt(0), and s_waitcnt
10172                                                             lgkmcnt(0) to allow
10173                                                             them to be
10174                                                             independently moved
10175                                                             according to the
10176                                                             following rules.
10177                                                           - s_waitcnt lgkmcnt(0) must
10178                                                             happen after
10179                                                             preceding
10180                                                             local/generic load
10181                                                             atomic/store
10182                                                             atomic/atomicrmw
10183                                                             with memory
10184                                                             ordering of seq_cst
10185                                                             and with equal or
10186                                                             wider sync scope.
10187                                                             (Note that seq_cst
10188                                                             fences have their
10189                                                             own s_waitcnt
10190                                                             lgkmcnt(0) and so do
10191                                                             not need to be
10192                                                             considered.)
10193                                                           - s_waitcnt vmcnt(0)
10194                                                             must happen after
10195                                                             preceding
10196                                                             global/generic load
10197                                                             atomic/
10198                                                             atomicrmw-with-return-value
10199                                                             with memory
10200                                                             ordering of seq_cst
10201                                                             and with equal or
10202                                                             wider sync scope.
10203                                                             (Note that seq_cst
10204                                                             fences have their
10205                                                             own s_waitcnt
10206                                                             vmcnt(0) and so do
10207                                                             not need to be
10208                                                             considered.)
10209                                                           - s_waitcnt vscnt(0)
10210                                                             Must happen after
10211                                                             preceding
10212                                                             global/generic store
10213                                                             atomic/
10214                                                             atomicrmw-no-return-value
10215                                                             with memory
10216                                                             ordering of seq_cst
10217                                                             and with equal or
10218                                                             wider sync scope.
10219                                                             (Note that seq_cst
10220                                                             fences have their
10221                                                             own s_waitcnt
10222                                                             vscnt(0) and so do
10223                                                             not need to be
10224                                                             considered.)
10225                                                           - Ensures any
10226                                                             preceding
10227                                                             sequential
10228                                                             consistent global/local
10229                                                             memory instructions
10230                                                             have completed
10231                                                             before executing
10232                                                             this sequentially
10233                                                             consistent
10234                                                             instruction. This
10235                                                             prevents reordering
10236                                                             a seq_cst store
10237                                                             followed by a
10238                                                             seq_cst load. (Note
10239                                                             that seq_cst is
10240                                                             stronger than
10241                                                             acquire/release as
10242                                                             the reordering of
10243                                                             load acquire
10244                                                             followed by a store
10245                                                             release is
10246                                                             prevented by the
10247                                                             s_waitcnt of
10248                                                             the release, but
10249                                                             there is nothing
10250                                                             preventing a store
10251                                                             release followed by
10252                                                             load acquire from
10253                                                             completing out of
10254                                                             order. The s_waitcnt
10255                                                             could be placed after
10256                                                             seq_store or before
10257                                                             the seq_load. We
10258                                                             choose the load to
10259                                                             make the s_waitcnt be
10260                                                             as late as possible
10261                                                             so that the store
10262                                                             may have already
10263                                                             completed.)
10264
10265                                                         2. *Following
10266                                                            instructions same as
10267                                                            corresponding load
10268                                                            atomic acquire,
10269                                                            except must generated
10270                                                            all instructions even
10271                                                            for OpenCL.*
10272     load atomic  seq_cst      - workgroup    - local
10273
10274                                                         1. s_waitcnt vmcnt(0) & vscnt(0)
10275
10276                                                           - If CU wavefront execution
10277                                                             mode, omit.
10278                                                           - Could be split into
10279                                                             separate s_waitcnt
10280                                                             vmcnt(0) and s_waitcnt
10281                                                             vscnt(0) to allow
10282                                                             them to be
10283                                                             independently moved
10284                                                             according to the
10285                                                             following rules.
10286                                                           - s_waitcnt vmcnt(0)
10287                                                             Must happen after
10288                                                             preceding
10289                                                             global/generic load
10290                                                             atomic/
10291                                                             atomicrmw-with-return-value
10292                                                             with memory
10293                                                             ordering of seq_cst
10294                                                             and with equal or
10295                                                             wider sync scope.
10296                                                             (Note that seq_cst
10297                                                             fences have their
10298                                                             own s_waitcnt
10299                                                             vmcnt(0) and so do
10300                                                             not need to be
10301                                                             considered.)
10302                                                           - s_waitcnt vscnt(0)
10303                                                             Must happen after
10304                                                             preceding
10305                                                             global/generic store
10306                                                             atomic/
10307                                                             atomicrmw-no-return-value
10308                                                             with memory
10309                                                             ordering of seq_cst
10310                                                             and with equal or
10311                                                             wider sync scope.
10312                                                             (Note that seq_cst
10313                                                             fences have their
10314                                                             own s_waitcnt
10315                                                             vscnt(0) and so do
10316                                                             not need to be
10317                                                             considered.)
10318                                                           - Ensures any
10319                                                             preceding
10320                                                             sequential
10321                                                             consistent global
10322                                                             memory instructions
10323                                                             have completed
10324                                                             before executing
10325                                                             this sequentially
10326                                                             consistent
10327                                                             instruction. This
10328                                                             prevents reordering
10329                                                             a seq_cst store
10330                                                             followed by a
10331                                                             seq_cst load. (Note
10332                                                             that seq_cst is
10333                                                             stronger than
10334                                                             acquire/release as
10335                                                             the reordering of
10336                                                             load acquire
10337                                                             followed by a store
10338                                                             release is
10339                                                             prevented by the
10340                                                             s_waitcnt of
10341                                                             the release, but
10342                                                             there is nothing
10343                                                             preventing a store
10344                                                             release followed by
10345                                                             load acquire from
10346                                                             completing out of
10347                                                             order. The s_waitcnt
10348                                                             could be placed after
10349                                                             seq_store or before
10350                                                             the seq_load. We
10351                                                             choose the load to
10352                                                             make the s_waitcnt be
10353                                                             as late as possible
10354                                                             so that the store
10355                                                             may have already
10356                                                             completed.)
10357
10358                                                         2. *Following
10359                                                            instructions same as
10360                                                            corresponding load
10361                                                            atomic acquire,
10362                                                            except must generated
10363                                                            all instructions even
10364                                                            for OpenCL.*
10365
10366     load atomic  seq_cst      - agent        - global   1. s_waitcnt lgkmcnt(0) &
10367                               - system       - generic     vmcnt(0) & vscnt(0)
10368
10369                                                           - Could be split into
10370                                                             separate s_waitcnt
10371                                                             vmcnt(0), s_waitcnt
10372                                                             vscnt(0) and s_waitcnt
10373                                                             lgkmcnt(0) to allow
10374                                                             them to be
10375                                                             independently moved
10376                                                             according to the
10377                                                             following rules.
10378                                                           - s_waitcnt lgkmcnt(0)
10379                                                             must happen after
10380                                                             preceding
10381                                                             local load
10382                                                             atomic/store
10383                                                             atomic/atomicrmw
10384                                                             with memory
10385                                                             ordering of seq_cst
10386                                                             and with equal or
10387                                                             wider sync scope.
10388                                                             (Note that seq_cst
10389                                                             fences have their
10390                                                             own s_waitcnt
10391                                                             lgkmcnt(0) and so do
10392                                                             not need to be
10393                                                             considered.)
10394                                                           - s_waitcnt vmcnt(0)
10395                                                             must happen after
10396                                                             preceding
10397                                                             global/generic load
10398                                                             atomic/
10399                                                             atomicrmw-with-return-value
10400                                                             with memory
10401                                                             ordering of seq_cst
10402                                                             and with equal or
10403                                                             wider sync scope.
10404                                                             (Note that seq_cst
10405                                                             fences have their
10406                                                             own s_waitcnt
10407                                                             vmcnt(0) and so do
10408                                                             not need to be
10409                                                             considered.)
10410                                                           - s_waitcnt vscnt(0)
10411                                                             Must happen after
10412                                                             preceding
10413                                                             global/generic store
10414                                                             atomic/
10415                                                             atomicrmw-no-return-value
10416                                                             with memory
10417                                                             ordering of seq_cst
10418                                                             and with equal or
10419                                                             wider sync scope.
10420                                                             (Note that seq_cst
10421                                                             fences have their
10422                                                             own s_waitcnt
10423                                                             vscnt(0) and so do
10424                                                             not need to be
10425                                                             considered.)
10426                                                           - Ensures any
10427                                                             preceding
10428                                                             sequential
10429                                                             consistent global
10430                                                             memory instructions
10431                                                             have completed
10432                                                             before executing
10433                                                             this sequentially
10434                                                             consistent
10435                                                             instruction. This
10436                                                             prevents reordering
10437                                                             a seq_cst store
10438                                                             followed by a
10439                                                             seq_cst load. (Note
10440                                                             that seq_cst is
10441                                                             stronger than
10442                                                             acquire/release as
10443                                                             the reordering of
10444                                                             load acquire
10445                                                             followed by a store
10446                                                             release is
10447                                                             prevented by the
10448                                                             s_waitcnt of
10449                                                             the release, but
10450                                                             there is nothing
10451                                                             preventing a store
10452                                                             release followed by
10453                                                             load acquire from
10454                                                             completing out of
10455                                                             order. The s_waitcnt
10456                                                             could be placed after
10457                                                             seq_store or before
10458                                                             the seq_load. We
10459                                                             choose the load to
10460                                                             make the s_waitcnt be
10461                                                             as late as possible
10462                                                             so that the store
10463                                                             may have already
10464                                                             completed.)
10465
10466                                                         2. *Following
10467                                                            instructions same as
10468                                                            corresponding load
10469                                                            atomic acquire,
10470                                                            except must generated
10471                                                            all instructions even
10472                                                            for OpenCL.*
10473     store atomic seq_cst      - singlethread - global   *Same as corresponding
10474                               - wavefront    - local    store atomic release,
10475                               - workgroup    - generic  except must generated
10476                               - agent                   all instructions even
10477                               - system                  for OpenCL.*
10478     atomicrmw    seq_cst      - singlethread - global   *Same as corresponding
10479                               - wavefront    - local    atomicrmw acq_rel,
10480                               - workgroup    - generic  except must generated
10481                               - agent                   all instructions even
10482                               - system                  for OpenCL.*
10483     fence        seq_cst      - singlethread *none*     *Same as corresponding
10484                               - wavefront               fence acq_rel,
10485                               - workgroup               except must generated
10486                               - agent                   all instructions even
10487                               - system                  for OpenCL.*
10488     ============ ============ ============== ========== ================================
10489
10490Trap Handler ABI
10491~~~~~~~~~~~~~~~~
10492
10493For code objects generated by the AMDGPU backend for HSA [HSA]_ compatible
10494runtimes (see :ref:`amdgpu-os`), the runtime installs a trap handler that
10495supports the ``s_trap`` instruction. For usage see:
10496
10497- :ref:`amdgpu-trap-handler-for-amdhsa-os-v2-table`
10498- :ref:`amdgpu-trap-handler-for-amdhsa-os-v3-table`
10499- :ref:`amdgpu-trap-handler-for-amdhsa-os-v4-table`
10500
10501  .. table:: AMDGPU Trap Handler for AMDHSA OS Code Object V2
10502     :name: amdgpu-trap-handler-for-amdhsa-os-v2-table
10503
10504     =================== =============== =============== =======================================
10505     Usage               Code Sequence   Trap Handler    Description
10506                                         Inputs
10507     =================== =============== =============== =======================================
10508     reserved            ``s_trap 0x00``                 Reserved by hardware.
10509     ``debugtrap(arg)``  ``s_trap 0x01`` ``SGPR0-1``:    Reserved for Finalizer HSA ``debugtrap``
10510                                           ``queue_ptr`` intrinsic (not implemented).
10511                                         ``VGPR0``:
10512                                           ``arg``
10513     ``llvm.trap``       ``s_trap 0x02`` ``SGPR0-1``:    Causes wave to be halted with the PC at
10514                                           ``queue_ptr`` the trap instruction. The associated
10515                                                         queue is signalled to put it into the
10516                                                         error state.  When the queue is put in
10517                                                         the error state, the waves executing
10518                                                         dispatches on the queue will be
10519                                                         terminated.
10520     ``llvm.debugtrap``  ``s_trap 0x03`` *none*          - If debugger not enabled then behaves
10521                                                           as a no-operation. The trap handler
10522                                                           is entered and immediately returns to
10523                                                           continue execution of the wavefront.
10524                                                         - If the debugger is enabled, causes
10525                                                           the debug trap to be reported by the
10526                                                           debugger and the wavefront is put in
10527                                                           the halt state with the PC at the
10528                                                           instruction.  The debugger must
10529                                                           increment the PC and resume the wave.
10530     reserved            ``s_trap 0x04``                 Reserved.
10531     reserved            ``s_trap 0x05``                 Reserved.
10532     reserved            ``s_trap 0x06``                 Reserved.
10533     reserved            ``s_trap 0x07``                 Reserved.
10534     reserved            ``s_trap 0x08``                 Reserved.
10535     reserved            ``s_trap 0xfe``                 Reserved.
10536     reserved            ``s_trap 0xff``                 Reserved.
10537     =================== =============== =============== =======================================
10538
10539..
10540
10541  .. table:: AMDGPU Trap Handler for AMDHSA OS Code Object V3
10542     :name: amdgpu-trap-handler-for-amdhsa-os-v3-table
10543
10544     =================== =============== =============== =======================================
10545     Usage               Code Sequence   Trap Handler    Description
10546                                         Inputs
10547     =================== =============== =============== =======================================
10548     reserved            ``s_trap 0x00``                 Reserved by hardware.
10549     debugger breakpoint ``s_trap 0x01`` *none*          Reserved for debugger to use for
10550                                                         breakpoints. Causes wave to be halted
10551                                                         with the PC at the trap instruction.
10552                                                         The debugger is responsible to resume
10553                                                         the wave, including the instruction
10554                                                         that the breakpoint overwrote.
10555     ``llvm.trap``       ``s_trap 0x02`` ``SGPR0-1``:    Causes wave to be halted with the PC at
10556                                           ``queue_ptr`` the trap instruction. The associated
10557                                                         queue is signalled to put it into the
10558                                                         error state.  When the queue is put in
10559                                                         the error state, the waves executing
10560                                                         dispatches on the queue will be
10561                                                         terminated.
10562     ``llvm.debugtrap``  ``s_trap 0x03`` *none*          - If debugger not enabled then behaves
10563                                                           as a no-operation. The trap handler
10564                                                           is entered and immediately returns to
10565                                                           continue execution of the wavefront.
10566                                                         - If the debugger is enabled, causes
10567                                                           the debug trap to be reported by the
10568                                                           debugger and the wavefront is put in
10569                                                           the halt state with the PC at the
10570                                                           instruction.  The debugger must
10571                                                           increment the PC and resume the wave.
10572     reserved            ``s_trap 0x04``                 Reserved.
10573     reserved            ``s_trap 0x05``                 Reserved.
10574     reserved            ``s_trap 0x06``                 Reserved.
10575     reserved            ``s_trap 0x07``                 Reserved.
10576     reserved            ``s_trap 0x08``                 Reserved.
10577     reserved            ``s_trap 0xfe``                 Reserved.
10578     reserved            ``s_trap 0xff``                 Reserved.
10579     =================== =============== =============== =======================================
10580
10581..
10582
10583  .. table:: AMDGPU Trap Handler for AMDHSA OS Code Object V4
10584     :name: amdgpu-trap-handler-for-amdhsa-os-v4-table
10585
10586     =================== =============== ================ ================= =======================================
10587     Usage               Code Sequence   GFX6-GFX8 Inputs GFX9-GFX10 Inputs Description
10588     =================== =============== ================ ================= =======================================
10589     reserved            ``s_trap 0x00``                                    Reserved by hardware.
10590     debugger breakpoint ``s_trap 0x01`` *none*           *none*            Reserved for debugger to use for
10591                                                                            breakpoints. Causes wave to be halted
10592                                                                            with the PC at the trap instruction.
10593                                                                            The debugger is responsible to resume
10594                                                                            the wave, including the instruction
10595                                                                            that the breakpoint overwrote.
10596     ``llvm.trap``       ``s_trap 0x02`` ``SGPR0-1``:     *none*            Causes wave to be halted with the PC at
10597                                           ``queue_ptr``                    the trap instruction. The associated
10598                                                                            queue is signalled to put it into the
10599                                                                            error state.  When the queue is put in
10600                                                                            the error state, the waves executing
10601                                                                            dispatches on the queue will be
10602                                                                            terminated.
10603     ``llvm.debugtrap``  ``s_trap 0x03`` *none*           *none*            - If debugger not enabled then behaves
10604                                                                              as a no-operation. The trap handler
10605                                                                              is entered and immediately returns to
10606                                                                              continue execution of the wavefront.
10607                                                                            - If the debugger is enabled, causes
10608                                                                              the debug trap to be reported by the
10609                                                                              debugger and the wavefront is put in
10610                                                                              the halt state with the PC at the
10611                                                                              instruction.  The debugger must
10612                                                                              increment the PC and resume the wave.
10613     reserved            ``s_trap 0x04``                                    Reserved.
10614     reserved            ``s_trap 0x05``                                    Reserved.
10615     reserved            ``s_trap 0x06``                                    Reserved.
10616     reserved            ``s_trap 0x07``                                    Reserved.
10617     reserved            ``s_trap 0x08``                                    Reserved.
10618     reserved            ``s_trap 0xfe``                                    Reserved.
10619     reserved            ``s_trap 0xff``                                    Reserved.
10620     =================== =============== ================ ================= =======================================
10621
10622.. _amdgpu-amdhsa-function-call-convention:
10623
10624Call Convention
10625~~~~~~~~~~~~~~~
10626
10627.. note::
10628
10629  This section is currently incomplete and has inaccuracies. It is WIP that will
10630  be updated as information is determined.
10631
10632See :ref:`amdgpu-dwarf-address-space-identifier` for information on swizzled
10633addresses. Unswizzled addresses are normal linear addresses.
10634
10635.. _amdgpu-amdhsa-function-call-convention-kernel-functions:
10636
10637Kernel Functions
10638++++++++++++++++
10639
10640This section describes the call convention ABI for the outer kernel function.
10641
10642See :ref:`amdgpu-amdhsa-initial-kernel-execution-state` for the kernel call
10643convention.
10644
10645The following is not part of the AMDGPU kernel calling convention but describes
10646how the AMDGPU implements function calls:
10647
106481.  Clang decides the kernarg layout to match the *HSA Programmer's Language
10649    Reference* [HSA]_.
10650
10651    - All structs are passed directly.
10652    - Lambda values are passed *TBA*.
10653
10654    .. TODO::
10655
10656      - Does this really follow HSA rules? Or are structs >16 bytes passed
10657        by-value struct?
10658      - What is ABI for lambda values?
10659
106604.  The kernel performs certain setup in its prolog, as described in
10661    :ref:`amdgpu-amdhsa-kernel-prolog`.
10662
10663.. _amdgpu-amdhsa-function-call-convention-non-kernel-functions:
10664
10665Non-Kernel Functions
10666++++++++++++++++++++
10667
10668This section describes the call convention ABI for functions other than the
10669outer kernel function.
10670
10671If a kernel has function calls then scratch is always allocated and used for
10672the call stack which grows from low address to high address using the swizzled
10673scratch address space.
10674
10675On entry to a function:
10676
106771.  SGPR0-3 contain a V# with the following properties (see
10678    :ref:`amdgpu-amdhsa-kernel-prolog-private-segment-buffer`):
10679
10680    * Base address pointing to the beginning of the wavefront scratch backing
10681      memory.
10682    * Swizzled with dword element size and stride of wavefront size elements.
10683
106842.  The FLAT_SCRATCH register pair is setup. See
10685    :ref:`amdgpu-amdhsa-kernel-prolog-flat-scratch`.
106863.  GFX6-GFX8: M0 register set to the size of LDS in bytes. See
10687    :ref:`amdgpu-amdhsa-kernel-prolog-m0`.
106884.  The EXEC register is set to the lanes active on entry to the function.
106895.  MODE register: *TBD*
106906.  VGPR0-31 and SGPR4-29 are used to pass function input arguments as described
10691    below.
106927.  SGPR30-31 return address (RA). The code address that the function must
10693    return to when it completes. The value is undefined if the function is *no
10694    return*.
106958.  SGPR32 is used for the stack pointer (SP). It is an unswizzled scratch
10696    offset relative to the beginning of the wavefront scratch backing memory.
10697
10698    The unswizzled SP can be used with buffer instructions as an unswizzled SGPR
10699    offset with the scratch V# in SGPR0-3 to access the stack in a swizzled
10700    manner.
10701
10702    The unswizzled SP value can be converted into the swizzled SP value by:
10703
10704      | swizzled SP = unswizzled SP / wavefront size
10705
10706    This may be used to obtain the private address space address of stack
10707    objects and to convert this address to a flat address by adding the flat
10708    scratch aperture base address.
10709
10710    The swizzled SP value is always 4 bytes aligned for the ``r600``
10711    architecture and 16 byte aligned for the ``amdgcn`` architecture.
10712
10713    .. note::
10714
10715      The ``amdgcn`` value is selected to avoid dynamic stack alignment for the
10716      OpenCL language which has the largest base type defined as 16 bytes.
10717
10718    On entry, the swizzled SP value is the address of the first function
10719    argument passed on the stack. Other stack passed arguments are positive
10720    offsets from the entry swizzled SP value.
10721
10722    The function may use positive offsets beyond the last stack passed argument
10723    for stack allocated local variables and register spill slots. If necessary,
10724    the function may align these to greater alignment than 16 bytes. After these
10725    the function may dynamically allocate space for such things as runtime sized
10726    ``alloca`` local allocations.
10727
10728    If the function calls another function, it will place any stack allocated
10729    arguments after the last local allocation and adjust SGPR32 to the address
10730    after the last local allocation.
10731
107329.  All other registers are unspecified.
1073310. Any necessary ``s_waitcnt`` has been performed to ensure memory is available
10734    to the function.
10735
10736On exit from a function:
10737
107381.  VGPR0-31 and SGPR4-29 are used to pass function result arguments as
10739    described below. Any registers used are considered clobbered registers.
107402.  The following registers are preserved and have the same value as on entry:
10741
10742    * FLAT_SCRATCH
10743    * EXEC
10744    * GFX6-GFX8: M0
10745    * All SGPR registers except the clobbered registers of SGPR4-31.
10746    * VGPR40-47
10747    * VGPR56-63
10748    * VGPR72-79
10749    * VGPR88-95
10750    * VGPR104-111
10751    * VGPR120-127
10752    * VGPR136-143
10753    * VGPR152-159
10754    * VGPR168-175
10755    * VGPR184-191
10756    * VGPR200-207
10757    * VGPR216-223
10758    * VGPR232-239
10759    * VGPR248-255
10760
10761        .. note::
10762
10763          Except the argument registers, the VGPRs clobbered and the preserved
10764          registers are intermixed at regular intervals in order to keep a
10765          similar ratio independent of the number of allocated VGPRs.
10766
10767    * Lanes of all VGPRs that are inactive at the call site.
10768
10769      For the AMDGPU backend, an inter-procedural register allocation (IPRA)
10770      optimization may mark some of clobbered SGPR and VGPR registers as
10771      preserved if it can be determined that the called function does not change
10772      their value.
10773
107742.  The PC is set to the RA provided on entry.
107753.  MODE register: *TBD*.
107764.  All other registers are clobbered.
107775.  Any necessary ``s_waitcnt`` has been performed to ensure memory accessed by
10778    function is available to the caller.
10779
10780.. TODO::
10781
10782  - On gfx908 are all ACC registers clobbered?
10783
10784  - How are function results returned? The address of structured types is passed
10785    by reference, but what about other types?
10786
10787The function input arguments are made up of the formal arguments explicitly
10788declared by the source language function plus the implicit input arguments used
10789by the implementation.
10790
10791The source language input arguments are:
10792
107931. Any source language implicit ``this`` or ``self`` argument comes first as a
10794   pointer type.
107952. Followed by the function formal arguments in left to right source order.
10796
10797The source language result arguments are:
10798
107991. The function result argument.
10800
10801The source language input or result struct type arguments that are less than or
10802equal to 16 bytes, are decomposed recursively into their base type fields, and
10803each field is passed as if a separate argument. For input arguments, if the
10804called function requires the struct to be in memory, for example because its
10805address is taken, then the function body is responsible for allocating a stack
10806location and copying the field arguments into it. Clang terms this *direct
10807struct*.
10808
10809The source language input struct type arguments that are greater than 16 bytes,
10810are passed by reference. The caller is responsible for allocating a stack
10811location to make a copy of the struct value and pass the address as the input
10812argument. The called function is responsible to perform the dereference when
10813accessing the input argument. Clang terms this *by-value struct*.
10814
10815A source language result struct type argument that is greater than 16 bytes, is
10816returned by reference. The caller is responsible for allocating a stack location
10817to hold the result value and passes the address as the last input argument
10818(before the implicit input arguments). In this case there are no result
10819arguments. The called function is responsible to perform the dereference when
10820storing the result value. Clang terms this *structured return (sret)*.
10821
10822*TODO: correct the ``sret`` definition.*
10823
10824.. TODO::
10825
10826  Is this definition correct? Or is ``sret`` only used if passing in registers, and
10827  pass as non-decomposed struct as stack argument? Or something else? Is the
10828  memory location in the caller stack frame, or a stack memory argument and so
10829  no address is passed as the caller can directly write to the argument stack
10830  location? But then the stack location is still live after return. If an
10831  argument stack location is it the first stack argument or the last one?
10832
10833Lambda argument types are treated as struct types with an implementation defined
10834set of fields.
10835
10836.. TODO::
10837
10838  Need to specify the ABI for lambda types for AMDGPU.
10839
10840For AMDGPU backend all source language arguments (including the decomposed
10841struct type arguments) are passed in VGPRs unless marked ``inreg`` in which case
10842they are passed in SGPRs.
10843
10844The AMDGPU backend walks the function call graph from the leaves to determine
10845which implicit input arguments are used, propagating to each caller of the
10846function. The used implicit arguments are appended to the function arguments
10847after the source language arguments in the following order:
10848
10849.. TODO::
10850
10851  Is recursion or external functions supported?
10852
108531.  Work-Item ID (1 VGPR)
10854
10855    The X, Y and Z work-item ID are packed into a single VGRP with the following
10856    layout. Only fields actually used by the function are set. The other bits
10857    are undefined.
10858
10859    The values come from the initial kernel execution state. See
10860    :ref:`amdgpu-amdhsa-initial-kernel-execution-state`.
10861
10862    .. table:: Work-item implicit argument layout
10863      :name: amdgpu-amdhsa-workitem-implicit-argument-layout-table
10864
10865      ======= ======= ==============
10866      Bits    Size    Field Name
10867      ======= ======= ==============
10868      9:0     10 bits X Work-Item ID
10869      19:10   10 bits Y Work-Item ID
10870      29:20   10 bits Z Work-Item ID
10871      31:30   2 bits  Unused
10872      ======= ======= ==============
10873
108742.  Dispatch Ptr (2 SGPRs)
10875
10876    The value comes from the initial kernel execution state. See
10877    :ref:`amdgpu-amdhsa-sgpr-register-set-up-order-table`.
10878
108793.  Queue Ptr (2 SGPRs)
10880
10881    The value comes from the initial kernel execution state. See
10882    :ref:`amdgpu-amdhsa-sgpr-register-set-up-order-table`.
10883
108844.  Kernarg Segment Ptr (2 SGPRs)
10885
10886    The value comes from the initial kernel execution state. See
10887    :ref:`amdgpu-amdhsa-sgpr-register-set-up-order-table`.
10888
108895.  Dispatch id (2 SGPRs)
10890
10891    The value comes from the initial kernel execution state. See
10892    :ref:`amdgpu-amdhsa-sgpr-register-set-up-order-table`.
10893
108946.  Work-Group ID X (1 SGPR)
10895
10896    The value comes from the initial kernel execution state. See
10897    :ref:`amdgpu-amdhsa-sgpr-register-set-up-order-table`.
10898
108997.  Work-Group ID Y (1 SGPR)
10900
10901    The value comes from the initial kernel execution state. See
10902    :ref:`amdgpu-amdhsa-sgpr-register-set-up-order-table`.
10903
109048.  Work-Group ID Z (1 SGPR)
10905
10906    The value comes from the initial kernel execution state. See
10907    :ref:`amdgpu-amdhsa-sgpr-register-set-up-order-table`.
10908
109099.  Implicit Argument Ptr (2 SGPRs)
10910
10911    The value is computed by adding an offset to Kernarg Segment Ptr to get the
10912    global address space pointer to the first kernarg implicit argument.
10913
10914The input and result arguments are assigned in order in the following manner:
10915
10916.. note::
10917
10918  There are likely some errors and omissions in the following description that
10919  need correction.
10920
10921  .. TODO::
10922
10923    Check the Clang source code to decipher how function arguments and return
10924    results are handled. Also see the AMDGPU specific values used.
10925
10926* VGPR arguments are assigned to consecutive VGPRs starting at VGPR0 up to
10927  VGPR31.
10928
10929  If there are more arguments than will fit in these registers, the remaining
10930  arguments are allocated on the stack in order on naturally aligned
10931  addresses.
10932
10933  .. TODO::
10934
10935    How are overly aligned structures allocated on the stack?
10936
10937* SGPR arguments are assigned to consecutive SGPRs starting at SGPR0 up to
10938  SGPR29.
10939
10940  If there are more arguments than will fit in these registers, the remaining
10941  arguments are allocated on the stack in order on naturally aligned
10942  addresses.
10943
10944Note that decomposed struct type arguments may have some fields passed in
10945registers and some in memory.
10946
10947.. TODO::
10948
10949  So, a struct which can pass some fields as decomposed register arguments, will
10950  pass the rest as decomposed stack elements? But an argument that will not start
10951  in registers will not be decomposed and will be passed as a non-decomposed
10952  stack value?
10953
10954The following is not part of the AMDGPU function calling convention but
10955describes how the AMDGPU implements function calls:
10956
109571.  SGPR33 is used as a frame pointer (FP) if necessary. Like the SP it is an
10958    unswizzled scratch address. It is only needed if runtime sized ``alloca``
10959    are used, or for the reasons defined in ``SIFrameLowering``.
109602.  Runtime stack alignment is supported. SGPR34 is used as a base pointer (BP)
10961    to access the incoming stack arguments in the function. The BP is needed
10962    only when the function requires the runtime stack alignment.
10963
109643.  Allocating SGPR arguments on the stack are not supported.
10965
109664.  No CFI is currently generated. See
10967    :ref:`amdgpu-dwarf-call-frame-information`.
10968
10969    .. note::
10970
10971      CFI will be generated that defines the CFA as the unswizzled address
10972      relative to the wave scratch base in the unswizzled private address space
10973      of the lowest address stack allocated local variable.
10974
10975      ``DW_AT_frame_base`` will be defined as the swizzled address in the
10976      swizzled private address space by dividing the CFA by the wavefront size
10977      (since CFA is always at least dword aligned which matches the scratch
10978      swizzle element size).
10979
10980      If no dynamic stack alignment was performed, the stack allocated arguments
10981      are accessed as negative offsets relative to ``DW_AT_frame_base``, and the
10982      local variables and register spill slots are accessed as positive offsets
10983      relative to ``DW_AT_frame_base``.
10984
109855.  Function argument passing is implemented by copying the input physical
10986    registers to virtual registers on entry. The register allocator can spill if
10987    necessary. These are copied back to physical registers at call sites. The
10988    net effect is that each function call can have these values in entirely
10989    distinct locations. The IPRA can help avoid shuffling argument registers.
109906.  Call sites are implemented by setting up the arguments at positive offsets
10991    from SP. Then SP is incremented to account for the known frame size before
10992    the call and decremented after the call.
10993
10994    .. note::
10995
10996      The CFI will reflect the changed calculation needed to compute the CFA
10997      from SP.
10998
109997.  4 byte spill slots are used in the stack frame. One slot is allocated for an
11000    emergency spill slot. Buffer instructions are used for stack accesses and
11001    not the ``flat_scratch`` instruction.
11002
11003    .. TODO::
11004
11005      Explain when the emergency spill slot is used.
11006
11007.. TODO::
11008
11009  Possible broken issues:
11010
11011  - Stack arguments must be aligned to required alignment.
11012  - Stack is aligned to max(16, max formal argument alignment)
11013  - Direct argument < 64 bits should check register budget.
11014  - Register budget calculation should respect ``inreg`` for SGPR.
11015  - SGPR overflow is not handled.
11016  - struct with 1 member unpeeling is not checking size of member.
11017  - ``sret`` is after ``this`` pointer.
11018  - Caller is not implementing stack realignment: need an extra pointer.
11019  - Should say AMDGPU passes FP rather than SP.
11020  - Should CFI define CFA as address of locals or arguments. Difference is
11021    apparent when have implemented dynamic alignment.
11022  - If ``SCRATCH`` instruction could allow negative offsets, then can make FP be
11023    highest address of stack frame and use negative offset for locals. Would
11024    allow SP to be the same as FP and could support signal-handler-like as now
11025    have a real SP for the top of the stack.
11026  - How is ``sret`` passed on the stack? In argument stack area? Can it overlay
11027    arguments?
11028
11029AMDPAL
11030------
11031
11032This section provides code conventions used when the target triple OS is
11033``amdpal`` (see :ref:`amdgpu-target-triples`).
11034
11035.. _amdgpu-amdpal-code-object-metadata-section:
11036
11037Code Object Metadata
11038~~~~~~~~~~~~~~~~~~~~
11039
11040.. note::
11041
11042  The metadata is currently in development and is subject to major
11043  changes. Only the current version is supported. *When this document
11044  was generated the version was 2.6.*
11045
11046Code object metadata is specified by the ``NT_AMDGPU_METADATA`` note
11047record (see :ref:`amdgpu-note-records-v3-v4`).
11048
11049The metadata is represented as Message Pack formatted binary data (see
11050[MsgPack]_). The top level is a Message Pack map that includes the keys
11051defined in table :ref:`amdgpu-amdpal-code-object-metadata-map-table`
11052and referenced tables.
11053
11054Additional information can be added to the maps. To avoid conflicts, any
11055key names should be prefixed by "*vendor-name*." where ``vendor-name``
11056can be the name of the vendor and specific vendor tool that generates the
11057information. The prefix is abbreviated to simply "." when it appears
11058within a map that has been added by the same *vendor-name*.
11059
11060  .. table:: AMDPAL Code Object Metadata Map
11061     :name: amdgpu-amdpal-code-object-metadata-map-table
11062
11063     =================== ============== ========= ======================================================================
11064     String Key          Value Type     Required? Description
11065     =================== ============== ========= ======================================================================
11066     "amdpal.version"    sequence of    Required  PAL code object metadata (major, minor) version. The current values
11067                         2 integers               are defined by *Util::Abi::PipelineMetadata(Major|Minor)Version*.
11068     "amdpal.pipelines"  sequence of    Required  Per-pipeline metadata. See
11069                         map                      :ref:`amdgpu-amdpal-code-object-pipeline-metadata-map-table` for the
11070                                                  definition of the keys included in that map.
11071     =================== ============== ========= ======================================================================
11072
11073..
11074
11075  .. table:: AMDPAL Code Object Pipeline Metadata Map
11076     :name: amdgpu-amdpal-code-object-pipeline-metadata-map-table
11077
11078     ====================================== ============== ========= ===================================================
11079     String Key                             Value Type     Required? Description
11080     ====================================== ============== ========= ===================================================
11081     ".name"                                string                   Source name of the pipeline.
11082     ".type"                                string                   Pipeline type, e.g. VsPs. Values include:
11083
11084                                                                       - "VsPs"
11085                                                                       - "Gs"
11086                                                                       - "Cs"
11087                                                                       - "Ngg"
11088                                                                       - "Tess"
11089                                                                       - "GsTess"
11090                                                                       - "NggTess"
11091
11092     ".internal_pipeline_hash"              sequence of    Required  Internal compiler hash for this pipeline. Lower
11093                                            2 integers               64 bits is the "stable" portion of the hash, used
11094                                                                     for e.g. shader replacement lookup. Upper 64 bits
11095                                                                     is the "unique" portion of the hash, used for
11096                                                                     e.g. pipeline cache lookup. The value is
11097                                                                     implementation defined, and can not be relied on
11098                                                                     between different builds of the compiler.
11099     ".shaders"                             map                      Per-API shader metadata. See
11100                                                                     :ref:`amdgpu-amdpal-code-object-shader-map-table`
11101                                                                     for the definition of the keys included in that
11102                                                                     map.
11103     ".hardware_stages"                     map                      Per-hardware stage metadata. See
11104                                                                     :ref:`amdgpu-amdpal-code-object-hardware-stage-map-table`
11105                                                                     for the definition of the keys included in that
11106                                                                     map.
11107     ".shader_functions"                    map                      Per-shader function metadata. See
11108                                                                     :ref:`amdgpu-amdpal-code-object-shader-function-map-table`
11109                                                                     for the definition of the keys included in that
11110                                                                     map.
11111     ".registers"                           map            Required  Hardware register configuration. See
11112                                                                     :ref:`amdgpu-amdpal-code-object-register-map-table`
11113                                                                     for the definition of the keys included in that
11114                                                                     map.
11115     ".user_data_limit"                     integer                  Number of user data entries accessed by this
11116                                                                     pipeline.
11117     ".spill_threshold"                     integer                  The user data spill threshold.  0xFFFF for
11118                                                                     NoUserDataSpilling.
11119     ".uses_viewport_array_index"           boolean                  Indicates whether or not the pipeline uses the
11120                                                                     viewport array index feature. Pipelines which use
11121                                                                     this feature can render into all 16 viewports,
11122                                                                     whereas pipelines which do not use it are
11123                                                                     restricted to viewport #0.
11124     ".es_gs_lds_size"                      integer                  Size in bytes of LDS space used internally for
11125                                                                     handling data-passing between the ES and GS
11126                                                                     shader stages. This can be zero if the data is
11127                                                                     passed using off-chip buffers. This value should
11128                                                                     be used to program all user-SGPRs which have been
11129                                                                     marked with "UserDataMapping::EsGsLdsSize"
11130                                                                     (typically only the GS and VS HW stages will ever
11131                                                                     have a user-SGPR so marked).
11132     ".nggSubgroupSize"                     integer                  Explicit maximum subgroup size for NGG shaders
11133                                                                     (maximum number of threads in a subgroup).
11134     ".num_interpolants"                    integer                  Graphics only. Number of PS interpolants.
11135     ".mesh_scratch_memory_size"            integer                  Max mesh shader scratch memory used.
11136     ".api"                                 string                   Name of the client graphics API.
11137     ".api_create_info"                     binary                   Graphics API shader create info binary blob. Can
11138                                                                     be defined by the driver using the compiler if
11139                                                                     they want to be able to correlate API-specific
11140                                                                     information used during creation at a later time.
11141     ====================================== ============== ========= ===================================================
11142
11143..
11144
11145  .. table:: AMDPAL Code Object Shader Map
11146     :name: amdgpu-amdpal-code-object-shader-map-table
11147
11148
11149     +-------------+--------------+-------------------------------------------------------------------+
11150     |String Key   |Value Type    |Description                                                        |
11151     +=============+==============+===================================================================+
11152     |- ".compute" |map           |See :ref:`amdgpu-amdpal-code-object-api-shader-metadata-map-table` |
11153     |- ".vertex"  |              |for the definition of the keys included in that map.               |
11154     |- ".hull"    |              |                                                                   |
11155     |- ".domain"  |              |                                                                   |
11156     |- ".geometry"|              |                                                                   |
11157     |- ".pixel"   |              |                                                                   |
11158     +-------------+--------------+-------------------------------------------------------------------+
11159
11160..
11161
11162  .. table:: AMDPAL Code Object API Shader Metadata Map
11163     :name: amdgpu-amdpal-code-object-api-shader-metadata-map-table
11164
11165     ==================== ============== ========= =====================================================================
11166     String Key           Value Type     Required? Description
11167     ==================== ============== ========= =====================================================================
11168     ".api_shader_hash"   sequence of    Required  Input shader hash, typically passed in from the client. The value
11169                          2 integers               is implementation defined, and can not be relied on between
11170                                                   different builds of the compiler.
11171     ".hardware_mapping"  sequence of    Required  Flags indicating the HW stages this API shader maps to. Values
11172                          string                   include:
11173
11174                                                     - ".ls"
11175                                                     - ".hs"
11176                                                     - ".es"
11177                                                     - ".gs"
11178                                                     - ".vs"
11179                                                     - ".ps"
11180                                                     - ".cs"
11181
11182     ==================== ============== ========= =====================================================================
11183
11184..
11185
11186  .. table:: AMDPAL Code Object Hardware Stage Map
11187     :name: amdgpu-amdpal-code-object-hardware-stage-map-table
11188
11189     +-------------+--------------+-----------------------------------------------------------------------+
11190     |String Key   |Value Type    |Description                                                            |
11191     +=============+==============+=======================================================================+
11192     |- ".ls"      |map           |See :ref:`amdgpu-amdpal-code-object-hardware-stage-metadata-map-table` |
11193     |- ".hs"      |              |for the definition of the keys included in that map.                   |
11194     |- ".es"      |              |                                                                       |
11195     |- ".gs"      |              |                                                                       |
11196     |- ".vs"      |              |                                                                       |
11197     |- ".ps"      |              |                                                                       |
11198     |- ".cs"      |              |                                                                       |
11199     +-------------+--------------+-----------------------------------------------------------------------+
11200
11201..
11202
11203  .. table:: AMDPAL Code Object Hardware Stage Metadata Map
11204     :name: amdgpu-amdpal-code-object-hardware-stage-metadata-map-table
11205
11206     ========================== ============== ========= ===============================================================
11207     String Key                 Value Type     Required? Description
11208     ========================== ============== ========= ===============================================================
11209     ".entry_point"             string                   The ELF symbol pointing to this pipeline's stage entry point.
11210     ".scratch_memory_size"     integer                  Scratch memory size in bytes.
11211     ".lds_size"                integer                  Local Data Share size in bytes.
11212     ".perf_data_buffer_size"   integer                  Performance data buffer size in bytes.
11213     ".vgpr_count"              integer                  Number of VGPRs used.
11214     ".sgpr_count"              integer                  Number of SGPRs used.
11215     ".vgpr_limit"              integer                  If non-zero, indicates the shader was compiled with a
11216                                                         directive to instruct the compiler to limit the VGPR usage to
11217                                                         be less than or equal to the specified value (only set if
11218                                                         different from HW default).
11219     ".sgpr_limit"              integer                  SGPR count upper limit (only set if different from HW
11220                                                         default).
11221     ".threadgroup_dimensions"  sequence of              Thread-group X/Y/Z dimensions (Compute only).
11222                                3 integers
11223     ".wavefront_size"          integer                  Wavefront size (only set if different from HW default).
11224     ".uses_uavs"               boolean                  The shader reads or writes UAVs.
11225     ".uses_rovs"               boolean                  The shader reads or writes ROVs.
11226     ".writes_uavs"             boolean                  The shader writes to one or more UAVs.
11227     ".writes_depth"            boolean                  The shader writes out a depth value.
11228     ".uses_append_consume"     boolean                  The shader uses append and/or consume operations, either
11229                                                         memory or GDS.
11230     ".uses_prim_id"            boolean                  The shader uses PrimID.
11231     ========================== ============== ========= ===============================================================
11232
11233..
11234
11235  .. table:: AMDPAL Code Object Shader Function Map
11236     :name: amdgpu-amdpal-code-object-shader-function-map-table
11237
11238     =============== ============== ====================================================================
11239     String Key      Value Type     Description
11240     =============== ============== ====================================================================
11241     *symbol name*   map            *symbol name* is the ELF symbol name of the shader function code
11242                                    entry address. The value is the function's metadata. See
11243                                    :ref:`amdgpu-amdpal-code-object-shader-function-metadata-map-table`.
11244     =============== ============== ====================================================================
11245
11246..
11247
11248  .. table:: AMDPAL Code Object Shader Function Metadata Map
11249     :name: amdgpu-amdpal-code-object-shader-function-metadata-map-table
11250
11251     ============================= ============== =================================================================
11252     String Key                    Value Type     Description
11253     ============================= ============== =================================================================
11254     ".api_shader_hash"            sequence of    Input shader hash, typically passed in from the client. The value
11255                                   2 integers     is implementation defined, and can not be relied on between
11256                                                  different builds of the compiler.
11257     ".scratch_memory_size"        integer        Size in bytes of scratch memory used by the shader.
11258     ".lds_size"                   integer        Size in bytes of LDS memory.
11259     ".vgpr_count"                 integer        Number of VGPRs used by the shader.
11260     ".sgpr_count"                 integer        Number of SGPRs used by the shader.
11261     ".stack_frame_size_in_bytes"  integer        Amount of stack size used by the shader.
11262     ".shader_subtype"             string         Shader subtype/kind. Values include:
11263
11264                                                    - "Unknown"
11265
11266     ============================= ============== =================================================================
11267
11268..
11269
11270  .. table:: AMDPAL Code Object Register Map
11271     :name: amdgpu-amdpal-code-object-register-map-table
11272
11273     ========================== ============== ====================================================================
11274     32-bit Integer Key         Value Type     Description
11275     ========================== ============== ====================================================================
11276     ``reg offset``             32-bit integer ``reg offset`` is the dword offset into the GFXIP register space of
11277                                               a GRBM register (i.e., driver accessible GPU register number, not
11278                                               shader GPR register number). The driver is required to program each
11279                                               specified register to the corresponding specified value when
11280                                               executing this pipeline. Typically, the ``reg offsets`` are the
11281                                               ``uint16_t`` offsets to each register as defined by the hardware
11282                                               chip headers. The register is set to the provided value. However, a
11283                                               ``reg offset`` that specifies a user data register (e.g.,
11284                                               COMPUTE_USER_DATA_0) needs special treatment. See
11285                                               :ref:`amdgpu-amdpal-code-object-user-data-section` section for more
11286                                               information.
11287     ========================== ============== ====================================================================
11288
11289.. _amdgpu-amdpal-code-object-user-data-section:
11290
11291User Data
11292+++++++++
11293
11294Each hardware stage has a set of 32-bit physical SPI *user data registers*
11295(either 16 or 32 based on graphics IP and the stage) which can be
11296written from a command buffer and then loaded into SGPRs when waves are
11297launched via a subsequent dispatch or draw operation. This is the way
11298most arguments are passed from the application/runtime to a hardware
11299shader.
11300
11301PAL abstracts this functionality by exposing a set of 128 *user data
11302entries* per pipeline a client can use to pass arguments from a command
11303buffer to one or more shaders in that pipeline. The ELF code object must
11304specify a mapping from virtualized *user data entries* to physical *user
11305data registers*, and PAL is responsible for implementing that mapping,
11306including spilling overflow *user data entries* to memory if needed.
11307
11308Since the *user data registers* are GRBM-accessible SPI registers, this
11309mapping is actually embedded in the ``.registers`` metadata entry. For
11310most registers, the value in that map is a literal 32-bit value that
11311should be written to the register by the driver. However, when the
11312register is a *user data register* (any USER_DATA register e.g.,
11313SPI_SHADER_USER_DATA_PS_5), the value is instead an encoding that tells
11314the driver to write either a *user data entry* value or one of several
11315driver-internal values to the register. This encoding is described in
11316the following table:
11317
11318.. note::
11319
11320  Currently, *user data registers* 0 and 1 (e.g., SPI_SHADER_USER_DATA_PS_0,
11321  and SPI_SHADER_USER_DATA_PS_1) are reserved. *User data register* 0 must
11322  always be programmed to the address of the GlobalTable, and *user data
11323  register* 1 must always be programmed to the address of the PerShaderTable.
11324
11325..
11326
11327  .. table:: AMDPAL User Data Mapping
11328     :name: amdgpu-amdpal-code-object-metadata-user-data-mapping-table
11329
11330     ==========  =================  ===============================================================================
11331     Value       Name               Description
11332     ==========  =================  ===============================================================================
11333     0..127      *User Data Entry*  32-bit value of user_data_entry[N] as specified via *CmdSetUserData()*
11334     0x10000000  GlobalTable        32-bit pointer to GPU memory containing the global internal table (should
11335                                    always point to *user data register* 0).
11336     0x10000001  PerShaderTable     32-bit pointer to GPU memory containing the per-shader internal table. See
11337                                    :ref:`amdgpu-amdpal-code-object-metadata-user-data-per-shader-table-section`
11338                                    for more detail (should always point to *user data register* 1).
11339     0x10000002  SpillTable         32-bit pointer to GPU memory containing the user data spill table. See
11340                                    :ref:`amdgpu-amdpal-code-object-metadata-user-data-spill-table-section` for
11341                                    more detail.
11342     0x10000003  BaseVertex         Vertex offset (32-bit unsigned integer). Not needed if the pipeline doesn't
11343                                    reference the draw index in the vertex shader. Only supported by the first
11344                                    stage in a graphics pipeline.
11345     0x10000004  BaseInstance       Instance offset (32-bit unsigned integer). Only supported by the first stage in
11346                                    a graphics pipeline.
11347     0x10000005  DrawIndex          Draw index (32-bit unsigned integer). Only supported by the first stage in a
11348                                    graphics pipeline.
11349     0x10000006  Workgroup          Thread group count (32-bit unsigned integer). Low half of a 64-bit address of
11350                                    a buffer containing the grid dimensions for a Compute dispatch operation. The
11351                                    high half of the address is stored in the next sequential user-SGPR. Only
11352                                    supported by compute pipelines.
11353     0x1000000A  EsGsLdsSize        Indicates that PAL will program this user-SGPR to contain the amount of LDS
11354                                    space used for the ES/GS pseudo-ring-buffer for passing data between shader
11355                                    stages.
11356     0x1000000B  ViewId             View id (32-bit unsigned integer) identifies a view of graphic
11357                                    pipeline instancing.
11358     0x1000000C  StreamOutTable     32-bit pointer to GPU memory containing the stream out target SRD table.  This
11359                                    can only appear for one shader stage per pipeline.
11360     0x1000000D  PerShaderPerfData  32-bit pointer to GPU memory containing the per-shader performance data buffer.
11361     0x1000000F  VertexBufferTable  32-bit pointer to GPU memory containing the vertex buffer SRD table.  This can
11362                                    only appear for one shader stage per pipeline.
11363     0x10000010  UavExportTable     32-bit pointer to GPU memory containing the UAV export SRD table.  This can
11364                                    only appear for one shader stage per pipeline (PS). These replace color targets
11365                                    and are completely separate from any UAVs used by the shader. This is optional,
11366                                    and only used by the PS when UAV exports are used to replace color-target
11367                                    exports to optimize specific shaders.
11368     0x10000011  NggCullingData     64-bit pointer to GPU memory containing the hardware register data needed by
11369                                    some NGG pipelines to perform culling.  This value contains the address of the
11370                                    first of two consecutive registers which provide the full GPU address.
11371     0x10000015  FetchShaderPtr     64-bit pointer to GPU memory containing the fetch shader subroutine.
11372     ==========  =================  ===============================================================================
11373
11374.. _amdgpu-amdpal-code-object-metadata-user-data-per-shader-table-section:
11375
11376Per-Shader Table
11377################
11378
11379Low 32 bits of the GPU address for an optional buffer in the ``.data``
11380section of the ELF. The high 32 bits of the address match the high 32 bits
11381of the shader's program counter.
11382
11383The buffer can be anything the shader compiler needs it for, and
11384allows each shader to have its own region of the ``.data`` section.
11385Typically, this could be a table of buffer SRD's and the data pointed to
11386by the buffer SRD's, but it could be a flat-address region of memory as
11387well. Its layout and usage are defined by the shader compiler.
11388
11389Each shader's table in the ``.data`` section is referenced by the symbol
11390``_amdgpu_``\ *xs*\ ``_shdr_intrl_data``  where *xs* corresponds with the
11391hardware shader stage the data is for. E.g.,
11392``_amdgpu_cs_shdr_intrl_data`` for the compute shader hardware stage.
11393
11394.. _amdgpu-amdpal-code-object-metadata-user-data-spill-table-section:
11395
11396Spill Table
11397###########
11398
11399It is possible for a hardware shader to need access to more *user data
11400entries* than there are slots available in user data registers for one
11401or more hardware shader stages. In that case, the PAL runtime expects
11402the necessary *user data entries* to be spilled to GPU memory and use
11403one user data register to point to the spilled user data memory. The
11404value of the *user data entry* must then represent the location where
11405a shader expects to read the low 32-bits of the table's GPU virtual
11406address. The *spill table* itself represents a set of 32-bit values
11407managed by the PAL runtime in GPU-accessible memory that can be made
11408indirectly accessible to a hardware shader.
11409
11410Unspecified OS
11411--------------
11412
11413This section provides code conventions used when the target triple OS is
11414empty (see :ref:`amdgpu-target-triples`).
11415
11416Trap Handler ABI
11417~~~~~~~~~~~~~~~~
11418
11419For code objects generated by AMDGPU backend for non-amdhsa OS, the runtime does
11420not install a trap handler. The ``llvm.trap`` and ``llvm.debugtrap``
11421instructions are handled as follows:
11422
11423  .. table:: AMDGPU Trap Handler for Non-AMDHSA OS
11424     :name: amdgpu-trap-handler-for-non-amdhsa-os-table
11425
11426     =============== =============== ===========================================
11427     Usage           Code Sequence   Description
11428     =============== =============== ===========================================
11429     llvm.trap       s_endpgm        Causes wavefront to be terminated.
11430     llvm.debugtrap  *none*          Compiler warning given that there is no
11431                                     trap handler installed.
11432     =============== =============== ===========================================
11433
11434Source Languages
11435================
11436
11437.. _amdgpu-opencl:
11438
11439OpenCL
11440------
11441
11442When the language is OpenCL the following differences occur:
11443
114441. The OpenCL memory model is used (see :ref:`amdgpu-amdhsa-memory-model`).
114452. The AMDGPU backend appends additional arguments to the kernel's explicit
11446   arguments for the AMDHSA OS (see
11447   :ref:`opencl-kernel-implicit-arguments-appended-for-amdhsa-os-table`).
114483. Additional metadata is generated
11449   (see :ref:`amdgpu-amdhsa-code-object-metadata`).
11450
11451  .. table:: OpenCL kernel implicit arguments appended for AMDHSA OS
11452     :name: opencl-kernel-implicit-arguments-appended-for-amdhsa-os-table
11453
11454     ======== ==== ========= ===========================================
11455     Position Byte Byte      Description
11456              Size Alignment
11457     ======== ==== ========= ===========================================
11458     1        8    8         OpenCL Global Offset X
11459     2        8    8         OpenCL Global Offset Y
11460     3        8    8         OpenCL Global Offset Z
11461     4        8    8         OpenCL address of printf buffer
11462     5        8    8         OpenCL address of virtual queue used by
11463                             enqueue_kernel.
11464     6        8    8         OpenCL address of AqlWrap struct used by
11465                             enqueue_kernel.
11466     7        8    8         Pointer argument used for Multi-gird
11467                             synchronization.
11468     ======== ==== ========= ===========================================
11469
11470.. _amdgpu-hcc:
11471
11472HCC
11473---
11474
11475When the language is HCC the following differences occur:
11476
114771. The HSA memory model is used (see :ref:`amdgpu-amdhsa-memory-model`).
11478
11479.. _amdgpu-assembler:
11480
11481Assembler
11482---------
11483
11484AMDGPU backend has LLVM-MC based assembler which is currently in development.
11485It supports AMDGCN GFX6-GFX10.
11486
11487This section describes general syntax for instructions and operands.
11488
11489Instructions
11490~~~~~~~~~~~~
11491
11492An instruction has the following :doc:`syntax<AMDGPUInstructionSyntax>`:
11493
11494  | ``<``\ *opcode*\ ``> <``\ *operand0*\ ``>, <``\ *operand1*\ ``>,...
11495    <``\ *modifier0*\ ``> <``\ *modifier1*\ ``>...``
11496
11497:doc:`Operands<AMDGPUOperandSyntax>` are comma-separated while
11498:doc:`modifiers<AMDGPUModifierSyntax>` are space-separated.
11499
11500The order of operands and modifiers is fixed.
11501Most modifiers are optional and may be omitted.
11502
11503Links to detailed instruction syntax description may be found in the following
11504table. Note that features under development are not included
11505in this description.
11506
11507    =================================== =======================================
11508    Core ISA                            ISA Extensions
11509    =================================== =======================================
11510    :doc:`GFX7<AMDGPU/AMDGPUAsmGFX7>`   \-
11511    :doc:`GFX8<AMDGPU/AMDGPUAsmGFX8>`   \-
11512    :doc:`GFX9<AMDGPU/AMDGPUAsmGFX9>`   :doc:`gfx900<AMDGPU/AMDGPUAsmGFX900>`
11513
11514                                        :doc:`gfx902<AMDGPU/AMDGPUAsmGFX900>`
11515
11516                                        :doc:`gfx904<AMDGPU/AMDGPUAsmGFX904>`
11517
11518                                        :doc:`gfx906<AMDGPU/AMDGPUAsmGFX906>`
11519
11520                                        :doc:`gfx908<AMDGPU/AMDGPUAsmGFX908>`
11521
11522                                        :doc:`gfx909<AMDGPU/AMDGPUAsmGFX900>`
11523
11524                                        :doc:`gfx90a<AMDGPU/AMDGPUAsmGFX90a>`
11525
11526    :doc:`GFX10<AMDGPU/AMDGPUAsmGFX10>` :doc:`gfx1011<AMDGPU/AMDGPUAsmGFX1011>`
11527
11528                                        :doc:`gfx1012<AMDGPU/AMDGPUAsmGFX1011>`
11529    =================================== =======================================
11530
11531For more information about instructions, their semantics and supported
11532combinations of operands, refer to one of instruction set architecture manuals
11533[AMD-GCN-GFX6]_, [AMD-GCN-GFX7]_, [AMD-GCN-GFX8]_,
11534[AMD-GCN-GFX900-GFX904-VEGA]_, [AMD-GCN-GFX906-VEGA7NM]_
11535[AMD-GCN-GFX908-CDNA1]_, [AMD-GCN-GFX10-RDNA1]_ and [AMD-GCN-GFX10-RDNA2]_.
11536
11537Operands
11538~~~~~~~~
11539
11540Detailed description of operands may be found :doc:`here<AMDGPUOperandSyntax>`.
11541
11542Modifiers
11543~~~~~~~~~
11544
11545Detailed description of modifiers may be found
11546:doc:`here<AMDGPUModifierSyntax>`.
11547
11548Instruction Examples
11549~~~~~~~~~~~~~~~~~~~~
11550
11551DS
11552++
11553
11554.. code-block:: nasm
11555
11556  ds_add_u32 v2, v4 offset:16
11557  ds_write_src2_b64 v2 offset0:4 offset1:8
11558  ds_cmpst_f32 v2, v4, v6
11559  ds_min_rtn_f64 v[8:9], v2, v[4:5]
11560
11561For full list of supported instructions, refer to "LDS/GDS instructions" in ISA
11562Manual.
11563
11564FLAT
11565++++
11566
11567.. code-block:: nasm
11568
11569  flat_load_dword v1, v[3:4]
11570  flat_store_dwordx3 v[3:4], v[5:7]
11571  flat_atomic_swap v1, v[3:4], v5 glc
11572  flat_atomic_cmpswap v1, v[3:4], v[5:6] glc slc
11573  flat_atomic_fmax_x2 v[1:2], v[3:4], v[5:6] glc
11574
11575For full list of supported instructions, refer to "FLAT instructions" in ISA
11576Manual.
11577
11578MUBUF
11579+++++
11580
11581.. code-block:: nasm
11582
11583  buffer_load_dword v1, off, s[4:7], s1
11584  buffer_store_dwordx4 v[1:4], v2, ttmp[4:7], s1 offen offset:4 glc tfe
11585  buffer_store_format_xy v[1:2], off, s[4:7], s1
11586  buffer_wbinvl1
11587  buffer_atomic_inc v1, v2, s[8:11], s4 idxen offset:4 slc
11588
11589For full list of supported instructions, refer to "MUBUF Instructions" in ISA
11590Manual.
11591
11592SMRD/SMEM
11593+++++++++
11594
11595.. code-block:: nasm
11596
11597  s_load_dword s1, s[2:3], 0xfc
11598  s_load_dwordx8 s[8:15], s[2:3], s4
11599  s_load_dwordx16 s[88:103], s[2:3], s4
11600  s_dcache_inv_vol
11601  s_memtime s[4:5]
11602
11603For full list of supported instructions, refer to "Scalar Memory Operations" in
11604ISA Manual.
11605
11606SOP1
11607++++
11608
11609.. code-block:: nasm
11610
11611  s_mov_b32 s1, s2
11612  s_mov_b64 s[0:1], 0x80000000
11613  s_cmov_b32 s1, 200
11614  s_wqm_b64 s[2:3], s[4:5]
11615  s_bcnt0_i32_b64 s1, s[2:3]
11616  s_swappc_b64 s[2:3], s[4:5]
11617  s_cbranch_join s[4:5]
11618
11619For full list of supported instructions, refer to "SOP1 Instructions" in ISA
11620Manual.
11621
11622SOP2
11623++++
11624
11625.. code-block:: nasm
11626
11627  s_add_u32 s1, s2, s3
11628  s_and_b64 s[2:3], s[4:5], s[6:7]
11629  s_cselect_b32 s1, s2, s3
11630  s_andn2_b32 s2, s4, s6
11631  s_lshr_b64 s[2:3], s[4:5], s6
11632  s_ashr_i32 s2, s4, s6
11633  s_bfm_b64 s[2:3], s4, s6
11634  s_bfe_i64 s[2:3], s[4:5], s6
11635  s_cbranch_g_fork s[4:5], s[6:7]
11636
11637For full list of supported instructions, refer to "SOP2 Instructions" in ISA
11638Manual.
11639
11640SOPC
11641++++
11642
11643.. code-block:: nasm
11644
11645  s_cmp_eq_i32 s1, s2
11646  s_bitcmp1_b32 s1, s2
11647  s_bitcmp0_b64 s[2:3], s4
11648  s_setvskip s3, s5
11649
11650For full list of supported instructions, refer to "SOPC Instructions" in ISA
11651Manual.
11652
11653SOPP
11654++++
11655
11656.. code-block:: nasm
11657
11658  s_barrier
11659  s_nop 2
11660  s_endpgm
11661  s_waitcnt 0 ; Wait for all counters to be 0
11662  s_waitcnt vmcnt(0) & expcnt(0) & lgkmcnt(0) ; Equivalent to above
11663  s_waitcnt vmcnt(1) ; Wait for vmcnt counter to be 1.
11664  s_sethalt 9
11665  s_sleep 10
11666  s_sendmsg 0x1
11667  s_sendmsg sendmsg(MSG_INTERRUPT)
11668  s_trap 1
11669
11670For full list of supported instructions, refer to "SOPP Instructions" in ISA
11671Manual.
11672
11673Unless otherwise mentioned, little verification is performed on the operands
11674of SOPP Instructions, so it is up to the programmer to be familiar with the
11675range or acceptable values.
11676
11677VALU
11678++++
11679
11680For vector ALU instruction opcodes (VOP1, VOP2, VOP3, VOPC, VOP_DPP, VOP_SDWA),
11681the assembler will automatically use optimal encoding based on its operands. To
11682force specific encoding, one can add a suffix to the opcode of the instruction:
11683
11684* _e32 for 32-bit VOP1/VOP2/VOPC
11685* _e64 for 64-bit VOP3
11686* _dpp for VOP_DPP
11687* _sdwa for VOP_SDWA
11688
11689VOP1/VOP2/VOP3/VOPC examples:
11690
11691.. code-block:: nasm
11692
11693  v_mov_b32 v1, v2
11694  v_mov_b32_e32 v1, v2
11695  v_nop
11696  v_cvt_f64_i32_e32 v[1:2], v2
11697  v_floor_f32_e32 v1, v2
11698  v_bfrev_b32_e32 v1, v2
11699  v_add_f32_e32 v1, v2, v3
11700  v_mul_i32_i24_e64 v1, v2, 3
11701  v_mul_i32_i24_e32 v1, -3, v3
11702  v_mul_i32_i24_e32 v1, -100, v3
11703  v_addc_u32 v1, s[0:1], v2, v3, s[2:3]
11704  v_max_f16_e32 v1, v2, v3
11705
11706VOP_DPP examples:
11707
11708.. code-block:: nasm
11709
11710  v_mov_b32 v0, v0 quad_perm:[0,2,1,1]
11711  v_sin_f32 v0, v0 row_shl:1 row_mask:0xa bank_mask:0x1 bound_ctrl:0
11712  v_mov_b32 v0, v0 wave_shl:1
11713  v_mov_b32 v0, v0 row_mirror
11714  v_mov_b32 v0, v0 row_bcast:31
11715  v_mov_b32 v0, v0 quad_perm:[1,3,0,1] row_mask:0xa bank_mask:0x1 bound_ctrl:0
11716  v_add_f32 v0, v0, |v0| row_shl:1 row_mask:0xa bank_mask:0x1 bound_ctrl:0
11717  v_max_f16 v1, v2, v3 row_shl:1 row_mask:0xa bank_mask:0x1 bound_ctrl:0
11718
11719VOP_SDWA examples:
11720
11721.. code-block:: nasm
11722
11723  v_mov_b32 v1, v2 dst_sel:BYTE_0 dst_unused:UNUSED_PRESERVE src0_sel:DWORD
11724  v_min_u32 v200, v200, v1 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_1 src1_sel:DWORD
11725  v_sin_f32 v0, v0 dst_unused:UNUSED_PAD src0_sel:WORD_1
11726  v_fract_f32 v0, |v0| dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_1
11727  v_cmpx_le_u32 vcc, v1, v2 src0_sel:BYTE_2 src1_sel:WORD_0
11728
11729For full list of supported instructions, refer to "Vector ALU instructions".
11730
11731.. _amdgpu-amdhsa-assembler-predefined-symbols-v2:
11732
11733Code Object V2 Predefined Symbols
11734~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
11735
11736.. warning::
11737  Code object V2 is not the default code object version emitted by
11738  this version of LLVM.
11739
11740The AMDGPU assembler defines and updates some symbols automatically. These
11741symbols do not affect code generation.
11742
11743.option.machine_version_major
11744+++++++++++++++++++++++++++++
11745
11746Set to the GFX major generation number of the target being assembled for. For
11747example, when assembling for a "GFX9" target this will be set to the integer
11748value "9". The possible GFX major generation numbers are presented in
11749:ref:`amdgpu-processors`.
11750
11751.option.machine_version_minor
11752+++++++++++++++++++++++++++++
11753
11754Set to the GFX minor generation number of the target being assembled for. For
11755example, when assembling for a "GFX810" target this will be set to the integer
11756value "1". The possible GFX minor generation numbers are presented in
11757:ref:`amdgpu-processors`.
11758
11759.option.machine_version_stepping
11760++++++++++++++++++++++++++++++++
11761
11762Set to the GFX stepping generation number of the target being assembled for.
11763For example, when assembling for a "GFX704" target this will be set to the
11764integer value "4". The possible GFX stepping generation numbers are presented
11765in :ref:`amdgpu-processors`.
11766
11767.kernel.vgpr_count
11768++++++++++++++++++
11769
11770Set to zero each time a
11771:ref:`amdgpu-amdhsa-assembler-directive-amdgpu_hsa_kernel` directive is
11772encountered. At each instruction, if the current value of this symbol is less
11773than or equal to the maximum VGPR number explicitly referenced within that
11774instruction then the symbol value is updated to equal that VGPR number plus
11775one.
11776
11777.kernel.sgpr_count
11778++++++++++++++++++
11779
11780Set to zero each time a
11781:ref:`amdgpu-amdhsa-assembler-directive-amdgpu_hsa_kernel` directive is
11782encountered. At each instruction, if the current value of this symbol is less
11783than or equal to the maximum VGPR number explicitly referenced within that
11784instruction then the symbol value is updated to equal that SGPR number plus
11785one.
11786
11787.. _amdgpu-amdhsa-assembler-directives-v2:
11788
11789Code Object V2 Directives
11790~~~~~~~~~~~~~~~~~~~~~~~~~
11791
11792.. warning::
11793  Code object V2 is not the default code object version emitted by
11794  this version of LLVM.
11795
11796AMDGPU ABI defines auxiliary data in output code object. In assembly source,
11797one can specify them with assembler directives.
11798
11799.hsa_code_object_version major, minor
11800+++++++++++++++++++++++++++++++++++++
11801
11802*major* and *minor* are integers that specify the version of the HSA code
11803object that will be generated by the assembler.
11804
11805.hsa_code_object_isa [major, minor, stepping, vendor, arch]
11806+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
11807
11808
11809*major*, *minor*, and *stepping* are all integers that describe the instruction
11810set architecture (ISA) version of the assembly program.
11811
11812*vendor* and *arch* are quoted strings. *vendor* should always be equal to
11813"AMD" and *arch* should always be equal to "AMDGPU".
11814
11815By default, the assembler will derive the ISA version, *vendor*, and *arch*
11816from the value of the -mcpu option that is passed to the assembler.
11817
11818.. _amdgpu-amdhsa-assembler-directive-amdgpu_hsa_kernel:
11819
11820.amdgpu_hsa_kernel (name)
11821+++++++++++++++++++++++++
11822
11823This directives specifies that the symbol with given name is a kernel entry
11824point (label) and the object should contain corresponding symbol of type
11825STT_AMDGPU_HSA_KERNEL.
11826
11827.amd_kernel_code_t
11828++++++++++++++++++
11829
11830This directive marks the beginning of a list of key / value pairs that are used
11831to specify the amd_kernel_code_t object that will be emitted by the assembler.
11832The list must be terminated by the *.end_amd_kernel_code_t* directive. For any
11833amd_kernel_code_t values that are unspecified a default value will be used. The
11834default value for all keys is 0, with the following exceptions:
11835
11836- *amd_code_version_major* defaults to 1.
11837- *amd_kernel_code_version_minor* defaults to 2.
11838- *amd_machine_kind* defaults to 1.
11839- *amd_machine_version_major*, *machine_version_minor*, and
11840  *amd_machine_version_stepping* are derived from the value of the -mcpu option
11841  that is passed to the assembler.
11842- *kernel_code_entry_byte_offset* defaults to 256.
11843- *wavefront_size* defaults 6 for all targets before GFX10. For GFX10 onwards
11844  defaults to 6 if target feature ``wavefrontsize64`` is enabled, otherwise 5.
11845  Note that wavefront size is specified as a power of two, so a value of **n**
11846  means a size of 2^ **n**.
11847- *call_convention* defaults to -1.
11848- *kernarg_segment_alignment*, *group_segment_alignment*, and
11849  *private_segment_alignment* default to 4. Note that alignments are specified
11850  as a power of 2, so a value of **n** means an alignment of 2^ **n**.
11851- *enable_tg_split* defaults to 1 if target feature ``tgsplit`` is enabled for
11852  GFX90A onwards.
11853- *enable_wgp_mode* defaults to 1 if target feature ``cumode`` is disabled for
11854  GFX10 onwards.
11855- *enable_mem_ordered* defaults to 1 for GFX10 onwards.
11856
11857The *.amd_kernel_code_t* directive must be placed immediately after the
11858function label and before any instructions.
11859
11860For a full list of amd_kernel_code_t keys, refer to AMDGPU ABI document,
11861comments in lib/Target/AMDGPU/AmdKernelCodeT.h and test/CodeGen/AMDGPU/hsa.s.
11862
11863.. _amdgpu-amdhsa-assembler-example-v2:
11864
11865Code Object V2 Example Source Code
11866~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
11867
11868.. warning::
11869  Code Object V2 is not the default code object version emitted by
11870  this version of LLVM.
11871
11872Here is an example of a minimal assembly source file, defining one HSA kernel:
11873
11874.. code::
11875   :number-lines:
11876
11877   .hsa_code_object_version 1,0
11878   .hsa_code_object_isa
11879
11880   .hsatext
11881   .globl  hello_world
11882   .p2align 8
11883   .amdgpu_hsa_kernel hello_world
11884
11885   hello_world:
11886
11887      .amd_kernel_code_t
11888         enable_sgpr_kernarg_segment_ptr = 1
11889         is_ptr64 = 1
11890         compute_pgm_rsrc1_vgprs = 0
11891         compute_pgm_rsrc1_sgprs = 0
11892         compute_pgm_rsrc2_user_sgpr = 2
11893         compute_pgm_rsrc1_wgp_mode = 0
11894         compute_pgm_rsrc1_mem_ordered = 0
11895         compute_pgm_rsrc1_fwd_progress = 1
11896     .end_amd_kernel_code_t
11897
11898     s_load_dwordx2 s[0:1], s[0:1] 0x0
11899     v_mov_b32 v0, 3.14159
11900     s_waitcnt lgkmcnt(0)
11901     v_mov_b32 v1, s0
11902     v_mov_b32 v2, s1
11903     flat_store_dword v[1:2], v0
11904     s_endpgm
11905   .Lfunc_end0:
11906        .size   hello_world, .Lfunc_end0-hello_world
11907
11908.. _amdgpu-amdhsa-assembler-predefined-symbols-v3-v4:
11909
11910Code Object V3 to V4 Predefined Symbols
11911~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
11912
11913The AMDGPU assembler defines and updates some symbols automatically. These
11914symbols do not affect code generation.
11915
11916.amdgcn.gfx_generation_number
11917+++++++++++++++++++++++++++++
11918
11919Set to the GFX major generation number of the target being assembled for. For
11920example, when assembling for a "GFX9" target this will be set to the integer
11921value "9". The possible GFX major generation numbers are presented in
11922:ref:`amdgpu-processors`.
11923
11924.amdgcn.gfx_generation_minor
11925++++++++++++++++++++++++++++
11926
11927Set to the GFX minor generation number of the target being assembled for. For
11928example, when assembling for a "GFX810" target this will be set to the integer
11929value "1". The possible GFX minor generation numbers are presented in
11930:ref:`amdgpu-processors`.
11931
11932.amdgcn.gfx_generation_stepping
11933+++++++++++++++++++++++++++++++
11934
11935Set to the GFX stepping generation number of the target being assembled for.
11936For example, when assembling for a "GFX704" target this will be set to the
11937integer value "4". The possible GFX stepping generation numbers are presented
11938in :ref:`amdgpu-processors`.
11939
11940.. _amdgpu-amdhsa-assembler-symbol-next_free_vgpr:
11941
11942.amdgcn.next_free_vgpr
11943++++++++++++++++++++++
11944
11945Set to zero before assembly begins. At each instruction, if the current value
11946of this symbol is less than or equal to the maximum VGPR number explicitly
11947referenced within that instruction then the symbol value is updated to equal
11948that VGPR number plus one.
11949
11950May be used to set the `.amdhsa_next_free_vgpr` directive in
11951:ref:`amdhsa-kernel-directives-table`.
11952
11953May be set at any time, e.g. manually set to zero at the start of each kernel.
11954
11955.. _amdgpu-amdhsa-assembler-symbol-next_free_sgpr:
11956
11957.amdgcn.next_free_sgpr
11958++++++++++++++++++++++
11959
11960Set to zero before assembly begins. At each instruction, if the current value
11961of this symbol is less than or equal the maximum SGPR number explicitly
11962referenced within that instruction then the symbol value is updated to equal
11963that SGPR number plus one.
11964
11965May be used to set the `.amdhsa_next_free_spgr` directive in
11966:ref:`amdhsa-kernel-directives-table`.
11967
11968May be set at any time, e.g. manually set to zero at the start of each kernel.
11969
11970.. _amdgpu-amdhsa-assembler-directives-v3-v4:
11971
11972Code Object V3 to V4 Directives
11973~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
11974
11975Directives which begin with ``.amdgcn`` are valid for all ``amdgcn``
11976architecture processors, and are not OS-specific. Directives which begin with
11977``.amdhsa`` are specific to ``amdgcn`` architecture processors when the
11978``amdhsa`` OS is specified. See :ref:`amdgpu-target-triples` and
11979:ref:`amdgpu-processors`.
11980
11981.. _amdgpu-assembler-directive-amdgcn-target:
11982
11983.amdgcn_target <target-triple> "-" <target-id>
11984++++++++++++++++++++++++++++++++++++++++++++++
11985
11986Optional directive which declares the ``<target-triple>-<target-id>`` supported
11987by the containing assembler source file. Used by the assembler to validate
11988command-line options such as ``-triple``, ``-mcpu``, and
11989``--offload-arch=<target-id>``. A non-canonical target ID is allowed. See
11990:ref:`amdgpu-target-triples` and :ref:`amdgpu-target-id`.
11991
11992.. note::
11993
11994  The target ID syntax used for code object V2 to V3 for this directive differs
11995  from that used elsewhere. See :ref:`amdgpu-target-id-v2-v3`.
11996
11997.amdhsa_kernel <name>
11998+++++++++++++++++++++
11999
12000Creates a correctly aligned AMDHSA kernel descriptor and a symbol,
12001``<name>.kd``, in the current location of the current section. Only valid when
12002the OS is ``amdhsa``. ``<name>`` must be a symbol that labels the first
12003instruction to execute, and does not need to be previously defined.
12004
12005Marks the beginning of a list of directives used to generate the bytes of a
12006kernel descriptor, as described in :ref:`amdgpu-amdhsa-kernel-descriptor`.
12007Directives which may appear in this list are described in
12008:ref:`amdhsa-kernel-directives-table`. Directives may appear in any order, must
12009be valid for the target being assembled for, and cannot be repeated. Directives
12010support the range of values specified by the field they reference in
12011:ref:`amdgpu-amdhsa-kernel-descriptor`. If a directive is not specified, it is
12012assumed to have its default value, unless it is marked as "Required", in which
12013case it is an error to omit the directive. This list of directives is
12014terminated by an ``.end_amdhsa_kernel`` directive.
12015
12016  .. table:: AMDHSA Kernel Assembler Directives
12017     :name: amdhsa-kernel-directives-table
12018
12019     ======================================================== =================== ============ ===================
12020     Directive                                                Default             Supported On Description
12021     ======================================================== =================== ============ ===================
12022     ``.amdhsa_group_segment_fixed_size``                     0                   GFX6-GFX10   Controls GROUP_SEGMENT_FIXED_SIZE in
12023                                                                                               :ref:`amdgpu-amdhsa-kernel-descriptor-v3-table`.
12024     ``.amdhsa_private_segment_fixed_size``                   0                   GFX6-GFX10   Controls PRIVATE_SEGMENT_FIXED_SIZE in
12025                                                                                               :ref:`amdgpu-amdhsa-kernel-descriptor-v3-table`.
12026     ``.amdhsa_kernarg_size``                                 0                   GFX6-GFX10   Controls KERNARG_SIZE in
12027                                                                                               :ref:`amdgpu-amdhsa-kernel-descriptor-v3-table`.
12028     ``.amdhsa_user_sgpr_private_segment_buffer``             0                   GFX6-GFX10   Controls ENABLE_SGPR_PRIVATE_SEGMENT_BUFFER in
12029                                                                                               :ref:`amdgpu-amdhsa-kernel-descriptor-v3-table`.
12030     ``.amdhsa_user_sgpr_dispatch_ptr``                       0                   GFX6-GFX10   Controls ENABLE_SGPR_DISPATCH_PTR in
12031                                                                                               :ref:`amdgpu-amdhsa-kernel-descriptor-v3-table`.
12032     ``.amdhsa_user_sgpr_queue_ptr``                          0                   GFX6-GFX10   Controls ENABLE_SGPR_QUEUE_PTR in
12033                                                                                               :ref:`amdgpu-amdhsa-kernel-descriptor-v3-table`.
12034     ``.amdhsa_user_sgpr_kernarg_segment_ptr``                0                   GFX6-GFX10   Controls ENABLE_SGPR_KERNARG_SEGMENT_PTR in
12035                                                                                               :ref:`amdgpu-amdhsa-kernel-descriptor-v3-table`.
12036     ``.amdhsa_user_sgpr_dispatch_id``                        0                   GFX6-GFX10   Controls ENABLE_SGPR_DISPATCH_ID in
12037                                                                                               :ref:`amdgpu-amdhsa-kernel-descriptor-v3-table`.
12038     ``.amdhsa_user_sgpr_flat_scratch_init``                  0                   GFX6-GFX10   Controls ENABLE_SGPR_FLAT_SCRATCH_INIT in
12039                                                                                               :ref:`amdgpu-amdhsa-kernel-descriptor-v3-table`.
12040     ``.amdhsa_user_sgpr_private_segment_size``               0                   GFX6-GFX10   Controls ENABLE_SGPR_PRIVATE_SEGMENT_SIZE in
12041                                                                                               :ref:`amdgpu-amdhsa-kernel-descriptor-v3-table`.
12042     ``.amdhsa_wavefront_size32``                             Target              GFX10        Controls ENABLE_WAVEFRONT_SIZE32 in
12043                                                              Feature                          :ref:`amdgpu-amdhsa-kernel-descriptor-v3-table`.
12044                                                              Specific
12045                                                              (wavefrontsize64)
12046     ``.amdhsa_system_sgpr_private_segment_wavefront_offset`` 0                   GFX6-GFX10   Controls ENABLE_PRIVATE_SEGMENT in
12047                                                                                               :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx10-table`.
12048     ``.amdhsa_system_sgpr_workgroup_id_x``                   1                   GFX6-GFX10   Controls ENABLE_SGPR_WORKGROUP_ID_X in
12049                                                                                               :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx10-table`.
12050     ``.amdhsa_system_sgpr_workgroup_id_y``                   0                   GFX6-GFX10   Controls ENABLE_SGPR_WORKGROUP_ID_Y in
12051                                                                                               :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx10-table`.
12052     ``.amdhsa_system_sgpr_workgroup_id_z``                   0                   GFX6-GFX10   Controls ENABLE_SGPR_WORKGROUP_ID_Z in
12053                                                                                               :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx10-table`.
12054     ``.amdhsa_system_sgpr_workgroup_info``                   0                   GFX6-GFX10   Controls ENABLE_SGPR_WORKGROUP_INFO in
12055                                                                                               :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx10-table`.
12056     ``.amdhsa_system_vgpr_workitem_id``                      0                   GFX6-GFX10   Controls ENABLE_VGPR_WORKITEM_ID in
12057                                                                                               :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx10-table`.
12058                                                                                               Possible values are defined in
12059                                                                                               :ref:`amdgpu-amdhsa-system-vgpr-work-item-id-enumeration-values-table`.
12060     ``.amdhsa_next_free_vgpr``                               Required            GFX6-GFX10   Maximum VGPR number explicitly referenced, plus one.
12061                                                                                               Used to calculate GRANULATED_WORKITEM_VGPR_COUNT in
12062                                                                                               :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx10-table`.
12063     ``.amdhsa_next_free_sgpr``                               Required            GFX6-GFX10   Maximum SGPR number explicitly referenced, plus one.
12064                                                                                               Used to calculate GRANULATED_WAVEFRONT_SGPR_COUNT in
12065                                                                                               :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx10-table`.
12066     ``.amdhsa_accum_offset``                                 Required            GFX90A       Offset of a first AccVGPR in the unified register file.
12067                                                                                               Used to calculate ACCUM_OFFSET in
12068                                                                                               :ref:`amdgpu-amdhsa-compute_pgm_rsrc3-gfx90a-table`.
12069     ``.amdhsa_reserve_vcc``                                  1                   GFX6-GFX10   Whether the kernel may use the special VCC SGPR.
12070                                                                                               Used to calculate GRANULATED_WAVEFRONT_SGPR_COUNT in
12071                                                                                               :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx10-table`.
12072     ``.amdhsa_reserve_flat_scratch``                         1                   GFX7-GFX10   Whether the kernel may use flat instructions to access
12073                                                                                               scratch memory. Used to calculate
12074                                                                                               GRANULATED_WAVEFRONT_SGPR_COUNT in
12075                                                                                               :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx10-table`.
12076     ``.amdhsa_reserve_xnack_mask``                           Target              GFX8-GFX10   Whether the kernel may trigger XNACK replay.
12077                                                              Feature                          Used to calculate GRANULATED_WAVEFRONT_SGPR_COUNT in
12078                                                              Specific                         :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx10-table`.
12079                                                              (xnack)
12080     ``.amdhsa_float_round_mode_32``                          0                   GFX6-GFX10   Controls FLOAT_ROUND_MODE_32 in
12081                                                                                               :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx10-table`.
12082                                                                                               Possible values are defined in
12083                                                                                               :ref:`amdgpu-amdhsa-floating-point-rounding-mode-enumeration-values-table`.
12084     ``.amdhsa_float_round_mode_16_64``                       0                   GFX6-GFX10   Controls FLOAT_ROUND_MODE_16_64 in
12085                                                                                               :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx10-table`.
12086                                                                                               Possible values are defined in
12087                                                                                               :ref:`amdgpu-amdhsa-floating-point-rounding-mode-enumeration-values-table`.
12088     ``.amdhsa_float_denorm_mode_32``                         0                   GFX6-GFX10   Controls FLOAT_DENORM_MODE_32 in
12089                                                                                               :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx10-table`.
12090                                                                                               Possible values are defined in
12091                                                                                               :ref:`amdgpu-amdhsa-floating-point-denorm-mode-enumeration-values-table`.
12092     ``.amdhsa_float_denorm_mode_16_64``                      3                   GFX6-GFX10   Controls FLOAT_DENORM_MODE_16_64 in
12093                                                                                               :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx10-table`.
12094                                                                                               Possible values are defined in
12095                                                                                               :ref:`amdgpu-amdhsa-floating-point-denorm-mode-enumeration-values-table`.
12096     ``.amdhsa_dx10_clamp``                                   1                   GFX6-GFX10   Controls ENABLE_DX10_CLAMP in
12097                                                                                               :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx10-table`.
12098     ``.amdhsa_ieee_mode``                                    1                   GFX6-GFX10   Controls ENABLE_IEEE_MODE in
12099                                                                                               :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx10-table`.
12100     ``.amdhsa_fp16_overflow``                                0                   GFX9-GFX10   Controls FP16_OVFL in
12101                                                                                               :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx10-table`.
12102     ``.amdhsa_tg_split``                                     Target              GFX90A       Controls TG_SPLIT in
12103                                                              Feature                          :ref:`amdgpu-amdhsa-compute_pgm_rsrc3-gfx90a-table`.
12104                                                              Specific
12105                                                              (tgsplit)
12106     ``.amdhsa_workgroup_processor_mode``                     Target              GFX10        Controls ENABLE_WGP_MODE in
12107                                                              Feature                          :ref:`amdgpu-amdhsa-kernel-descriptor-v3-table`.
12108                                                              Specific
12109                                                              (cumode)
12110     ``.amdhsa_memory_ordered``                               1                   GFX10        Controls MEM_ORDERED in
12111                                                                                               :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx10-table`.
12112     ``.amdhsa_forward_progress``                             0                   GFX10        Controls FWD_PROGRESS in
12113                                                                                               :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx10-table`.
12114     ``.amdhsa_exception_fp_ieee_invalid_op``                 0                   GFX6-GFX10   Controls ENABLE_EXCEPTION_IEEE_754_FP_INVALID_OPERATION in
12115                                                                                               :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx10-table`.
12116     ``.amdhsa_exception_fp_denorm_src``                      0                   GFX6-GFX10   Controls ENABLE_EXCEPTION_FP_DENORMAL_SOURCE in
12117                                                                                               :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx10-table`.
12118     ``.amdhsa_exception_fp_ieee_div_zero``                   0                   GFX6-GFX10   Controls ENABLE_EXCEPTION_IEEE_754_FP_DIVISION_BY_ZERO in
12119                                                                                               :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx10-table`.
12120     ``.amdhsa_exception_fp_ieee_overflow``                   0                   GFX6-GFX10   Controls ENABLE_EXCEPTION_IEEE_754_FP_OVERFLOW in
12121                                                                                               :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx10-table`.
12122     ``.amdhsa_exception_fp_ieee_underflow``                  0                   GFX6-GFX10   Controls ENABLE_EXCEPTION_IEEE_754_FP_UNDERFLOW in
12123                                                                                               :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx10-table`.
12124     ``.amdhsa_exception_fp_ieee_inexact``                    0                   GFX6-GFX10   Controls ENABLE_EXCEPTION_IEEE_754_FP_INEXACT in
12125                                                                                               :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx10-table`.
12126     ``.amdhsa_exception_int_div_zero``                       0                   GFX6-GFX10   Controls ENABLE_EXCEPTION_INT_DIVIDE_BY_ZERO in
12127                                                                                               :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx10-table`.
12128     ======================================================== =================== ============ ===================
12129
12130.amdgpu_metadata
12131++++++++++++++++
12132
12133Optional directive which declares the contents of the ``NT_AMDGPU_METADATA``
12134note record (see :ref:`amdgpu-elf-note-records-table-v3-v4`).
12135
12136The contents must be in the [YAML]_ markup format, with the same structure and
12137semantics described in :ref:`amdgpu-amdhsa-code-object-metadata-v3` or
12138:ref:`amdgpu-amdhsa-code-object-metadata-v4`.
12139
12140This directive is terminated by an ``.end_amdgpu_metadata`` directive.
12141
12142.. _amdgpu-amdhsa-assembler-example-v3-v4:
12143
12144Code Object V3 to V4 Example Source Code
12145~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
12146
12147Here is an example of a minimal assembly source file, defining one HSA kernel:
12148
12149.. code::
12150   :number-lines:
12151
12152   .amdgcn_target "amdgcn-amd-amdhsa--gfx900+xnack" // optional
12153
12154   .text
12155   .globl hello_world
12156   .p2align 8
12157   .type hello_world,@function
12158   hello_world:
12159     s_load_dwordx2 s[0:1], s[0:1] 0x0
12160     v_mov_b32 v0, 3.14159
12161     s_waitcnt lgkmcnt(0)
12162     v_mov_b32 v1, s0
12163     v_mov_b32 v2, s1
12164     flat_store_dword v[1:2], v0
12165     s_endpgm
12166   .Lfunc_end0:
12167     .size   hello_world, .Lfunc_end0-hello_world
12168
12169   .rodata
12170   .p2align 6
12171   .amdhsa_kernel hello_world
12172     .amdhsa_user_sgpr_kernarg_segment_ptr 1
12173     .amdhsa_next_free_vgpr .amdgcn.next_free_vgpr
12174     .amdhsa_next_free_sgpr .amdgcn.next_free_sgpr
12175   .end_amdhsa_kernel
12176
12177   .amdgpu_metadata
12178   ---
12179   amdhsa.version:
12180     - 1
12181     - 0
12182   amdhsa.kernels:
12183     - .name: hello_world
12184       .symbol: hello_world.kd
12185       .kernarg_segment_size: 48
12186       .group_segment_fixed_size: 0
12187       .private_segment_fixed_size: 0
12188       .kernarg_segment_align: 4
12189       .wavefront_size: 64
12190       .sgpr_count: 2
12191       .vgpr_count: 3
12192       .max_flat_workgroup_size: 256
12193       .args:
12194         - .size: 8
12195           .offset: 0
12196           .value_kind: global_buffer
12197           .address_space: global
12198           .actual_access: write_only
12199   //...
12200   .end_amdgpu_metadata
12201
12202This kernel is equivalent to the following HIP program:
12203
12204.. code::
12205   :number-lines:
12206
12207   __global__ void hello_world(float *p) {
12208       *p = 3.14159f;
12209   }
12210
12211If an assembly source file contains multiple kernels and/or functions, the
12212:ref:`amdgpu-amdhsa-assembler-symbol-next_free_vgpr` and
12213:ref:`amdgpu-amdhsa-assembler-symbol-next_free_sgpr` symbols may be reset using
12214the ``.set <symbol>, <expression>`` directive. For example, in the case of two
12215kernels, where ``function1`` is only called from ``kernel1`` it is sufficient
12216to group the function with the kernel that calls it and reset the symbols
12217between the two connected components:
12218
12219.. code::
12220   :number-lines:
12221
12222   .amdgcn_target "amdgcn-amd-amdhsa--gfx900+xnack" // optional
12223
12224   // gpr tracking symbols are implicitly set to zero
12225
12226   .text
12227   .globl kern0
12228   .p2align 8
12229   .type kern0,@function
12230   kern0:
12231     // ...
12232     s_endpgm
12233   .Lkern0_end:
12234     .size   kern0, .Lkern0_end-kern0
12235
12236   .rodata
12237   .p2align 6
12238   .amdhsa_kernel kern0
12239     // ...
12240     .amdhsa_next_free_vgpr .amdgcn.next_free_vgpr
12241     .amdhsa_next_free_sgpr .amdgcn.next_free_sgpr
12242   .end_amdhsa_kernel
12243
12244   // reset symbols to begin tracking usage in func1 and kern1
12245   .set .amdgcn.next_free_vgpr, 0
12246   .set .amdgcn.next_free_sgpr, 0
12247
12248   .text
12249   .hidden func1
12250   .global func1
12251   .p2align 2
12252   .type func1,@function
12253   func1:
12254     // ...
12255     s_setpc_b64 s[30:31]
12256   .Lfunc1_end:
12257   .size func1, .Lfunc1_end-func1
12258
12259   .globl kern1
12260   .p2align 8
12261   .type kern1,@function
12262   kern1:
12263     // ...
12264     s_getpc_b64 s[4:5]
12265     s_add_u32 s4, s4, func1@rel32@lo+4
12266     s_addc_u32 s5, s5, func1@rel32@lo+4
12267     s_swappc_b64 s[30:31], s[4:5]
12268     // ...
12269     s_endpgm
12270   .Lkern1_end:
12271     .size   kern1, .Lkern1_end-kern1
12272
12273   .rodata
12274   .p2align 6
12275   .amdhsa_kernel kern1
12276     // ...
12277     .amdhsa_next_free_vgpr .amdgcn.next_free_vgpr
12278     .amdhsa_next_free_sgpr .amdgcn.next_free_sgpr
12279   .end_amdhsa_kernel
12280
12281These symbols cannot identify connected components in order to automatically
12282track the usage for each kernel. However, in some cases careful organization of
12283the kernels and functions in the source file means there is minimal additional
12284effort required to accurately calculate GPR usage.
12285
12286Additional Documentation
12287========================
12288
12289.. [AMD-GCN-GFX6] `AMD Southern Islands Series ISA <http://developer.amd.com/wordpress/media/2012/12/AMD_Southern_Islands_Instruction_Set_Architecture.pdf>`__
12290.. [AMD-GCN-GFX7] `AMD Sea Islands Series ISA <http://developer.amd.com/wordpress/media/2013/07/AMD_Sea_Islands_Instruction_Set_Architecture.pdf>`_
12291.. [AMD-GCN-GFX8] `AMD GCN3 Instruction Set Architecture <http://amd-dev.wpengine.netdna-cdn.com/wordpress/media/2013/12/AMD_GCN3_Instruction_Set_Architecture_rev1.1.pdf>`__
12292.. [AMD-GCN-GFX900-GFX904-VEGA] `AMD Vega Instruction Set Architecture <http://developer.amd.com/wordpress/media/2013/12/Vega_Shader_ISA_28July2017.pdf>`__
12293.. [AMD-GCN-GFX906-VEGA7NM] `AMD Vega 7nm Instruction Set Architecture <https://gpuopen.com/wp-content/uploads/2019/11/Vega_7nm_Shader_ISA_26November2019.pdf>`__
12294.. [AMD-GCN-GFX908-CDNA1] `AMD Instinct MI100 Instruction Set Architecture <https://developer.amd.com/wp-content/resources/CDNA1_Shader_ISA_14December2020.pdf>`__
12295.. [AMD-GCN-GFX10-RDNA1] `AMD RDNA 1.0 Instruction Set Architecture <https://gpuopen.com/wp-content/uploads/2019/08/RDNA_Shader_ISA_5August2019.pdf>`__
12296.. [AMD-GCN-GFX10-RDNA2] `AMD RDNA 2 Instruction Set Architecture <https://developer.amd.com/wp-content/resources/RDNA2_Shader_ISA_November2020.pdf>`__
12297.. [AMD-RADEON-HD-2000-3000] `AMD R6xx shader ISA <http://developer.amd.com/wordpress/media/2012/10/R600_Instruction_Set_Architecture.pdf>`__
12298.. [AMD-RADEON-HD-4000] `AMD R7xx shader ISA <http://developer.amd.com/wordpress/media/2012/10/R700-Family_Instruction_Set_Architecture.pdf>`__
12299.. [AMD-RADEON-HD-5000] `AMD Evergreen shader ISA <http://developer.amd.com/wordpress/media/2012/10/AMD_Evergreen-Family_Instruction_Set_Architecture.pdf>`__
12300.. [AMD-RADEON-HD-6000] `AMD Cayman/Trinity shader ISA <http://developer.amd.com/wordpress/media/2012/10/AMD_HD_6900_Series_Instruction_Set_Architecture.pdf>`__
12301.. [AMD-ROCm] `AMD ROCm™ Platform <https://rocmdocs.amd.com/>`__
12302.. [AMD-ROCm-github] `AMD ROCm™ github <http://github.com/RadeonOpenCompute>`__
12303.. [AMD-ROCm-Release-Notes] `AMD ROCm Release Notes <https://github.com/RadeonOpenCompute/ROCm>`__
12304.. [CLANG-ATTR] `Attributes in Clang <https://clang.llvm.org/docs/AttributeReference.html>`__
12305.. [DWARF] `DWARF Debugging Information Format <http://dwarfstd.org/>`__
12306.. [ELF] `Executable and Linkable Format (ELF) <http://www.sco.com/developers/gabi/>`__
12307.. [HRF] `Heterogeneous-race-free Memory Models <http://benedictgaster.org/wp-content/uploads/2014/01/asplos269-FINAL.pdf>`__
12308.. [HSA] `Heterogeneous System Architecture (HSA) Foundation <http://www.hsafoundation.com/>`__
12309.. [MsgPack] `Message Pack <http://www.msgpack.org/>`__
12310.. [OpenCL] `The OpenCL Specification Version 2.0 <http://www.khronos.org/registry/cl/specs/opencl-2.0.pdf>`__
12311.. [SEMVER] `Semantic Versioning <https://semver.org/>`__
12312.. [YAML] `YAML Ain't Markup Language (YAML™) Version 1.2 <http://www.yaml.org/spec/1.2/spec.html>`__
12313