1============================= 2User Guide for AMDGPU Backend 3============================= 4 5.. contents:: 6 :local: 7 8.. toctree:: 9 :hidden: 10 11 AMDGPU/AMDGPUAsmGFX7 12 AMDGPU/AMDGPUAsmGFX8 13 AMDGPU/AMDGPUAsmGFX9 14 AMDGPU/AMDGPUAsmGFX900 15 AMDGPU/AMDGPUAsmGFX904 16 AMDGPU/AMDGPUAsmGFX906 17 AMDGPU/AMDGPUAsmGFX908 18 AMDGPU/AMDGPUAsmGFX90a 19 AMDGPU/AMDGPUAsmGFX10 20 AMDGPU/AMDGPUAsmGFX1011 21 AMDGPUModifierSyntax 22 AMDGPUOperandSyntax 23 AMDGPUInstructionSyntax 24 AMDGPUInstructionNotation 25 AMDGPUDwarfExtensionsForHeterogeneousDebugging 26 27Introduction 28============ 29 30The AMDGPU backend provides ISA code generation for AMD GPUs, starting with the 31R600 family up until the current GCN families. It lives in the 32``llvm/lib/Target/AMDGPU`` directory. 33 34LLVM 35==== 36 37.. _amdgpu-target-triples: 38 39Target Triples 40-------------- 41 42Use the Clang option ``-target <Architecture>-<Vendor>-<OS>-<Environment>`` 43to specify the target triple: 44 45 .. table:: AMDGPU Architectures 46 :name: amdgpu-architecture-table 47 48 ============ ============================================================== 49 Architecture Description 50 ============ ============================================================== 51 ``r600`` AMD GPUs HD2XXX-HD6XXX for graphics and compute shaders. 52 ``amdgcn`` AMD GPUs GCN GFX6 onwards for graphics and compute shaders. 53 ============ ============================================================== 54 55 .. table:: AMDGPU Vendors 56 :name: amdgpu-vendor-table 57 58 ============ ============================================================== 59 Vendor Description 60 ============ ============================================================== 61 ``amd`` Can be used for all AMD GPU usage. 62 ``mesa3d`` Can be used if the OS is ``mesa3d``. 63 ============ ============================================================== 64 65 .. table:: AMDGPU Operating Systems 66 :name: amdgpu-os 67 68 ============== ============================================================ 69 OS Description 70 ============== ============================================================ 71 *<empty>* Defaults to the *unknown* OS. 72 ``amdhsa`` Compute kernels executed on HSA [HSA]_ compatible runtimes 73 such as: 74 75 - AMD's ROCm™ runtime [AMD-ROCm]_ using the *rocm-amdhsa* 76 loader on Linux. See *AMD ROCm Platform Release Notes* 77 [AMD-ROCm-Release-Notes]_ for supported hardware and 78 software. 79 - AMD's PAL runtime using the *pal-amdhsa* loader on 80 Windows. 81 82 ``amdpal`` Graphic shaders and compute kernels executed on AMD's PAL 83 runtime using the *pal-amdpal* loader on Windows and Linux 84 Pro. 85 ``mesa3d`` Graphic shaders and compute kernels executed on AMD's Mesa 86 3D runtime using the *mesa-mesa3d* loader on Linux. 87 ============== ============================================================ 88 89 .. table:: AMDGPU Environments 90 :name: amdgpu-environment-table 91 92 ============ ============================================================== 93 Environment Description 94 ============ ============================================================== 95 *<empty>* Default. 96 ============ ============================================================== 97 98.. _amdgpu-processors: 99 100Processors 101---------- 102 103Use the Clang options ``-mcpu=<target-id>`` or ``--offload-arch=<target-id>`` to 104specify the AMDGPU processor together with optional target features. See 105:ref:`amdgpu-target-id` and :ref:`amdgpu-target-features` for AMD GPU target 106specific information. 107 108Every processor supports every OS ABI (see :ref:`amdgpu-os`) with the following exceptions: 109 110* ``amdhsa`` is not supported in ``r600`` architecture (see :ref:`amdgpu-architecture-table`). 111 112 113 .. table:: AMDGPU Processors 114 :name: amdgpu-processor-table 115 116 =========== =============== ============ ===== ================= =============== =============== ====================== 117 Processor Alternative Target dGPU/ Target Target OS Support Example 118 Processor Triple APU Features Properties *(see* Products 119 Architecture Supported `amdgpu-os`_ 120 *and 121 corresponding 122 runtime release 123 notes for 124 current 125 information and 126 level of 127 support)* 128 =========== =============== ============ ===== ================= =============== =============== ====================== 129 **Radeon HD 2000/3000 Series (R600)** [AMD-RADEON-HD-2000-3000]_ 130 ----------------------------------------------------------------------------------------------------------------------- 131 ``r600`` ``r600`` dGPU - Does not 132 support 133 generic 134 address 135 space 136 ``r630`` ``r600`` dGPU - Does not 137 support 138 generic 139 address 140 space 141 ``rs880`` ``r600`` dGPU - Does not 142 support 143 generic 144 address 145 space 146 ``rv670`` ``r600`` dGPU - Does not 147 support 148 generic 149 address 150 space 151 **Radeon HD 4000 Series (R700)** [AMD-RADEON-HD-4000]_ 152 ----------------------------------------------------------------------------------------------------------------------- 153 ``rv710`` ``r600`` dGPU - Does not 154 support 155 generic 156 address 157 space 158 ``rv730`` ``r600`` dGPU - Does not 159 support 160 generic 161 address 162 space 163 ``rv770`` ``r600`` dGPU - Does not 164 support 165 generic 166 address 167 space 168 **Radeon HD 5000 Series (Evergreen)** [AMD-RADEON-HD-5000]_ 169 ----------------------------------------------------------------------------------------------------------------------- 170 ``cedar`` ``r600`` dGPU - Does not 171 support 172 generic 173 address 174 space 175 ``cypress`` ``r600`` dGPU - Does not 176 support 177 generic 178 address 179 space 180 ``juniper`` ``r600`` dGPU - Does not 181 support 182 generic 183 address 184 space 185 ``redwood`` ``r600`` dGPU - Does not 186 support 187 generic 188 address 189 space 190 ``sumo`` ``r600`` dGPU - Does not 191 support 192 generic 193 address 194 space 195 **Radeon HD 6000 Series (Northern Islands)** [AMD-RADEON-HD-6000]_ 196 ----------------------------------------------------------------------------------------------------------------------- 197 ``barts`` ``r600`` dGPU - Does not 198 support 199 generic 200 address 201 space 202 ``caicos`` ``r600`` dGPU - Does not 203 support 204 generic 205 address 206 space 207 ``cayman`` ``r600`` dGPU - Does not 208 support 209 generic 210 address 211 space 212 ``turks`` ``r600`` dGPU - Does not 213 support 214 generic 215 address 216 space 217 **GCN GFX6 (Southern Islands (SI))** [AMD-GCN-GFX6]_ 218 ----------------------------------------------------------------------------------------------------------------------- 219 ``gfx600`` - ``tahiti`` ``amdgcn`` dGPU - Does not - *pal-amdpal* 220 support 221 generic 222 address 223 space 224 ``gfx601`` - ``pitcairn`` ``amdgcn`` dGPU - Does not - *pal-amdpal* 225 - ``verde`` support 226 generic 227 address 228 space 229 ``gfx602`` - ``hainan`` ``amdgcn`` dGPU - Does not - *pal-amdpal* 230 - ``oland`` support 231 generic 232 address 233 space 234 **GCN GFX7 (Sea Islands (CI))** [AMD-GCN-GFX7]_ 235 ----------------------------------------------------------------------------------------------------------------------- 236 ``gfx700`` - ``kaveri`` ``amdgcn`` APU - Offset - *rocm-amdhsa* - A6-7000 237 flat - *pal-amdhsa* - A6 Pro-7050B 238 scratch - *pal-amdpal* - A8-7100 239 - A8 Pro-7150B 240 - A10-7300 241 - A10 Pro-7350B 242 - FX-7500 243 - A8-7200P 244 - A10-7400P 245 - FX-7600P 246 ``gfx701`` - ``hawaii`` ``amdgcn`` dGPU - Offset - *rocm-amdhsa* - FirePro W8100 247 flat - *pal-amdhsa* - FirePro W9100 248 scratch - *pal-amdpal* - FirePro S9150 249 - FirePro S9170 250 ``gfx702`` ``amdgcn`` dGPU - Offset - *rocm-amdhsa* - Radeon R9 290 251 flat - *pal-amdhsa* - Radeon R9 290x 252 scratch - *pal-amdpal* - Radeon R390 253 - Radeon R390x 254 ``gfx703`` - ``kabini`` ``amdgcn`` APU - Offset - *pal-amdhsa* - E1-2100 255 - ``mullins`` flat - *pal-amdpal* - E1-2200 256 scratch - E1-2500 257 - E2-3000 258 - E2-3800 259 - A4-5000 260 - A4-5100 261 - A6-5200 262 - A4 Pro-3340B 263 ``gfx704`` - ``bonaire`` ``amdgcn`` dGPU - Offset - *pal-amdhsa* - Radeon HD 7790 264 flat - *pal-amdpal* - Radeon HD 8770 265 scratch - R7 260 266 - R7 260X 267 ``gfx705`` ``amdgcn`` APU - Offset - *pal-amdhsa* *TBA* 268 flat - *pal-amdpal* 269 scratch .. TODO:: 270 271 Add product 272 names. 273 274 **GCN GFX8 (Volcanic Islands (VI))** [AMD-GCN-GFX8]_ 275 ----------------------------------------------------------------------------------------------------------------------- 276 ``gfx801`` - ``carrizo`` ``amdgcn`` APU - xnack - Offset - *rocm-amdhsa* - A6-8500P 277 flat - *pal-amdhsa* - Pro A6-8500B 278 scratch - *pal-amdpal* - A8-8600P 279 - Pro A8-8600B 280 - FX-8800P 281 - Pro A12-8800B 282 - A10-8700P 283 - Pro A10-8700B 284 - A10-8780P 285 - A10-9600P 286 - A10-9630P 287 - A12-9700P 288 - A12-9730P 289 - FX-9800P 290 - FX-9830P 291 - E2-9010 292 - A6-9210 293 - A9-9410 294 ``gfx802`` - ``iceland`` ``amdgcn`` dGPU - Offset - *rocm-amdhsa* - Radeon R9 285 295 - ``tonga`` flat - *pal-amdhsa* - Radeon R9 380 296 scratch - *pal-amdpal* - Radeon R9 385 297 ``gfx803`` - ``fiji`` ``amdgcn`` dGPU - *rocm-amdhsa* - Radeon R9 Nano 298 - *pal-amdhsa* - Radeon R9 Fury 299 - *pal-amdpal* - Radeon R9 FuryX 300 - Radeon Pro Duo 301 - FirePro S9300x2 302 - Radeon Instinct MI8 303 \ - ``polaris10`` ``amdgcn`` dGPU - Offset - *rocm-amdhsa* - Radeon RX 470 304 flat - *pal-amdhsa* - Radeon RX 480 305 scratch - *pal-amdpal* - Radeon Instinct MI6 306 \ - ``polaris11`` ``amdgcn`` dGPU - Offset - *rocm-amdhsa* - Radeon RX 460 307 flat - *pal-amdhsa* 308 scratch - *pal-amdpal* 309 ``gfx805`` - ``tongapro`` ``amdgcn`` dGPU - Offset - *rocm-amdhsa* - FirePro S7150 310 flat - *pal-amdhsa* - FirePro S7100 311 scratch - *pal-amdpal* - FirePro W7100 312 - Mobile FirePro 313 M7170 314 ``gfx810`` - ``stoney`` ``amdgcn`` APU - xnack - Offset - *rocm-amdhsa* *TBA* 315 flat - *pal-amdhsa* 316 scratch - *pal-amdpal* .. TODO:: 317 318 Add product 319 names. 320 321 **GCN GFX9 (Vega)** [AMD-GCN-GFX9]_ [AMD-GCN-GFX908-CDNA1]_ 322 ----------------------------------------------------------------------------------------------------------------------- 323 ``gfx900`` ``amdgcn`` dGPU - xnack - Absolute - *rocm-amdhsa* - Radeon Vega 324 flat - *pal-amdhsa* Frontier Edition 325 scratch - *pal-amdpal* - Radeon RX Vega 56 326 - Radeon RX Vega 64 327 - Radeon RX Vega 64 328 Liquid 329 - Radeon Instinct MI25 330 ``gfx902`` ``amdgcn`` APU - xnack - Absolute - *rocm-amdhsa* - Ryzen 3 2200G 331 flat - *pal-amdhsa* - Ryzen 5 2400G 332 scratch - *pal-amdpal* 333 ``gfx904`` ``amdgcn`` dGPU - xnack - *rocm-amdhsa* *TBA* 334 - *pal-amdhsa* 335 - *pal-amdpal* .. TODO:: 336 337 Add product 338 names. 339 340 ``gfx906`` ``amdgcn`` dGPU - sramecc - Absolute - *rocm-amdhsa* - Radeon Instinct MI50 341 - xnack flat - *pal-amdhsa* - Radeon Instinct MI60 342 scratch - *pal-amdpal* - Radeon VII 343 - Radeon Pro VII 344 ``gfx908`` ``amdgcn`` dGPU - sramecc - *rocm-amdhsa* - AMD Instinct MI100 Accelerator 345 - xnack - Absolute 346 flat 347 scratch 348 ``gfx909`` ``amdgcn`` APU - xnack - Absolute - *pal-amdpal* *TBA* 349 flat 350 scratch .. TODO:: 351 352 Add product 353 names. 354 355 ``gfx90a`` ``amdgcn`` dGPU - sramecc - Absolute - *rocm-amdhsa* *TBA* 356 - tgsplit flat 357 - xnack scratch .. TODO:: 358 - Packed 359 work-item Add product 360 IDs names. 361 362 ``gfx90c`` ``amdgcn`` APU - xnack - Absolute - *pal-amdpal* - Ryzen 7 4700G 363 flat - Ryzen 7 4700GE 364 scratch - Ryzen 5 4600G 365 - Ryzen 5 4600GE 366 - Ryzen 3 4300G 367 - Ryzen 3 4300GE 368 - Ryzen Pro 4000G 369 - Ryzen 7 Pro 4700G 370 - Ryzen 7 Pro 4750GE 371 - Ryzen 5 Pro 4650G 372 - Ryzen 5 Pro 4650GE 373 - Ryzen 3 Pro 4350G 374 - Ryzen 3 Pro 4350GE 375 376 **GCN GFX10 (RDNA 1)** [AMD-GCN-GFX10-RDNA1]_ 377 ----------------------------------------------------------------------------------------------------------------------- 378 ``gfx1010`` ``amdgcn`` dGPU - cumode - Absolute - *rocm-amdhsa* - Radeon RX 5700 379 - wavefrontsize64 flat - *pal-amdhsa* - Radeon RX 5700 XT 380 - xnack scratch - *pal-amdpal* - Radeon Pro 5600 XT 381 - Radeon Pro 5600M 382 ``gfx1011`` ``amdgcn`` dGPU - cumode - *rocm-amdhsa* - Radeon Pro V520 383 - wavefrontsize64 - Absolute - *pal-amdhsa* 384 - xnack flat - *pal-amdpal* 385 scratch 386 ``gfx1012`` ``amdgcn`` dGPU - cumode - Absolute - *rocm-amdhsa* - Radeon RX 5500 387 - wavefrontsize64 flat - *pal-amdhsa* - Radeon RX 5500 XT 388 - xnack scratch - *pal-amdpal* 389 **GCN GFX10 (RDNA 2)** [AMD-GCN-GFX10-RDNA2]_ 390 ----------------------------------------------------------------------------------------------------------------------- 391 ``gfx1030`` ``amdgcn`` dGPU - cumode - Absolute - *rocm-amdhsa* - Radeon RX 6800 392 - wavefrontsize64 flat - *pal-amdhsa* - Radeon RX 6800 XT 393 scratch - *pal-amdpal* - Radeon RX 6900 XT 394 ``gfx1031`` ``amdgcn`` dGPU - cumode - Absolute - *rocm-amdhsa* - Radeon RX 6700 XT 395 - wavefrontsize64 flat - *pal-amdhsa* 396 scratch - *pal-amdpal* 397 ``gfx1032`` ``amdgcn`` dGPU - cumode - Absolute - *rocm-amdhsa* *TBA* 398 - wavefrontsize64 flat - *pal-amdhsa* 399 scratch - *pal-amdpal* .. TODO:: 400 401 Add product 402 names. 403 404 ``gfx1033`` ``amdgcn`` APU - cumode - Absolute - *pal-amdpal* *TBA* 405 - wavefrontsize64 flat 406 scratch .. TODO:: 407 408 Add product 409 names. 410 ``gfx1034`` ``amdgcn`` dGPU - cumode - Absolute - *pal-amdpal* *TBA* 411 - wavefrontsize64 flat 412 scratch .. TODO:: 413 414 Add product 415 names. 416 417 =========== =============== ============ ===== ================= =============== =============== ====================== 418 419.. _amdgpu-target-features: 420 421Target Features 422--------------- 423 424Target features control how code is generated to support certain 425processor specific features. Not all target features are supported by 426all processors. The runtime must ensure that the features supported by 427the device used to execute the code match the features enabled when 428generating the code. A mismatch of features may result in incorrect 429execution, or a reduction in performance. 430 431The target features supported by each processor is listed in 432:ref:`amdgpu-processor-table`. 433 434Target features are controlled by exactly one of the following Clang 435options: 436 437``-mcpu=<target-id>`` or ``--offload-arch=<target-id>`` 438 439 The ``-mcpu`` and ``--offload-arch`` can specify the target feature as 440 optional components of the target ID. If omitted, the target feature has the 441 ``any`` value. See :ref:`amdgpu-target-id`. 442 443``-m[no-]<target-feature>`` 444 445 Target features not specified by the target ID are specified using a 446 separate option. These target features can have an ``on`` or ``off`` 447 value. ``on`` is specified by omitting the ``no-`` prefix, and 448 ``off`` is specified by including the ``no-`` prefix. The default 449 if not specified is ``off``. 450 451For example: 452 453``-mcpu=gfx908:xnack+`` 454 Enable the ``xnack`` feature. 455``-mcpu=gfx908:xnack-`` 456 Disable the ``xnack`` feature. 457``-mcumode`` 458 Enable the ``cumode`` feature. 459``-mno-cumode`` 460 Disable the ``cumode`` feature. 461 462 .. table:: AMDGPU Target Features 463 :name: amdgpu-target-features-table 464 465 =============== ============================ ================================================== 466 Target Feature Clang Option to Control Description 467 Name 468 =============== ============================ ================================================== 469 cumode - ``-m[no-]cumode`` Control the wavefront execution mode used 470 when generating code for kernels. When disabled 471 native WGP wavefront execution mode is used, 472 when enabled CU wavefront execution mode is used 473 (see :ref:`amdgpu-amdhsa-memory-model`). 474 475 sramecc - ``-mcpu`` If specified, generate code that can only be 476 - ``--offload-arch`` loaded and executed in a process that has a 477 matching setting for SRAMECC. 478 479 If not specified for code object V2 to V3, generate 480 code that can be loaded and executed in a process 481 with SRAMECC enabled. 482 483 If not specified for code object V4, generate 484 code that can be loaded and executed in a process 485 with either setting of SRAMECC. 486 487 tgsplit ``-m[no-]tgsplit`` Enable/disable generating code that assumes 488 work-groups are launched in threadgroup split mode. 489 When enabled the waves of a work-group may be 490 launched in different CUs. 491 492 wavefrontsize64 - ``-m[no-]wavefrontsize64`` Control the wavefront size used when 493 generating code for kernels. When disabled 494 native wavefront size 32 is used, when enabled 495 wavefront size 64 is used. 496 497 xnack - ``-mcpu`` If specified, generate code that can only be 498 - ``--offload-arch`` loaded and executed in a process that has a 499 matching setting for XNACK replay. 500 501 If not specified for code object V2 to V3, generate 502 code that can be loaded and executed in a process 503 with XNACK replay enabled. 504 505 If not specified for code object V4, generate 506 code that can be loaded and executed in a process 507 with either setting of XNACK replay. 508 509 XNACK replay can be used for demand paging and 510 page migration. If enabled in the device, then if 511 a page fault occurs the code may execute 512 incorrectly unless generated with XNACK replay 513 enabled, or generated for code object V4 without 514 specifying XNACK replay. Executing code that was 515 generated with XNACK replay enabled, or generated 516 for code object V4 without specifying XNACK replay, 517 on a device that does not have XNACK replay 518 enabled will execute correctly but may be less 519 performant than code generated for XNACK replay 520 disabled. 521 =============== ============================ ================================================== 522 523.. _amdgpu-target-id: 524 525Target ID 526--------- 527 528AMDGPU supports target IDs. See `Clang Offload Bundler 529<https://clang.llvm.org/docs/ClangOffloadBundler.html>`_ for a general 530description. The AMDGPU target specific information is: 531 532**processor** 533 Is an AMDGPU processor or alternative processor name specified in 534 :ref:`amdgpu-processor-table`. The non-canonical form target ID allows both 535 the primary processor and alternative processor names. The canonical form 536 target ID only allow the primary processor name. 537 538**target-feature** 539 Is a target feature name specified in :ref:`amdgpu-target-features-table` that 540 is supported by the processor. The target features supported by each processor 541 is specified in :ref:`amdgpu-processor-table`. Those that can be specified in 542 a target ID are marked as being controlled by ``-mcpu`` and 543 ``--offload-arch``. Each target feature must appear at most once in a target 544 ID. The non-canonical form target ID allows the target features to be 545 specified in any order. The canonical form target ID requires the target 546 features to be specified in alphabetic order. 547 548.. _amdgpu-target-id-v2-v3: 549 550Code Object V2 to V3 Target ID 551~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 552 553The target ID syntax for code object V2 to V3 is the same as defined in `Clang 554Offload Bundler <https://clang.llvm.org/docs/ClangOffloadBundler.html>`_ except 555when used in the :ref:`amdgpu-assembler-directive-amdgcn-target` assembler 556directive and the bundle entry ID. In those cases it has the following BNF 557syntax: 558 559.. code:: 560 561 <target-id> ::== <processor> ( "+" <target-feature> )* 562 563Where a target feature is omitted if *Off* and present if *On* or *Any*. 564 565.. note:: 566 567 The code object V2 to V3 cannot represent *Any* and treats it the same as 568 *On*. 569 570.. _amdgpu-embedding-bundled-objects: 571 572Embedding Bundled Code Objects 573------------------------------ 574 575AMDGPU supports the HIP and OpenMP languages that perform code object embedding 576as described in `Clang Offload Bundler 577<https://clang.llvm.org/docs/ClangOffloadBundler.html>`_. 578 579.. note:: 580 581 The target ID syntax used for code object V2 to V3 for a bundle entry ID 582 differs from that used elsewhere. See :ref:`amdgpu-target-id-v2-v3`. 583 584.. _amdgpu-address-spaces: 585 586Address Spaces 587-------------- 588 589The AMDGPU architecture supports a number of memory address spaces. The address 590space names use the OpenCL standard names, with some additions. 591 592The AMDGPU address spaces correspond to target architecture specific LLVM 593address space numbers used in LLVM IR. 594 595The AMDGPU address spaces are described in 596:ref:`amdgpu-address-spaces-table`. Only 64-bit process address spaces are 597supported for the ``amdgcn`` target. 598 599 .. table:: AMDGPU Address Spaces 600 :name: amdgpu-address-spaces-table 601 602 ================================= =============== =========== ================ ======= ============================ 603 .. 64-Bit Process Address Space 604 --------------------------------- --------------- ----------- ---------------- ------------------------------------ 605 Address Space Name LLVM IR Address HSA Segment Hardware Address NULL Value 606 Space Number Name Name Size 607 ================================= =============== =========== ================ ======= ============================ 608 Generic 0 flat flat 64 0x0000000000000000 609 Global 1 global global 64 0x0000000000000000 610 Region 2 N/A GDS 32 *not implemented for AMDHSA* 611 Local 3 group LDS 32 0xFFFFFFFF 612 Constant 4 constant *same as global* 64 0x0000000000000000 613 Private 5 private scratch 32 0xFFFFFFFF 614 Constant 32-bit 6 *TODO* 0x00000000 615 Buffer Fat Pointer (experimental) 7 *TODO* 616 ================================= =============== =========== ================ ======= ============================ 617 618**Generic** 619 The generic address space is supported unless the *Target Properties* column 620 of :ref:`amdgpu-processor-table` specifies *Does not support generic address 621 space*. 622 623 The generic address space uses the hardware flat address support for two fixed 624 ranges of virtual addresses (the private and local apertures), that are 625 outside the range of addressable global memory, to map from a flat address to 626 a private or local address. This uses FLAT instructions that can take a flat 627 address and access global, private (scratch), and group (LDS) memory depending 628 on if the address is within one of the aperture ranges. 629 630 Flat access to scratch requires hardware aperture setup and setup in the 631 kernel prologue (see :ref:`amdgpu-amdhsa-kernel-prolog-flat-scratch`). Flat 632 access to LDS requires hardware aperture setup and M0 (GFX7-GFX8) register 633 setup (see :ref:`amdgpu-amdhsa-kernel-prolog-m0`). 634 635 To convert between a private or group address space address (termed a segment 636 address) and a flat address the base address of the corresponding aperture 637 can be used. For GFX7-GFX8 these are available in the 638 :ref:`amdgpu-amdhsa-hsa-aql-queue` the address of which can be obtained with 639 Queue Ptr SGPR (see :ref:`amdgpu-amdhsa-initial-kernel-execution-state`). For 640 GFX9-GFX10 the aperture base addresses are directly available as inline 641 constant registers ``SRC_SHARED_BASE/LIMIT`` and ``SRC_PRIVATE_BASE/LIMIT``. 642 In 64-bit address mode the aperture sizes are 2^32 bytes and the base is 643 aligned to 2^32 which makes it easier to convert from flat to segment or 644 segment to flat. 645 646 A global address space address has the same value when used as a flat address 647 so no conversion is needed. 648 649**Global and Constant** 650 The global and constant address spaces both use global virtual addresses, 651 which are the same virtual address space used by the CPU. However, some 652 virtual addresses may only be accessible to the CPU, some only accessible 653 by the GPU, and some by both. 654 655 Using the constant address space indicates that the data will not change 656 during the execution of the kernel. This allows scalar read instructions to 657 be used. As the constant address space could only be modified on the host 658 side, a generic pointer loaded from the constant address space is safe to be 659 assumed as a global pointer since only the device global memory is visible 660 and managed on the host side. The vector and scalar L1 caches are invalidated 661 of volatile data before each kernel dispatch execution to allow constant 662 memory to change values between kernel dispatches. 663 664**Region** 665 The region address space uses the hardware Global Data Store (GDS). All 666 wavefronts executing on the same device will access the same memory for any 667 given region address. However, the same region address accessed by wavefronts 668 executing on different devices will access different memory. It is higher 669 performance than global memory. It is allocated by the runtime. The data 670 store (DS) instructions can be used to access it. 671 672**Local** 673 The local address space uses the hardware Local Data Store (LDS) which is 674 automatically allocated when the hardware creates the wavefronts of a 675 work-group, and freed when all the wavefronts of a work-group have 676 terminated. All wavefronts belonging to the same work-group will access the 677 same memory for any given local address. However, the same local address 678 accessed by wavefronts belonging to different work-groups will access 679 different memory. It is higher performance than global memory. The data store 680 (DS) instructions can be used to access it. 681 682**Private** 683 The private address space uses the hardware scratch memory support which 684 automatically allocates memory when it creates a wavefront and frees it when 685 a wavefronts terminates. The memory accessed by a lane of a wavefront for any 686 given private address will be different to the memory accessed by another lane 687 of the same or different wavefront for the same private address. 688 689 If a kernel dispatch uses scratch, then the hardware allocates memory from a 690 pool of backing memory allocated by the runtime for each wavefront. The lanes 691 of the wavefront access this using dword (4 byte) interleaving. The mapping 692 used from private address to backing memory address is: 693 694 ``wavefront-scratch-base + 695 ((private-address / 4) * wavefront-size * 4) + 696 (wavefront-lane-id * 4) + (private-address % 4)`` 697 698 If each lane of a wavefront accesses the same private address, the 699 interleaving results in adjacent dwords being accessed and hence requires 700 fewer cache lines to be fetched. 701 702 There are different ways that the wavefront scratch base address is 703 determined by a wavefront (see 704 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`). 705 706 Scratch memory can be accessed in an interleaved manner using buffer 707 instructions with the scratch buffer descriptor and per wavefront scratch 708 offset, by the scratch instructions, or by flat instructions. Multi-dword 709 access is not supported except by flat and scratch instructions in 710 GFX9-GFX10. 711 712**Constant 32-bit** 713 *TODO* 714 715**Buffer Fat Pointer** 716 The buffer fat pointer is an experimental address space that is currently 717 unsupported in the backend. It exposes a non-integral pointer that is in 718 the future intended to support the modelling of 128-bit buffer descriptors 719 plus a 32-bit offset into the buffer (in total encapsulating a 160-bit 720 *pointer*), allowing normal LLVM load/store/atomic operations to be used to 721 model the buffer descriptors used heavily in graphics workloads targeting 722 the backend. 723 724.. _amdgpu-memory-scopes: 725 726Memory Scopes 727------------- 728 729This section provides LLVM memory synchronization scopes supported by the AMDGPU 730backend memory model when the target triple OS is ``amdhsa`` (see 731:ref:`amdgpu-amdhsa-memory-model` and :ref:`amdgpu-target-triples`). 732 733The memory model supported is based on the HSA memory model [HSA]_ which is 734based in turn on HRF-indirect with scope inclusion [HRF]_. The happens-before 735relation is transitive over the synchronizes-with relation independent of scope 736and synchronizes-with allows the memory scope instances to be inclusive (see 737table :ref:`amdgpu-amdhsa-llvm-sync-scopes-table`). 738 739This is different to the OpenCL [OpenCL]_ memory model which does not have scope 740inclusion and requires the memory scopes to exactly match. However, this 741is conservatively correct for OpenCL. 742 743 .. table:: AMDHSA LLVM Sync Scopes 744 :name: amdgpu-amdhsa-llvm-sync-scopes-table 745 746 ======================= =================================================== 747 LLVM Sync Scope Description 748 ======================= =================================================== 749 *none* The default: ``system``. 750 751 Synchronizes with, and participates in modification 752 and seq_cst total orderings with, other operations 753 (except image operations) for all address spaces 754 (except private, or generic that accesses private) 755 provided the other operation's sync scope is: 756 757 - ``system``. 758 - ``agent`` and executed by a thread on the same 759 agent. 760 - ``workgroup`` and executed by a thread in the 761 same work-group. 762 - ``wavefront`` and executed by a thread in the 763 same wavefront. 764 765 ``agent`` Synchronizes with, and participates in modification 766 and seq_cst total orderings with, other operations 767 (except image operations) for all address spaces 768 (except private, or generic that accesses private) 769 provided the other operation's sync scope is: 770 771 - ``system`` or ``agent`` and executed by a thread 772 on the same agent. 773 - ``workgroup`` and executed by a thread in the 774 same work-group. 775 - ``wavefront`` and executed by a thread in the 776 same wavefront. 777 778 ``workgroup`` Synchronizes with, and participates in modification 779 and seq_cst total orderings with, other operations 780 (except image operations) for all address spaces 781 (except private, or generic that accesses private) 782 provided the other operation's sync scope is: 783 784 - ``system``, ``agent`` or ``workgroup`` and 785 executed by a thread in the same work-group. 786 - ``wavefront`` and executed by a thread in the 787 same wavefront. 788 789 ``wavefront`` Synchronizes with, and participates in modification 790 and seq_cst total orderings with, other operations 791 (except image operations) for all address spaces 792 (except private, or generic that accesses private) 793 provided the other operation's sync scope is: 794 795 - ``system``, ``agent``, ``workgroup`` or 796 ``wavefront`` and executed by a thread in the 797 same wavefront. 798 799 ``singlethread`` Only synchronizes with and participates in 800 modification and seq_cst total orderings with, 801 other operations (except image operations) running 802 in the same thread for all address spaces (for 803 example, in signal handlers). 804 805 ``one-as`` Same as ``system`` but only synchronizes with other 806 operations within the same address space. 807 808 ``agent-one-as`` Same as ``agent`` but only synchronizes with other 809 operations within the same address space. 810 811 ``workgroup-one-as`` Same as ``workgroup`` but only synchronizes with 812 other operations within the same address space. 813 814 ``wavefront-one-as`` Same as ``wavefront`` but only synchronizes with 815 other operations within the same address space. 816 817 ``singlethread-one-as`` Same as ``singlethread`` but only synchronizes with 818 other operations within the same address space. 819 ======================= =================================================== 820 821LLVM IR Intrinsics 822------------------ 823 824The AMDGPU backend implements the following LLVM IR intrinsics. 825 826*This section is WIP.* 827 828.. TODO:: 829 830 List AMDGPU intrinsics. 831 832LLVM IR Attributes 833------------------ 834 835The AMDGPU backend supports the following LLVM IR attributes. 836 837 .. table:: AMDGPU LLVM IR Attributes 838 :name: amdgpu-llvm-ir-attributes-table 839 840 ======================================= ========================================================== 841 LLVM Attribute Description 842 ======================================= ========================================================== 843 "amdgpu-flat-work-group-size"="min,max" Specify the minimum and maximum flat work group sizes that 844 will be specified when the kernel is dispatched. Generated 845 by the ``amdgpu_flat_work_group_size`` CLANG attribute [CLANG-ATTR]_. 846 "amdgpu-implicitarg-num-bytes"="n" Number of kernel argument bytes to add to the kernel 847 argument block size for the implicit arguments. This 848 varies by OS and language (for OpenCL see 849 :ref:`opencl-kernel-implicit-arguments-appended-for-amdhsa-os-table`). 850 "amdgpu-num-sgpr"="n" Specifies the number of SGPRs to use. Generated by 851 the ``amdgpu_num_sgpr`` CLANG attribute [CLANG-ATTR]_. 852 "amdgpu-num-vgpr"="n" Specifies the number of VGPRs to use. Generated by the 853 ``amdgpu_num_vgpr`` CLANG attribute [CLANG-ATTR]_. 854 "amdgpu-waves-per-eu"="m,n" Specify the minimum and maximum number of waves per 855 execution unit. Generated by the ``amdgpu_waves_per_eu`` 856 CLANG attribute [CLANG-ATTR]_. 857 "amdgpu-ieee" true/false. Specify whether the function expects the IEEE field of the 858 mode register to be set on entry. Overrides the default for 859 the calling convention. 860 "amdgpu-dx10-clamp" true/false. Specify whether the function expects the DX10_CLAMP field of 861 the mode register to be set on entry. Overrides the default 862 for the calling convention. 863 ======================================= ========================================================== 864 865.. _amdgpu-elf-code-object: 866 867ELF Code Object 868=============== 869 870The AMDGPU backend generates a standard ELF [ELF]_ relocatable code object that 871can be linked by ``lld`` to produce a standard ELF shared code object which can 872be loaded and executed on an AMDGPU target. 873 874.. _amdgpu-elf-header: 875 876Header 877------ 878 879The AMDGPU backend uses the following ELF header: 880 881 .. table:: AMDGPU ELF Header 882 :name: amdgpu-elf-header-table 883 884 ========================== =============================== 885 Field Value 886 ========================== =============================== 887 ``e_ident[EI_CLASS]`` ``ELFCLASS64`` 888 ``e_ident[EI_DATA]`` ``ELFDATA2LSB`` 889 ``e_ident[EI_OSABI]`` - ``ELFOSABI_NONE`` 890 - ``ELFOSABI_AMDGPU_HSA`` 891 - ``ELFOSABI_AMDGPU_PAL`` 892 - ``ELFOSABI_AMDGPU_MESA3D`` 893 ``e_ident[EI_ABIVERSION]`` - ``ELFABIVERSION_AMDGPU_HSA_V2`` 894 - ``ELFABIVERSION_AMDGPU_HSA_V3`` 895 - ``ELFABIVERSION_AMDGPU_HSA_V4`` 896 - ``ELFABIVERSION_AMDGPU_PAL`` 897 - ``ELFABIVERSION_AMDGPU_MESA3D`` 898 ``e_type`` - ``ET_REL`` 899 - ``ET_DYN`` 900 ``e_machine`` ``EM_AMDGPU`` 901 ``e_entry`` 0 902 ``e_flags`` See :ref:`amdgpu-elf-header-e_flags-v2-table`, 903 :ref:`amdgpu-elf-header-e_flags-table-v3`, 904 and :ref:`amdgpu-elf-header-e_flags-table-v4` 905 ========================== =============================== 906 907.. 908 909 .. table:: AMDGPU ELF Header Enumeration Values 910 :name: amdgpu-elf-header-enumeration-values-table 911 912 =============================== ===== 913 Name Value 914 =============================== ===== 915 ``EM_AMDGPU`` 224 916 ``ELFOSABI_NONE`` 0 917 ``ELFOSABI_AMDGPU_HSA`` 64 918 ``ELFOSABI_AMDGPU_PAL`` 65 919 ``ELFOSABI_AMDGPU_MESA3D`` 66 920 ``ELFABIVERSION_AMDGPU_HSA_V2`` 0 921 ``ELFABIVERSION_AMDGPU_HSA_V3`` 1 922 ``ELFABIVERSION_AMDGPU_HSA_V4`` 2 923 ``ELFABIVERSION_AMDGPU_PAL`` 0 924 ``ELFABIVERSION_AMDGPU_MESA3D`` 0 925 =============================== ===== 926 927``e_ident[EI_CLASS]`` 928 The ELF class is: 929 930 * ``ELFCLASS32`` for ``r600`` architecture. 931 932 * ``ELFCLASS64`` for ``amdgcn`` architecture which only supports 64-bit 933 process address space applications. 934 935``e_ident[EI_DATA]`` 936 All AMDGPU targets use ``ELFDATA2LSB`` for little-endian byte ordering. 937 938``e_ident[EI_OSABI]`` 939 One of the following AMDGPU target architecture specific OS ABIs 940 (see :ref:`amdgpu-os`): 941 942 * ``ELFOSABI_NONE`` for *unknown* OS. 943 944 * ``ELFOSABI_AMDGPU_HSA`` for ``amdhsa`` OS. 945 946 * ``ELFOSABI_AMDGPU_PAL`` for ``amdpal`` OS. 947 948 * ``ELFOSABI_AMDGPU_MESA3D`` for ``mesa3D`` OS. 949 950``e_ident[EI_ABIVERSION]`` 951 The ABI version of the AMDGPU target architecture specific OS ABI to which the code 952 object conforms: 953 954 * ``ELFABIVERSION_AMDGPU_HSA_V2`` is used to specify the version of AMD HSA 955 runtime ABI for code object V2. Specify using the Clang option 956 ``-mcode-object-version=2``. 957 958 * ``ELFABIVERSION_AMDGPU_HSA_V3`` is used to specify the version of AMD HSA 959 runtime ABI for code object V3. Specify using the Clang option 960 ``-mcode-object-version=3``. This is the default code object 961 version if not specified. 962 963 * ``ELFABIVERSION_AMDGPU_HSA_V4`` is used to specify the version of AMD HSA 964 runtime ABI for code object V4. Specify using the Clang option 965 ``-mcode-object-version=4``. 966 967 * ``ELFABIVERSION_AMDGPU_PAL`` is used to specify the version of AMD PAL 968 runtime ABI. 969 970 * ``ELFABIVERSION_AMDGPU_MESA3D`` is used to specify the version of AMD MESA 971 3D runtime ABI. 972 973``e_type`` 974 Can be one of the following values: 975 976 977 ``ET_REL`` 978 The type produced by the AMDGPU backend compiler as it is relocatable code 979 object. 980 981 ``ET_DYN`` 982 The type produced by the linker as it is a shared code object. 983 984 The AMD HSA runtime loader requires a ``ET_DYN`` code object. 985 986``e_machine`` 987 The value ``EM_AMDGPU`` is used for the machine for all processors supported 988 by the ``r600`` and ``amdgcn`` architectures (see 989 :ref:`amdgpu-processor-table`). The specific processor is specified in the 990 ``NT_AMD_HSA_ISA_VERSION`` note record for code object V2 (see 991 :ref:`amdgpu-note-records-v2`) and in the ``EF_AMDGPU_MACH`` bit field of the 992 ``e_flags`` for code object V3 to V4 (see 993 :ref:`amdgpu-elf-header-e_flags-table-v3` and 994 :ref:`amdgpu-elf-header-e_flags-table-v4`). 995 996``e_entry`` 997 The entry point is 0 as the entry points for individual kernels must be 998 selected in order to invoke them through AQL packets. 999 1000``e_flags`` 1001 The AMDGPU backend uses the following ELF header flags: 1002 1003 .. table:: AMDGPU ELF Header ``e_flags`` for Code Object V2 1004 :name: amdgpu-elf-header-e_flags-v2-table 1005 1006 ===================================== ===== ============================= 1007 Name Value Description 1008 ===================================== ===== ============================= 1009 ``EF_AMDGPU_FEATURE_XNACK_V2`` 0x01 Indicates if the ``xnack`` 1010 target feature is 1011 enabled for all code 1012 contained in the code object. 1013 If the processor 1014 does not support the 1015 ``xnack`` target 1016 feature then must 1017 be 0. 1018 See 1019 :ref:`amdgpu-target-features`. 1020 ``EF_AMDGPU_FEATURE_TRAP_HANDLER_V2`` 0x02 Indicates if the trap 1021 handler is enabled for all 1022 code contained in the code 1023 object. If the processor 1024 does not support a trap 1025 handler then must be 0. 1026 See 1027 :ref:`amdgpu-target-features`. 1028 ===================================== ===== ============================= 1029 1030 .. table:: AMDGPU ELF Header ``e_flags`` for Code Object V3 1031 :name: amdgpu-elf-header-e_flags-table-v3 1032 1033 ================================= ===== ============================= 1034 Name Value Description 1035 ================================= ===== ============================= 1036 ``EF_AMDGPU_MACH`` 0x0ff AMDGPU processor selection 1037 mask for 1038 ``EF_AMDGPU_MACH_xxx`` values 1039 defined in 1040 :ref:`amdgpu-ef-amdgpu-mach-table`. 1041 ``EF_AMDGPU_FEATURE_XNACK_V3`` 0x100 Indicates if the ``xnack`` 1042 target feature is 1043 enabled for all code 1044 contained in the code object. 1045 If the processor 1046 does not support the 1047 ``xnack`` target 1048 feature then must 1049 be 0. 1050 See 1051 :ref:`amdgpu-target-features`. 1052 ``EF_AMDGPU_FEATURE_SRAMECC_V3`` 0x200 Indicates if the ``sramecc`` 1053 target feature is 1054 enabled for all code 1055 contained in the code object. 1056 If the processor 1057 does not support the 1058 ``sramecc`` target 1059 feature then must 1060 be 0. 1061 See 1062 :ref:`amdgpu-target-features`. 1063 ================================= ===== ============================= 1064 1065 .. table:: AMDGPU ELF Header ``e_flags`` for Code Object V4 1066 :name: amdgpu-elf-header-e_flags-table-v4 1067 1068 ============================================ ===== =================================== 1069 Name Value Description 1070 ============================================ ===== =================================== 1071 ``EF_AMDGPU_MACH`` 0x0ff AMDGPU processor selection 1072 mask for 1073 ``EF_AMDGPU_MACH_xxx`` values 1074 defined in 1075 :ref:`amdgpu-ef-amdgpu-mach-table`. 1076 ``EF_AMDGPU_FEATURE_XNACK_V4`` 0x300 XNACK selection mask for 1077 ``EF_AMDGPU_FEATURE_XNACK_*_V4`` 1078 values. 1079 ``EF_AMDGPU_FEATURE_XNACK_UNSUPPORTED_V4`` 0x000 XNACK unsuppored. 1080 ``EF_AMDGPU_FEATURE_XNACK_ANY_V4`` 0x100 XNACK can have any value. 1081 ``EF_AMDGPU_FEATURE_XNACK_OFF_V4`` 0x200 XNACK disabled. 1082 ``EF_AMDGPU_FEATURE_XNACK_ON_V4`` 0x300 XNACK enabled. 1083 ``EF_AMDGPU_FEATURE_SRAMECC_V4`` 0xc00 SRAMECC selection mask for 1084 ``EF_AMDGPU_FEATURE_SRAMECC_*_V4`` 1085 values. 1086 ``EF_AMDGPU_FEATURE_SRAMECC_UNSUPPORTED_V4`` 0x000 SRAMECC unsuppored. 1087 ``EF_AMDGPU_FEATURE_SRAMECC_ANY_V4`` 0x400 SRAMECC can have any value. 1088 ``EF_AMDGPU_FEATURE_SRAMECC_OFF_V4`` 0x800 SRAMECC disabled, 1089 ``EF_AMDGPU_FEATURE_SRAMECC_ON_V4`` 0xc00 SRAMECC enabled. 1090 ============================================ ===== =================================== 1091 1092 .. table:: AMDGPU ``EF_AMDGPU_MACH`` Values 1093 :name: amdgpu-ef-amdgpu-mach-table 1094 1095 ==================================== ========== ============================= 1096 Name Value Description (see 1097 :ref:`amdgpu-processor-table`) 1098 ==================================== ========== ============================= 1099 ``EF_AMDGPU_MACH_NONE`` 0x000 *not specified* 1100 ``EF_AMDGPU_MACH_R600_R600`` 0x001 ``r600`` 1101 ``EF_AMDGPU_MACH_R600_R630`` 0x002 ``r630`` 1102 ``EF_AMDGPU_MACH_R600_RS880`` 0x003 ``rs880`` 1103 ``EF_AMDGPU_MACH_R600_RV670`` 0x004 ``rv670`` 1104 ``EF_AMDGPU_MACH_R600_RV710`` 0x005 ``rv710`` 1105 ``EF_AMDGPU_MACH_R600_RV730`` 0x006 ``rv730`` 1106 ``EF_AMDGPU_MACH_R600_RV770`` 0x007 ``rv770`` 1107 ``EF_AMDGPU_MACH_R600_CEDAR`` 0x008 ``cedar`` 1108 ``EF_AMDGPU_MACH_R600_CYPRESS`` 0x009 ``cypress`` 1109 ``EF_AMDGPU_MACH_R600_JUNIPER`` 0x00a ``juniper`` 1110 ``EF_AMDGPU_MACH_R600_REDWOOD`` 0x00b ``redwood`` 1111 ``EF_AMDGPU_MACH_R600_SUMO`` 0x00c ``sumo`` 1112 ``EF_AMDGPU_MACH_R600_BARTS`` 0x00d ``barts`` 1113 ``EF_AMDGPU_MACH_R600_CAICOS`` 0x00e ``caicos`` 1114 ``EF_AMDGPU_MACH_R600_CAYMAN`` 0x00f ``cayman`` 1115 ``EF_AMDGPU_MACH_R600_TURKS`` 0x010 ``turks`` 1116 *reserved* 0x011 - Reserved for ``r600`` 1117 0x01f architecture processors. 1118 ``EF_AMDGPU_MACH_AMDGCN_GFX600`` 0x020 ``gfx600`` 1119 ``EF_AMDGPU_MACH_AMDGCN_GFX601`` 0x021 ``gfx601`` 1120 ``EF_AMDGPU_MACH_AMDGCN_GFX700`` 0x022 ``gfx700`` 1121 ``EF_AMDGPU_MACH_AMDGCN_GFX701`` 0x023 ``gfx701`` 1122 ``EF_AMDGPU_MACH_AMDGCN_GFX702`` 0x024 ``gfx702`` 1123 ``EF_AMDGPU_MACH_AMDGCN_GFX703`` 0x025 ``gfx703`` 1124 ``EF_AMDGPU_MACH_AMDGCN_GFX704`` 0x026 ``gfx704`` 1125 *reserved* 0x027 Reserved. 1126 ``EF_AMDGPU_MACH_AMDGCN_GFX801`` 0x028 ``gfx801`` 1127 ``EF_AMDGPU_MACH_AMDGCN_GFX802`` 0x029 ``gfx802`` 1128 ``EF_AMDGPU_MACH_AMDGCN_GFX803`` 0x02a ``gfx803`` 1129 ``EF_AMDGPU_MACH_AMDGCN_GFX810`` 0x02b ``gfx810`` 1130 ``EF_AMDGPU_MACH_AMDGCN_GFX900`` 0x02c ``gfx900`` 1131 ``EF_AMDGPU_MACH_AMDGCN_GFX902`` 0x02d ``gfx902`` 1132 ``EF_AMDGPU_MACH_AMDGCN_GFX904`` 0x02e ``gfx904`` 1133 ``EF_AMDGPU_MACH_AMDGCN_GFX906`` 0x02f ``gfx906`` 1134 ``EF_AMDGPU_MACH_AMDGCN_GFX908`` 0x030 ``gfx908`` 1135 ``EF_AMDGPU_MACH_AMDGCN_GFX909`` 0x031 ``gfx909`` 1136 ``EF_AMDGPU_MACH_AMDGCN_GFX90C`` 0x032 ``gfx90c`` 1137 ``EF_AMDGPU_MACH_AMDGCN_GFX1010`` 0x033 ``gfx1010`` 1138 ``EF_AMDGPU_MACH_AMDGCN_GFX1011`` 0x034 ``gfx1011`` 1139 ``EF_AMDGPU_MACH_AMDGCN_GFX1012`` 0x035 ``gfx1012`` 1140 ``EF_AMDGPU_MACH_AMDGCN_GFX1030`` 0x036 ``gfx1030`` 1141 ``EF_AMDGPU_MACH_AMDGCN_GFX1031`` 0x037 ``gfx1031`` 1142 ``EF_AMDGPU_MACH_AMDGCN_GFX1032`` 0x038 ``gfx1032`` 1143 ``EF_AMDGPU_MACH_AMDGCN_GFX1033`` 0x039 ``gfx1033`` 1144 ``EF_AMDGPU_MACH_AMDGCN_GFX602`` 0x03a ``gfx602`` 1145 ``EF_AMDGPU_MACH_AMDGCN_GFX705`` 0x03b ``gfx705`` 1146 ``EF_AMDGPU_MACH_AMDGCN_GFX805`` 0x03c ``gfx805`` 1147 *reserved* 0x03d Reserved. 1148 ``EF_AMDGPU_MACH_AMDGCN_GFX1034`` 0x03e ``gfx1034`` 1149 ``EF_AMDGPU_MACH_AMDGCN_GFX90A`` 0x03f ``gfx90a`` 1150 *reserved* 0x040 Reserved. 1151 *reserved* 0x041 Reserved. 1152 ==================================== ========== ============================= 1153 1154Sections 1155-------- 1156 1157An AMDGPU target ELF code object has the standard ELF sections which include: 1158 1159 .. table:: AMDGPU ELF Sections 1160 :name: amdgpu-elf-sections-table 1161 1162 ================== ================ ================================= 1163 Name Type Attributes 1164 ================== ================ ================================= 1165 ``.bss`` ``SHT_NOBITS`` ``SHF_ALLOC`` + ``SHF_WRITE`` 1166 ``.data`` ``SHT_PROGBITS`` ``SHF_ALLOC`` + ``SHF_WRITE`` 1167 ``.debug_``\ *\** ``SHT_PROGBITS`` *none* 1168 ``.dynamic`` ``SHT_DYNAMIC`` ``SHF_ALLOC`` 1169 ``.dynstr`` ``SHT_PROGBITS`` ``SHF_ALLOC`` 1170 ``.dynsym`` ``SHT_PROGBITS`` ``SHF_ALLOC`` 1171 ``.got`` ``SHT_PROGBITS`` ``SHF_ALLOC`` + ``SHF_WRITE`` 1172 ``.hash`` ``SHT_HASH`` ``SHF_ALLOC`` 1173 ``.note`` ``SHT_NOTE`` *none* 1174 ``.rela``\ *name* ``SHT_RELA`` *none* 1175 ``.rela.dyn`` ``SHT_RELA`` *none* 1176 ``.rodata`` ``SHT_PROGBITS`` ``SHF_ALLOC`` 1177 ``.shstrtab`` ``SHT_STRTAB`` *none* 1178 ``.strtab`` ``SHT_STRTAB`` *none* 1179 ``.symtab`` ``SHT_SYMTAB`` *none* 1180 ``.text`` ``SHT_PROGBITS`` ``SHF_ALLOC`` + ``SHF_EXECINSTR`` 1181 ================== ================ ================================= 1182 1183These sections have their standard meanings (see [ELF]_) and are only generated 1184if needed. 1185 1186``.debug``\ *\** 1187 The standard DWARF sections. See :ref:`amdgpu-dwarf-debug-information` for 1188 information on the DWARF produced by the AMDGPU backend. 1189 1190``.dynamic``, ``.dynstr``, ``.dynsym``, ``.hash`` 1191 The standard sections used by a dynamic loader. 1192 1193``.note`` 1194 See :ref:`amdgpu-note-records` for the note records supported by the AMDGPU 1195 backend. 1196 1197``.rela``\ *name*, ``.rela.dyn`` 1198 For relocatable code objects, *name* is the name of the section that the 1199 relocation records apply. For example, ``.rela.text`` is the section name for 1200 relocation records associated with the ``.text`` section. 1201 1202 For linked shared code objects, ``.rela.dyn`` contains all the relocation 1203 records from each of the relocatable code object's ``.rela``\ *name* sections. 1204 1205 See :ref:`amdgpu-relocation-records` for the relocation records supported by 1206 the AMDGPU backend. 1207 1208``.text`` 1209 The executable machine code for the kernels and functions they call. Generated 1210 as position independent code. See :ref:`amdgpu-code-conventions` for 1211 information on conventions used in the isa generation. 1212 1213.. _amdgpu-note-records: 1214 1215Note Records 1216------------ 1217 1218The AMDGPU backend code object contains ELF note records in the ``.note`` 1219section. The set of generated notes and their semantics depend on the code 1220object version; see :ref:`amdgpu-note-records-v2` and 1221:ref:`amdgpu-note-records-v3-v4`. 1222 1223As required by ``ELFCLASS32`` and ``ELFCLASS64``, minimal zero-byte padding 1224must be generated after the ``name`` field to ensure the ``desc`` field is 4 1225byte aligned. In addition, minimal zero-byte padding must be generated to 1226ensure the ``desc`` field size is a multiple of 4 bytes. The ``sh_addralign`` 1227field of the ``.note`` section must be at least 4 to indicate at least 8 byte 1228alignment. 1229 1230.. _amdgpu-note-records-v2: 1231 1232Code Object V2 Note Records 1233~~~~~~~~~~~~~~~~~~~~~~~~~~~ 1234 1235.. warning:: 1236 Code object V2 is not the default code object version emitted by 1237 this version of LLVM. 1238 1239The AMDGPU backend code object uses the following ELF note record in the 1240``.note`` section when compiling for code object V2. 1241 1242The note record vendor field is "AMD". 1243 1244Additional note records may be present, but any which are not documented here 1245are deprecated and should not be used. 1246 1247 .. table:: AMDGPU Code Object V2 ELF Note Records 1248 :name: amdgpu-elf-note-records-v2-table 1249 1250 ===== ===================================== ====================================== 1251 Name Type Description 1252 ===== ===================================== ====================================== 1253 "AMD" ``NT_AMD_HSA_CODE_OBJECT_VERSION`` Code object version. 1254 "AMD" ``NT_AMD_HSA_HSAIL`` HSAIL properties generated by the HSAIL 1255 Finalizer and not the LLVM compiler. 1256 "AMD" ``NT_AMD_HSA_ISA_VERSION`` Target ISA version. 1257 "AMD" ``NT_AMD_HSA_METADATA`` Metadata null terminated string in 1258 YAML [YAML]_ textual format. 1259 "AMD" ``NT_AMD_HSA_ISA_NAME`` Target ISA name. 1260 ===== ===================================== ====================================== 1261 1262.. 1263 1264 .. table:: AMDGPU Code Object V2 ELF Note Record Enumeration Values 1265 :name: amdgpu-elf-note-record-enumeration-values-v2-table 1266 1267 ===================================== ===== 1268 Name Value 1269 ===================================== ===== 1270 ``NT_AMD_HSA_CODE_OBJECT_VERSION`` 1 1271 ``NT_AMD_HSA_HSAIL`` 2 1272 ``NT_AMD_HSA_ISA_VERSION`` 3 1273 *reserved* 4-9 1274 ``NT_AMD_HSA_METADATA`` 10 1275 ``NT_AMD_HSA_ISA_NAME`` 11 1276 ===================================== ===== 1277 1278``NT_AMD_HSA_CODE_OBJECT_VERSION`` 1279 Specifies the code object version number. The description field has the 1280 following layout: 1281 1282 .. code:: c 1283 1284 struct amdgpu_hsa_note_code_object_version_s { 1285 uint32_t major_version; 1286 uint32_t minor_version; 1287 }; 1288 1289 The ``major_version`` has a value less than or equal to 2. 1290 1291``NT_AMD_HSA_HSAIL`` 1292 Specifies the HSAIL properties used by the HSAIL Finalizer. The description 1293 field has the following layout: 1294 1295 .. code:: c 1296 1297 struct amdgpu_hsa_note_hsail_s { 1298 uint32_t hsail_major_version; 1299 uint32_t hsail_minor_version; 1300 uint8_t profile; 1301 uint8_t machine_model; 1302 uint8_t default_float_round; 1303 }; 1304 1305``NT_AMD_HSA_ISA_VERSION`` 1306 Specifies the target ISA version. The description field has the following layout: 1307 1308 .. code:: c 1309 1310 struct amdgpu_hsa_note_isa_s { 1311 uint16_t vendor_name_size; 1312 uint16_t architecture_name_size; 1313 uint32_t major; 1314 uint32_t minor; 1315 uint32_t stepping; 1316 char vendor_and_architecture_name[1]; 1317 }; 1318 1319 ``vendor_name_size`` and ``architecture_name_size`` are the length of the 1320 vendor and architecture names respectively, including the NUL character. 1321 1322 ``vendor_and_architecture_name`` contains the NUL terminates string for the 1323 vendor, immediately followed by the NUL terminated string for the 1324 architecture. 1325 1326 This note record is used by the HSA runtime loader. 1327 1328 Code object V2 only supports a limited number of processors and has fixed 1329 settings for target features. See 1330 :ref:`amdgpu-elf-note-record-supported_processors-v2-table` for a list of 1331 processors and the corresponding target ID. In the table the note record ISA 1332 name is a concatenation of the vendor name, architecture name, major, minor, 1333 and stepping separated by a ":". 1334 1335 The target ID column shows the processor name and fixed target features used 1336 by the LLVM compiler. The LLVM compiler does not generate a 1337 ``NT_AMD_HSA_HSAIL`` note record. 1338 1339 A code object generated by the Finalizer also uses code object V2 and always 1340 generates a ``NT_AMD_HSA_HSAIL`` note record. The processor name and 1341 ``sramecc`` target feature is as shown in 1342 :ref:`amdgpu-elf-note-record-supported_processors-v2-table` but the ``xnack`` 1343 target feature is specified by the ``EF_AMDGPU_FEATURE_XNACK_V2`` ``e_flags`` 1344 bit. 1345 1346``NT_AMD_HSA_ISA_NAME`` 1347 Specifies the target ISA name as a non-NUL terminated string. 1348 1349 This note record is not used by the HSA runtime loader. 1350 1351 See the ``NT_AMD_HSA_ISA_VERSION`` note record description of the code object 1352 V2's limited support of processors and fixed settings for target features. 1353 1354 See :ref:`amdgpu-elf-note-record-supported_processors-v2-table` for a mapping 1355 from the string to the corresponding target ID. If the ``xnack`` target 1356 feature is supported and enabled, the string produced by the LLVM compiler 1357 will may have a ``+xnack`` appended. The Finlizer did not do the appending and 1358 instead used the ``EF_AMDGPU_FEATURE_XNACK_V2`` ``e_flags`` bit. 1359 1360``NT_AMD_HSA_METADATA`` 1361 Specifies extensible metadata associated with the code objects executed on HSA 1362 [HSA]_ compatible runtimes (see :ref:`amdgpu-os`). It is required when the 1363 target triple OS is ``amdhsa`` (see :ref:`amdgpu-target-triples`). See 1364 :ref:`amdgpu-amdhsa-code-object-metadata-v2` for the syntax of the code object 1365 metadata string. 1366 1367 .. table:: AMDGPU Code Object V2 Supported Processors and Fixed Target Feature Settings 1368 :name: amdgpu-elf-note-record-supported_processors-v2-table 1369 1370 ===================== ========================== 1371 Note Record ISA Name Target ID 1372 ===================== ========================== 1373 ``AMD:AMDGPU:6:0:0`` ``gfx600`` 1374 ``AMD:AMDGPU:6:0:1`` ``gfx601`` 1375 ``AMD:AMDGPU:6:0:2`` ``gfx602`` 1376 ``AMD:AMDGPU:7:0:0`` ``gfx700`` 1377 ``AMD:AMDGPU:7:0:1`` ``gfx701`` 1378 ``AMD:AMDGPU:7:0:2`` ``gfx702`` 1379 ``AMD:AMDGPU:7:0:3`` ``gfx703`` 1380 ``AMD:AMDGPU:7:0:4`` ``gfx704`` 1381 ``AMD:AMDGPU:7:0:5`` ``gfx705`` 1382 ``AMD:AMDGPU:8:0:0`` ``gfx802`` 1383 ``AMD:AMDGPU:8:0:1`` ``gfx801:xnack+`` 1384 ``AMD:AMDGPU:8:0:2`` ``gfx802`` 1385 ``AMD:AMDGPU:8:0:3`` ``gfx803`` 1386 ``AMD:AMDGPU:8:0:4`` ``gfx803`` 1387 ``AMD:AMDGPU:8:0:5`` ``gfx805`` 1388 ``AMD:AMDGPU:8:1:0`` ``gfx810:xnack+`` 1389 ``AMD:AMDGPU:9:0:0`` ``gfx900:xnack-`` 1390 ``AMD:AMDGPU:9:0:1`` ``gfx900:xnack+`` 1391 ``AMD:AMDGPU:9:0:2`` ``gfx902:xnack-`` 1392 ``AMD:AMDGPU:9:0:3`` ``gfx902:xnack+`` 1393 ``AMD:AMDGPU:9:0:4`` ``gfx904:xnack-`` 1394 ``AMD:AMDGPU:9:0:5`` ``gfx904:xnack+`` 1395 ``AMD:AMDGPU:9:0:6`` ``gfx906:sramecc-:xnack-`` 1396 ``AMD:AMDGPU:9:0:7`` ``gfx906:sramecc-:xnack+`` 1397 ``AMD:AMDGPU:9:0:12`` ``gfx90c:xnack-`` 1398 ===================== ========================== 1399 1400.. _amdgpu-note-records-v3-v4: 1401 1402Code Object V3 to V4 Note Records 1403~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 1404 1405The AMDGPU backend code object uses the following ELF note record in the 1406``.note`` section when compiling for code object V3 to V4. 1407 1408The note record vendor field is "AMDGPU". 1409 1410Additional note records may be present, but any which are not documented here 1411are deprecated and should not be used. 1412 1413 .. table:: AMDGPU Code Object V3 to V4 ELF Note Records 1414 :name: amdgpu-elf-note-records-table-v3-v4 1415 1416 ======== ============================== ====================================== 1417 Name Type Description 1418 ======== ============================== ====================================== 1419 "AMDGPU" ``NT_AMDGPU_METADATA`` Metadata in Message Pack [MsgPack]_ 1420 binary format. 1421 ======== ============================== ====================================== 1422 1423.. 1424 1425 .. table:: AMDGPU Code Object V3 to V4 ELF Note Record Enumeration Values 1426 :name: amdgpu-elf-note-record-enumeration-values-table-v3-v4 1427 1428 ============================== ===== 1429 Name Value 1430 ============================== ===== 1431 *reserved* 0-31 1432 ``NT_AMDGPU_METADATA`` 32 1433 ============================== ===== 1434 1435``NT_AMDGPU_METADATA`` 1436 Specifies extensible metadata associated with an AMDGPU code object. It is 1437 encoded as a map in the Message Pack [MsgPack]_ binary data format. See 1438 :ref:`amdgpu-amdhsa-code-object-metadata-v3` and 1439 :ref:`amdgpu-amdhsa-code-object-metadata-v4` for the map keys defined for the 1440 ``amdhsa`` OS. 1441 1442.. _amdgpu-symbols: 1443 1444Symbols 1445------- 1446 1447Symbols include the following: 1448 1449 .. table:: AMDGPU ELF Symbols 1450 :name: amdgpu-elf-symbols-table 1451 1452 ===================== ================== ================ ================== 1453 Name Type Section Description 1454 ===================== ================== ================ ================== 1455 *link-name* ``STT_OBJECT`` - ``.data`` Global variable 1456 - ``.rodata`` 1457 - ``.bss`` 1458 *link-name*\ ``.kd`` ``STT_OBJECT`` - ``.rodata`` Kernel descriptor 1459 *link-name* ``STT_FUNC`` - ``.text`` Kernel entry point 1460 *link-name* ``STT_OBJECT`` - SHN_AMDGPU_LDS Global variable in LDS 1461 ===================== ================== ================ ================== 1462 1463Global variable 1464 Global variables both used and defined by the compilation unit. 1465 1466 If the symbol is defined in the compilation unit then it is allocated in the 1467 appropriate section according to if it has initialized data or is readonly. 1468 1469 If the symbol is external then its section is ``STN_UNDEF`` and the loader 1470 will resolve relocations using the definition provided by another code object 1471 or explicitly defined by the runtime. 1472 1473 If the symbol resides in local/group memory (LDS) then its section is the 1474 special processor specific section name ``SHN_AMDGPU_LDS``, and the 1475 ``st_value`` field describes alignment requirements as it does for common 1476 symbols. 1477 1478 .. TODO:: 1479 1480 Add description of linked shared object symbols. Seems undefined symbols 1481 are marked as STT_NOTYPE. 1482 1483Kernel descriptor 1484 Every HSA kernel has an associated kernel descriptor. It is the address of the 1485 kernel descriptor that is used in the AQL dispatch packet used to invoke the 1486 kernel, not the kernel entry point. The layout of the HSA kernel descriptor is 1487 defined in :ref:`amdgpu-amdhsa-kernel-descriptor`. 1488 1489Kernel entry point 1490 Every HSA kernel also has a symbol for its machine code entry point. 1491 1492.. _amdgpu-relocation-records: 1493 1494Relocation Records 1495------------------ 1496 1497AMDGPU backend generates ``Elf64_Rela`` relocation records. Supported 1498relocatable fields are: 1499 1500``word32`` 1501 This specifies a 32-bit field occupying 4 bytes with arbitrary byte 1502 alignment. These values use the same byte order as other word values in the 1503 AMDGPU architecture. 1504 1505``word64`` 1506 This specifies a 64-bit field occupying 8 bytes with arbitrary byte 1507 alignment. These values use the same byte order as other word values in the 1508 AMDGPU architecture. 1509 1510Following notations are used for specifying relocation calculations: 1511 1512**A** 1513 Represents the addend used to compute the value of the relocatable field. 1514 1515**G** 1516 Represents the offset into the global offset table at which the relocation 1517 entry's symbol will reside during execution. 1518 1519**GOT** 1520 Represents the address of the global offset table. 1521 1522**P** 1523 Represents the place (section offset for ``et_rel`` or address for ``et_dyn``) 1524 of the storage unit being relocated (computed using ``r_offset``). 1525 1526**S** 1527 Represents the value of the symbol whose index resides in the relocation 1528 entry. Relocations not using this must specify a symbol index of 1529 ``STN_UNDEF``. 1530 1531**B** 1532 Represents the base address of a loaded executable or shared object which is 1533 the difference between the ELF address and the actual load address. 1534 Relocations using this are only valid in executable or shared objects. 1535 1536The following relocation types are supported: 1537 1538 .. table:: AMDGPU ELF Relocation Records 1539 :name: amdgpu-elf-relocation-records-table 1540 1541 ========================== ======= ===== ========== ============================== 1542 Relocation Type Kind Value Field Calculation 1543 ========================== ======= ===== ========== ============================== 1544 ``R_AMDGPU_NONE`` 0 *none* *none* 1545 ``R_AMDGPU_ABS32_LO`` Static, 1 ``word32`` (S + A) & 0xFFFFFFFF 1546 Dynamic 1547 ``R_AMDGPU_ABS32_HI`` Static, 2 ``word32`` (S + A) >> 32 1548 Dynamic 1549 ``R_AMDGPU_ABS64`` Static, 3 ``word64`` S + A 1550 Dynamic 1551 ``R_AMDGPU_REL32`` Static 4 ``word32`` S + A - P 1552 ``R_AMDGPU_REL64`` Static 5 ``word64`` S + A - P 1553 ``R_AMDGPU_ABS32`` Static, 6 ``word32`` S + A 1554 Dynamic 1555 ``R_AMDGPU_GOTPCREL`` Static 7 ``word32`` G + GOT + A - P 1556 ``R_AMDGPU_GOTPCREL32_LO`` Static 8 ``word32`` (G + GOT + A - P) & 0xFFFFFFFF 1557 ``R_AMDGPU_GOTPCREL32_HI`` Static 9 ``word32`` (G + GOT + A - P) >> 32 1558 ``R_AMDGPU_REL32_LO`` Static 10 ``word32`` (S + A - P) & 0xFFFFFFFF 1559 ``R_AMDGPU_REL32_HI`` Static 11 ``word32`` (S + A - P) >> 32 1560 *reserved* 12 1561 ``R_AMDGPU_RELATIVE64`` Dynamic 13 ``word64`` B + A 1562 ========================== ======= ===== ========== ============================== 1563 1564``R_AMDGPU_ABS32_LO`` and ``R_AMDGPU_ABS32_HI`` are only supported by 1565the ``mesa3d`` OS, which does not support ``R_AMDGPU_ABS64``. 1566 1567There is no current OS loader support for 32-bit programs and so 1568``R_AMDGPU_ABS32`` is not used. 1569 1570.. _amdgpu-loaded-code-object-path-uniform-resource-identifier: 1571 1572Loaded Code Object Path Uniform Resource Identifier (URI) 1573--------------------------------------------------------- 1574 1575The AMD GPU code object loader represents the path of the ELF shared object from 1576which the code object was loaded as a textual Unifom Resource Identifier (URI). 1577Note that the code object is the in memory loaded relocated form of the ELF 1578shared object. Multiple code objects may be loaded at different memory 1579addresses in the same process from the same ELF shared object. 1580 1581The loaded code object path URI syntax is defined by the following BNF syntax: 1582 1583.. code:: 1584 1585 code_object_uri ::== file_uri | memory_uri 1586 file_uri ::== "file://" file_path [ range_specifier ] 1587 memory_uri ::== "memory://" process_id range_specifier 1588 range_specifier ::== [ "#" | "?" ] "offset=" number "&" "size=" number 1589 file_path ::== URI_ENCODED_OS_FILE_PATH 1590 process_id ::== DECIMAL_NUMBER 1591 number ::== HEX_NUMBER | DECIMAL_NUMBER | OCTAL_NUMBER 1592 1593**number** 1594 Is a C integral literal where hexadecimal values are prefixed by "0x" or "0X", 1595 and octal values by "0". 1596 1597**file_path** 1598 Is the file's path specified as a URI encoded UTF-8 string. In URI encoding, 1599 every character that is not in the regular expression ``[a-zA-Z0-9/_.~-]`` is 1600 encoded as two uppercase hexadecimal digits proceeded by "%". Directories in 1601 the path are separated by "/". 1602 1603**offset** 1604 Is a 0-based byte offset to the start of the code object. For a file URI, it 1605 is from the start of the file specified by the ``file_path``, and if omitted 1606 defaults to 0. For a memory URI, it is the memory address and is required. 1607 1608**size** 1609 Is the number of bytes in the code object. For a file URI, if omitted it 1610 defaults to the size of the file. It is required for a memory URI. 1611 1612**process_id** 1613 Is the identity of the process owning the memory. For Linux it is the C 1614 unsigned integral decimal literal for the process ID (PID). 1615 1616For example: 1617 1618.. code:: 1619 1620 file:///dir1/dir2/file1 1621 file:///dir3/dir4/file2#offset=0x2000&size=3000 1622 memory://1234#offset=0x20000&size=3000 1623 1624.. _amdgpu-dwarf-debug-information: 1625 1626DWARF Debug Information 1627======================= 1628 1629.. warning:: 1630 1631 This section describes **provisional support** for AMDGPU DWARF [DWARF]_ that 1632 is not currently fully implemented and is subject to change. 1633 1634AMDGPU generates DWARF [DWARF]_ debugging information ELF sections (see 1635:ref:`amdgpu-elf-code-object`) which contain information that maps the code 1636object executable code and data to the source language constructs. It can be 1637used by tools such as debuggers and profilers. It uses features defined in 1638:doc:`AMDGPUDwarfExtensionsForHeterogeneousDebugging` that are made available in 1639DWARF Version 4 and DWARF Version 5 as an LLVM vendor extension. 1640 1641This section defines the AMDGPU target architecture specific DWARF mappings. 1642 1643.. _amdgpu-dwarf-register-identifier: 1644 1645Register Identifier 1646------------------- 1647 1648This section defines the AMDGPU target architecture register numbers used in 1649DWARF operation expressions (see DWARF Version 5 section 2.5 and 1650:ref:`amdgpu-dwarf-operation-expressions`) and Call Frame Information 1651instructions (see DWARF Version 5 section 6.4 and 1652:ref:`amdgpu-dwarf-call-frame-information`). 1653 1654A single code object can contain code for kernels that have different wavefront 1655sizes. The vector registers and some scalar registers are based on the wavefront 1656size. AMDGPU defines distinct DWARF registers for each wavefront size. This 1657simplifies the consumer of the DWARF so that each register has a fixed size, 1658rather than being dynamic according to the wavefront size mode. Similarly, 1659distinct DWARF registers are defined for those registers that vary in size 1660according to the process address size. This allows a consumer to treat a 1661specific AMDGPU processor as a single architecture regardless of how it is 1662configured at run time. The compiler explicitly specifies the DWARF registers 1663that match the mode in which the code it is generating will be executed. 1664 1665DWARF registers are encoded as numbers, which are mapped to architecture 1666registers. The mapping for AMDGPU is defined in 1667:ref:`amdgpu-dwarf-register-mapping-table`. All AMDGPU targets use the same 1668mapping. 1669 1670.. table:: AMDGPU DWARF Register Mapping 1671 :name: amdgpu-dwarf-register-mapping-table 1672 1673 ============== ================= ======== ================================== 1674 DWARF Register AMDGPU Register Bit Size Description 1675 ============== ================= ======== ================================== 1676 0 PC_32 32 Program Counter (PC) when 1677 executing in a 32-bit process 1678 address space. Used in the CFI to 1679 describe the PC of the calling 1680 frame. 1681 1 EXEC_MASK_32 32 Execution Mask Register when 1682 executing in wavefront 32 mode. 1683 2-15 *Reserved* *Reserved for highly accessed 1684 registers using DWARF shortcut.* 1685 16 PC_64 64 Program Counter (PC) when 1686 executing in a 64-bit process 1687 address space. Used in the CFI to 1688 describe the PC of the calling 1689 frame. 1690 17 EXEC_MASK_64 64 Execution Mask Register when 1691 executing in wavefront 64 mode. 1692 18-31 *Reserved* *Reserved for highly accessed 1693 registers using DWARF shortcut.* 1694 32-95 SGPR0-SGPR63 32 Scalar General Purpose 1695 Registers. 1696 96-127 *Reserved* *Reserved for frequently accessed 1697 registers using DWARF 1-byte ULEB.* 1698 128 STATUS 32 Status Register. 1699 129-511 *Reserved* *Reserved for future Scalar 1700 Architectural Registers.* 1701 512 VCC_32 32 Vector Condition Code Register 1702 when executing in wavefront 32 1703 mode. 1704 513-1023 *Reserved* *Reserved for future Vector 1705 Architectural Registers when 1706 executing in wavefront 32 mode.* 1707 768 VCC_64 64 Vector Condition Code Register 1708 when executing in wavefront 64 1709 mode. 1710 769-1023 *Reserved* *Reserved for future Vector 1711 Architectural Registers when 1712 executing in wavefront 64 mode.* 1713 1024-1087 *Reserved* *Reserved for padding.* 1714 1088-1129 SGPR64-SGPR105 32 Scalar General Purpose Registers. 1715 1130-1535 *Reserved* *Reserved for future Scalar 1716 General Purpose Registers.* 1717 1536-1791 VGPR0-VGPR255 32*32 Vector General Purpose Registers 1718 when executing in wavefront 32 1719 mode. 1720 1792-2047 *Reserved* *Reserved for future Vector 1721 General Purpose Registers when 1722 executing in wavefront 32 mode.* 1723 2048-2303 AGPR0-AGPR255 32*32 Vector Accumulation Registers 1724 when executing in wavefront 32 1725 mode. 1726 2304-2559 *Reserved* *Reserved for future Vector 1727 Accumulation Registers when 1728 executing in wavefront 32 mode.* 1729 2560-2815 VGPR0-VGPR255 64*32 Vector General Purpose Registers 1730 when executing in wavefront 64 1731 mode. 1732 2816-3071 *Reserved* *Reserved for future Vector 1733 General Purpose Registers when 1734 executing in wavefront 64 mode.* 1735 3072-3327 AGPR0-AGPR255 64*32 Vector Accumulation Registers 1736 when executing in wavefront 64 1737 mode. 1738 3328-3583 *Reserved* *Reserved for future Vector 1739 Accumulation Registers when 1740 executing in wavefront 64 mode.* 1741 ============== ================= ======== ================================== 1742 1743The vector registers are represented as the full size for the wavefront. They 1744are organized as consecutive dwords (32-bits), one per lane, with the dword at 1745the least significant bit position corresponding to lane 0 and so forth. DWARF 1746location expressions involving the ``DW_OP_LLVM_offset`` and 1747``DW_OP_LLVM_push_lane`` operations are used to select the part of the vector 1748register corresponding to the lane that is executing the current thread of 1749execution in languages that are implemented using a SIMD or SIMT execution 1750model. 1751 1752If the wavefront size is 32 lanes then the wavefront 32 mode register 1753definitions are used. If the wavefront size is 64 lanes then the wavefront 64 1754mode register definitions are used. Some AMDGPU targets support executing in 1755both wavefront 32 and wavefront 64 mode. The register definitions corresponding 1756to the wavefront mode of the generated code will be used. 1757 1758If code is generated to execute in a 32-bit process address space, then the 175932-bit process address space register definitions are used. If code is generated 1760to execute in a 64-bit process address space, then the 64-bit process address 1761space register definitions are used. The ``amdgcn`` target only supports the 176264-bit process address space. 1763 1764.. _amdgpu-dwarf-address-class-identifier: 1765 1766Address Class Identifier 1767------------------------ 1768 1769The DWARF address class represents the source language memory space. See DWARF 1770Version 5 section 2.12 which is updated by the *DWARF Extensions For 1771Heterogeneous Debugging* section :ref:`amdgpu-dwarf-segment_addresses`. 1772 1773The DWARF address class mapping used for AMDGPU is defined in 1774:ref:`amdgpu-dwarf-address-class-mapping-table`. 1775 1776.. table:: AMDGPU DWARF Address Class Mapping 1777 :name: amdgpu-dwarf-address-class-mapping-table 1778 1779 ========================= ====== ================= 1780 DWARF AMDGPU 1781 -------------------------------- ----------------- 1782 Address Class Name Value Address Space 1783 ========================= ====== ================= 1784 ``DW_ADDR_none`` 0x0000 Generic (Flat) 1785 ``DW_ADDR_LLVM_global`` 0x0001 Global 1786 ``DW_ADDR_LLVM_constant`` 0x0002 Global 1787 ``DW_ADDR_LLVM_group`` 0x0003 Local (group/LDS) 1788 ``DW_ADDR_LLVM_private`` 0x0004 Private (Scratch) 1789 ``DW_ADDR_AMDGPU_region`` 0x8000 Region (GDS) 1790 ========================= ====== ================= 1791 1792The DWARF address class values defined in the *DWARF Extensions For 1793Heterogeneous Debugging* section :ref:`amdgpu-dwarf-segment_addresses` are used. 1794 1795In addition, ``DW_ADDR_AMDGPU_region`` is encoded as a vendor extension. This is 1796available for use for the AMD extension for access to the hardware GDS memory 1797which is scratchpad memory allocated per device. 1798 1799For AMDGPU if no ``DW_AT_address_class`` attribute is present, then the default 1800address class of ``DW_ADDR_none`` is used. 1801 1802See :ref:`amdgpu-dwarf-address-space-identifier` for information on the AMDGPU 1803mapping of DWARF address classes to DWARF address spaces, including address size 1804and NULL value. 1805 1806.. _amdgpu-dwarf-address-space-identifier: 1807 1808Address Space Identifier 1809------------------------ 1810 1811DWARF address spaces correspond to target architecture specific linear 1812addressable memory areas. See DWARF Version 5 section 2.12 and *DWARF Extensions 1813For Heterogeneous Debugging* section :ref:`amdgpu-dwarf-segment_addresses`. 1814 1815The DWARF address space mapping used for AMDGPU is defined in 1816:ref:`amdgpu-dwarf-address-space-mapping-table`. 1817 1818.. table:: AMDGPU DWARF Address Space Mapping 1819 :name: amdgpu-dwarf-address-space-mapping-table 1820 1821 ======================================= ===== ======= ======== ================= ======================= 1822 DWARF AMDGPU Notes 1823 --------------------------------------- ----- ---------------- ----------------- ----------------------- 1824 Address Space Name Value Address Bit Size Address Space 1825 --------------------------------------- ----- ------- -------- ----------------- ----------------------- 1826 .. 64-bit 32-bit 1827 process process 1828 address address 1829 space space 1830 ======================================= ===== ======= ======== ================= ======================= 1831 ``DW_ASPACE_none`` 0x00 64 32 Global *default address space* 1832 ``DW_ASPACE_AMDGPU_generic`` 0x01 64 32 Generic (Flat) 1833 ``DW_ASPACE_AMDGPU_region`` 0x02 32 32 Region (GDS) 1834 ``DW_ASPACE_AMDGPU_local`` 0x03 32 32 Local (group/LDS) 1835 *Reserved* 0x04 1836 ``DW_ASPACE_AMDGPU_private_lane`` 0x05 32 32 Private (Scratch) *focused lane* 1837 ``DW_ASPACE_AMDGPU_private_wave`` 0x06 32 32 Private (Scratch) *unswizzled wavefront* 1838 ======================================= ===== ======= ======== ================= ======================= 1839 1840See :ref:`amdgpu-address-spaces` for information on the AMDGPU address spaces 1841including address size and NULL value. 1842 1843The ``DW_ASPACE_none`` address space is the default target architecture address 1844space used in DWARF operations that do not specify an address space. It 1845therefore has to map to the global address space so that the ``DW_OP_addr*`` and 1846related operations can refer to addresses in the program code. 1847 1848The ``DW_ASPACE_AMDGPU_generic`` address space allows location expressions to 1849specify the flat address space. If the address corresponds to an address in the 1850local address space, then it corresponds to the wavefront that is executing the 1851focused thread of execution. If the address corresponds to an address in the 1852private address space, then it corresponds to the lane that is executing the 1853focused thread of execution for languages that are implemented using a SIMD or 1854SIMT execution model. 1855 1856.. note:: 1857 1858 CUDA-like languages such as HIP that do not have address spaces in the 1859 language type system, but do allow variables to be allocated in different 1860 address spaces, need to explicitly specify the ``DW_ASPACE_AMDGPU_generic`` 1861 address space in the DWARF expression operations as the default address space 1862 is the global address space. 1863 1864The ``DW_ASPACE_AMDGPU_local`` address space allows location expressions to 1865specify the local address space corresponding to the wavefront that is executing 1866the focused thread of execution. 1867 1868The ``DW_ASPACE_AMDGPU_private_lane`` address space allows location expressions 1869to specify the private address space corresponding to the lane that is executing 1870the focused thread of execution for languages that are implemented using a SIMD 1871or SIMT execution model. 1872 1873The ``DW_ASPACE_AMDGPU_private_wave`` address space allows location expressions 1874to specify the unswizzled private address space corresponding to the wavefront 1875that is executing the focused thread of execution. The wavefront view of private 1876memory is the per wavefront unswizzled backing memory layout defined in 1877:ref:`amdgpu-address-spaces`, such that address 0 corresponds to the first 1878location for the backing memory of the wavefront (namely the address is not 1879offset by ``wavefront-scratch-base``). The following formula can be used to 1880convert from a ``DW_ASPACE_AMDGPU_private_lane`` address to a 1881``DW_ASPACE_AMDGPU_private_wave`` address: 1882 1883:: 1884 1885 private-address-wavefront = 1886 ((private-address-lane / 4) * wavefront-size * 4) + 1887 (wavefront-lane-id * 4) + (private-address-lane % 4) 1888 1889If the ``DW_ASPACE_AMDGPU_private_lane`` address is dword aligned, and the start 1890of the dwords for each lane starting with lane 0 is required, then this 1891simplifies to: 1892 1893:: 1894 1895 private-address-wavefront = 1896 private-address-lane * wavefront-size 1897 1898A compiler can use the ``DW_ASPACE_AMDGPU_private_wave`` address space to read a 1899complete spilled vector register back into a complete vector register in the 1900CFI. The frame pointer can be a private lane address which is dword aligned, 1901which can be shifted to multiply by the wavefront size, and then used to form a 1902private wavefront address that gives a location for a contiguous set of dwords, 1903one per lane, where the vector register dwords are spilled. The compiler knows 1904the wavefront size since it generates the code. Note that the type of the 1905address may have to be converted as the size of a 1906``DW_ASPACE_AMDGPU_private_lane`` address may be smaller than the size of a 1907``DW_ASPACE_AMDGPU_private_wave`` address. 1908 1909.. _amdgpu-dwarf-lane-identifier: 1910 1911Lane identifier 1912--------------- 1913 1914DWARF lane identifies specify a target architecture lane position for hardware 1915that executes in a SIMD or SIMT manner, and on which a source language maps its 1916threads of execution onto those lanes. The DWARF lane identifier is pushed by 1917the ``DW_OP_LLVM_push_lane`` DWARF expression operation. See DWARF Version 5 1918section 2.5 which is updated by *DWARF Extensions For Heterogeneous Debugging* 1919section :ref:`amdgpu-dwarf-operation-expressions`. 1920 1921For AMDGPU, the lane identifier corresponds to the hardware lane ID of a 1922wavefront. It is numbered from 0 to the wavefront size minus 1. 1923 1924Operation Expressions 1925--------------------- 1926 1927DWARF expressions are used to compute program values and the locations of 1928program objects. See DWARF Version 5 section 2.5 and 1929:ref:`amdgpu-dwarf-operation-expressions`. 1930 1931DWARF location descriptions describe how to access storage which includes memory 1932and registers. When accessing storage on AMDGPU, bytes are ordered with least 1933significant bytes first, and bits are ordered within bytes with least 1934significant bits first. 1935 1936For AMDGPU CFI expressions, ``DW_OP_LLVM_select_bit_piece`` is used to describe 1937unwinding vector registers that are spilled under the execution mask to memory: 1938the zero-single location description is the vector register, and the one-single 1939location description is the spilled memory location description. The 1940``DW_OP_LLVM_form_aspace_address`` is used to specify the address space of the 1941memory location description. 1942 1943In AMDGPU expressions, ``DW_OP_LLVM_select_bit_piece`` is used by the 1944``DW_AT_LLVM_lane_pc`` attribute expression where divergent control flow is 1945controlled by the execution mask. An undefined location description together 1946with ``DW_OP_LLVM_extend`` is used to indicate the lane was not active on entry 1947to the subprogram. See :ref:`amdgpu-dwarf-dw-at-llvm-lane-pc` for an example. 1948 1949Debugger Information Entry Attributes 1950------------------------------------- 1951 1952This section describes how certain debugger information entry attributes are 1953used by AMDGPU. See the sections in DWARF Version 5 section 2 which are updated 1954by *DWARF Extensions For Heterogeneous Debugging* section 1955:ref:`amdgpu-dwarf-debugging-information-entry-attributes`. 1956 1957.. _amdgpu-dwarf-dw-at-llvm-lane-pc: 1958 1959``DW_AT_LLVM_lane_pc`` 1960~~~~~~~~~~~~~~~~~~~~~~ 1961 1962For AMDGPU, the ``DW_AT_LLVM_lane_pc`` attribute is used to specify the program 1963location of the separate lanes of a SIMT thread. 1964 1965If the lane is an active lane then this will be the same as the current program 1966location. 1967 1968If the lane is inactive, but was active on entry to the subprogram, then this is 1969the program location in the subprogram at which execution of the lane is 1970conceptual positioned. 1971 1972If the lane was not active on entry to the subprogram, then this will be the 1973undefined location. A client debugger can check if the lane is part of a valid 1974work-group by checking that the lane is in the range of the associated 1975work-group within the grid, accounting for partial work-groups. If it is not, 1976then the debugger can omit any information for the lane. Otherwise, the debugger 1977may repeatedly unwind the stack and inspect the ``DW_AT_LLVM_lane_pc`` of the 1978calling subprogram until it finds a non-undefined location. Conceptually the 1979lane only has the call frames that it has a non-undefined 1980``DW_AT_LLVM_lane_pc``. 1981 1982The following example illustrates how the AMDGPU backend can generate a DWARF 1983location list expression for the nested ``IF/THEN/ELSE`` structures of the 1984following subprogram pseudo code for a target with 64 lanes per wavefront. 1985 1986.. code:: 1987 :number-lines: 1988 1989 SUBPROGRAM X 1990 BEGIN 1991 a; 1992 IF (c1) THEN 1993 b; 1994 IF (c2) THEN 1995 c; 1996 ELSE 1997 d; 1998 ENDIF 1999 e; 2000 ELSE 2001 f; 2002 ENDIF 2003 g; 2004 END 2005 2006The AMDGPU backend may generate the following pseudo LLVM MIR to manipulate the 2007execution mask (``EXEC``) to linearize the control flow. The condition is 2008evaluated to make a mask of the lanes for which the condition evaluates to true. 2009First the ``THEN`` region is executed by setting the ``EXEC`` mask to the 2010logical ``AND`` of the current ``EXEC`` mask with the condition mask. Then the 2011``ELSE`` region is executed by negating the ``EXEC`` mask and logical ``AND`` of 2012the saved ``EXEC`` mask at the start of the region. After the ``IF/THEN/ELSE`` 2013region the ``EXEC`` mask is restored to the value it had at the beginning of the 2014region. This is shown below. Other approaches are possible, but the basic 2015concept is the same. 2016 2017.. code:: 2018 :number-lines: 2019 2020 $lex_start: 2021 a; 2022 %1 = EXEC 2023 %2 = c1 2024 $lex_1_start: 2025 EXEC = %1 & %2 2026 $if_1_then: 2027 b; 2028 %3 = EXEC 2029 %4 = c2 2030 $lex_1_1_start: 2031 EXEC = %3 & %4 2032 $lex_1_1_then: 2033 c; 2034 EXEC = ~EXEC & %3 2035 $lex_1_1_else: 2036 d; 2037 EXEC = %3 2038 $lex_1_1_end: 2039 e; 2040 EXEC = ~EXEC & %1 2041 $lex_1_else: 2042 f; 2043 EXEC = %1 2044 $lex_1_end: 2045 g; 2046 $lex_end: 2047 2048To create the DWARF location list expression that defines the location 2049description of a vector of lane program locations, the LLVM MIR ``DBG_VALUE`` 2050pseudo instruction can be used to annotate the linearized control flow. This can 2051be done by defining an artificial variable for the lane PC. The DWARF location 2052list expression created for it is used as the value of the 2053``DW_AT_LLVM_lane_pc`` attribute on the subprogram's debugger information entry. 2054 2055A DWARF procedure is defined for each well nested structured control flow region 2056which provides the conceptual lane program location for a lane if it is not 2057active (namely it is divergent). The DWARF operation expression for each region 2058conceptually inherits the value of the immediately enclosing region and modifies 2059it according to the semantics of the region. 2060 2061For an ``IF/THEN/ELSE`` region the divergent program location is at the start of 2062the region for the ``THEN`` region since it is executed first. For the ``ELSE`` 2063region the divergent program location is at the end of the ``IF/THEN/ELSE`` 2064region since the ``THEN`` region has completed. 2065 2066The lane PC artificial variable is assigned at each region transition. It uses 2067the immediately enclosing region's DWARF procedure to compute the program 2068location for each lane assuming they are divergent, and then modifies the result 2069by inserting the current program location for each lane that the ``EXEC`` mask 2070indicates is active. 2071 2072By having separate DWARF procedures for each region, they can be reused to 2073define the value for any nested region. This reduces the total size of the DWARF 2074operation expressions. 2075 2076The following provides an example using pseudo LLVM MIR. 2077 2078.. code:: 2079 :number-lines: 2080 2081 $lex_start: 2082 DEFINE_DWARF %__uint_64 = DW_TAG_base_type[ 2083 DW_AT_name = "__uint64"; 2084 DW_AT_byte_size = 8; 2085 DW_AT_encoding = DW_ATE_unsigned; 2086 ]; 2087 DEFINE_DWARF %__active_lane_pc = DW_TAG_dwarf_procedure[ 2088 DW_AT_name = "__active_lane_pc"; 2089 DW_AT_location = [ 2090 DW_OP_regx PC; 2091 DW_OP_LLVM_extend 64, 64; 2092 DW_OP_regval_type EXEC, %uint_64; 2093 DW_OP_LLVM_select_bit_piece 64, 64; 2094 ]; 2095 ]; 2096 DEFINE_DWARF %__divergent_lane_pc = DW_TAG_dwarf_procedure[ 2097 DW_AT_name = "__divergent_lane_pc"; 2098 DW_AT_location = [ 2099 DW_OP_LLVM_undefined; 2100 DW_OP_LLVM_extend 64, 64; 2101 ]; 2102 ]; 2103 DBG_VALUE $noreg, $noreg, %DW_AT_LLVM_lane_pc, DIExpression[ 2104 DW_OP_call_ref %__divergent_lane_pc; 2105 DW_OP_call_ref %__active_lane_pc; 2106 ]; 2107 a; 2108 %1 = EXEC; 2109 DBG_VALUE %1, $noreg, %__lex_1_save_exec; 2110 %2 = c1; 2111 $lex_1_start: 2112 EXEC = %1 & %2; 2113 $lex_1_then: 2114 DEFINE_DWARF %__divergent_lane_pc_1_then = DW_TAG_dwarf_procedure[ 2115 DW_AT_name = "__divergent_lane_pc_1_then"; 2116 DW_AT_location = DIExpression[ 2117 DW_OP_call_ref %__divergent_lane_pc; 2118 DW_OP_addrx &lex_1_start; 2119 DW_OP_stack_value; 2120 DW_OP_LLVM_extend 64, 64; 2121 DW_OP_call_ref %__lex_1_save_exec; 2122 DW_OP_deref_type 64, %__uint_64; 2123 DW_OP_LLVM_select_bit_piece 64, 64; 2124 ]; 2125 ]; 2126 DBG_VALUE $noreg, $noreg, %DW_AT_LLVM_lane_pc, DIExpression[ 2127 DW_OP_call_ref %__divergent_lane_pc_1_then; 2128 DW_OP_call_ref %__active_lane_pc; 2129 ]; 2130 b; 2131 %3 = EXEC; 2132 DBG_VALUE %3, %__lex_1_1_save_exec; 2133 %4 = c2; 2134 $lex_1_1_start: 2135 EXEC = %3 & %4; 2136 $lex_1_1_then: 2137 DEFINE_DWARF %__divergent_lane_pc_1_1_then = DW_TAG_dwarf_procedure[ 2138 DW_AT_name = "__divergent_lane_pc_1_1_then"; 2139 DW_AT_location = DIExpression[ 2140 DW_OP_call_ref %__divergent_lane_pc_1_then; 2141 DW_OP_addrx &lex_1_1_start; 2142 DW_OP_stack_value; 2143 DW_OP_LLVM_extend 64, 64; 2144 DW_OP_call_ref %__lex_1_1_save_exec; 2145 DW_OP_deref_type 64, %__uint_64; 2146 DW_OP_LLVM_select_bit_piece 64, 64; 2147 ]; 2148 ]; 2149 DBG_VALUE $noreg, $noreg, %DW_AT_LLVM_lane_pc, DIExpression[ 2150 DW_OP_call_ref %__divergent_lane_pc_1_1_then; 2151 DW_OP_call_ref %__active_lane_pc; 2152 ]; 2153 c; 2154 EXEC = ~EXEC & %3; 2155 $lex_1_1_else: 2156 DEFINE_DWARF %__divergent_lane_pc_1_1_else = DW_TAG_dwarf_procedure[ 2157 DW_AT_name = "__divergent_lane_pc_1_1_else"; 2158 DW_AT_location = DIExpression[ 2159 DW_OP_call_ref %__divergent_lane_pc_1_then; 2160 DW_OP_addrx &lex_1_1_end; 2161 DW_OP_stack_value; 2162 DW_OP_LLVM_extend 64, 64; 2163 DW_OP_call_ref %__lex_1_1_save_exec; 2164 DW_OP_deref_type 64, %__uint_64; 2165 DW_OP_LLVM_select_bit_piece 64, 64; 2166 ]; 2167 ]; 2168 DBG_VALUE $noreg, $noreg, %DW_AT_LLVM_lane_pc, DIExpression[ 2169 DW_OP_call_ref %__divergent_lane_pc_1_1_else; 2170 DW_OP_call_ref %__active_lane_pc; 2171 ]; 2172 d; 2173 EXEC = %3; 2174 $lex_1_1_end: 2175 DBG_VALUE $noreg, $noreg, %DW_AT_LLVM_lane_pc, DIExpression[ 2176 DW_OP_call_ref %__divergent_lane_pc; 2177 DW_OP_call_ref %__active_lane_pc; 2178 ]; 2179 e; 2180 EXEC = ~EXEC & %1; 2181 $lex_1_else: 2182 DEFINE_DWARF %__divergent_lane_pc_1_else = DW_TAG_dwarf_procedure[ 2183 DW_AT_name = "__divergent_lane_pc_1_else"; 2184 DW_AT_location = DIExpression[ 2185 DW_OP_call_ref %__divergent_lane_pc; 2186 DW_OP_addrx &lex_1_end; 2187 DW_OP_stack_value; 2188 DW_OP_LLVM_extend 64, 64; 2189 DW_OP_call_ref %__lex_1_save_exec; 2190 DW_OP_deref_type 64, %__uint_64; 2191 DW_OP_LLVM_select_bit_piece 64, 64; 2192 ]; 2193 ]; 2194 DBG_VALUE $noreg, $noreg, %DW_AT_LLVM_lane_pc, DIExpression[ 2195 DW_OP_call_ref %__divergent_lane_pc_1_else; 2196 DW_OP_call_ref %__active_lane_pc; 2197 ]; 2198 f; 2199 EXEC = %1; 2200 $lex_1_end: 2201 DBG_VALUE $noreg, $noreg, %DW_AT_LLVM_lane_pc DIExpression[ 2202 DW_OP_call_ref %__divergent_lane_pc; 2203 DW_OP_call_ref %__active_lane_pc; 2204 ]; 2205 g; 2206 $lex_end: 2207 2208The DWARF procedure ``%__active_lane_pc`` is used to update the lane pc elements 2209that are active, with the current program location. 2210 2211Artificial variables %__lex_1_save_exec and %__lex_1_1_save_exec are created for 2212the execution masks saved on entry to a region. Using the ``DBG_VALUE`` pseudo 2213instruction, location list entries will be created that describe where the 2214artificial variables are allocated at any given program location. The compiler 2215may allocate them to registers or spill them to memory. 2216 2217The DWARF procedures for each region use the values of the saved execution mask 2218artificial variables to only update the lanes that are active on entry to the 2219region. All other lanes retain the value of the enclosing region where they were 2220last active. If they were not active on entry to the subprogram, then will have 2221the undefined location description. 2222 2223Other structured control flow regions can be handled similarly. For example, 2224loops would set the divergent program location for the region at the end of the 2225loop. Any lanes active will be in the loop, and any lanes not active must have 2226exited the loop. 2227 2228An ``IF/THEN/ELSEIF/ELSEIF/...`` region can be treated as a nest of 2229``IF/THEN/ELSE`` regions. 2230 2231The DWARF procedures can use the active lane artificial variable described in 2232:ref:`amdgpu-dwarf-amdgpu-dw-at-llvm-active-lane` rather than the actual 2233``EXEC`` mask in order to support whole or quad wavefront mode. 2234 2235.. _amdgpu-dwarf-amdgpu-dw-at-llvm-active-lane: 2236 2237``DW_AT_LLVM_active_lane`` 2238~~~~~~~~~~~~~~~~~~~~~~~~~~ 2239 2240The ``DW_AT_LLVM_active_lane`` attribute on a subprogram debugger information 2241entry is used to specify the lanes that are conceptually active for a SIMT 2242thread. 2243 2244The execution mask may be modified to implement whole or quad wavefront mode 2245operations. For example, all lanes may need to temporarily be made active to 2246execute a whole wavefront operation. Such regions would save the ``EXEC`` mask, 2247update it to enable the necessary lanes, perform the operations, and then 2248restore the ``EXEC`` mask from the saved value. While executing the whole 2249wavefront region, the conceptual execution mask is the saved value, not the 2250``EXEC`` value. 2251 2252This is handled by defining an artificial variable for the active lane mask. The 2253active lane mask artificial variable would be the actual ``EXEC`` mask for 2254normal regions, and the saved execution mask for regions where the mask is 2255temporarily updated. The location list expression created for this artificial 2256variable is used to define the value of the ``DW_AT_LLVM_active_lane`` 2257attribute. 2258 2259``DW_AT_LLVM_augmentation`` 2260~~~~~~~~~~~~~~~~~~~~~~~~~~~ 2261 2262For AMDGPU, the ``DW_AT_LLVM_augmentation`` attribute of a compilation unit 2263debugger information entry has the following value for the augmentation string: 2264 2265:: 2266 2267 [amdgpu:v0.0] 2268 2269The "vX.Y" specifies the major X and minor Y version number of the AMDGPU 2270extensions used in the DWARF of the compilation unit. The version number 2271conforms to [SEMVER]_. 2272 2273Call Frame Information 2274---------------------- 2275 2276DWARF Call Frame Information (CFI) describes how a consumer can virtually 2277*unwind* call frames in a running process or core dump. See DWARF Version 5 2278section 6.4 and :ref:`amdgpu-dwarf-call-frame-information`. 2279 2280For AMDGPU, the Common Information Entry (CIE) fields have the following values: 2281 22821. ``augmentation`` string contains the following null-terminated UTF-8 string: 2283 2284 :: 2285 2286 [amd:v0.0] 2287 2288 The ``vX.Y`` specifies the major X and minor Y version number of the AMDGPU 2289 extensions used in this CIE or to the FDEs that use it. The version number 2290 conforms to [SEMVER]_. 2291 22922. ``address_size`` for the ``Global`` address space is defined in 2293 :ref:`amdgpu-dwarf-address-space-identifier`. 2294 22953. ``segment_selector_size`` is 0 as AMDGPU does not use a segment selector. 2296 22974. ``code_alignment_factor`` is 4 bytes. 2298 2299 .. TODO:: 2300 2301 Add to :ref:`amdgpu-processor-table` table. 2302 23035. ``data_alignment_factor`` is 4 bytes. 2304 2305 .. TODO:: 2306 2307 Add to :ref:`amdgpu-processor-table` table. 2308 23096. ``return_address_register`` is ``PC_32`` for 32-bit processes and ``PC_64`` 2310 for 64-bit processes defined in :ref:`amdgpu-dwarf-register-identifier`. 2311 23127. ``initial_instructions`` Since a subprogram X with fewer registers can be 2313 called from subprogram Y that has more allocated, X will not change any of 2314 the extra registers as it cannot access them. Therefore, the default rule 2315 for all columns is ``same value``. 2316 2317For AMDGPU the register number follows the numbering defined in 2318:ref:`amdgpu-dwarf-register-identifier`. 2319 2320For AMDGPU the instructions are variable size. A consumer can subtract 1 from 2321the return address to get the address of a byte within the call site 2322instructions. See DWARF Version 5 section 6.4.4. 2323 2324Accelerated Access 2325------------------ 2326 2327See DWARF Version 5 section 6.1. 2328 2329Lookup By Name Section Header 2330~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 2331 2332See DWARF Version 5 section 6.1.1.4.1 and :ref:`amdgpu-dwarf-lookup-by-name`. 2333 2334For AMDGPU the lookup by name section header table: 2335 2336``augmentation_string_size`` (uword) 2337 2338 Set to the length of the ``augmentation_string`` value which is always a 2339 multiple of 4. 2340 2341``augmentation_string`` (sequence of UTF-8 characters) 2342 2343 Contains the following UTF-8 string null padded to a multiple of 4 bytes: 2344 2345 :: 2346 2347 [amdgpu:v0.0] 2348 2349 The "vX.Y" specifies the major X and minor Y version number of the AMDGPU 2350 extensions used in the DWARF of this index. The version number conforms to 2351 [SEMVER]_. 2352 2353 .. note:: 2354 2355 This is different to the DWARF Version 5 definition that requires the first 2356 4 characters to be the vendor ID. But this is consistent with the other 2357 augmentation strings and does allow multiple vendor contributions. However, 2358 backwards compatibility may be more desirable. 2359 2360Lookup By Address Section Header 2361~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 2362 2363See DWARF Version 5 section 6.1.2. 2364 2365For AMDGPU the lookup by address section header table: 2366 2367``address_size`` (ubyte) 2368 2369 Match the address size for the ``Global`` address space defined in 2370 :ref:`amdgpu-dwarf-address-space-identifier`. 2371 2372``segment_selector_size`` (ubyte) 2373 2374 AMDGPU does not use a segment selector so this is 0. The entries in the 2375 ``.debug_aranges`` do not have a segment selector. 2376 2377Line Number Information 2378----------------------- 2379 2380See DWARF Version 5 section 6.2 and :ref:`amdgpu-dwarf-line-number-information`. 2381 2382AMDGPU does not use the ``isa`` state machine registers and always sets it to 0. 2383The instruction set must be obtained from the ELF file header ``e_flags`` field 2384in the ``EF_AMDGPU_MACH`` bit position (see :ref:`ELF Header 2385<amdgpu-elf-header>`). See DWARF Version 5 section 6.2.2. 2386 2387.. TODO:: 2388 2389 Should the ``isa`` state machine register be used to indicate if the code is 2390 in wavefront32 or wavefront64 mode? Or used to specify the architecture ISA? 2391 2392For AMDGPU the line number program header fields have the following values (see 2393DWARF Version 5 section 6.2.4): 2394 2395``address_size`` (ubyte) 2396 Matches the address size for the ``Global`` address space defined in 2397 :ref:`amdgpu-dwarf-address-space-identifier`. 2398 2399``segment_selector_size`` (ubyte) 2400 AMDGPU does not use a segment selector so this is 0. 2401 2402``minimum_instruction_length`` (ubyte) 2403 For GFX9-GFX10 this is 4. 2404 2405``maximum_operations_per_instruction`` (ubyte) 2406 For GFX9-GFX10 this is 1. 2407 2408Source text for online-compiled programs (for example, those compiled by the 2409OpenCL language runtime) may be embedded into the DWARF Version 5 line table. 2410See DWARF Version 5 section 6.2.4.1 which is updated by *DWARF Extensions For 2411Heterogeneous Debugging* section :ref:`DW_LNCT_LLVM_source 2412<amdgpu-dwarf-line-number-information-dw-lnct-llvm-source>`. 2413 2414The Clang option used to control source embedding in AMDGPU is defined in 2415:ref:`amdgpu-clang-debug-options-table`. 2416 2417 .. table:: AMDGPU Clang Debug Options 2418 :name: amdgpu-clang-debug-options-table 2419 2420 ==================== ================================================== 2421 Debug Flag Description 2422 ==================== ================================================== 2423 -g[no-]embed-source Enable/disable embedding source text in DWARF 2424 debug sections. Useful for environments where 2425 source cannot be written to disk, such as 2426 when performing online compilation. 2427 ==================== ================================================== 2428 2429For example: 2430 2431``-gembed-source`` 2432 Enable the embedded source. 2433 2434``-gno-embed-source`` 2435 Disable the embedded source. 2436 243732-Bit and 64-Bit DWARF Formats 2438------------------------------- 2439 2440See DWARF Version 5 section 7.4 and 2441:ref:`amdgpu-dwarf-32-bit-and-64-bit-dwarf-formats`. 2442 2443For AMDGPU: 2444 2445* For the ``amdgcn`` target architecture only the 64-bit process address space 2446 is supported. 2447 2448* The producer can generate either 32-bit or 64-bit DWARF format. LLVM generates 2449 the 32-bit DWARF format. 2450 2451Unit Headers 2452------------ 2453 2454For AMDGPU the following values apply for each of the unit headers described in 2455DWARF Version 5 sections 7.5.1.1, 7.5.1.2, and 7.5.1.3: 2456 2457``address_size`` (ubyte) 2458 Matches the address size for the ``Global`` address space defined in 2459 :ref:`amdgpu-dwarf-address-space-identifier`. 2460 2461.. _amdgpu-code-conventions: 2462 2463Code Conventions 2464================ 2465 2466This section provides code conventions used for each supported target triple OS 2467(see :ref:`amdgpu-target-triples`). 2468 2469AMDHSA 2470------ 2471 2472This section provides code conventions used when the target triple OS is 2473``amdhsa`` (see :ref:`amdgpu-target-triples`). 2474 2475.. _amdgpu-amdhsa-code-object-metadata: 2476 2477Code Object Metadata 2478~~~~~~~~~~~~~~~~~~~~ 2479 2480The code object metadata specifies extensible metadata associated with the code 2481objects executed on HSA [HSA]_ compatible runtimes (see :ref:`amdgpu-os`). The 2482encoding and semantics of this metadata depends on the code object version; see 2483:ref:`amdgpu-amdhsa-code-object-metadata-v2`, 2484:ref:`amdgpu-amdhsa-code-object-metadata-v3`, and 2485:ref:`amdgpu-amdhsa-code-object-metadata-v4`. 2486 2487Code object metadata is specified in a note record (see 2488:ref:`amdgpu-note-records`) and is required when the target triple OS is 2489``amdhsa`` (see :ref:`amdgpu-target-triples`). It must contain the minimum 2490information necessary to support the HSA compatible runtime kernel queries. For 2491example, the segment sizes needed in a dispatch packet. In addition, a 2492high-level language runtime may require other information to be included. For 2493example, the AMD OpenCL runtime records kernel argument information. 2494 2495.. _amdgpu-amdhsa-code-object-metadata-v2: 2496 2497Code Object V2 Metadata 2498+++++++++++++++++++++++ 2499 2500.. warning:: 2501 Code object V2 is not the default code object version emitted by this version 2502 of LLVM. 2503 2504Code object V2 metadata is specified by the ``NT_AMD_HSA_METADATA`` note record 2505(see :ref:`amdgpu-note-records-v2`). 2506 2507The metadata is specified as a YAML formatted string (see [YAML]_ and 2508:doc:`YamlIO`). 2509 2510.. TODO:: 2511 2512 Is the string null terminated? It probably should not if YAML allows it to 2513 contain null characters, otherwise it should be. 2514 2515The metadata is represented as a single YAML document comprised of the mapping 2516defined in table :ref:`amdgpu-amdhsa-code-object-metadata-map-v2-table` and 2517referenced tables. 2518 2519For boolean values, the string values of ``false`` and ``true`` are used for 2520false and true respectively. 2521 2522Additional information can be added to the mappings. To avoid conflicts, any 2523non-AMD key names should be prefixed by "*vendor-name*.". 2524 2525 .. table:: AMDHSA Code Object V2 Metadata Map 2526 :name: amdgpu-amdhsa-code-object-metadata-map-v2-table 2527 2528 ========== ============== ========= ======================================= 2529 String Key Value Type Required? Description 2530 ========== ============== ========= ======================================= 2531 "Version" sequence of Required - The first integer is the major 2532 2 integers version. Currently 1. 2533 - The second integer is the minor 2534 version. Currently 0. 2535 "Printf" sequence of Each string is encoded information 2536 strings about a printf function call. The 2537 encoded information is organized as 2538 fields separated by colon (':'): 2539 2540 ``ID:N:S[0]:S[1]:...:S[N-1]:FormatString`` 2541 2542 where: 2543 2544 ``ID`` 2545 A 32-bit integer as a unique id for 2546 each printf function call 2547 2548 ``N`` 2549 A 32-bit integer equal to the number 2550 of arguments of printf function call 2551 minus 1 2552 2553 ``S[i]`` (where i = 0, 1, ... , N-1) 2554 32-bit integers for the size in bytes 2555 of the i-th FormatString argument of 2556 the printf function call 2557 2558 FormatString 2559 The format string passed to the 2560 printf function call. 2561 "Kernels" sequence of Required Sequence of the mappings for each 2562 mapping kernel in the code object. See 2563 :ref:`amdgpu-amdhsa-code-object-kernel-metadata-map-v2-table` 2564 for the definition of the mapping. 2565 ========== ============== ========= ======================================= 2566 2567.. 2568 2569 .. table:: AMDHSA Code Object V2 Kernel Metadata Map 2570 :name: amdgpu-amdhsa-code-object-kernel-metadata-map-v2-table 2571 2572 ================= ============== ========= ================================ 2573 String Key Value Type Required? Description 2574 ================= ============== ========= ================================ 2575 "Name" string Required Source name of the kernel. 2576 "SymbolName" string Required Name of the kernel 2577 descriptor ELF symbol. 2578 "Language" string Source language of the kernel. 2579 Values include: 2580 2581 - "OpenCL C" 2582 - "OpenCL C++" 2583 - "HCC" 2584 - "OpenMP" 2585 2586 "LanguageVersion" sequence of - The first integer is the major 2587 2 integers version. 2588 - The second integer is the 2589 minor version. 2590 "Attrs" mapping Mapping of kernel attributes. 2591 See 2592 :ref:`amdgpu-amdhsa-code-object-kernel-attribute-metadata-map-v2-table` 2593 for the mapping definition. 2594 "Args" sequence of Sequence of mappings of the 2595 mapping kernel arguments. See 2596 :ref:`amdgpu-amdhsa-code-object-kernel-argument-metadata-map-v2-table` 2597 for the definition of the mapping. 2598 "CodeProps" mapping Mapping of properties related to 2599 the kernel code. See 2600 :ref:`amdgpu-amdhsa-code-object-kernel-code-properties-metadata-map-v2-table` 2601 for the mapping definition. 2602 ================= ============== ========= ================================ 2603 2604.. 2605 2606 .. table:: AMDHSA Code Object V2 Kernel Attribute Metadata Map 2607 :name: amdgpu-amdhsa-code-object-kernel-attribute-metadata-map-v2-table 2608 2609 =================== ============== ========= ============================== 2610 String Key Value Type Required? Description 2611 =================== ============== ========= ============================== 2612 "ReqdWorkGroupSize" sequence of If not 0, 0, 0 then all values 2613 3 integers must be >=1 and the dispatch 2614 work-group size X, Y, Z must 2615 correspond to the specified 2616 values. Defaults to 0, 0, 0. 2617 2618 Corresponds to the OpenCL 2619 ``reqd_work_group_size`` 2620 attribute. 2621 "WorkGroupSizeHint" sequence of The dispatch work-group size 2622 3 integers X, Y, Z is likely to be the 2623 specified values. 2624 2625 Corresponds to the OpenCL 2626 ``work_group_size_hint`` 2627 attribute. 2628 "VecTypeHint" string The name of a scalar or vector 2629 type. 2630 2631 Corresponds to the OpenCL 2632 ``vec_type_hint`` attribute. 2633 2634 "RuntimeHandle" string The external symbol name 2635 associated with a kernel. 2636 OpenCL runtime allocates a 2637 global buffer for the symbol 2638 and saves the kernel's address 2639 to it, which is used for 2640 device side enqueueing. Only 2641 available for device side 2642 enqueued kernels. 2643 =================== ============== ========= ============================== 2644 2645.. 2646 2647 .. table:: AMDHSA Code Object V2 Kernel Argument Metadata Map 2648 :name: amdgpu-amdhsa-code-object-kernel-argument-metadata-map-v2-table 2649 2650 ================= ============== ========= ================================ 2651 String Key Value Type Required? Description 2652 ================= ============== ========= ================================ 2653 "Name" string Kernel argument name. 2654 "TypeName" string Kernel argument type name. 2655 "Size" integer Required Kernel argument size in bytes. 2656 "Align" integer Required Kernel argument alignment in 2657 bytes. Must be a power of two. 2658 "ValueKind" string Required Kernel argument kind that 2659 specifies how to set up the 2660 corresponding argument. 2661 Values include: 2662 2663 "ByValue" 2664 The argument is copied 2665 directly into the kernarg. 2666 2667 "GlobalBuffer" 2668 A global address space pointer 2669 to the buffer data is passed 2670 in the kernarg. 2671 2672 "DynamicSharedPointer" 2673 A group address space pointer 2674 to dynamically allocated LDS 2675 is passed in the kernarg. 2676 2677 "Sampler" 2678 A global address space 2679 pointer to a S# is passed in 2680 the kernarg. 2681 2682 "Image" 2683 A global address space 2684 pointer to a T# is passed in 2685 the kernarg. 2686 2687 "Pipe" 2688 A global address space pointer 2689 to an OpenCL pipe is passed in 2690 the kernarg. 2691 2692 "Queue" 2693 A global address space pointer 2694 to an OpenCL device enqueue 2695 queue is passed in the 2696 kernarg. 2697 2698 "HiddenGlobalOffsetX" 2699 The OpenCL grid dispatch 2700 global offset for the X 2701 dimension is passed in the 2702 kernarg. 2703 2704 "HiddenGlobalOffsetY" 2705 The OpenCL grid dispatch 2706 global offset for the Y 2707 dimension is passed in the 2708 kernarg. 2709 2710 "HiddenGlobalOffsetZ" 2711 The OpenCL grid dispatch 2712 global offset for the Z 2713 dimension is passed in the 2714 kernarg. 2715 2716 "HiddenNone" 2717 An argument that is not used 2718 by the kernel. Space needs to 2719 be left for it, but it does 2720 not need to be set up. 2721 2722 "HiddenPrintfBuffer" 2723 A global address space pointer 2724 to the runtime printf buffer 2725 is passed in kernarg. 2726 2727 "HiddenHostcallBuffer" 2728 A global address space pointer 2729 to the runtime hostcall buffer 2730 is passed in kernarg. 2731 2732 "HiddenDefaultQueue" 2733 A global address space pointer 2734 to the OpenCL device enqueue 2735 queue that should be used by 2736 the kernel by default is 2737 passed in the kernarg. 2738 2739 "HiddenCompletionAction" 2740 A global address space pointer 2741 to help link enqueued kernels into 2742 the ancestor tree for determining 2743 when the parent kernel has finished. 2744 2745 "HiddenMultiGridSyncArg" 2746 A global address space pointer for 2747 multi-grid synchronization is 2748 passed in the kernarg. 2749 2750 "ValueType" string Unused and deprecated. This should no longer 2751 be emitted, but is accepted for compatibility. 2752 2753 2754 "PointeeAlign" integer Alignment in bytes of pointee 2755 type for pointer type kernel 2756 argument. Must be a power 2757 of 2. Only present if 2758 "ValueKind" is 2759 "DynamicSharedPointer". 2760 "AddrSpaceQual" string Kernel argument address space 2761 qualifier. Only present if 2762 "ValueKind" is "GlobalBuffer" or 2763 "DynamicSharedPointer". Values 2764 are: 2765 2766 - "Private" 2767 - "Global" 2768 - "Constant" 2769 - "Local" 2770 - "Generic" 2771 - "Region" 2772 2773 .. TODO:: 2774 2775 Is GlobalBuffer only Global 2776 or Constant? Is 2777 DynamicSharedPointer always 2778 Local? Can HCC allow Generic? 2779 How can Private or Region 2780 ever happen? 2781 2782 "AccQual" string Kernel argument access 2783 qualifier. Only present if 2784 "ValueKind" is "Image" or 2785 "Pipe". Values 2786 are: 2787 2788 - "ReadOnly" 2789 - "WriteOnly" 2790 - "ReadWrite" 2791 2792 .. TODO:: 2793 2794 Does this apply to 2795 GlobalBuffer? 2796 2797 "ActualAccQual" string The actual memory accesses 2798 performed by the kernel on the 2799 kernel argument. Only present if 2800 "ValueKind" is "GlobalBuffer", 2801 "Image", or "Pipe". This may be 2802 more restrictive than indicated 2803 by "AccQual" to reflect what the 2804 kernel actual does. If not 2805 present then the runtime must 2806 assume what is implied by 2807 "AccQual" and "IsConst". Values 2808 are: 2809 2810 - "ReadOnly" 2811 - "WriteOnly" 2812 - "ReadWrite" 2813 2814 "IsConst" boolean Indicates if the kernel argument 2815 is const qualified. Only present 2816 if "ValueKind" is 2817 "GlobalBuffer". 2818 2819 "IsRestrict" boolean Indicates if the kernel argument 2820 is restrict qualified. Only 2821 present if "ValueKind" is 2822 "GlobalBuffer". 2823 2824 "IsVolatile" boolean Indicates if the kernel argument 2825 is volatile qualified. Only 2826 present if "ValueKind" is 2827 "GlobalBuffer". 2828 2829 "IsPipe" boolean Indicates if the kernel argument 2830 is pipe qualified. Only present 2831 if "ValueKind" is "Pipe". 2832 2833 .. TODO:: 2834 2835 Can GlobalBuffer be pipe 2836 qualified? 2837 2838 ================= ============== ========= ================================ 2839 2840.. 2841 2842 .. table:: AMDHSA Code Object V2 Kernel Code Properties Metadata Map 2843 :name: amdgpu-amdhsa-code-object-kernel-code-properties-metadata-map-v2-table 2844 2845 ============================ ============== ========= ===================== 2846 String Key Value Type Required? Description 2847 ============================ ============== ========= ===================== 2848 "KernargSegmentSize" integer Required The size in bytes of 2849 the kernarg segment 2850 that holds the values 2851 of the arguments to 2852 the kernel. 2853 "GroupSegmentFixedSize" integer Required The amount of group 2854 segment memory 2855 required by a 2856 work-group in 2857 bytes. This does not 2858 include any 2859 dynamically allocated 2860 group segment memory 2861 that may be added 2862 when the kernel is 2863 dispatched. 2864 "PrivateSegmentFixedSize" integer Required The amount of fixed 2865 private address space 2866 memory required for a 2867 work-item in 2868 bytes. If the kernel 2869 uses a dynamic call 2870 stack then additional 2871 space must be added 2872 to this value for the 2873 call stack. 2874 "KernargSegmentAlign" integer Required The maximum byte 2875 alignment of 2876 arguments in the 2877 kernarg segment. Must 2878 be a power of 2. 2879 "WavefrontSize" integer Required Wavefront size. Must 2880 be a power of 2. 2881 "NumSGPRs" integer Required Number of scalar 2882 registers used by a 2883 wavefront for 2884 GFX6-GFX10. This 2885 includes the special 2886 SGPRs for VCC, Flat 2887 Scratch (GFX7-GFX10) 2888 and XNACK (for 2889 GFX8-GFX10). It does 2890 not include the 16 2891 SGPR added if a trap 2892 handler is 2893 enabled. It is not 2894 rounded up to the 2895 allocation 2896 granularity. 2897 "NumVGPRs" integer Required Number of vector 2898 registers used by 2899 each work-item for 2900 GFX6-GFX10 2901 "MaxFlatWorkGroupSize" integer Required Maximum flat 2902 work-group size 2903 supported by the 2904 kernel in work-items. 2905 Must be >=1 and 2906 consistent with 2907 ReqdWorkGroupSize if 2908 not 0, 0, 0. 2909 "NumSpilledSGPRs" integer Number of stores from 2910 a scalar register to 2911 a register allocator 2912 created spill 2913 location. 2914 "NumSpilledVGPRs" integer Number of stores from 2915 a vector register to 2916 a register allocator 2917 created spill 2918 location. 2919 ============================ ============== ========= ===================== 2920 2921.. _amdgpu-amdhsa-code-object-metadata-v3: 2922 2923Code Object V3 Metadata 2924+++++++++++++++++++++++ 2925 2926Code object V3 to V4 metadata is specified by the ``NT_AMDGPU_METADATA`` note 2927record (see :ref:`amdgpu-note-records-v3-v4`). 2928 2929The metadata is represented as Message Pack formatted binary data (see 2930[MsgPack]_). The top level is a Message Pack map that includes the 2931keys defined in table 2932:ref:`amdgpu-amdhsa-code-object-metadata-map-table-v3` and referenced 2933tables. 2934 2935Additional information can be added to the maps. To avoid conflicts, 2936any key names should be prefixed by "*vendor-name*." where 2937``vendor-name`` can be the name of the vendor and specific vendor 2938tool that generates the information. The prefix is abbreviated to 2939simply "." when it appears within a map that has been added by the 2940same *vendor-name*. 2941 2942 .. table:: AMDHSA Code Object V3 Metadata Map 2943 :name: amdgpu-amdhsa-code-object-metadata-map-table-v3 2944 2945 ================= ============== ========= ======================================= 2946 String Key Value Type Required? Description 2947 ================= ============== ========= ======================================= 2948 "amdhsa.version" sequence of Required - The first integer is the major 2949 2 integers version. Currently 1. 2950 - The second integer is the minor 2951 version. Currently 0. 2952 "amdhsa.printf" sequence of Each string is encoded information 2953 strings about a printf function call. The 2954 encoded information is organized as 2955 fields separated by colon (':'): 2956 2957 ``ID:N:S[0]:S[1]:...:S[N-1]:FormatString`` 2958 2959 where: 2960 2961 ``ID`` 2962 A 32-bit integer as a unique id for 2963 each printf function call 2964 2965 ``N`` 2966 A 32-bit integer equal to the number 2967 of arguments of printf function call 2968 minus 1 2969 2970 ``S[i]`` (where i = 0, 1, ... , N-1) 2971 32-bit integers for the size in bytes 2972 of the i-th FormatString argument of 2973 the printf function call 2974 2975 FormatString 2976 The format string passed to the 2977 printf function call. 2978 "amdhsa.kernels" sequence of Required Sequence of the maps for each 2979 map kernel in the code object. See 2980 :ref:`amdgpu-amdhsa-code-object-kernel-metadata-map-table-v3` 2981 for the definition of the keys included 2982 in that map. 2983 ================= ============== ========= ======================================= 2984 2985.. 2986 2987 .. table:: AMDHSA Code Object V3 Kernel Metadata Map 2988 :name: amdgpu-amdhsa-code-object-kernel-metadata-map-table-v3 2989 2990 =================================== ============== ========= ================================ 2991 String Key Value Type Required? Description 2992 =================================== ============== ========= ================================ 2993 ".name" string Required Source name of the kernel. 2994 ".symbol" string Required Name of the kernel 2995 descriptor ELF symbol. 2996 ".language" string Source language of the kernel. 2997 Values include: 2998 2999 - "OpenCL C" 3000 - "OpenCL C++" 3001 - "HCC" 3002 - "HIP" 3003 - "OpenMP" 3004 - "Assembler" 3005 3006 ".language_version" sequence of - The first integer is the major 3007 2 integers version. 3008 - The second integer is the 3009 minor version. 3010 ".args" sequence of Sequence of maps of the 3011 map kernel arguments. See 3012 :ref:`amdgpu-amdhsa-code-object-kernel-argument-metadata-map-table-v3` 3013 for the definition of the keys 3014 included in that map. 3015 ".reqd_workgroup_size" sequence of If not 0, 0, 0 then all values 3016 3 integers must be >=1 and the dispatch 3017 work-group size X, Y, Z must 3018 correspond to the specified 3019 values. Defaults to 0, 0, 0. 3020 3021 Corresponds to the OpenCL 3022 ``reqd_work_group_size`` 3023 attribute. 3024 ".workgroup_size_hint" sequence of The dispatch work-group size 3025 3 integers X, Y, Z is likely to be the 3026 specified values. 3027 3028 Corresponds to the OpenCL 3029 ``work_group_size_hint`` 3030 attribute. 3031 ".vec_type_hint" string The name of a scalar or vector 3032 type. 3033 3034 Corresponds to the OpenCL 3035 ``vec_type_hint`` attribute. 3036 3037 ".device_enqueue_symbol" string The external symbol name 3038 associated with a kernel. 3039 OpenCL runtime allocates a 3040 global buffer for the symbol 3041 and saves the kernel's address 3042 to it, which is used for 3043 device side enqueueing. Only 3044 available for device side 3045 enqueued kernels. 3046 ".kernarg_segment_size" integer Required The size in bytes of 3047 the kernarg segment 3048 that holds the values 3049 of the arguments to 3050 the kernel. 3051 ".group_segment_fixed_size" integer Required The amount of group 3052 segment memory 3053 required by a 3054 work-group in 3055 bytes. This does not 3056 include any 3057 dynamically allocated 3058 group segment memory 3059 that may be added 3060 when the kernel is 3061 dispatched. 3062 ".private_segment_fixed_size" integer Required The amount of fixed 3063 private address space 3064 memory required for a 3065 work-item in 3066 bytes. If the kernel 3067 uses a dynamic call 3068 stack then additional 3069 space must be added 3070 to this value for the 3071 call stack. 3072 ".kernarg_segment_align" integer Required The maximum byte 3073 alignment of 3074 arguments in the 3075 kernarg segment. Must 3076 be a power of 2. 3077 ".wavefront_size" integer Required Wavefront size. Must 3078 be a power of 2. 3079 ".sgpr_count" integer Required Number of scalar 3080 registers required by a 3081 wavefront for 3082 GFX6-GFX9. A register 3083 is required if it is 3084 used explicitly, or 3085 if a higher numbered 3086 register is used 3087 explicitly. This 3088 includes the special 3089 SGPRs for VCC, Flat 3090 Scratch (GFX7-GFX9) 3091 and XNACK (for 3092 GFX8-GFX9). It does 3093 not include the 16 3094 SGPR added if a trap 3095 handler is 3096 enabled. It is not 3097 rounded up to the 3098 allocation 3099 granularity. 3100 ".vgpr_count" integer Required Number of vector 3101 registers required by 3102 each work-item for 3103 GFX6-GFX9. A register 3104 is required if it is 3105 used explicitly, or 3106 if a higher numbered 3107 register is used 3108 explicitly. 3109 ".max_flat_workgroup_size" integer Required Maximum flat 3110 work-group size 3111 supported by the 3112 kernel in work-items. 3113 Must be >=1 and 3114 consistent with 3115 ReqdWorkGroupSize if 3116 not 0, 0, 0. 3117 ".sgpr_spill_count" integer Number of stores from 3118 a scalar register to 3119 a register allocator 3120 created spill 3121 location. 3122 ".vgpr_spill_count" integer Number of stores from 3123 a vector register to 3124 a register allocator 3125 created spill 3126 location. 3127 =================================== ============== ========= ================================ 3128 3129.. 3130 3131 .. table:: AMDHSA Code Object V3 Kernel Argument Metadata Map 3132 :name: amdgpu-amdhsa-code-object-kernel-argument-metadata-map-table-v3 3133 3134 ====================== ============== ========= ================================ 3135 String Key Value Type Required? Description 3136 ====================== ============== ========= ================================ 3137 ".name" string Kernel argument name. 3138 ".type_name" string Kernel argument type name. 3139 ".size" integer Required Kernel argument size in bytes. 3140 ".offset" integer Required Kernel argument offset in 3141 bytes. The offset must be a 3142 multiple of the alignment 3143 required by the argument. 3144 ".value_kind" string Required Kernel argument kind that 3145 specifies how to set up the 3146 corresponding argument. 3147 Values include: 3148 3149 "by_value" 3150 The argument is copied 3151 directly into the kernarg. 3152 3153 "global_buffer" 3154 A global address space pointer 3155 to the buffer data is passed 3156 in the kernarg. 3157 3158 "dynamic_shared_pointer" 3159 A group address space pointer 3160 to dynamically allocated LDS 3161 is passed in the kernarg. 3162 3163 "sampler" 3164 A global address space 3165 pointer to a S# is passed in 3166 the kernarg. 3167 3168 "image" 3169 A global address space 3170 pointer to a T# is passed in 3171 the kernarg. 3172 3173 "pipe" 3174 A global address space pointer 3175 to an OpenCL pipe is passed in 3176 the kernarg. 3177 3178 "queue" 3179 A global address space pointer 3180 to an OpenCL device enqueue 3181 queue is passed in the 3182 kernarg. 3183 3184 "hidden_global_offset_x" 3185 The OpenCL grid dispatch 3186 global offset for the X 3187 dimension is passed in the 3188 kernarg. 3189 3190 "hidden_global_offset_y" 3191 The OpenCL grid dispatch 3192 global offset for the Y 3193 dimension is passed in the 3194 kernarg. 3195 3196 "hidden_global_offset_z" 3197 The OpenCL grid dispatch 3198 global offset for the Z 3199 dimension is passed in the 3200 kernarg. 3201 3202 "hidden_none" 3203 An argument that is not used 3204 by the kernel. Space needs to 3205 be left for it, but it does 3206 not need to be set up. 3207 3208 "hidden_printf_buffer" 3209 A global address space pointer 3210 to the runtime printf buffer 3211 is passed in kernarg. 3212 3213 "hidden_hostcall_buffer" 3214 A global address space pointer 3215 to the runtime hostcall buffer 3216 is passed in kernarg. 3217 3218 "hidden_default_queue" 3219 A global address space pointer 3220 to the OpenCL device enqueue 3221 queue that should be used by 3222 the kernel by default is 3223 passed in the kernarg. 3224 3225 "hidden_completion_action" 3226 A global address space pointer 3227 to help link enqueued kernels into 3228 the ancestor tree for determining 3229 when the parent kernel has finished. 3230 3231 "hidden_multigrid_sync_arg" 3232 A global address space pointer for 3233 multi-grid synchronization is 3234 passed in the kernarg. 3235 3236 ".value_type" string Unused and deprecated. This should no longer 3237 be emitted, but is accepted for compatibility. 3238 3239 ".pointee_align" integer Alignment in bytes of pointee 3240 type for pointer type kernel 3241 argument. Must be a power 3242 of 2. Only present if 3243 ".value_kind" is 3244 "dynamic_shared_pointer". 3245 ".address_space" string Kernel argument address space 3246 qualifier. Only present if 3247 ".value_kind" is "global_buffer" or 3248 "dynamic_shared_pointer". Values 3249 are: 3250 3251 - "private" 3252 - "global" 3253 - "constant" 3254 - "local" 3255 - "generic" 3256 - "region" 3257 3258 .. TODO:: 3259 3260 Is "global_buffer" only "global" 3261 or "constant"? Is 3262 "dynamic_shared_pointer" always 3263 "local"? Can HCC allow "generic"? 3264 How can "private" or "region" 3265 ever happen? 3266 3267 ".access" string Kernel argument access 3268 qualifier. Only present if 3269 ".value_kind" is "image" or 3270 "pipe". Values 3271 are: 3272 3273 - "read_only" 3274 - "write_only" 3275 - "read_write" 3276 3277 .. TODO:: 3278 3279 Does this apply to 3280 "global_buffer"? 3281 3282 ".actual_access" string The actual memory accesses 3283 performed by the kernel on the 3284 kernel argument. Only present if 3285 ".value_kind" is "global_buffer", 3286 "image", or "pipe". This may be 3287 more restrictive than indicated 3288 by ".access" to reflect what the 3289 kernel actual does. If not 3290 present then the runtime must 3291 assume what is implied by 3292 ".access" and ".is_const" . Values 3293 are: 3294 3295 - "read_only" 3296 - "write_only" 3297 - "read_write" 3298 3299 ".is_const" boolean Indicates if the kernel argument 3300 is const qualified. Only present 3301 if ".value_kind" is 3302 "global_buffer". 3303 3304 ".is_restrict" boolean Indicates if the kernel argument 3305 is restrict qualified. Only 3306 present if ".value_kind" is 3307 "global_buffer". 3308 3309 ".is_volatile" boolean Indicates if the kernel argument 3310 is volatile qualified. Only 3311 present if ".value_kind" is 3312 "global_buffer". 3313 3314 ".is_pipe" boolean Indicates if the kernel argument 3315 is pipe qualified. Only present 3316 if ".value_kind" is "pipe". 3317 3318 .. TODO:: 3319 3320 Can "global_buffer" be pipe 3321 qualified? 3322 3323 ====================== ============== ========= ================================ 3324 3325.. _amdgpu-amdhsa-code-object-metadata-v4: 3326 3327Code Object V4 Metadata 3328+++++++++++++++++++++++ 3329 3330.. warning:: 3331 Code object V4 is not the default code object version emitted by this version 3332 of LLVM. 3333 3334Code object V4 metadata is the same as 3335:ref:`amdgpu-amdhsa-code-object-metadata-v3` with the changes and additions 3336defined in table :ref:`amdgpu-amdhsa-code-object-metadata-map-table-v3`. 3337 3338 .. table:: AMDHSA Code Object V4 Metadata Map Changes from :ref:`amdgpu-amdhsa-code-object-metadata-v3` 3339 :name: amdgpu-amdhsa-code-object-metadata-map-table-v4 3340 3341 ================= ============== ========= ======================================= 3342 String Key Value Type Required? Description 3343 ================= ============== ========= ======================================= 3344 "amdhsa.version" sequence of Required - The first integer is the major 3345 2 integers version. Currently 1. 3346 - The second integer is the minor 3347 version. Currently 1. 3348 "amdhsa.target" string Required The target name of the code using the syntax: 3349 3350 .. code:: 3351 3352 <target-triple> [ "-" <target-id> ] 3353 3354 A canonical target ID must be 3355 used. See :ref:`amdgpu-target-triples` 3356 and :ref:`amdgpu-target-id`. 3357 ================= ============== ========= ======================================= 3358 3359.. 3360 3361Kernel Dispatch 3362~~~~~~~~~~~~~~~ 3363 3364The HSA architected queuing language (AQL) defines a user space memory interface 3365that can be used to control the dispatch of kernels, in an agent independent 3366way. An agent can have zero or more AQL queues created for it using an HSA 3367compatible runtime (see :ref:`amdgpu-os`), in which AQL packets (all of which 3368are 64 bytes) can be placed. See the *HSA Platform System Architecture 3369Specification* [HSA]_ for the AQL queue mechanics and packet layouts. 3370 3371The packet processor of a kernel agent is responsible for detecting and 3372dispatching HSA kernels from the AQL queues associated with it. For AMD GPUs the 3373packet processor is implemented by the hardware command processor (CP), 3374asynchronous dispatch controller (ADC) and shader processor input controller 3375(SPI). 3376 3377An HSA compatible runtime can be used to allocate an AQL queue object. It uses 3378the kernel mode driver to initialize and register the AQL queue with CP. 3379 3380To dispatch a kernel the following actions are performed. This can occur in the 3381CPU host program, or from an HSA kernel executing on a GPU. 3382 33831. A pointer to an AQL queue for the kernel agent on which the kernel is to be 3384 executed is obtained. 33852. A pointer to the kernel descriptor (see 3386 :ref:`amdgpu-amdhsa-kernel-descriptor`) of the kernel to execute is obtained. 3387 It must be for a kernel that is contained in a code object that that was 3388 loaded by an HSA compatible runtime on the kernel agent with which the AQL 3389 queue is associated. 33903. Space is allocated for the kernel arguments using the HSA compatible runtime 3391 allocator for a memory region with the kernarg property for the kernel agent 3392 that will execute the kernel. It must be at least 16-byte aligned. 33934. Kernel argument values are assigned to the kernel argument memory 3394 allocation. The layout is defined in the *HSA Programmer's Language 3395 Reference* [HSA]_. For AMDGPU the kernel execution directly accesses the 3396 kernel argument memory in the same way constant memory is accessed. (Note 3397 that the HSA specification allows an implementation to copy the kernel 3398 argument contents to another location that is accessed by the kernel.) 33995. An AQL kernel dispatch packet is created on the AQL queue. The HSA compatible 3400 runtime api uses 64-bit atomic operations to reserve space in the AQL queue 3401 for the packet. The packet must be set up, and the final write must use an 3402 atomic store release to set the packet kind to ensure the packet contents are 3403 visible to the kernel agent. AQL defines a doorbell signal mechanism to 3404 notify the kernel agent that the AQL queue has been updated. These rules, and 3405 the layout of the AQL queue and kernel dispatch packet is defined in the *HSA 3406 System Architecture Specification* [HSA]_. 34076. A kernel dispatch packet includes information about the actual dispatch, 3408 such as grid and work-group size, together with information from the code 3409 object about the kernel, such as segment sizes. The HSA compatible runtime 3410 queries on the kernel symbol can be used to obtain the code object values 3411 which are recorded in the :ref:`amdgpu-amdhsa-code-object-metadata`. 34127. CP executes micro-code and is responsible for detecting and setting up the 3413 GPU to execute the wavefronts of a kernel dispatch. 34148. CP ensures that when the a wavefront starts executing the kernel machine 3415 code, the scalar general purpose registers (SGPR) and vector general purpose 3416 registers (VGPR) are set up as required by the machine code. The required 3417 setup is defined in the :ref:`amdgpu-amdhsa-kernel-descriptor`. The initial 3418 register state is defined in 3419 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`. 34209. The prolog of the kernel machine code (see 3421 :ref:`amdgpu-amdhsa-kernel-prolog`) sets up the machine state as necessary 3422 before continuing executing the machine code that corresponds to the kernel. 342310. When the kernel dispatch has completed execution, CP signals the completion 3424 signal specified in the kernel dispatch packet if not 0. 3425 3426.. _amdgpu-amdhsa-memory-spaces: 3427 3428Memory Spaces 3429~~~~~~~~~~~~~ 3430 3431The memory space properties are: 3432 3433 .. table:: AMDHSA Memory Spaces 3434 :name: amdgpu-amdhsa-memory-spaces-table 3435 3436 ================= =========== ======== ======= ================== 3437 Memory Space Name HSA Segment Hardware Address NULL Value 3438 Name Name Size 3439 ================= =========== ======== ======= ================== 3440 Private private scratch 32 0x00000000 3441 Local group LDS 32 0xFFFFFFFF 3442 Global global global 64 0x0000000000000000 3443 Constant constant *same as 64 0x0000000000000000 3444 global* 3445 Generic flat flat 64 0x0000000000000000 3446 Region N/A GDS 32 *not implemented 3447 for AMDHSA* 3448 ================= =========== ======== ======= ================== 3449 3450The global and constant memory spaces both use global virtual addresses, which 3451are the same virtual address space used by the CPU. However, some virtual 3452addresses may only be accessible to the CPU, some only accessible by the GPU, 3453and some by both. 3454 3455Using the constant memory space indicates that the data will not change during 3456the execution of the kernel. This allows scalar read instructions to be 3457used. The vector and scalar L1 caches are invalidated of volatile data before 3458each kernel dispatch execution to allow constant memory to change values between 3459kernel dispatches. 3460 3461The local memory space uses the hardware Local Data Store (LDS) which is 3462automatically allocated when the hardware creates work-groups of wavefronts, and 3463freed when all the wavefronts of a work-group have terminated. The data store 3464(DS) instructions can be used to access it. 3465 3466The private memory space uses the hardware scratch memory support. If the kernel 3467uses scratch, then the hardware allocates memory that is accessed using 3468wavefront lane dword (4 byte) interleaving. The mapping used from private 3469address to physical address is: 3470 3471 ``wavefront-scratch-base + 3472 (private-address * wavefront-size * 4) + 3473 (wavefront-lane-id * 4)`` 3474 3475There are different ways that the wavefront scratch base address is determined 3476by a wavefront (see :ref:`amdgpu-amdhsa-initial-kernel-execution-state`). This 3477memory can be accessed in an interleaved manner using buffer instruction with 3478the scratch buffer descriptor and per wavefront scratch offset, by the scratch 3479instructions, or by flat instructions. If each lane of a wavefront accesses the 3480same private address, the interleaving results in adjacent dwords being accessed 3481and hence requires fewer cache lines to be fetched. Multi-dword access is not 3482supported except by flat and scratch instructions in GFX9-GFX10. 3483 3484The generic address space uses the hardware flat address support available in 3485GFX7-GFX10. This uses two fixed ranges of virtual addresses (the private and 3486local apertures), that are outside the range of addressible global memory, to 3487map from a flat address to a private or local address. 3488 3489FLAT instructions can take a flat address and access global, private (scratch) 3490and group (LDS) memory depending in if the address is within one of the 3491aperture ranges. Flat access to scratch requires hardware aperture setup and 3492setup in the kernel prologue (see 3493:ref:`amdgpu-amdhsa-kernel-prolog-flat-scratch`). Flat access to LDS requires 3494hardware aperture setup and M0 (GFX7-GFX8) register setup (see 3495:ref:`amdgpu-amdhsa-kernel-prolog-m0`). 3496 3497To convert between a segment address and a flat address the base address of the 3498apertures address can be used. For GFX7-GFX8 these are available in the 3499:ref:`amdgpu-amdhsa-hsa-aql-queue` the address of which can be obtained with 3500Queue Ptr SGPR (see :ref:`amdgpu-amdhsa-initial-kernel-execution-state`). For 3501GFX9-GFX10 the aperture base addresses are directly available as inline constant 3502registers ``SRC_SHARED_BASE/LIMIT`` and ``SRC_PRIVATE_BASE/LIMIT``. In 64 bit 3503address mode the aperture sizes are 2^32 bytes and the base is aligned to 2^32 3504which makes it easier to convert from flat to segment or segment to flat. 3505 3506Image and Samplers 3507~~~~~~~~~~~~~~~~~~ 3508 3509Image and sample handles created by an HSA compatible runtime (see 3510:ref:`amdgpu-os`) are 64-bit addresses of a hardware 32-byte V# and 48 byte S# 3511object respectively. In order to support the HSA ``query_sampler`` operations 3512two extra dwords are used to store the HSA BRIG enumeration values for the 3513queries that are not trivially deducible from the S# representation. 3514 3515HSA Signals 3516~~~~~~~~~~~ 3517 3518HSA signal handles created by an HSA compatible runtime (see :ref:`amdgpu-os`) 3519are 64-bit addresses of a structure allocated in memory accessible from both the 3520CPU and GPU. The structure is defined by the runtime and subject to change 3521between releases. For example, see [AMD-ROCm-github]_. 3522 3523.. _amdgpu-amdhsa-hsa-aql-queue: 3524 3525HSA AQL Queue 3526~~~~~~~~~~~~~ 3527 3528The HSA AQL queue structure is defined by an HSA compatible runtime (see 3529:ref:`amdgpu-os`) and subject to change between releases. For example, see 3530[AMD-ROCm-github]_. For some processors it contains fields needed to implement 3531certain language features such as the flat address aperture bases. It also 3532contains fields used by CP such as managing the allocation of scratch memory. 3533 3534.. _amdgpu-amdhsa-kernel-descriptor: 3535 3536Kernel Descriptor 3537~~~~~~~~~~~~~~~~~ 3538 3539A kernel descriptor consists of the information needed by CP to initiate the 3540execution of a kernel, including the entry point address of the machine code 3541that implements the kernel. 3542 3543Code Object V3 Kernel Descriptor 3544++++++++++++++++++++++++++++++++ 3545 3546CP microcode requires the Kernel descriptor to be allocated on 64-byte 3547alignment. 3548 3549The fields used by CP for code objects before V3 also match those specified in 3550:ref:`amdgpu-amdhsa-kernel-descriptor-v3-table`. 3551 3552 .. table:: Code Object V3 Kernel Descriptor 3553 :name: amdgpu-amdhsa-kernel-descriptor-v3-table 3554 3555 ======= ======= =============================== ============================ 3556 Bits Size Field Name Description 3557 ======= ======= =============================== ============================ 3558 31:0 4 bytes GROUP_SEGMENT_FIXED_SIZE The amount of fixed local 3559 address space memory 3560 required for a work-group 3561 in bytes. This does not 3562 include any dynamically 3563 allocated local address 3564 space memory that may be 3565 added when the kernel is 3566 dispatched. 3567 63:32 4 bytes PRIVATE_SEGMENT_FIXED_SIZE The amount of fixed 3568 private address space 3569 memory required for a 3570 work-item in bytes. 3571 Additional space may need to 3572 be added to this value if 3573 the call stack has 3574 non-inlined function calls. 3575 95:64 4 bytes KERNARG_SIZE The size of the kernarg 3576 memory pointed to by the 3577 AQL dispatch packet. The 3578 kernarg memory is used to 3579 pass arguments to the 3580 kernel. 3581 3582 * If the kernarg pointer in 3583 the dispatch packet is NULL 3584 then there are no kernel 3585 arguments. 3586 * If the kernarg pointer in 3587 the dispatch packet is 3588 not NULL and this value 3589 is 0 then the kernarg 3590 memory size is 3591 unspecified. 3592 * If the kernarg pointer in 3593 the dispatch packet is 3594 not NULL and this value 3595 is not 0 then the value 3596 specifies the kernarg 3597 memory size in bytes. It 3598 is recommended to provide 3599 a value as it may be used 3600 by CP to optimize making 3601 the kernarg memory 3602 visible to the kernel 3603 code. 3604 3605 127:96 4 bytes Reserved, must be 0. 3606 191:128 8 bytes KERNEL_CODE_ENTRY_BYTE_OFFSET Byte offset (possibly 3607 negative) from base 3608 address of kernel 3609 descriptor to kernel's 3610 entry point instruction 3611 which must be 256 byte 3612 aligned. 3613 351:272 20 Reserved, must be 0. 3614 bytes 3615 383:352 4 bytes COMPUTE_PGM_RSRC3 GFX6-GFX9 3616 Reserved, must be 0. 3617 GFX90A 3618 Compute Shader (CS) 3619 program settings used by 3620 CP to set up 3621 ``COMPUTE_PGM_RSRC3`` 3622 configuration 3623 register. See 3624 :ref:`amdgpu-amdhsa-compute_pgm_rsrc3-gfx90a-table`. 3625 GFX10 3626 Compute Shader (CS) 3627 program settings used by 3628 CP to set up 3629 ``COMPUTE_PGM_RSRC3`` 3630 configuration 3631 register. See 3632 :ref:`amdgpu-amdhsa-compute_pgm_rsrc3-gfx10-table`. 3633 415:384 4 bytes COMPUTE_PGM_RSRC1 Compute Shader (CS) 3634 program settings used by 3635 CP to set up 3636 ``COMPUTE_PGM_RSRC1`` 3637 configuration 3638 register. See 3639 :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx10-table`. 3640 447:416 4 bytes COMPUTE_PGM_RSRC2 Compute Shader (CS) 3641 program settings used by 3642 CP to set up 3643 ``COMPUTE_PGM_RSRC2`` 3644 configuration 3645 register. See 3646 :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx10-table`. 3647 458:448 7 bits *See separate bits below.* Enable the setup of the 3648 SGPR user data registers 3649 (see 3650 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`). 3651 3652 The total number of SGPR 3653 user data registers 3654 requested must not exceed 3655 16 and match value in 3656 ``compute_pgm_rsrc2.user_sgpr.user_sgpr_count``. 3657 Any requests beyond 16 3658 will be ignored. 3659 >448 1 bit ENABLE_SGPR_PRIVATE_SEGMENT If the *Target Properties* 3660 _BUFFER column of 3661 :ref:`amdgpu-processor-table` 3662 specifies *Architected flat 3663 scratch* then not supported 3664 and must be 0, 3665 >449 1 bit ENABLE_SGPR_DISPATCH_PTR 3666 >450 1 bit ENABLE_SGPR_QUEUE_PTR 3667 >451 1 bit ENABLE_SGPR_KERNARG_SEGMENT_PTR 3668 >452 1 bit ENABLE_SGPR_DISPATCH_ID 3669 >453 1 bit ENABLE_SGPR_FLAT_SCRATCH_INIT If the *Target Properties* 3670 column of 3671 :ref:`amdgpu-processor-table` 3672 specifies *Architected flat 3673 scratch* then not supported 3674 and must be 0, 3675 >454 1 bit ENABLE_SGPR_PRIVATE_SEGMENT 3676 _SIZE 3677 457:455 3 bits Reserved, must be 0. 3678 458 1 bit ENABLE_WAVEFRONT_SIZE32 GFX6-GFX9 3679 Reserved, must be 0. 3680 GFX10 3681 - If 0 execute in 3682 wavefront size 64 mode. 3683 - If 1 execute in 3684 native wavefront size 3685 32 mode. 3686 463:459 1 bit Reserved, must be 0. 3687 464 1 bit RESERVED_464 Deprecated, must be 0. 3688 467:465 3 bits Reserved, must be 0. 3689 468 1 bit RESERVED_468 Deprecated, must be 0. 3690 469:471 3 bits Reserved, must be 0. 3691 511:472 5 bytes Reserved, must be 0. 3692 512 **Total size 64 bytes.** 3693 ======= ==================================================================== 3694 3695.. 3696 3697 .. table:: compute_pgm_rsrc1 for GFX6-GFX10 3698 :name: amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx10-table 3699 3700 ======= ======= =============================== =========================================================================== 3701 Bits Size Field Name Description 3702 ======= ======= =============================== =========================================================================== 3703 5:0 6 bits GRANULATED_WORKITEM_VGPR_COUNT Number of vector register 3704 blocks used by each work-item; 3705 granularity is device 3706 specific: 3707 3708 GFX6-GFX9 3709 - vgprs_used 0..256 3710 - max(0, ceil(vgprs_used / 4) - 1) 3711 GFX90A 3712 - vgprs_used 0..512 3713 - vgprs_used = align(arch_vgprs, 4) 3714 + acc_vgprs 3715 - max(0, ceil(vgprs_used / 8) - 1) 3716 GFX10 (wavefront size 64) 3717 - max_vgpr 1..256 3718 - max(0, ceil(vgprs_used / 4) - 1) 3719 GFX10 (wavefront size 32) 3720 - max_vgpr 1..256 3721 - max(0, ceil(vgprs_used / 8) - 1) 3722 3723 Where vgprs_used is defined 3724 as the highest VGPR number 3725 explicitly referenced plus 3726 one. 3727 3728 Used by CP to set up 3729 ``COMPUTE_PGM_RSRC1.VGPRS``. 3730 3731 The 3732 :ref:`amdgpu-assembler` 3733 calculates this 3734 automatically for the 3735 selected processor from 3736 values provided to the 3737 `.amdhsa_kernel` directive 3738 by the 3739 `.amdhsa_next_free_vgpr` 3740 nested directive (see 3741 :ref:`amdhsa-kernel-directives-table`). 3742 9:6 4 bits GRANULATED_WAVEFRONT_SGPR_COUNT Number of scalar register 3743 blocks used by a wavefront; 3744 granularity is device 3745 specific: 3746 3747 GFX6-GFX8 3748 - sgprs_used 0..112 3749 - max(0, ceil(sgprs_used / 8) - 1) 3750 GFX9 3751 - sgprs_used 0..112 3752 - 2 * max(0, ceil(sgprs_used / 16) - 1) 3753 GFX10 3754 Reserved, must be 0. 3755 (128 SGPRs always 3756 allocated.) 3757 3758 Where sgprs_used is 3759 defined as the highest 3760 SGPR number explicitly 3761 referenced plus one, plus 3762 a target specific number 3763 of additional special 3764 SGPRs for VCC, 3765 FLAT_SCRATCH (GFX7+) and 3766 XNACK_MASK (GFX8+), and 3767 any additional 3768 target specific 3769 limitations. It does not 3770 include the 16 SGPRs added 3771 if a trap handler is 3772 enabled. 3773 3774 The target specific 3775 limitations and special 3776 SGPR layout are defined in 3777 the hardware 3778 documentation, which can 3779 be found in the 3780 :ref:`amdgpu-processors` 3781 table. 3782 3783 Used by CP to set up 3784 ``COMPUTE_PGM_RSRC1.SGPRS``. 3785 3786 The 3787 :ref:`amdgpu-assembler` 3788 calculates this 3789 automatically for the 3790 selected processor from 3791 values provided to the 3792 `.amdhsa_kernel` directive 3793 by the 3794 `.amdhsa_next_free_sgpr` 3795 and `.amdhsa_reserve_*` 3796 nested directives (see 3797 :ref:`amdhsa-kernel-directives-table`). 3798 11:10 2 bits PRIORITY Must be 0. 3799 3800 Start executing wavefront 3801 at the specified priority. 3802 3803 CP is responsible for 3804 filling in 3805 ``COMPUTE_PGM_RSRC1.PRIORITY``. 3806 13:12 2 bits FLOAT_ROUND_MODE_32 Wavefront starts execution 3807 with specified rounding 3808 mode for single (32 3809 bit) floating point 3810 precision floating point 3811 operations. 3812 3813 Floating point rounding 3814 mode values are defined in 3815 :ref:`amdgpu-amdhsa-floating-point-rounding-mode-enumeration-values-table`. 3816 3817 Used by CP to set up 3818 ``COMPUTE_PGM_RSRC1.FLOAT_MODE``. 3819 15:14 2 bits FLOAT_ROUND_MODE_16_64 Wavefront starts execution 3820 with specified rounding 3821 denorm mode for half/double (16 3822 and 64-bit) floating point 3823 precision floating point 3824 operations. 3825 3826 Floating point rounding 3827 mode values are defined in 3828 :ref:`amdgpu-amdhsa-floating-point-rounding-mode-enumeration-values-table`. 3829 3830 Used by CP to set up 3831 ``COMPUTE_PGM_RSRC1.FLOAT_MODE``. 3832 17:16 2 bits FLOAT_DENORM_MODE_32 Wavefront starts execution 3833 with specified denorm mode 3834 for single (32 3835 bit) floating point 3836 precision floating point 3837 operations. 3838 3839 Floating point denorm mode 3840 values are defined in 3841 :ref:`amdgpu-amdhsa-floating-point-denorm-mode-enumeration-values-table`. 3842 3843 Used by CP to set up 3844 ``COMPUTE_PGM_RSRC1.FLOAT_MODE``. 3845 19:18 2 bits FLOAT_DENORM_MODE_16_64 Wavefront starts execution 3846 with specified denorm mode 3847 for half/double (16 3848 and 64-bit) floating point 3849 precision floating point 3850 operations. 3851 3852 Floating point denorm mode 3853 values are defined in 3854 :ref:`amdgpu-amdhsa-floating-point-denorm-mode-enumeration-values-table`. 3855 3856 Used by CP to set up 3857 ``COMPUTE_PGM_RSRC1.FLOAT_MODE``. 3858 20 1 bit PRIV Must be 0. 3859 3860 Start executing wavefront 3861 in privilege trap handler 3862 mode. 3863 3864 CP is responsible for 3865 filling in 3866 ``COMPUTE_PGM_RSRC1.PRIV``. 3867 21 1 bit ENABLE_DX10_CLAMP Wavefront starts execution 3868 with DX10 clamp mode 3869 enabled. Used by the vector 3870 ALU to force DX10 style 3871 treatment of NaN's (when 3872 set, clamp NaN to zero, 3873 otherwise pass NaN 3874 through). 3875 3876 Used by CP to set up 3877 ``COMPUTE_PGM_RSRC1.DX10_CLAMP``. 3878 22 1 bit DEBUG_MODE Must be 0. 3879 3880 Start executing wavefront 3881 in single step mode. 3882 3883 CP is responsible for 3884 filling in 3885 ``COMPUTE_PGM_RSRC1.DEBUG_MODE``. 3886 23 1 bit ENABLE_IEEE_MODE Wavefront starts execution 3887 with IEEE mode 3888 enabled. Floating point 3889 opcodes that support 3890 exception flag gathering 3891 will quiet and propagate 3892 signaling-NaN inputs per 3893 IEEE 754-2008. Min_dx10 and 3894 max_dx10 become IEEE 3895 754-2008 compliant due to 3896 signaling-NaN propagation 3897 and quieting. 3898 3899 Used by CP to set up 3900 ``COMPUTE_PGM_RSRC1.IEEE_MODE``. 3901 24 1 bit BULKY Must be 0. 3902 3903 Only one work-group allowed 3904 to execute on a compute 3905 unit. 3906 3907 CP is responsible for 3908 filling in 3909 ``COMPUTE_PGM_RSRC1.BULKY``. 3910 25 1 bit CDBG_USER Must be 0. 3911 3912 Flag that can be used to 3913 control debugging code. 3914 3915 CP is responsible for 3916 filling in 3917 ``COMPUTE_PGM_RSRC1.CDBG_USER``. 3918 26 1 bit FP16_OVFL GFX6-GFX8 3919 Reserved, must be 0. 3920 GFX9-GFX10 3921 Wavefront starts execution 3922 with specified fp16 overflow 3923 mode. 3924 3925 - If 0, fp16 overflow generates 3926 +/-INF values. 3927 - If 1, fp16 overflow that is the 3928 result of an +/-INF input value 3929 or divide by 0 produces a +/-INF, 3930 otherwise clamps computed 3931 overflow to +/-MAX_FP16 as 3932 appropriate. 3933 3934 Used by CP to set up 3935 ``COMPUTE_PGM_RSRC1.FP16_OVFL``. 3936 28:27 2 bits Reserved, must be 0. 3937 29 1 bit WGP_MODE GFX6-GFX9 3938 Reserved, must be 0. 3939 GFX10 3940 - If 0 execute work-groups in 3941 CU wavefront execution mode. 3942 - If 1 execute work-groups on 3943 in WGP wavefront execution mode. 3944 3945 See :ref:`amdgpu-amdhsa-memory-model`. 3946 3947 Used by CP to set up 3948 ``COMPUTE_PGM_RSRC1.WGP_MODE``. 3949 30 1 bit MEM_ORDERED GFX6-GFX9 3950 Reserved, must be 0. 3951 GFX10 3952 Controls the behavior of the 3953 s_waitcnt's vmcnt and vscnt 3954 counters. 3955 3956 - If 0 vmcnt reports completion 3957 of load and atomic with return 3958 out of order with sample 3959 instructions, and the vscnt 3960 reports the completion of 3961 store and atomic without 3962 return in order. 3963 - If 1 vmcnt reports completion 3964 of load, atomic with return 3965 and sample instructions in 3966 order, and the vscnt reports 3967 the completion of store and 3968 atomic without return in order. 3969 3970 Used by CP to set up 3971 ``COMPUTE_PGM_RSRC1.MEM_ORDERED``. 3972 31 1 bit FWD_PROGRESS GFX6-GFX9 3973 Reserved, must be 0. 3974 GFX10 3975 - If 0 execute SIMD wavefronts 3976 using oldest first policy. 3977 - If 1 execute SIMD wavefronts to 3978 ensure wavefronts will make some 3979 forward progress. 3980 3981 Used by CP to set up 3982 ``COMPUTE_PGM_RSRC1.FWD_PROGRESS``. 3983 32 **Total size 4 bytes** 3984 ======= =================================================================================================================== 3985 3986.. 3987 3988 .. table:: compute_pgm_rsrc2 for GFX6-GFX10 3989 :name: amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx10-table 3990 3991 ======= ======= =============================== =========================================================================== 3992 Bits Size Field Name Description 3993 ======= ======= =============================== =========================================================================== 3994 0 1 bit ENABLE_PRIVATE_SEGMENT * Enable the setup of the 3995 private segment. 3996 * If the *Target Properties* 3997 column of 3998 :ref:`amdgpu-processor-table` 3999 does not specify 4000 *Architected flat 4001 scratch* then enable the 4002 setup of the SGPR 4003 wavefront scratch offset 4004 system register (see 4005 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`). 4006 * If the *Target Properties* 4007 column of 4008 :ref:`amdgpu-processor-table` 4009 specifies *Architected 4010 flat scratch* then enable 4011 the setup of the 4012 FLAT_SCRATCH register 4013 pair (see 4014 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`). 4015 4016 Used by CP to set up 4017 ``COMPUTE_PGM_RSRC2.SCRATCH_EN``. 4018 5:1 5 bits USER_SGPR_COUNT The total number of SGPR 4019 user data registers 4020 requested. This number must 4021 match the number of user 4022 data registers enabled. 4023 4024 Used by CP to set up 4025 ``COMPUTE_PGM_RSRC2.USER_SGPR``. 4026 6 1 bit ENABLE_TRAP_HANDLER Must be 0. 4027 4028 This bit represents 4029 ``COMPUTE_PGM_RSRC2.TRAP_PRESENT``, 4030 which is set by the CP if 4031 the runtime has installed a 4032 trap handler. 4033 7 1 bit ENABLE_SGPR_WORKGROUP_ID_X Enable the setup of the 4034 system SGPR register for 4035 the work-group id in the X 4036 dimension (see 4037 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`). 4038 4039 Used by CP to set up 4040 ``COMPUTE_PGM_RSRC2.TGID_X_EN``. 4041 8 1 bit ENABLE_SGPR_WORKGROUP_ID_Y Enable the setup of the 4042 system SGPR register for 4043 the work-group id in the Y 4044 dimension (see 4045 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`). 4046 4047 Used by CP to set up 4048 ``COMPUTE_PGM_RSRC2.TGID_Y_EN``. 4049 9 1 bit ENABLE_SGPR_WORKGROUP_ID_Z Enable the setup of the 4050 system SGPR register for 4051 the work-group id in the Z 4052 dimension (see 4053 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`). 4054 4055 Used by CP to set up 4056 ``COMPUTE_PGM_RSRC2.TGID_Z_EN``. 4057 10 1 bit ENABLE_SGPR_WORKGROUP_INFO Enable the setup of the 4058 system SGPR register for 4059 work-group information (see 4060 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`). 4061 4062 Used by CP to set up 4063 ``COMPUTE_PGM_RSRC2.TGID_SIZE_EN``. 4064 12:11 2 bits ENABLE_VGPR_WORKITEM_ID Enable the setup of the 4065 VGPR system registers used 4066 for the work-item ID. 4067 :ref:`amdgpu-amdhsa-system-vgpr-work-item-id-enumeration-values-table` 4068 defines the values. 4069 4070 Used by CP to set up 4071 ``COMPUTE_PGM_RSRC2.TIDIG_CMP_CNT``. 4072 13 1 bit ENABLE_EXCEPTION_ADDRESS_WATCH Must be 0. 4073 4074 Wavefront starts execution 4075 with address watch 4076 exceptions enabled which 4077 are generated when L1 has 4078 witnessed a thread access 4079 an *address of 4080 interest*. 4081 4082 CP is responsible for 4083 filling in the address 4084 watch bit in 4085 ``COMPUTE_PGM_RSRC2.EXCP_EN_MSB`` 4086 according to what the 4087 runtime requests. 4088 14 1 bit ENABLE_EXCEPTION_MEMORY Must be 0. 4089 4090 Wavefront starts execution 4091 with memory violation 4092 exceptions exceptions 4093 enabled which are generated 4094 when a memory violation has 4095 occurred for this wavefront from 4096 L1 or LDS 4097 (write-to-read-only-memory, 4098 mis-aligned atomic, LDS 4099 address out of range, 4100 illegal address, etc.). 4101 4102 CP sets the memory 4103 violation bit in 4104 ``COMPUTE_PGM_RSRC2.EXCP_EN_MSB`` 4105 according to what the 4106 runtime requests. 4107 23:15 9 bits GRANULATED_LDS_SIZE Must be 0. 4108 4109 CP uses the rounded value 4110 from the dispatch packet, 4111 not this value, as the 4112 dispatch may contain 4113 dynamically allocated group 4114 segment memory. CP writes 4115 directly to 4116 ``COMPUTE_PGM_RSRC2.LDS_SIZE``. 4117 4118 Amount of group segment 4119 (LDS) to allocate for each 4120 work-group. Granularity is 4121 device specific: 4122 4123 GFX6 4124 roundup(lds-size / (64 * 4)) 4125 GFX7-GFX10 4126 roundup(lds-size / (128 * 4)) 4127 4128 24 1 bit ENABLE_EXCEPTION_IEEE_754_FP Wavefront starts execution 4129 _INVALID_OPERATION with specified exceptions 4130 enabled. 4131 4132 Used by CP to set up 4133 ``COMPUTE_PGM_RSRC2.EXCP_EN`` 4134 (set from bits 0..6). 4135 4136 IEEE 754 FP Invalid 4137 Operation 4138 25 1 bit ENABLE_EXCEPTION_FP_DENORMAL FP Denormal one or more 4139 _SOURCE input operands is a 4140 denormal number 4141 26 1 bit ENABLE_EXCEPTION_IEEE_754_FP IEEE 754 FP Division by 4142 _DIVISION_BY_ZERO Zero 4143 27 1 bit ENABLE_EXCEPTION_IEEE_754_FP IEEE 754 FP FP Overflow 4144 _OVERFLOW 4145 28 1 bit ENABLE_EXCEPTION_IEEE_754_FP IEEE 754 FP Underflow 4146 _UNDERFLOW 4147 29 1 bit ENABLE_EXCEPTION_IEEE_754_FP IEEE 754 FP Inexact 4148 _INEXACT 4149 30 1 bit ENABLE_EXCEPTION_INT_DIVIDE_BY Integer Division by Zero 4150 _ZERO (rcp_iflag_f32 instruction 4151 only) 4152 31 1 bit Reserved, must be 0. 4153 32 **Total size 4 bytes.** 4154 ======= =================================================================================================================== 4155 4156.. 4157 4158 .. table:: compute_pgm_rsrc3 for GFX90A 4159 :name: amdgpu-amdhsa-compute_pgm_rsrc3-gfx90a-table 4160 4161 ======= ======= =============================== =========================================================================== 4162 Bits Size Field Name Description 4163 ======= ======= =============================== =========================================================================== 4164 5:0 6 bits ACCUM_OFFSET Offset of a first AccVGPR in the unified register file. Granularity 4. 4165 Value 0-63. 0 - accum-offset = 4, 1 - accum-offset = 8, ..., 4166 63 - accum-offset = 256. 4167 6:15 10 Reserved, must be 0. 4168 bits 4169 16 1 bit TG_SPLIT - If 0 the waves of a work-group are 4170 launched in the same CU. 4171 - If 1 the waves of a work-group can be 4172 launched in different CUs. The waves 4173 cannot use S_BARRIER or LDS. 4174 17:31 15 Reserved, must be 0. 4175 bits 4176 32 **Total size 4 bytes.** 4177 ======= =================================================================================================================== 4178 4179.. 4180 4181 .. table:: compute_pgm_rsrc3 for GFX10 4182 :name: amdgpu-amdhsa-compute_pgm_rsrc3-gfx10-table 4183 4184 ======= ======= =============================== =========================================================================== 4185 Bits Size Field Name Description 4186 ======= ======= =============================== =========================================================================== 4187 3:0 4 bits SHARED_VGPR_COUNT Number of shared VGPRs for wavefront size 64. Granularity 8. Value 0-120. 4188 compute_pgm_rsrc1.vgprs + shared_vgpr_cnt cannot exceed 64. 4189 31:4 28 Reserved, must be 0. 4190 bits 4191 32 **Total size 4 bytes.** 4192 ======= =================================================================================================================== 4193 4194.. 4195 4196 .. table:: Floating Point Rounding Mode Enumeration Values 4197 :name: amdgpu-amdhsa-floating-point-rounding-mode-enumeration-values-table 4198 4199 ====================================== ===== ============================== 4200 Enumeration Name Value Description 4201 ====================================== ===== ============================== 4202 FLOAT_ROUND_MODE_NEAR_EVEN 0 Round Ties To Even 4203 FLOAT_ROUND_MODE_PLUS_INFINITY 1 Round Toward +infinity 4204 FLOAT_ROUND_MODE_MINUS_INFINITY 2 Round Toward -infinity 4205 FLOAT_ROUND_MODE_ZERO 3 Round Toward 0 4206 ====================================== ===== ============================== 4207 4208.. 4209 4210 .. table:: Floating Point Denorm Mode Enumeration Values 4211 :name: amdgpu-amdhsa-floating-point-denorm-mode-enumeration-values-table 4212 4213 ====================================== ===== ============================== 4214 Enumeration Name Value Description 4215 ====================================== ===== ============================== 4216 FLOAT_DENORM_MODE_FLUSH_SRC_DST 0 Flush Source and Destination 4217 Denorms 4218 FLOAT_DENORM_MODE_FLUSH_DST 1 Flush Output Denorms 4219 FLOAT_DENORM_MODE_FLUSH_SRC 2 Flush Source Denorms 4220 FLOAT_DENORM_MODE_FLUSH_NONE 3 No Flush 4221 ====================================== ===== ============================== 4222 4223.. 4224 4225 .. table:: System VGPR Work-Item ID Enumeration Values 4226 :name: amdgpu-amdhsa-system-vgpr-work-item-id-enumeration-values-table 4227 4228 ======================================== ===== ============================ 4229 Enumeration Name Value Description 4230 ======================================== ===== ============================ 4231 SYSTEM_VGPR_WORKITEM_ID_X 0 Set work-item X dimension 4232 ID. 4233 SYSTEM_VGPR_WORKITEM_ID_X_Y 1 Set work-item X and Y 4234 dimensions ID. 4235 SYSTEM_VGPR_WORKITEM_ID_X_Y_Z 2 Set work-item X, Y and Z 4236 dimensions ID. 4237 SYSTEM_VGPR_WORKITEM_ID_UNDEFINED 3 Undefined. 4238 ======================================== ===== ============================ 4239 4240.. _amdgpu-amdhsa-initial-kernel-execution-state: 4241 4242Initial Kernel Execution State 4243~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 4244 4245This section defines the register state that will be set up by the packet 4246processor prior to the start of execution of every wavefront. This is limited by 4247the constraints of the hardware controllers of CP/ADC/SPI. 4248 4249The order of the SGPR registers is defined, but the compiler can specify which 4250ones are actually setup in the kernel descriptor using the ``enable_sgpr_*`` bit 4251fields (see :ref:`amdgpu-amdhsa-kernel-descriptor`). The register numbers used 4252for enabled registers are dense starting at SGPR0: the first enabled register is 4253SGPR0, the next enabled register is SGPR1 etc.; disabled registers do not have 4254an SGPR number. 4255 4256The initial SGPRs comprise up to 16 User SRGPs that are set by CP and apply to 4257all wavefronts of the grid. It is possible to specify more than 16 User SGPRs 4258using the ``enable_sgpr_*`` bit fields, in which case only the first 16 are 4259actually initialized. These are then immediately followed by the System SGPRs 4260that are set up by ADC/SPI and can have different values for each wavefront of 4261the grid dispatch. 4262 4263SGPR register initial state is defined in 4264:ref:`amdgpu-amdhsa-sgpr-register-set-up-order-table`. 4265 4266 .. table:: SGPR Register Set Up Order 4267 :name: amdgpu-amdhsa-sgpr-register-set-up-order-table 4268 4269 ========== ========================== ====== ============================== 4270 SGPR Order Name Number Description 4271 (kernel descriptor enable of 4272 field) SGPRs 4273 ========== ========================== ====== ============================== 4274 First Private Segment Buffer 4 See 4275 (enable_sgpr_private :ref:`amdgpu-amdhsa-kernel-prolog-private-segment-buffer`. 4276 _segment_buffer) 4277 then Dispatch Ptr 2 64-bit address of AQL dispatch 4278 (enable_sgpr_dispatch_ptr) packet for kernel dispatch 4279 actually executing. 4280 then Queue Ptr 2 64-bit address of amd_queue_t 4281 (enable_sgpr_queue_ptr) object for AQL queue on which 4282 the dispatch packet was 4283 queued. 4284 then Kernarg Segment Ptr 2 64-bit address of Kernarg 4285 (enable_sgpr_kernarg segment. This is directly 4286 _segment_ptr) copied from the 4287 kernarg_address in the kernel 4288 dispatch packet. 4289 4290 Having CP load it once avoids 4291 loading it at the beginning of 4292 every wavefront. 4293 then Dispatch Id 2 64-bit Dispatch ID of the 4294 (enable_sgpr_dispatch_id) dispatch packet being 4295 executed. 4296 then Flat Scratch Init 2 See 4297 (enable_sgpr_flat_scratch :ref:`amdgpu-amdhsa-kernel-prolog-flat-scratch`. 4298 _init) 4299 then Private Segment Size 1 The 32-bit byte size of a 4300 (enable_sgpr_private single work-item's memory 4301 _segment_size) allocation. This is the 4302 value from the kernel 4303 dispatch packet Private 4304 Segment Byte Size rounded up 4305 by CP to a multiple of 4306 DWORD. 4307 4308 Having CP load it once avoids 4309 loading it at the beginning of 4310 every wavefront. 4311 4312 This is not used for 4313 GFX7-GFX8 since it is the same 4314 value as the second SGPR of 4315 Flat Scratch Init. However, it 4316 may be needed for GFX9-GFX10 which 4317 changes the meaning of the 4318 Flat Scratch Init value. 4319 then Work-Group Id X 1 32-bit work-group id in X 4320 (enable_sgpr_workgroup_id dimension of grid for 4321 _X) wavefront. 4322 then Work-Group Id Y 1 32-bit work-group id in Y 4323 (enable_sgpr_workgroup_id dimension of grid for 4324 _Y) wavefront. 4325 then Work-Group Id Z 1 32-bit work-group id in Z 4326 (enable_sgpr_workgroup_id dimension of grid for 4327 _Z) wavefront. 4328 then Work-Group Info 1 {first_wavefront, 14'b0000, 4329 (enable_sgpr_workgroup ordered_append_term[10:0], 4330 _info) threadgroup_size_in_wavefronts[5:0]} 4331 then Scratch Wavefront Offset 1 See 4332 (enable_sgpr_private :ref:`amdgpu-amdhsa-kernel-prolog-flat-scratch`. 4333 _segment_wavefront_offset) and 4334 :ref:`amdgpu-amdhsa-kernel-prolog-private-segment-buffer`. 4335 ========== ========================== ====== ============================== 4336 4337The order of the VGPR registers is defined, but the compiler can specify which 4338ones are actually setup in the kernel descriptor using the ``enable_vgpr*`` bit 4339fields (see :ref:`amdgpu-amdhsa-kernel-descriptor`). The register numbers used 4340for enabled registers are dense starting at VGPR0: the first enabled register is 4341VGPR0, the next enabled register is VGPR1 etc.; disabled registers do not have a 4342VGPR number. 4343 4344There are different methods used for the VGPR initial state: 4345 4346* Unless the *Target Properties* column of :ref:`amdgpu-processor-table` 4347 specifies otherwise, a separate VGPR register is used per work-item ID. The 4348 VGPR register initial state for this method is defined in 4349 :ref:`amdgpu-amdhsa-vgpr-register-set-up-order-for-unpacked-work-item-id-method-table`. 4350* If *Target Properties* column of :ref:`amdgpu-processor-table` 4351 specifies *Packed work-item IDs*, the initial value of VGPR0 register is used 4352 for all work-item IDs. The register layout for this method is defined in 4353 :ref:`amdgpu-amdhsa-register-layout-for-packed-work-item-id-method-table`. 4354 4355 .. table:: VGPR Register Set Up Order for Unpacked Work-Item ID Method 4356 :name: amdgpu-amdhsa-vgpr-register-set-up-order-for-unpacked-work-item-id-method-table 4357 4358 ========== ========================== ====== ============================== 4359 VGPR Order Name Number Description 4360 (kernel descriptor enable of 4361 field) VGPRs 4362 ========== ========================== ====== ============================== 4363 First Work-Item Id X 1 32-bit work-item id in X 4364 (Always initialized) dimension of work-group for 4365 wavefront lane. 4366 then Work-Item Id Y 1 32-bit work-item id in Y 4367 (enable_vgpr_workitem_id dimension of work-group for 4368 > 0) wavefront lane. 4369 then Work-Item Id Z 1 32-bit work-item id in Z 4370 (enable_vgpr_workitem_id dimension of work-group for 4371 > 1) wavefront lane. 4372 ========== ========================== ====== ============================== 4373 4374.. 4375 4376 .. table:: Register Layout for Packed Work-Item ID Method 4377 :name: amdgpu-amdhsa-register-layout-for-packed-work-item-id-method-table 4378 4379 ======= ======= ================ ========================================= 4380 Bits Size Field Name Description 4381 ======= ======= ================ ========================================= 4382 0:9 10 bits Work-Item Id X Work-item id in X 4383 dimension of work-group for 4384 wavefront lane. 4385 4386 Always initialized. 4387 4388 10:19 10 bits Work-Item Id Y Work-item id in Y 4389 dimension of work-group for 4390 wavefront lane. 4391 4392 Initialized if enable_vgpr_workitem_id > 4393 0, otherwise set to 0. 4394 20:29 10 bits Work-Item Id Z Work-item id in Z 4395 dimension of work-group for 4396 wavefront lane. 4397 4398 Initialized if enable_vgpr_workitem_id > 4399 1, otherwise set to 0. 4400 30:31 2 bits Reserved, set to 0. 4401 ======= ======= ================ ========================================= 4402 4403The setting of registers is done by GPU CP/ADC/SPI hardware as follows: 4404 44051. SGPRs before the Work-Group Ids are set by CP using the 16 User Data 4406 registers. 44072. Work-group Id registers X, Y, Z are set by ADC which supports any 4408 combination including none. 44093. Scratch Wavefront Offset is set by SPI in a per wavefront basis which is why 4410 its value cannot be included with the flat scratch init value which is per 4411 queue (see :ref:`amdgpu-amdhsa-kernel-prolog-flat-scratch`). 44124. The VGPRs are set by SPI which only supports specifying either (X), (X, Y) 4413 or (X, Y, Z). 44145. Flat Scratch register pair initialization is described in 4415 :ref:`amdgpu-amdhsa-kernel-prolog-flat-scratch`. 4416 4417The global segment can be accessed either using buffer instructions (GFX6 which 4418has V# 64-bit address support), flat instructions (GFX7-GFX10), or global 4419instructions (GFX9-GFX10). 4420 4421If buffer operations are used, then the compiler can generate a V# with the 4422following properties: 4423 4424* base address of 0 4425* no swizzle 4426* ATC: 1 if IOMMU present (such as APU) 4427* ptr64: 1 4428* MTYPE set to support memory coherence that matches the runtime (such as CC for 4429 APU and NC for dGPU). 4430 4431.. _amdgpu-amdhsa-kernel-prolog: 4432 4433Kernel Prolog 4434~~~~~~~~~~~~~ 4435 4436The compiler performs initialization in the kernel prologue depending on the 4437target and information about things like stack usage in the kernel and called 4438functions. Some of this initialization requires the compiler to request certain 4439User and System SGPRs be present in the 4440:ref:`amdgpu-amdhsa-initial-kernel-execution-state` via the 4441:ref:`amdgpu-amdhsa-kernel-descriptor`. 4442 4443.. _amdgpu-amdhsa-kernel-prolog-cfi: 4444 4445CFI 4446+++ 4447 44481. The CFI return address is undefined. 4449 44502. The CFI CFA is defined using an expression which evaluates to a location 4451 description that comprises one memory location description for the 4452 ``DW_ASPACE_AMDGPU_private_lane`` address space address ``0``. 4453 4454.. _amdgpu-amdhsa-kernel-prolog-m0: 4455 4456M0 4457++ 4458 4459GFX6-GFX8 4460 The M0 register must be initialized with a value at least the total LDS size 4461 if the kernel may access LDS via DS or flat operations. Total LDS size is 4462 available in dispatch packet. For M0, it is also possible to use maximum 4463 possible value of LDS for given target (0x7FFF for GFX6 and 0xFFFF for 4464 GFX7-GFX8). 4465GFX9-GFX10 4466 The M0 register is not used for range checking LDS accesses and so does not 4467 need to be initialized in the prolog. 4468 4469.. _amdgpu-amdhsa-kernel-prolog-stack-pointer: 4470 4471Stack Pointer 4472+++++++++++++ 4473 4474If the kernel has function calls it must set up the ABI stack pointer described 4475in :ref:`amdgpu-amdhsa-function-call-convention-non-kernel-functions` by setting 4476SGPR32 to the unswizzled scratch offset of the address past the last local 4477allocation. 4478 4479.. _amdgpu-amdhsa-kernel-prolog-frame-pointer: 4480 4481Frame Pointer 4482+++++++++++++ 4483 4484If the kernel needs a frame pointer for the reasons defined in 4485``SIFrameLowering`` then SGPR33 is used and is always set to ``0`` in the 4486kernel prolog. If a frame pointer is not required then all uses of the frame 4487pointer are replaced with immediate ``0`` offsets. 4488 4489.. _amdgpu-amdhsa-kernel-prolog-flat-scratch: 4490 4491Flat Scratch 4492++++++++++++ 4493 4494There are different methods used for initializing flat scratch: 4495 4496* If the *Target Properties* column of :ref:`amdgpu-processor-table` 4497 specifies *Does not support generic address space*: 4498 4499 Flat scratch is not supported and there is no flat scratch register pair. 4500 4501* If the *Target Properties* column of :ref:`amdgpu-processor-table` 4502 specifies *Offset flat scratch*: 4503 4504 If the kernel or any function it calls may use flat operations to access 4505 scratch memory, the prolog code must set up the FLAT_SCRATCH register pair 4506 (FLAT_SCRATCH_LO/FLAT_SCRATCH_HI). Initialization uses Flat Scratch Init and 4507 Scratch Wavefront Offset SGPR registers (see 4508 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`): 4509 4510 1. The low word of Flat Scratch Init is the 32-bit byte offset from 4511 ``SH_HIDDEN_PRIVATE_BASE_VIMID`` to the base of scratch backing memory 4512 being managed by SPI for the queue executing the kernel dispatch. This is 4513 the same value used in the Scratch Segment Buffer V# base address. 4514 4515 CP obtains this from the runtime. (The Scratch Segment Buffer base address 4516 is ``SH_HIDDEN_PRIVATE_BASE_VIMID`` plus this offset.) 4517 4518 The prolog must add the value of Scratch Wavefront Offset to get the 4519 wavefront's byte scratch backing memory offset from 4520 ``SH_HIDDEN_PRIVATE_BASE_VIMID``. 4521 4522 The Scratch Wavefront Offset must also be used as an offset with Private 4523 segment address when using the Scratch Segment Buffer. 4524 4525 Since FLAT_SCRATCH_LO is in units of 256 bytes, the offset must be right 4526 shifted by 8 before moving into FLAT_SCRATCH_HI. 4527 4528 FLAT_SCRATCH_HI corresponds to SGPRn-4 on GFX7, and SGPRn-6 on GFX8 (where 4529 SGPRn is the highest numbered SGPR allocated to the wavefront). 4530 FLAT_SCRATCH_HI is multiplied by 256 (as it is in units of 256 bytes) and 4531 added to ``SH_HIDDEN_PRIVATE_BASE_VIMID`` to calculate the per wavefront 4532 FLAT SCRATCH BASE in flat memory instructions that access the scratch 4533 aperture. 4534 2. The second word of Flat Scratch Init is 32-bit byte size of a single 4535 work-items scratch memory usage. 4536 4537 CP obtains this from the runtime, and it is always a multiple of DWORD. CP 4538 checks that the value in the kernel dispatch packet Private Segment Byte 4539 Size is not larger and requests the runtime to increase the queue's scratch 4540 size if necessary. 4541 4542 CP directly loads from the kernel dispatch packet Private Segment Byte Size 4543 field and rounds up to a multiple of DWORD. Having CP load it once avoids 4544 loading it at the beginning of every wavefront. 4545 4546 The kernel prolog code must move it to FLAT_SCRATCH_LO which is SGPRn-3 on 4547 GFX7 and SGPRn-5 on GFX8. FLAT_SCRATCH_LO is used as the FLAT SCRATCH SIZE 4548 in flat memory instructions. 4549 4550* If the *Target Properties* column of :ref:`amdgpu-processor-table` 4551 specifies *Absolute flat scratch*: 4552 4553 If the kernel or any function it calls may use flat operations to access 4554 scratch memory, the prolog code must set up the FLAT_SCRATCH register pair 4555 (FLAT_SCRATCH_LO/FLAT_SCRATCH_HI which are in SGPRn-4/SGPRn-3). Initialization 4556 uses Flat Scratch Init and Scratch Wavefront Offset SGPR registers (see 4557 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`): 4558 4559 The Flat Scratch Init is the 64-bit address of the base of scratch backing 4560 memory being managed by SPI for the queue executing the kernel dispatch. 4561 4562 CP obtains this from the runtime. 4563 4564 The kernel prolog must add the value of the wave's Scratch Wavefront Offset 4565 and move the result as a 64-bit value to the FLAT_SCRATCH SGPR register pair 4566 which is SGPRn-6 and SGPRn-5. It is used as the FLAT SCRATCH BASE in flat 4567 memory instructions. 4568 4569 The Scratch Wavefront Offset must also be used as an offset with Private 4570 segment address when using the Scratch Segment Buffer (see 4571 :ref:`amdgpu-amdhsa-kernel-prolog-private-segment-buffer`). 4572 4573* If the *Target Properties* column of :ref:`amdgpu-processor-table` 4574 specifies *Architected flat scratch*: 4575 4576 If ENABLE_PRIVATE_SEGMENT is enabled in 4577 :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx10-table` then the FLAT_SCRATCH 4578 register pair will be initialized to the 64-bit address of the base of scratch 4579 backing memory being managed by SPI for the queue executing the kernel 4580 dispatch plus the value of the wave's Scratch Wavefront Offset for use as the 4581 flat scratch base in flat memory instructions. 4582 4583.. _amdgpu-amdhsa-kernel-prolog-private-segment-buffer: 4584 4585Private Segment Buffer 4586++++++++++++++++++++++ 4587 4588If the *Target Properties* column of :ref:`amdgpu-processor-table` specifies 4589*Architected flat scratch* then a Private Segment Buffer is not supported. 4590Instead the flat SCRATCH instructions are used. 4591 4592Otherwise, Private Segment Buffer SGPR register is used to initialize 4 SGPRs 4593that are used as a V# to access scratch. CP uses the value provided by the 4594runtime. It is used, together with Scratch Wavefront Offset as an offset, to 4595access the private memory space using a segment address. See 4596:ref:`amdgpu-amdhsa-initial-kernel-execution-state`. 4597 4598The scratch V# is a four-aligned SGPR and always selected for the kernel as 4599follows: 4600 4601 - If it is known during instruction selection that there is stack usage, 4602 SGPR0-3 is reserved for use as the scratch V#. Stack usage is assumed if 4603 optimizations are disabled (``-O0``), if stack objects already exist (for 4604 locals, etc.), or if there are any function calls. 4605 4606 - Otherwise, four high numbered SGPRs beginning at a four-aligned SGPR index 4607 are reserved for the tentative scratch V#. These will be used if it is 4608 determined that spilling is needed. 4609 4610 - If no use is made of the tentative scratch V#, then it is unreserved, 4611 and the register count is determined ignoring it. 4612 - If use is made of the tentative scratch V#, then its register numbers 4613 are shifted to the first four-aligned SGPR index after the highest one 4614 allocated by the register allocator, and all uses are updated. The 4615 register count includes them in the shifted location. 4616 - In either case, if the processor has the SGPR allocation bug, the 4617 tentative allocation is not shifted or unreserved in order to ensure 4618 the register count is higher to workaround the bug. 4619 4620 .. note:: 4621 4622 This approach of using a tentative scratch V# and shifting the register 4623 numbers if used avoids having to perform register allocation a second 4624 time if the tentative V# is eliminated. This is more efficient and 4625 avoids the problem that the second register allocation may perform 4626 spilling which will fail as there is no longer a scratch V#. 4627 4628When the kernel prolog code is being emitted it is known whether the scratch V# 4629described above is actually used. If it is, the prolog code must set it up by 4630copying the Private Segment Buffer to the scratch V# registers and then adding 4631the Private Segment Wavefront Offset to the queue base address in the V#. The 4632result is a V# with a base address pointing to the beginning of the wavefront 4633scratch backing memory. 4634 4635The Private Segment Buffer is always requested, but the Private Segment 4636Wavefront Offset is only requested if it is used (see 4637:ref:`amdgpu-amdhsa-initial-kernel-execution-state`). 4638 4639.. _amdgpu-amdhsa-memory-model: 4640 4641Memory Model 4642~~~~~~~~~~~~ 4643 4644This section describes the mapping of the LLVM memory model onto AMDGPU machine 4645code (see :ref:`memmodel`). 4646 4647The AMDGPU backend supports the memory synchronization scopes specified in 4648:ref:`amdgpu-memory-scopes`. 4649 4650The code sequences used to implement the memory model specify the order of 4651instructions that a single thread must execute. The ``s_waitcnt`` and cache 4652management instructions such as ``buffer_wbinvl1_vol`` are defined with respect 4653to other memory instructions executed by the same thread. This allows them to be 4654moved earlier or later which can allow them to be combined with other instances 4655of the same instruction, or hoisted/sunk out of loops to improve performance. 4656Only the instructions related to the memory model are given; additional 4657``s_waitcnt`` instructions are required to ensure registers are defined before 4658being used. These may be able to be combined with the memory model ``s_waitcnt`` 4659instructions as described above. 4660 4661The AMDGPU backend supports the following memory models: 4662 4663 HSA Memory Model [HSA]_ 4664 The HSA memory model uses a single happens-before relation for all address 4665 spaces (see :ref:`amdgpu-address-spaces`). 4666 OpenCL Memory Model [OpenCL]_ 4667 The OpenCL memory model which has separate happens-before relations for the 4668 global and local address spaces. Only a fence specifying both global and 4669 local address space, and seq_cst instructions join the relationships. Since 4670 the LLVM ``memfence`` instruction does not allow an address space to be 4671 specified the OpenCL fence has to conservatively assume both local and 4672 global address space was specified. However, optimizations can often be 4673 done to eliminate the additional ``s_waitcnt`` instructions when there are 4674 no intervening memory instructions which access the corresponding address 4675 space. The code sequences in the table indicate what can be omitted for the 4676 OpenCL memory. The target triple environment is used to determine if the 4677 source language is OpenCL (see :ref:`amdgpu-opencl`). 4678 4679``ds/flat_load/store/atomic`` instructions to local memory are termed LDS 4680operations. 4681 4682``buffer/global/flat_load/store/atomic`` instructions to global memory are 4683termed vector memory operations. 4684 4685Private address space uses ``buffer_load/store`` using the scratch V# 4686(GFX6-GFX8), or ``scratch_load/store`` (GFX9-GFX10). Since only a single thread 4687is accessing the memory, atomic memory orderings are not meaningful, and all 4688accesses are treated as non-atomic. 4689 4690Constant address space uses ``buffer/global_load`` instructions (or equivalent 4691scalar memory instructions). Since the constant address space contents do not 4692change during the execution of a kernel dispatch it is not legal to perform 4693stores, and atomic memory orderings are not meaningful, and all accesses are 4694treated as non-atomic. 4695 4696A memory synchronization scope wider than work-group is not meaningful for the 4697group (LDS) address space and is treated as work-group. 4698 4699The memory model does not support the region address space which is treated as 4700non-atomic. 4701 4702Acquire memory ordering is not meaningful on store atomic instructions and is 4703treated as non-atomic. 4704 4705Release memory ordering is not meaningful on load atomic instructions and is 4706treated a non-atomic. 4707 4708Acquire-release memory ordering is not meaningful on load or store atomic 4709instructions and is treated as acquire and release respectively. 4710 4711The memory order also adds the single thread optimization constraints defined in 4712table 4713:ref:`amdgpu-amdhsa-memory-model-single-thread-optimization-constraints-table`. 4714 4715 .. table:: AMDHSA Memory Model Single Thread Optimization Constraints 4716 :name: amdgpu-amdhsa-memory-model-single-thread-optimization-constraints-table 4717 4718 ============ ============================================================== 4719 LLVM Memory Optimization Constraints 4720 Ordering 4721 ============ ============================================================== 4722 unordered *none* 4723 monotonic *none* 4724 acquire - If a load atomic/atomicrmw then no following load/load 4725 atomic/store/store atomic/atomicrmw/fence instruction can be 4726 moved before the acquire. 4727 - If a fence then same as load atomic, plus no preceding 4728 associated fence-paired-atomic can be moved after the fence. 4729 release - If a store atomic/atomicrmw then no preceding load/load 4730 atomic/store/store atomic/atomicrmw/fence instruction can be 4731 moved after the release. 4732 - If a fence then same as store atomic, plus no following 4733 associated fence-paired-atomic can be moved before the 4734 fence. 4735 acq_rel Same constraints as both acquire and release. 4736 seq_cst - If a load atomic then same constraints as acquire, plus no 4737 preceding sequentially consistent load atomic/store 4738 atomic/atomicrmw/fence instruction can be moved after the 4739 seq_cst. 4740 - If a store atomic then the same constraints as release, plus 4741 no following sequentially consistent load atomic/store 4742 atomic/atomicrmw/fence instruction can be moved before the 4743 seq_cst. 4744 - If an atomicrmw/fence then same constraints as acq_rel. 4745 ============ ============================================================== 4746 4747The code sequences used to implement the memory model are defined in the 4748following sections: 4749 4750* :ref:`amdgpu-amdhsa-memory-model-gfx6-gfx9` 4751* :ref:`amdgpu-amdhsa-memory-model-gfx90a` 4752* :ref:`amdgpu-amdhsa-memory-model-gfx10` 4753 4754.. _amdgpu-amdhsa-memory-model-gfx6-gfx9: 4755 4756Memory Model GFX6-GFX9 4757++++++++++++++++++++++ 4758 4759For GFX6-GFX9: 4760 4761* Each agent has multiple shader arrays (SA). 4762* Each SA has multiple compute units (CU). 4763* Each CU has multiple SIMDs that execute wavefronts. 4764* The wavefronts for a single work-group are executed in the same CU but may be 4765 executed by different SIMDs. 4766* Each CU has a single LDS memory shared by the wavefronts of the work-groups 4767 executing on it. 4768* All LDS operations of a CU are performed as wavefront wide operations in a 4769 global order and involve no caching. Completion is reported to a wavefront in 4770 execution order. 4771* The LDS memory has multiple request queues shared by the SIMDs of a 4772 CU. Therefore, the LDS operations performed by different wavefronts of a 4773 work-group can be reordered relative to each other, which can result in 4774 reordering the visibility of vector memory operations with respect to LDS 4775 operations of other wavefronts in the same work-group. A ``s_waitcnt 4776 lgkmcnt(0)`` is required to ensure synchronization between LDS operations and 4777 vector memory operations between wavefronts of a work-group, but not between 4778 operations performed by the same wavefront. 4779* The vector memory operations are performed as wavefront wide operations and 4780 completion is reported to a wavefront in execution order. The exception is 4781 that for GFX7-GFX9 ``flat_load/store/atomic`` instructions can report out of 4782 vector memory order if they access LDS memory, and out of LDS operation order 4783 if they access global memory. 4784* The vector memory operations access a single vector L1 cache shared by all 4785 SIMDs a CU. Therefore, no special action is required for coherence between the 4786 lanes of a single wavefront, or for coherence between wavefronts in the same 4787 work-group. A ``buffer_wbinvl1_vol`` is required for coherence between 4788 wavefronts executing in different work-groups as they may be executing on 4789 different CUs. 4790* The scalar memory operations access a scalar L1 cache shared by all wavefronts 4791 on a group of CUs. The scalar and vector L1 caches are not coherent. However, 4792 scalar operations are used in a restricted way so do not impact the memory 4793 model. See :ref:`amdgpu-amdhsa-memory-spaces`. 4794* The vector and scalar memory operations use an L2 cache shared by all CUs on 4795 the same agent. 4796* The L2 cache has independent channels to service disjoint ranges of virtual 4797 addresses. 4798* Each CU has a separate request queue per channel. Therefore, the vector and 4799 scalar memory operations performed by wavefronts executing in different 4800 work-groups (which may be executing on different CUs) of an agent can be 4801 reordered relative to each other. A ``s_waitcnt vmcnt(0)`` is required to 4802 ensure synchronization between vector memory operations of different CUs. It 4803 ensures a previous vector memory operation has completed before executing a 4804 subsequent vector memory or LDS operation and so can be used to meet the 4805 requirements of acquire and release. 4806* The L2 cache can be kept coherent with other agents on some targets, or ranges 4807 of virtual addresses can be set up to bypass it to ensure system coherence. 4808 4809Scalar memory operations are only used to access memory that is proven to not 4810change during the execution of the kernel dispatch. This includes constant 4811address space and global address space for program scope ``const`` variables. 4812Therefore, the kernel machine code does not have to maintain the scalar cache to 4813ensure it is coherent with the vector caches. The scalar and vector caches are 4814invalidated between kernel dispatches by CP since constant address space data 4815may change between kernel dispatch executions. See 4816:ref:`amdgpu-amdhsa-memory-spaces`. 4817 4818The one exception is if scalar writes are used to spill SGPR registers. In this 4819case the AMDGPU backend ensures the memory location used to spill is never 4820accessed by vector memory operations at the same time. If scalar writes are used 4821then a ``s_dcache_wb`` is inserted before the ``s_endpgm`` and before a function 4822return since the locations may be used for vector memory instructions by a 4823future wavefront that uses the same scratch area, or a function call that 4824creates a frame at the same address, respectively. There is no need for a 4825``s_dcache_inv`` as all scalar writes are write-before-read in the same thread. 4826 4827For kernarg backing memory: 4828 4829* CP invalidates the L1 cache at the start of each kernel dispatch. 4830* On dGPU the kernarg backing memory is allocated in host memory accessed as 4831 MTYPE UC (uncached) to avoid needing to invalidate the L2 cache. This also 4832 causes it to be treated as non-volatile and so is not invalidated by 4833 ``*_vol``. 4834* On APU the kernarg backing memory it is accessed as MTYPE CC (cache coherent) 4835 and so the L2 cache will be coherent with the CPU and other agents. 4836 4837Scratch backing memory (which is used for the private address space) is accessed 4838with MTYPE NC_NV (non-coherent non-volatile). Since the private address space is 4839only accessed by a single thread, and is always write-before-read, there is 4840never a need to invalidate these entries from the L1 cache. Hence all cache 4841invalidates are done as ``*_vol`` to only invalidate the volatile cache lines. 4842 4843The code sequences used to implement the memory model for GFX6-GFX9 are defined 4844in table :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx6-gfx9-table`. 4845 4846 .. table:: AMDHSA Memory Model Code Sequences GFX6-GFX9 4847 :name: amdgpu-amdhsa-memory-model-code-sequences-gfx6-gfx9-table 4848 4849 ============ ============ ============== ========== ================================ 4850 LLVM Instr LLVM Memory LLVM Memory AMDGPU AMDGPU Machine Code 4851 Ordering Sync Scope Address GFX6-GFX9 4852 Space 4853 ============ ============ ============== ========== ================================ 4854 **Non-Atomic** 4855 ------------------------------------------------------------------------------------ 4856 load *none* *none* - global - !volatile & !nontemporal 4857 - generic 4858 - private 1. buffer/global/flat_load 4859 - constant 4860 - !volatile & nontemporal 4861 4862 1. buffer/global/flat_load 4863 glc=1 slc=1 4864 4865 - volatile 4866 4867 1. buffer/global/flat_load 4868 glc=1 4869 2. s_waitcnt vmcnt(0) 4870 4871 - Must happen before 4872 any following volatile 4873 global/generic 4874 load/store. 4875 - Ensures that 4876 volatile 4877 operations to 4878 different 4879 addresses will not 4880 be reordered by 4881 hardware. 4882 4883 load *none* *none* - local 1. ds_load 4884 store *none* *none* - global - !volatile & !nontemporal 4885 - generic 4886 - private 1. buffer/global/flat_store 4887 - constant 4888 - !volatile & nontemporal 4889 4890 1. buffer/global/flat_store 4891 glc=1 slc=1 4892 4893 - volatile 4894 4895 1. buffer/global/flat_store 4896 2. s_waitcnt vmcnt(0) 4897 4898 - Must happen before 4899 any following volatile 4900 global/generic 4901 load/store. 4902 - Ensures that 4903 volatile 4904 operations to 4905 different 4906 addresses will not 4907 be reordered by 4908 hardware. 4909 4910 store *none* *none* - local 1. ds_store 4911 **Unordered Atomic** 4912 ------------------------------------------------------------------------------------ 4913 load atomic unordered *any* *any* *Same as non-atomic*. 4914 store atomic unordered *any* *any* *Same as non-atomic*. 4915 atomicrmw unordered *any* *any* *Same as monotonic atomic*. 4916 **Monotonic Atomic** 4917 ------------------------------------------------------------------------------------ 4918 load atomic monotonic - singlethread - global 1. buffer/global/ds/flat_load 4919 - wavefront - local 4920 - workgroup - generic 4921 load atomic monotonic - agent - global 1. buffer/global/flat_load 4922 - system - generic glc=1 4923 store atomic monotonic - singlethread - global 1. buffer/global/flat_store 4924 - wavefront - generic 4925 - workgroup 4926 - agent 4927 - system 4928 store atomic monotonic - singlethread - local 1. ds_store 4929 - wavefront 4930 - workgroup 4931 atomicrmw monotonic - singlethread - global 1. buffer/global/flat_atomic 4932 - wavefront - generic 4933 - workgroup 4934 - agent 4935 - system 4936 atomicrmw monotonic - singlethread - local 1. ds_atomic 4937 - wavefront 4938 - workgroup 4939 **Acquire Atomic** 4940 ------------------------------------------------------------------------------------ 4941 load atomic acquire - singlethread - global 1. buffer/global/ds/flat_load 4942 - wavefront - local 4943 - generic 4944 load atomic acquire - workgroup - global 1. buffer/global_load 4945 load atomic acquire - workgroup - local 1. ds/flat_load 4946 - generic 2. s_waitcnt lgkmcnt(0) 4947 4948 - If OpenCL, omit. 4949 - Must happen before 4950 any following 4951 global/generic 4952 load/load 4953 atomic/store/store 4954 atomic/atomicrmw. 4955 - Ensures any 4956 following global 4957 data read is no 4958 older than a local load 4959 atomic value being 4960 acquired. 4961 4962 load atomic acquire - agent - global 1. buffer/global_load 4963 - system glc=1 4964 2. s_waitcnt vmcnt(0) 4965 4966 - Must happen before 4967 following 4968 buffer_wbinvl1_vol. 4969 - Ensures the load 4970 has completed 4971 before invalidating 4972 the cache. 4973 4974 3. buffer_wbinvl1_vol 4975 4976 - Must happen before 4977 any following 4978 global/generic 4979 load/load 4980 atomic/atomicrmw. 4981 - Ensures that 4982 following 4983 loads will not see 4984 stale global data. 4985 4986 load atomic acquire - agent - generic 1. flat_load glc=1 4987 - system 2. s_waitcnt vmcnt(0) & 4988 lgkmcnt(0) 4989 4990 - If OpenCL omit 4991 lgkmcnt(0). 4992 - Must happen before 4993 following 4994 buffer_wbinvl1_vol. 4995 - Ensures the flat_load 4996 has completed 4997 before invalidating 4998 the cache. 4999 5000 3. buffer_wbinvl1_vol 5001 5002 - Must happen before 5003 any following 5004 global/generic 5005 load/load 5006 atomic/atomicrmw. 5007 - Ensures that 5008 following loads 5009 will not see stale 5010 global data. 5011 5012 atomicrmw acquire - singlethread - global 1. buffer/global/ds/flat_atomic 5013 - wavefront - local 5014 - generic 5015 atomicrmw acquire - workgroup - global 1. buffer/global_atomic 5016 atomicrmw acquire - workgroup - local 1. ds/flat_atomic 5017 - generic 2. s_waitcnt lgkmcnt(0) 5018 5019 - If OpenCL, omit. 5020 - Must happen before 5021 any following 5022 global/generic 5023 load/load 5024 atomic/store/store 5025 atomic/atomicrmw. 5026 - Ensures any 5027 following global 5028 data read is no 5029 older than a local 5030 atomicrmw value 5031 being acquired. 5032 5033 atomicrmw acquire - agent - global 1. buffer/global_atomic 5034 - system 2. s_waitcnt vmcnt(0) 5035 5036 - Must happen before 5037 following 5038 buffer_wbinvl1_vol. 5039 - Ensures the 5040 atomicrmw has 5041 completed before 5042 invalidating the 5043 cache. 5044 5045 3. buffer_wbinvl1_vol 5046 5047 - Must happen before 5048 any following 5049 global/generic 5050 load/load 5051 atomic/atomicrmw. 5052 - Ensures that 5053 following loads 5054 will not see stale 5055 global data. 5056 5057 atomicrmw acquire - agent - generic 1. flat_atomic 5058 - system 2. s_waitcnt vmcnt(0) & 5059 lgkmcnt(0) 5060 5061 - If OpenCL, omit 5062 lgkmcnt(0). 5063 - Must happen before 5064 following 5065 buffer_wbinvl1_vol. 5066 - Ensures the 5067 atomicrmw has 5068 completed before 5069 invalidating the 5070 cache. 5071 5072 3. buffer_wbinvl1_vol 5073 5074 - Must happen before 5075 any following 5076 global/generic 5077 load/load 5078 atomic/atomicrmw. 5079 - Ensures that 5080 following loads 5081 will not see stale 5082 global data. 5083 5084 fence acquire - singlethread *none* *none* 5085 - wavefront 5086 fence acquire - workgroup *none* 1. s_waitcnt lgkmcnt(0) 5087 5088 - If OpenCL and 5089 address space is 5090 not generic, omit. 5091 - However, since LLVM 5092 currently has no 5093 address space on 5094 the fence need to 5095 conservatively 5096 always generate. If 5097 fence had an 5098 address space then 5099 set to address 5100 space of OpenCL 5101 fence flag, or to 5102 generic if both 5103 local and global 5104 flags are 5105 specified. 5106 - Must happen after 5107 any preceding 5108 local/generic load 5109 atomic/atomicrmw 5110 with an equal or 5111 wider sync scope 5112 and memory ordering 5113 stronger than 5114 unordered (this is 5115 termed the 5116 fence-paired-atomic). 5117 - Must happen before 5118 any following 5119 global/generic 5120 load/load 5121 atomic/store/store 5122 atomic/atomicrmw. 5123 - Ensures any 5124 following global 5125 data read is no 5126 older than the 5127 value read by the 5128 fence-paired-atomic. 5129 5130 fence acquire - agent *none* 1. s_waitcnt lgkmcnt(0) & 5131 - system vmcnt(0) 5132 5133 - If OpenCL and 5134 address space is 5135 not generic, omit 5136 lgkmcnt(0). 5137 - However, since LLVM 5138 currently has no 5139 address space on 5140 the fence need to 5141 conservatively 5142 always generate 5143 (see comment for 5144 previous fence). 5145 - Could be split into 5146 separate s_waitcnt 5147 vmcnt(0) and 5148 s_waitcnt 5149 lgkmcnt(0) to allow 5150 them to be 5151 independently moved 5152 according to the 5153 following rules. 5154 - s_waitcnt vmcnt(0) 5155 must happen after 5156 any preceding 5157 global/generic load 5158 atomic/atomicrmw 5159 with an equal or 5160 wider sync scope 5161 and memory ordering 5162 stronger than 5163 unordered (this is 5164 termed the 5165 fence-paired-atomic). 5166 - s_waitcnt lgkmcnt(0) 5167 must happen after 5168 any preceding 5169 local/generic load 5170 atomic/atomicrmw 5171 with an equal or 5172 wider sync scope 5173 and memory ordering 5174 stronger than 5175 unordered (this is 5176 termed the 5177 fence-paired-atomic). 5178 - Must happen before 5179 the following 5180 buffer_wbinvl1_vol. 5181 - Ensures that the 5182 fence-paired atomic 5183 has completed 5184 before invalidating 5185 the 5186 cache. Therefore 5187 any following 5188 locations read must 5189 be no older than 5190 the value read by 5191 the 5192 fence-paired-atomic. 5193 5194 2. buffer_wbinvl1_vol 5195 5196 - Must happen before any 5197 following global/generic 5198 load/load 5199 atomic/store/store 5200 atomic/atomicrmw. 5201 - Ensures that 5202 following loads 5203 will not see stale 5204 global data. 5205 5206 **Release Atomic** 5207 ------------------------------------------------------------------------------------ 5208 store atomic release - singlethread - global 1. buffer/global/ds/flat_store 5209 - wavefront - local 5210 - generic 5211 store atomic release - workgroup - global 1. s_waitcnt lgkmcnt(0) 5212 - generic 5213 - If OpenCL, omit. 5214 - Must happen after 5215 any preceding 5216 local/generic 5217 load/store/load 5218 atomic/store 5219 atomic/atomicrmw. 5220 - Must happen before 5221 the following 5222 store. 5223 - Ensures that all 5224 memory operations 5225 to local have 5226 completed before 5227 performing the 5228 store that is being 5229 released. 5230 5231 2. buffer/global/flat_store 5232 store atomic release - workgroup - local 1. ds_store 5233 store atomic release - agent - global 1. s_waitcnt lgkmcnt(0) & 5234 - system - generic vmcnt(0) 5235 5236 - If OpenCL and 5237 address space is 5238 not generic, omit 5239 lgkmcnt(0). 5240 - Could be split into 5241 separate s_waitcnt 5242 vmcnt(0) and 5243 s_waitcnt 5244 lgkmcnt(0) to allow 5245 them to be 5246 independently moved 5247 according to the 5248 following rules. 5249 - s_waitcnt vmcnt(0) 5250 must happen after 5251 any preceding 5252 global/generic 5253 load/store/load 5254 atomic/store 5255 atomic/atomicrmw. 5256 - s_waitcnt lgkmcnt(0) 5257 must happen after 5258 any preceding 5259 local/generic 5260 load/store/load 5261 atomic/store 5262 atomic/atomicrmw. 5263 - Must happen before 5264 the following 5265 store. 5266 - Ensures that all 5267 memory operations 5268 to memory have 5269 completed before 5270 performing the 5271 store that is being 5272 released. 5273 5274 2. buffer/global/flat_store 5275 atomicrmw release - singlethread - global 1. buffer/global/ds/flat_atomic 5276 - wavefront - local 5277 - generic 5278 atomicrmw release - workgroup - global 1. s_waitcnt lgkmcnt(0) 5279 - generic 5280 - If OpenCL, omit. 5281 - Must happen after 5282 any preceding 5283 local/generic 5284 load/store/load 5285 atomic/store 5286 atomic/atomicrmw. 5287 - Must happen before 5288 the following 5289 atomicrmw. 5290 - Ensures that all 5291 memory operations 5292 to local have 5293 completed before 5294 performing the 5295 atomicrmw that is 5296 being released. 5297 5298 2. buffer/global/flat_atomic 5299 atomicrmw release - workgroup - local 1. ds_atomic 5300 atomicrmw release - agent - global 1. s_waitcnt lgkmcnt(0) & 5301 - system - generic vmcnt(0) 5302 5303 - If OpenCL, omit 5304 lgkmcnt(0). 5305 - Could be split into 5306 separate s_waitcnt 5307 vmcnt(0) and 5308 s_waitcnt 5309 lgkmcnt(0) to allow 5310 them to be 5311 independently moved 5312 according to the 5313 following rules. 5314 - s_waitcnt vmcnt(0) 5315 must happen after 5316 any preceding 5317 global/generic 5318 load/store/load 5319 atomic/store 5320 atomic/atomicrmw. 5321 - s_waitcnt lgkmcnt(0) 5322 must happen after 5323 any preceding 5324 local/generic 5325 load/store/load 5326 atomic/store 5327 atomic/atomicrmw. 5328 - Must happen before 5329 the following 5330 atomicrmw. 5331 - Ensures that all 5332 memory operations 5333 to global and local 5334 have completed 5335 before performing 5336 the atomicrmw that 5337 is being released. 5338 5339 2. buffer/global/flat_atomic 5340 fence release - singlethread *none* *none* 5341 - wavefront 5342 fence release - workgroup *none* 1. s_waitcnt lgkmcnt(0) 5343 5344 - If OpenCL and 5345 address space is 5346 not generic, omit. 5347 - However, since LLVM 5348 currently has no 5349 address space on 5350 the fence need to 5351 conservatively 5352 always generate. If 5353 fence had an 5354 address space then 5355 set to address 5356 space of OpenCL 5357 fence flag, or to 5358 generic if both 5359 local and global 5360 flags are 5361 specified. 5362 - Must happen after 5363 any preceding 5364 local/generic 5365 load/load 5366 atomic/store/store 5367 atomic/atomicrmw. 5368 - Must happen before 5369 any following store 5370 atomic/atomicrmw 5371 with an equal or 5372 wider sync scope 5373 and memory ordering 5374 stronger than 5375 unordered (this is 5376 termed the 5377 fence-paired-atomic). 5378 - Ensures that all 5379 memory operations 5380 to local have 5381 completed before 5382 performing the 5383 following 5384 fence-paired-atomic. 5385 5386 fence release - agent *none* 1. s_waitcnt lgkmcnt(0) & 5387 - system vmcnt(0) 5388 5389 - If OpenCL and 5390 address space is 5391 not generic, omit 5392 lgkmcnt(0). 5393 - If OpenCL and 5394 address space is 5395 local, omit 5396 vmcnt(0). 5397 - However, since LLVM 5398 currently has no 5399 address space on 5400 the fence need to 5401 conservatively 5402 always generate. If 5403 fence had an 5404 address space then 5405 set to address 5406 space of OpenCL 5407 fence flag, or to 5408 generic if both 5409 local and global 5410 flags are 5411 specified. 5412 - Could be split into 5413 separate s_waitcnt 5414 vmcnt(0) and 5415 s_waitcnt 5416 lgkmcnt(0) to allow 5417 them to be 5418 independently moved 5419 according to the 5420 following rules. 5421 - s_waitcnt vmcnt(0) 5422 must happen after 5423 any preceding 5424 global/generic 5425 load/store/load 5426 atomic/store 5427 atomic/atomicrmw. 5428 - s_waitcnt lgkmcnt(0) 5429 must happen after 5430 any preceding 5431 local/generic 5432 load/store/load 5433 atomic/store 5434 atomic/atomicrmw. 5435 - Must happen before 5436 any following store 5437 atomic/atomicrmw 5438 with an equal or 5439 wider sync scope 5440 and memory ordering 5441 stronger than 5442 unordered (this is 5443 termed the 5444 fence-paired-atomic). 5445 - Ensures that all 5446 memory operations 5447 have 5448 completed before 5449 performing the 5450 following 5451 fence-paired-atomic. 5452 5453 **Acquire-Release Atomic** 5454 ------------------------------------------------------------------------------------ 5455 atomicrmw acq_rel - singlethread - global 1. buffer/global/ds/flat_atomic 5456 - wavefront - local 5457 - generic 5458 atomicrmw acq_rel - workgroup - global 1. s_waitcnt lgkmcnt(0) 5459 5460 - If OpenCL, omit. 5461 - Must happen after 5462 any preceding 5463 local/generic 5464 load/store/load 5465 atomic/store 5466 atomic/atomicrmw. 5467 - Must happen before 5468 the following 5469 atomicrmw. 5470 - Ensures that all 5471 memory operations 5472 to local have 5473 completed before 5474 performing the 5475 atomicrmw that is 5476 being released. 5477 5478 2. buffer/global_atomic 5479 5480 atomicrmw acq_rel - workgroup - local 1. ds_atomic 5481 2. s_waitcnt lgkmcnt(0) 5482 5483 - If OpenCL, omit. 5484 - Must happen before 5485 any following 5486 global/generic 5487 load/load 5488 atomic/store/store 5489 atomic/atomicrmw. 5490 - Ensures any 5491 following global 5492 data read is no 5493 older than the local load 5494 atomic value being 5495 acquired. 5496 5497 atomicrmw acq_rel - workgroup - generic 1. s_waitcnt lgkmcnt(0) 5498 5499 - If OpenCL, omit. 5500 - Must happen after 5501 any preceding 5502 local/generic 5503 load/store/load 5504 atomic/store 5505 atomic/atomicrmw. 5506 - Must happen before 5507 the following 5508 atomicrmw. 5509 - Ensures that all 5510 memory operations 5511 to local have 5512 completed before 5513 performing the 5514 atomicrmw that is 5515 being released. 5516 5517 2. flat_atomic 5518 3. s_waitcnt lgkmcnt(0) 5519 5520 - If OpenCL, omit. 5521 - Must happen before 5522 any following 5523 global/generic 5524 load/load 5525 atomic/store/store 5526 atomic/atomicrmw. 5527 - Ensures any 5528 following global 5529 data read is no 5530 older than a local load 5531 atomic value being 5532 acquired. 5533 5534 atomicrmw acq_rel - agent - global 1. s_waitcnt lgkmcnt(0) & 5535 - system vmcnt(0) 5536 5537 - If OpenCL, omit 5538 lgkmcnt(0). 5539 - Could be split into 5540 separate s_waitcnt 5541 vmcnt(0) and 5542 s_waitcnt 5543 lgkmcnt(0) to allow 5544 them to be 5545 independently moved 5546 according to the 5547 following rules. 5548 - s_waitcnt vmcnt(0) 5549 must happen after 5550 any preceding 5551 global/generic 5552 load/store/load 5553 atomic/store 5554 atomic/atomicrmw. 5555 - s_waitcnt lgkmcnt(0) 5556 must happen after 5557 any preceding 5558 local/generic 5559 load/store/load 5560 atomic/store 5561 atomic/atomicrmw. 5562 - Must happen before 5563 the following 5564 atomicrmw. 5565 - Ensures that all 5566 memory operations 5567 to global have 5568 completed before 5569 performing the 5570 atomicrmw that is 5571 being released. 5572 5573 2. buffer/global_atomic 5574 3. s_waitcnt vmcnt(0) 5575 5576 - Must happen before 5577 following 5578 buffer_wbinvl1_vol. 5579 - Ensures the 5580 atomicrmw has 5581 completed before 5582 invalidating the 5583 cache. 5584 5585 4. buffer_wbinvl1_vol 5586 5587 - Must happen before 5588 any following 5589 global/generic 5590 load/load 5591 atomic/atomicrmw. 5592 - Ensures that 5593 following loads 5594 will not see stale 5595 global data. 5596 5597 atomicrmw acq_rel - agent - generic 1. s_waitcnt lgkmcnt(0) & 5598 - system vmcnt(0) 5599 5600 - If OpenCL, omit 5601 lgkmcnt(0). 5602 - Could be split into 5603 separate s_waitcnt 5604 vmcnt(0) and 5605 s_waitcnt 5606 lgkmcnt(0) to allow 5607 them to be 5608 independently moved 5609 according to the 5610 following rules. 5611 - s_waitcnt vmcnt(0) 5612 must happen after 5613 any preceding 5614 global/generic 5615 load/store/load 5616 atomic/store 5617 atomic/atomicrmw. 5618 - s_waitcnt lgkmcnt(0) 5619 must happen after 5620 any preceding 5621 local/generic 5622 load/store/load 5623 atomic/store 5624 atomic/atomicrmw. 5625 - Must happen before 5626 the following 5627 atomicrmw. 5628 - Ensures that all 5629 memory operations 5630 to global have 5631 completed before 5632 performing the 5633 atomicrmw that is 5634 being released. 5635 5636 2. flat_atomic 5637 3. s_waitcnt vmcnt(0) & 5638 lgkmcnt(0) 5639 5640 - If OpenCL, omit 5641 lgkmcnt(0). 5642 - Must happen before 5643 following 5644 buffer_wbinvl1_vol. 5645 - Ensures the 5646 atomicrmw has 5647 completed before 5648 invalidating the 5649 cache. 5650 5651 4. buffer_wbinvl1_vol 5652 5653 - Must happen before 5654 any following 5655 global/generic 5656 load/load 5657 atomic/atomicrmw. 5658 - Ensures that 5659 following loads 5660 will not see stale 5661 global data. 5662 5663 fence acq_rel - singlethread *none* *none* 5664 - wavefront 5665 fence acq_rel - workgroup *none* 1. s_waitcnt lgkmcnt(0) 5666 5667 - If OpenCL and 5668 address space is 5669 not generic, omit. 5670 - However, 5671 since LLVM 5672 currently has no 5673 address space on 5674 the fence need to 5675 conservatively 5676 always generate 5677 (see comment for 5678 previous fence). 5679 - Must happen after 5680 any preceding 5681 local/generic 5682 load/load 5683 atomic/store/store 5684 atomic/atomicrmw. 5685 - Must happen before 5686 any following 5687 global/generic 5688 load/load 5689 atomic/store/store 5690 atomic/atomicrmw. 5691 - Ensures that all 5692 memory operations 5693 to local have 5694 completed before 5695 performing any 5696 following global 5697 memory operations. 5698 - Ensures that the 5699 preceding 5700 local/generic load 5701 atomic/atomicrmw 5702 with an equal or 5703 wider sync scope 5704 and memory ordering 5705 stronger than 5706 unordered (this is 5707 termed the 5708 acquire-fence-paired-atomic) 5709 has completed 5710 before following 5711 global memory 5712 operations. This 5713 satisfies the 5714 requirements of 5715 acquire. 5716 - Ensures that all 5717 previous memory 5718 operations have 5719 completed before a 5720 following 5721 local/generic store 5722 atomic/atomicrmw 5723 with an equal or 5724 wider sync scope 5725 and memory ordering 5726 stronger than 5727 unordered (this is 5728 termed the 5729 release-fence-paired-atomic). 5730 This satisfies the 5731 requirements of 5732 release. 5733 5734 fence acq_rel - agent *none* 1. s_waitcnt lgkmcnt(0) & 5735 - system vmcnt(0) 5736 5737 - If OpenCL and 5738 address space is 5739 not generic, omit 5740 lgkmcnt(0). 5741 - However, since LLVM 5742 currently has no 5743 address space on 5744 the fence need to 5745 conservatively 5746 always generate 5747 (see comment for 5748 previous fence). 5749 - Could be split into 5750 separate s_waitcnt 5751 vmcnt(0) and 5752 s_waitcnt 5753 lgkmcnt(0) to allow 5754 them to be 5755 independently moved 5756 according to the 5757 following rules. 5758 - s_waitcnt vmcnt(0) 5759 must happen after 5760 any preceding 5761 global/generic 5762 load/store/load 5763 atomic/store 5764 atomic/atomicrmw. 5765 - s_waitcnt lgkmcnt(0) 5766 must happen after 5767 any preceding 5768 local/generic 5769 load/store/load 5770 atomic/store 5771 atomic/atomicrmw. 5772 - Must happen before 5773 the following 5774 buffer_wbinvl1_vol. 5775 - Ensures that the 5776 preceding 5777 global/local/generic 5778 load 5779 atomic/atomicrmw 5780 with an equal or 5781 wider sync scope 5782 and memory ordering 5783 stronger than 5784 unordered (this is 5785 termed the 5786 acquire-fence-paired-atomic) 5787 has completed 5788 before invalidating 5789 the cache. This 5790 satisfies the 5791 requirements of 5792 acquire. 5793 - Ensures that all 5794 previous memory 5795 operations have 5796 completed before a 5797 following 5798 global/local/generic 5799 store 5800 atomic/atomicrmw 5801 with an equal or 5802 wider sync scope 5803 and memory ordering 5804 stronger than 5805 unordered (this is 5806 termed the 5807 release-fence-paired-atomic). 5808 This satisfies the 5809 requirements of 5810 release. 5811 5812 2. buffer_wbinvl1_vol 5813 5814 - Must happen before 5815 any following 5816 global/generic 5817 load/load 5818 atomic/store/store 5819 atomic/atomicrmw. 5820 - Ensures that 5821 following loads 5822 will not see stale 5823 global data. This 5824 satisfies the 5825 requirements of 5826 acquire. 5827 5828 **Sequential Consistent Atomic** 5829 ------------------------------------------------------------------------------------ 5830 load atomic seq_cst - singlethread - global *Same as corresponding 5831 - wavefront - local load atomic acquire, 5832 - generic except must generated 5833 all instructions even 5834 for OpenCL.* 5835 load atomic seq_cst - workgroup - global 1. s_waitcnt lgkmcnt(0) 5836 - generic 5837 5838 - Must 5839 happen after 5840 preceding 5841 local/generic load 5842 atomic/store 5843 atomic/atomicrmw 5844 with memory 5845 ordering of seq_cst 5846 and with equal or 5847 wider sync scope. 5848 (Note that seq_cst 5849 fences have their 5850 own s_waitcnt 5851 lgkmcnt(0) and so do 5852 not need to be 5853 considered.) 5854 - Ensures any 5855 preceding 5856 sequential 5857 consistent local 5858 memory instructions 5859 have completed 5860 before executing 5861 this sequentially 5862 consistent 5863 instruction. This 5864 prevents reordering 5865 a seq_cst store 5866 followed by a 5867 seq_cst load. (Note 5868 that seq_cst is 5869 stronger than 5870 acquire/release as 5871 the reordering of 5872 load acquire 5873 followed by a store 5874 release is 5875 prevented by the 5876 s_waitcnt of 5877 the release, but 5878 there is nothing 5879 preventing a store 5880 release followed by 5881 load acquire from 5882 completing out of 5883 order. The s_waitcnt 5884 could be placed after 5885 seq_store or before 5886 the seq_load. We 5887 choose the load to 5888 make the s_waitcnt be 5889 as late as possible 5890 so that the store 5891 may have already 5892 completed.) 5893 5894 2. *Following 5895 instructions same as 5896 corresponding load 5897 atomic acquire, 5898 except must generated 5899 all instructions even 5900 for OpenCL.* 5901 load atomic seq_cst - workgroup - local *Same as corresponding 5902 load atomic acquire, 5903 except must generated 5904 all instructions even 5905 for OpenCL.* 5906 5907 load atomic seq_cst - agent - global 1. s_waitcnt lgkmcnt(0) & 5908 - system - generic vmcnt(0) 5909 5910 - Could be split into 5911 separate s_waitcnt 5912 vmcnt(0) 5913 and s_waitcnt 5914 lgkmcnt(0) to allow 5915 them to be 5916 independently moved 5917 according to the 5918 following rules. 5919 - s_waitcnt lgkmcnt(0) 5920 must happen after 5921 preceding 5922 global/generic load 5923 atomic/store 5924 atomic/atomicrmw 5925 with memory 5926 ordering of seq_cst 5927 and with equal or 5928 wider sync scope. 5929 (Note that seq_cst 5930 fences have their 5931 own s_waitcnt 5932 lgkmcnt(0) and so do 5933 not need to be 5934 considered.) 5935 - s_waitcnt vmcnt(0) 5936 must happen after 5937 preceding 5938 global/generic load 5939 atomic/store 5940 atomic/atomicrmw 5941 with memory 5942 ordering of seq_cst 5943 and with equal or 5944 wider sync scope. 5945 (Note that seq_cst 5946 fences have their 5947 own s_waitcnt 5948 vmcnt(0) and so do 5949 not need to be 5950 considered.) 5951 - Ensures any 5952 preceding 5953 sequential 5954 consistent global 5955 memory instructions 5956 have completed 5957 before executing 5958 this sequentially 5959 consistent 5960 instruction. This 5961 prevents reordering 5962 a seq_cst store 5963 followed by a 5964 seq_cst load. (Note 5965 that seq_cst is 5966 stronger than 5967 acquire/release as 5968 the reordering of 5969 load acquire 5970 followed by a store 5971 release is 5972 prevented by the 5973 s_waitcnt of 5974 the release, but 5975 there is nothing 5976 preventing a store 5977 release followed by 5978 load acquire from 5979 completing out of 5980 order. The s_waitcnt 5981 could be placed after 5982 seq_store or before 5983 the seq_load. We 5984 choose the load to 5985 make the s_waitcnt be 5986 as late as possible 5987 so that the store 5988 may have already 5989 completed.) 5990 5991 2. *Following 5992 instructions same as 5993 corresponding load 5994 atomic acquire, 5995 except must generated 5996 all instructions even 5997 for OpenCL.* 5998 store atomic seq_cst - singlethread - global *Same as corresponding 5999 - wavefront - local store atomic release, 6000 - workgroup - generic except must generated 6001 - agent all instructions even 6002 - system for OpenCL.* 6003 atomicrmw seq_cst - singlethread - global *Same as corresponding 6004 - wavefront - local atomicrmw acq_rel, 6005 - workgroup - generic except must generated 6006 - agent all instructions even 6007 - system for OpenCL.* 6008 fence seq_cst - singlethread *none* *Same as corresponding 6009 - wavefront fence acq_rel, 6010 - workgroup except must generated 6011 - agent all instructions even 6012 - system for OpenCL.* 6013 ============ ============ ============== ========== ================================ 6014 6015.. _amdgpu-amdhsa-memory-model-gfx90a: 6016 6017Memory Model GFX90A 6018+++++++++++++++++++ 6019 6020For GFX90A: 6021 6022* Each agent has multiple shader arrays (SA). 6023* Each SA has multiple compute units (CU). 6024* Each CU has multiple SIMDs that execute wavefronts. 6025* The wavefronts for a single work-group are executed in the same CU but may be 6026 executed by different SIMDs. The exception is when in tgsplit execution mode 6027 when the wavefronts may be executed by different SIMDs in different CUs. 6028* Each CU has a single LDS memory shared by the wavefronts of the work-groups 6029 executing on it. The exception is when in tgsplit execution mode when no LDS 6030 is allocated as wavefronts of the same work-group can be in different CUs. 6031* All LDS operations of a CU are performed as wavefront wide operations in a 6032 global order and involve no caching. Completion is reported to a wavefront in 6033 execution order. 6034* The LDS memory has multiple request queues shared by the SIMDs of a 6035 CU. Therefore, the LDS operations performed by different wavefronts of a 6036 work-group can be reordered relative to each other, which can result in 6037 reordering the visibility of vector memory operations with respect to LDS 6038 operations of other wavefronts in the same work-group. A ``s_waitcnt 6039 lgkmcnt(0)`` is required to ensure synchronization between LDS operations and 6040 vector memory operations between wavefronts of a work-group, but not between 6041 operations performed by the same wavefront. 6042* The vector memory operations are performed as wavefront wide operations and 6043 completion is reported to a wavefront in execution order. The exception is 6044 that ``flat_load/store/atomic`` instructions can report out of vector memory 6045 order if they access LDS memory, and out of LDS operation order if they access 6046 global memory. 6047* The vector memory operations access a single vector L1 cache shared by all 6048 SIMDs a CU. Therefore: 6049 6050 * No special action is required for coherence between the lanes of a single 6051 wavefront. 6052 6053 * No special action is required for coherence between wavefronts in the same 6054 work-group since they execute on the same CU. The exception is when in 6055 tgsplit execution mode as wavefronts of the same work-group can be in 6056 different CUs and so a ``buffer_wbinvl1_vol`` is required as described in 6057 the following item. 6058 6059 * A ``buffer_wbinvl1_vol`` is required for coherence between wavefronts 6060 executing in different work-groups as they may be executing on different 6061 CUs. 6062 6063* The scalar memory operations access a scalar L1 cache shared by all wavefronts 6064 on a group of CUs. The scalar and vector L1 caches are not coherent. However, 6065 scalar operations are used in a restricted way so do not impact the memory 6066 model. See :ref:`amdgpu-amdhsa-memory-spaces`. 6067* The vector and scalar memory operations use an L2 cache shared by all CUs on 6068 the same agent. 6069 6070 * The L2 cache has independent channels to service disjoint ranges of virtual 6071 addresses. 6072 * Each CU has a separate request queue per channel. Therefore, the vector and 6073 scalar memory operations performed by wavefronts executing in different 6074 work-groups (which may be executing on different CUs), or the same 6075 work-group if executing in tgsplit mode, of an agent can be reordered 6076 relative to each other. A ``s_waitcnt vmcnt(0)`` is required to ensure 6077 synchronization between vector memory operations of different CUs. It 6078 ensures a previous vector memory operation has completed before executing a 6079 subsequent vector memory or LDS operation and so can be used to meet the 6080 requirements of acquire and release. 6081 * The L2 cache of one agent can be kept coherent with other agents by using 6082 the MTYPE CC (cache-coherent) with the PTE C-bit for memory local to the L2, 6083 and MTYPE UC (uncached) with the PTE C-bit set for memory not local to the 6084 L2. 6085 6086 * Any local memory cache lines will be automatically invalidated by writes 6087 from CUs associated with other L2 caches, or writes from the CPU, due to 6088 the cache probe caused by coherent requests. Coherent requests are caused 6089 by GPU accesses to pages with the PTE C-bit set, by CPU accesses over 6090 XGMI, and by PCIe requests that are configured to be coherent requests. 6091 * XGMI accesses from the CPU to local memory may be cached on the CPU. 6092 Subsequent access from the GPU will automatically invalidate or writeback 6093 the CPU cache due to the L2 probe filter and and the PTE C-bit being set. 6094 * Since all work-groups on the same agent share the same L2, no L2 6095 invalidation or writeback is required for coherence. 6096 * Since local memory reads and writes of work-groups in different agents 6097 access memory using MTYPE CC, no L2 invalidate or writeback is required 6098 for coherence. MTYPE CC causes write through to DRAM and local reads to be 6099 invalidated by remote writes with with the PTE C-bit. 6100 * Since remote memory reads and writes of work-groups in different agents 6101 access memory using MTYPE UC, no L2 invalidate or writeback is required 6102 for coherence. MTYPE UC causes direct accesses to DRAM. 6103 6104 * PCIe access from the GPU to the CPU memory is kept coherent by using the 6105 MTYPE UC (uncached) which bypasses the L2. 6106 6107Scalar memory operations are only used to access memory that is proven to not 6108change during the execution of the kernel dispatch. This includes constant 6109address space and global address space for program scope ``const`` variables. 6110Therefore, the kernel machine code does not have to maintain the scalar cache to 6111ensure it is coherent with the vector caches. The scalar and vector caches are 6112invalidated between kernel dispatches by CP since constant address space data 6113may change between kernel dispatch executions. See 6114:ref:`amdgpu-amdhsa-memory-spaces`. 6115 6116The one exception is if scalar writes are used to spill SGPR registers. In this 6117case the AMDGPU backend ensures the memory location used to spill is never 6118accessed by vector memory operations at the same time. If scalar writes are used 6119then a ``s_dcache_wb`` is inserted before the ``s_endpgm`` and before a function 6120return since the locations may be used for vector memory instructions by a 6121future wavefront that uses the same scratch area, or a function call that 6122creates a frame at the same address, respectively. There is no need for a 6123``s_dcache_inv`` as all scalar writes are write-before-read in the same thread. 6124 6125For kernarg backing memory: 6126 6127* CP invalidates the L1 cache at the start of each kernel dispatch. 6128* On dGPU over XGMI or PCIe the kernarg backing memory is allocated in host 6129 memory accessed as MTYPE UC (uncached) to avoid needing to invalidate the L2 6130 cache. This also causes it to be treated as non-volatile and so is not 6131 invalidated by ``*_vol``. 6132* On APU the kernarg backing memory is accessed as MTYPE CC (cache coherent) and 6133 so the L2 cache will be coherent with the CPU and other agents. 6134 6135Scratch backing memory (which is used for the private address space) is accessed 6136with MTYPE NC_NV (non-coherent non-volatile). Since the private address space is 6137only accessed by a single thread, and is always write-before-read, there is 6138never a need to invalidate these entries from the L1 cache. Hence all cache 6139invalidates are done as ``*_vol`` to only invalidate the volatile cache lines. 6140 6141The code sequences used to implement the memory model for GFX90A are defined 6142in table :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx90a-table`. 6143 6144 .. table:: AMDHSA Memory Model Code Sequences GFX90A 6145 :name: amdgpu-amdhsa-memory-model-code-sequences-gfx90a-table 6146 6147 ============ ============ ============== ========== ================================ 6148 LLVM Instr LLVM Memory LLVM Memory AMDGPU AMDGPU Machine Code 6149 Ordering Sync Scope Address GFX90A 6150 Space 6151 ============ ============ ============== ========== ================================ 6152 **Non-Atomic** 6153 ------------------------------------------------------------------------------------ 6154 load *none* *none* - global - !volatile & !nontemporal 6155 - generic 6156 - private 1. buffer/global/flat_load 6157 - constant 6158 - !volatile & nontemporal 6159 6160 1. buffer/global/flat_load 6161 glc=1 slc=1 6162 6163 - volatile 6164 6165 1. buffer/global/flat_load 6166 glc=1 6167 2. s_waitcnt vmcnt(0) 6168 6169 - Must happen before 6170 any following volatile 6171 global/generic 6172 load/store. 6173 - Ensures that 6174 volatile 6175 operations to 6176 different 6177 addresses will not 6178 be reordered by 6179 hardware. 6180 6181 load *none* *none* - local 1. ds_load 6182 store *none* *none* - global - !volatile & !nontemporal 6183 - generic 6184 - private 1. buffer/global/flat_store 6185 - constant 6186 - !volatile & nontemporal 6187 6188 1. buffer/global/flat_store 6189 glc=1 slc=1 6190 6191 - volatile 6192 6193 1. buffer/global/flat_store 6194 2. s_waitcnt vmcnt(0) 6195 6196 - Must happen before 6197 any following volatile 6198 global/generic 6199 load/store. 6200 - Ensures that 6201 volatile 6202 operations to 6203 different 6204 addresses will not 6205 be reordered by 6206 hardware. 6207 6208 store *none* *none* - local 1. ds_store 6209 **Unordered Atomic** 6210 ------------------------------------------------------------------------------------ 6211 load atomic unordered *any* *any* *Same as non-atomic*. 6212 store atomic unordered *any* *any* *Same as non-atomic*. 6213 atomicrmw unordered *any* *any* *Same as monotonic atomic*. 6214 **Monotonic Atomic** 6215 ------------------------------------------------------------------------------------ 6216 load atomic monotonic - singlethread - global 1. buffer/global/flat_load 6217 - wavefront - generic 6218 load atomic monotonic - workgroup - global 1. buffer/global/flat_load 6219 - generic glc=1 6220 6221 - If not TgSplit execution 6222 mode, omit glc=1. 6223 6224 load atomic monotonic - singlethread - local *If TgSplit execution mode, 6225 - wavefront local address space cannot 6226 - workgroup be used.* 6227 6228 1. ds_load 6229 load atomic monotonic - agent - global 1. buffer/global/flat_load 6230 - generic glc=1 6231 load atomic monotonic - system - global 1. buffer/global/flat_load 6232 - generic glc=1 6233 store atomic monotonic - singlethread - global 1. buffer/global/flat_store 6234 - wavefront - generic 6235 - workgroup 6236 - agent 6237 store atomic monotonic - system - global 1. buffer/global/flat_store 6238 - generic 6239 store atomic monotonic - singlethread - local *If TgSplit execution mode, 6240 - wavefront local address space cannot 6241 - workgroup be used.* 6242 6243 1. ds_store 6244 atomicrmw monotonic - singlethread - global 1. buffer/global/flat_atomic 6245 - wavefront - generic 6246 - workgroup 6247 - agent 6248 atomicrmw monotonic - system - global 1. buffer/global/flat_atomic 6249 - generic 6250 atomicrmw monotonic - singlethread - local *If TgSplit execution mode, 6251 - wavefront local address space cannot 6252 - workgroup be used.* 6253 6254 1. ds_atomic 6255 **Acquire Atomic** 6256 ------------------------------------------------------------------------------------ 6257 load atomic acquire - singlethread - global 1. buffer/global/ds/flat_load 6258 - wavefront - local 6259 - generic 6260 load atomic acquire - workgroup - global 1. buffer/global_load glc=1 6261 6262 - If not TgSplit execution 6263 mode, omit glc=1. 6264 6265 2. s_waitcnt vmcnt(0) 6266 6267 - If not TgSplit execution 6268 mode, omit. 6269 - Must happen before the 6270 following buffer_wbinvl1_vol. 6271 6272 3. buffer_wbinvl1_vol 6273 6274 - If not TgSplit execution 6275 mode, omit. 6276 - Must happen before 6277 any following 6278 global/generic 6279 load/load 6280 atomic/store/store 6281 atomic/atomicrmw. 6282 - Ensures that 6283 following 6284 loads will not see 6285 stale data. 6286 6287 load atomic acquire - workgroup - local *If TgSplit execution mode, 6288 local address space cannot 6289 be used.* 6290 6291 1. ds_load 6292 2. s_waitcnt lgkmcnt(0) 6293 6294 - If OpenCL, omit. 6295 - Must happen before 6296 any following 6297 global/generic 6298 load/load 6299 atomic/store/store 6300 atomic/atomicrmw. 6301 - Ensures any 6302 following global 6303 data read is no 6304 older than the local load 6305 atomic value being 6306 acquired. 6307 6308 load atomic acquire - workgroup - generic 1. flat_load glc=1 6309 6310 - If not TgSplit execution 6311 mode, omit glc=1. 6312 6313 2. s_waitcnt lgkm/vmcnt(0) 6314 6315 - Use lgkmcnt(0) if not 6316 TgSplit execution mode 6317 and vmcnt(0) if TgSplit 6318 execution mode. 6319 - If OpenCL, omit lgkmcnt(0). 6320 - Must happen before 6321 the following 6322 buffer_wbinvl1_vol and any 6323 following global/generic 6324 load/load 6325 atomic/store/store 6326 atomic/atomicrmw. 6327 - Ensures any 6328 following global 6329 data read is no 6330 older than a local load 6331 atomic value being 6332 acquired. 6333 6334 3. buffer_wbinvl1_vol 6335 6336 - If not TgSplit execution 6337 mode, omit. 6338 - Ensures that 6339 following 6340 loads will not see 6341 stale data. 6342 6343 load atomic acquire - agent - global 1. buffer/global_load 6344 glc=1 6345 2. s_waitcnt vmcnt(0) 6346 6347 - Must happen before 6348 following 6349 buffer_wbinvl1_vol. 6350 - Ensures the load 6351 has completed 6352 before invalidating 6353 the cache. 6354 6355 3. buffer_wbinvl1_vol 6356 6357 - Must happen before 6358 any following 6359 global/generic 6360 load/load 6361 atomic/atomicrmw. 6362 - Ensures that 6363 following 6364 loads will not see 6365 stale global data. 6366 6367 load atomic acquire - system - global 1. buffer/global/flat_load 6368 glc=1 6369 2. s_waitcnt vmcnt(0) 6370 6371 - Must happen before 6372 following 6373 buffer_wbinvl1_vol. 6374 - Ensures the load 6375 has completed 6376 before invalidating 6377 the cache. 6378 6379 3. buffer_wbinvl1_vol 6380 6381 - Must happen before 6382 any following 6383 global/generic 6384 load/load 6385 atomic/atomicrmw. 6386 - Ensures that 6387 following 6388 loads will not see 6389 stale L1 global data. 6390 MTYPE RW and CC memory will 6391 never be stale in L2 due to 6392 the memory probes. 6393 6394 load atomic acquire - agent - generic 1. flat_load glc=1 6395 2. s_waitcnt vmcnt(0) & 6396 lgkmcnt(0) 6397 6398 - If TgSplit execution mode, 6399 omit lgkmcnt(0). 6400 - If OpenCL omit 6401 lgkmcnt(0). 6402 - Must happen before 6403 following 6404 buffer_wbinvl1_vol. 6405 - Ensures the flat_load 6406 has completed 6407 before invalidating 6408 the cache. 6409 6410 3. buffer_wbinvl1_vol 6411 6412 - Must happen before 6413 any following 6414 global/generic 6415 load/load 6416 atomic/atomicrmw. 6417 - Ensures that 6418 following loads 6419 will not see stale 6420 global data. 6421 6422 load atomic acquire - system - generic 1. flat_load glc=1 6423 2. s_waitcnt vmcnt(0) & 6424 lgkmcnt(0) 6425 6426 - If TgSplit execution mode, 6427 omit lgkmcnt(0). 6428 - If OpenCL omit 6429 lgkmcnt(0). 6430 - Must happen before 6431 following 6432 buffer_wbinvl1_vol. 6433 - Ensures the flat_load 6434 has completed 6435 before invalidating 6436 the caches. 6437 6438 3. buffer_wbinvl1_vol 6439 6440 - Must happen before 6441 any following 6442 global/generic 6443 load/load 6444 atomic/atomicrmw. 6445 - Ensures that 6446 following 6447 L1 loads will not see 6448 stale global data. 6449 MTYPE RW and CC memory will 6450 never be stale in L2 due to 6451 the memory probes. 6452 6453 atomicrmw acquire - singlethread - global 1. buffer/global/flat_atomic 6454 - wavefront - generic 6455 atomicrmw acquire - singlethread - local *If TgSplit execution mode, 6456 - wavefront local address space cannot 6457 be used.* 6458 6459 1. ds_atomic 6460 atomicrmw acquire - workgroup - global 1. buffer/global_atomic 6461 2. s_waitcnt vmcnt(0) 6462 6463 - If not TgSplit execution 6464 mode, omit. 6465 - Must happen before the 6466 following buffer_wbinvl1_vol. 6467 - Ensures the atomicrmw 6468 has completed 6469 before invalidating 6470 the cache. 6471 6472 3. buffer_wbinvl1_vol 6473 6474 - If not TgSplit execution 6475 mode, omit. 6476 - Must happen before 6477 any following 6478 global/generic 6479 load/load 6480 atomic/atomicrmw. 6481 - Ensures that 6482 following loads 6483 will not see stale 6484 global data. 6485 6486 atomicrmw acquire - workgroup - local *If TgSplit execution mode, 6487 local address space cannot 6488 be used.* 6489 6490 1. ds_atomic 6491 2. s_waitcnt lgkmcnt(0) 6492 6493 - If OpenCL, omit. 6494 - Must happen before 6495 any following 6496 global/generic 6497 load/load 6498 atomic/store/store 6499 atomic/atomicrmw. 6500 - Ensures any 6501 following global 6502 data read is no 6503 older than the local 6504 atomicrmw value 6505 being acquired. 6506 6507 atomicrmw acquire - workgroup - generic 1. flat_atomic 6508 2. s_waitcnt lgkm/vmcnt(0) 6509 6510 - Use lgkmcnt(0) if not 6511 TgSplit execution mode 6512 and vmcnt(0) if TgSplit 6513 execution mode. 6514 - If OpenCL, omit lgkmcnt(0). 6515 - Must happen before 6516 the following 6517 buffer_wbinvl1_vol and 6518 any following 6519 global/generic 6520 load/load 6521 atomic/store/store 6522 atomic/atomicrmw. 6523 - Ensures any 6524 following global 6525 data read is no 6526 older than a local 6527 atomicrmw value 6528 being acquired. 6529 6530 3. buffer_wbinvl1_vol 6531 6532 - If not TgSplit execution 6533 mode, omit. 6534 - Ensures that 6535 following 6536 loads will not see 6537 stale data. 6538 6539 atomicrmw acquire - agent - global 1. buffer/global_atomic 6540 2. s_waitcnt vmcnt(0) 6541 6542 - Must happen before 6543 following 6544 buffer_wbinvl1_vol. 6545 - Ensures the 6546 atomicrmw has 6547 completed before 6548 invalidating the 6549 cache. 6550 6551 3. buffer_wbinvl1_vol 6552 6553 - Must happen before 6554 any following 6555 global/generic 6556 load/load 6557 atomic/atomicrmw. 6558 - Ensures that 6559 following loads 6560 will not see stale 6561 global data. 6562 6563 atomicrmw acquire - system - global 1. buffer/global_atomic 6564 2. s_waitcnt vmcnt(0) 6565 6566 - Must happen before 6567 following 6568 buffer_wbinvl1_vol. 6569 - Ensures the 6570 atomicrmw has 6571 completed before 6572 invalidating the 6573 caches. 6574 6575 3. buffer_wbinvl1_vol 6576 6577 - Must happen before 6578 any following 6579 global/generic 6580 load/load 6581 atomic/atomicrmw. 6582 - Ensures that 6583 following 6584 loads will not see 6585 stale L1 global data. 6586 MTYPE RW and CC L2 memory 6587 never be stale in L2 due to 6588 the memory probes. 6589 6590 atomicrmw acquire - agent - generic 1. flat_atomic 6591 2. s_waitcnt vmcnt(0) & 6592 lgkmcnt(0) 6593 6594 - If TgSplit execution mode, 6595 omit lgkmcnt(0). 6596 - If OpenCL, omit 6597 lgkmcnt(0). 6598 - Must happen before 6599 following 6600 buffer_wbinvl1_vol. 6601 - Ensures the 6602 atomicrmw has 6603 completed before 6604 invalidating the 6605 cache. 6606 6607 3. buffer_wbinvl1_vol 6608 6609 - Must happen before 6610 any following 6611 global/generic 6612 load/load 6613 atomic/atomicrmw. 6614 - Ensures that 6615 following loads 6616 will not see stale 6617 global data. 6618 6619 atomicrmw acquire - system - generic 1. flat_atomic 6620 2. s_waitcnt vmcnt(0) & 6621 lgkmcnt(0) 6622 6623 - If TgSplit execution mode, 6624 omit lgkmcnt(0). 6625 - If OpenCL, omit 6626 lgkmcnt(0). 6627 - Must happen before 6628 following 6629 buffer_wbinvl1_vol. 6630 - Ensures the 6631 atomicrmw has 6632 completed before 6633 invalidating the 6634 caches. 6635 6636 3. buffer_wbinvl1_vol 6637 6638 - Must happen before 6639 any following 6640 global/generic 6641 load/load 6642 atomic/atomicrmw. 6643 - Ensures that 6644 following 6645 loads will not see 6646 stale L1 global data. 6647 MTYPE RW and CC memory will 6648 never be stale in L2 due to 6649 the memory probes. 6650 6651 fence acquire - singlethread *none* *none* 6652 - wavefront 6653 fence acquire - workgroup *none* 1. s_waitcnt lgkm/vmcnt(0) 6654 6655 - Use lgkmcnt(0) if not 6656 TgSplit execution mode 6657 and vmcnt(0) if TgSplit 6658 execution mode. 6659 - If OpenCL and 6660 address space is 6661 not generic, omit 6662 lgkmcnt(0). 6663 - If OpenCL and 6664 address space is 6665 local, omit 6666 vmcnt(0). 6667 - However, since LLVM 6668 currently has no 6669 address space on 6670 the fence need to 6671 conservatively 6672 always generate. If 6673 fence had an 6674 address space then 6675 set to address 6676 space of OpenCL 6677 fence flag, or to 6678 generic if both 6679 local and global 6680 flags are 6681 specified. 6682 - s_waitcnt vmcnt(0) 6683 must happen after 6684 any preceding 6685 global/generic load 6686 atomic/ 6687 atomicrmw 6688 with an equal or 6689 wider sync scope 6690 and memory ordering 6691 stronger than 6692 unordered (this is 6693 termed the 6694 fence-paired-atomic). 6695 - s_waitcnt lgkmcnt(0) 6696 must happen after 6697 any preceding 6698 local/generic load 6699 atomic/atomicrmw 6700 with an equal or 6701 wider sync scope 6702 and memory ordering 6703 stronger than 6704 unordered (this is 6705 termed the 6706 fence-paired-atomic). 6707 - Must happen before 6708 the following 6709 buffer_wbinvl1_vol and 6710 any following 6711 global/generic 6712 load/load 6713 atomic/store/store 6714 atomic/atomicrmw. 6715 - Ensures any 6716 following global 6717 data read is no 6718 older than the 6719 value read by the 6720 fence-paired-atomic. 6721 6722 3. buffer_wbinvl1_vol 6723 6724 - If not TgSplit execution 6725 mode, omit. 6726 - Ensures that 6727 following 6728 loads will not see 6729 stale data. 6730 6731 fence acquire - agent *none* 1. s_waitcnt lgkmcnt(0) & 6732 vmcnt(0) 6733 6734 - If TgSplit execution mode, 6735 omit lgkmcnt(0). 6736 - If OpenCL and 6737 address space is 6738 not generic, omit 6739 lgkmcnt(0). 6740 - However, since LLVM 6741 currently has no 6742 address space on 6743 the fence need to 6744 conservatively 6745 always generate 6746 (see comment for 6747 previous fence). 6748 - Could be split into 6749 separate s_waitcnt 6750 vmcnt(0) and 6751 s_waitcnt 6752 lgkmcnt(0) to allow 6753 them to be 6754 independently moved 6755 according to the 6756 following rules. 6757 - s_waitcnt vmcnt(0) 6758 must happen after 6759 any preceding 6760 global/generic load 6761 atomic/atomicrmw 6762 with an equal or 6763 wider sync scope 6764 and memory ordering 6765 stronger than 6766 unordered (this is 6767 termed the 6768 fence-paired-atomic). 6769 - s_waitcnt lgkmcnt(0) 6770 must happen after 6771 any preceding 6772 local/generic load 6773 atomic/atomicrmw 6774 with an equal or 6775 wider sync scope 6776 and memory ordering 6777 stronger than 6778 unordered (this is 6779 termed the 6780 fence-paired-atomic). 6781 - Must happen before 6782 the following 6783 buffer_wbinvl1_vol. 6784 - Ensures that the 6785 fence-paired atomic 6786 has completed 6787 before invalidating 6788 the 6789 cache. Therefore 6790 any following 6791 locations read must 6792 be no older than 6793 the value read by 6794 the 6795 fence-paired-atomic. 6796 6797 2. buffer_wbinvl1_vol 6798 6799 - Must happen before any 6800 following global/generic 6801 load/load 6802 atomic/store/store 6803 atomic/atomicrmw. 6804 - Ensures that 6805 following loads 6806 will not see stale 6807 global data. 6808 6809 fence acquire - system *none* 1. s_waitcnt lgkmcnt(0) & 6810 vmcnt(0) 6811 6812 - If TgSplit execution mode, 6813 omit lgkmcnt(0). 6814 - If OpenCL and 6815 address space is 6816 not generic, omit 6817 lgkmcnt(0). 6818 - However, since LLVM 6819 currently has no 6820 address space on 6821 the fence need to 6822 conservatively 6823 always generate 6824 (see comment for 6825 previous fence). 6826 - Could be split into 6827 separate s_waitcnt 6828 vmcnt(0) and 6829 s_waitcnt 6830 lgkmcnt(0) to allow 6831 them to be 6832 independently moved 6833 according to the 6834 following rules. 6835 - s_waitcnt vmcnt(0) 6836 must happen after 6837 any preceding 6838 global/generic load 6839 atomic/atomicrmw 6840 with an equal or 6841 wider sync scope 6842 and memory ordering 6843 stronger than 6844 unordered (this is 6845 termed the 6846 fence-paired-atomic). 6847 - s_waitcnt lgkmcnt(0) 6848 must happen after 6849 any preceding 6850 local/generic load 6851 atomic/atomicrmw 6852 with an equal or 6853 wider sync scope 6854 and memory ordering 6855 stronger than 6856 unordered (this is 6857 termed the 6858 fence-paired-atomic). 6859 - Must happen before 6860 the following 6861 buffer_wbinvl1_vol. 6862 - Ensures that the 6863 fence-paired atomic 6864 has completed 6865 before invalidating 6866 the 6867 cache. Therefore 6868 any following 6869 locations read must 6870 be no older than 6871 the value read by 6872 the 6873 fence-paired-atomic. 6874 6875 2. buffer_wbinvl1_vol 6876 6877 - Must happen before any 6878 following global/generic 6879 load/load 6880 atomic/store/store 6881 atomic/atomicrmw. 6882 - Ensures that 6883 following 6884 loads will not see 6885 stale L1 global data. 6886 MTYPE RW and CC memory will 6887 never be stale in L2 due to 6888 the memory probes. 6889 **Release Atomic** 6890 ------------------------------------------------------------------------------------ 6891 store atomic release - singlethread - global 1. buffer/global/flat_store 6892 - wavefront - generic 6893 store atomic release - singlethread - local *If TgSplit execution mode, 6894 - wavefront local address space cannot 6895 be used.* 6896 6897 1. ds_store 6898 store atomic release - workgroup - global 1. s_waitcnt lgkm/vmcnt(0) 6899 - generic 6900 - Use lgkmcnt(0) if not 6901 TgSplit execution mode 6902 and vmcnt(0) if TgSplit 6903 execution mode. 6904 - If OpenCL, omit lgkmcnt(0). 6905 - s_waitcnt vmcnt(0) 6906 must happen after 6907 any preceding 6908 global/generic load/store/ 6909 load atomic/store atomic/ 6910 atomicrmw. 6911 - s_waitcnt lgkmcnt(0) 6912 must happen after 6913 any preceding 6914 local/generic 6915 load/store/load 6916 atomic/store 6917 atomic/atomicrmw. 6918 - Must happen before 6919 the following 6920 store. 6921 - Ensures that all 6922 memory operations 6923 have 6924 completed before 6925 performing the 6926 store that is being 6927 released. 6928 6929 2. buffer/global/flat_store 6930 store atomic release - workgroup - local *If TgSplit execution mode, 6931 local address space cannot 6932 be used.* 6933 6934 1. ds_store 6935 store atomic release - agent - global 1. s_waitcnt lgkmcnt(0) & 6936 - generic vmcnt(0) 6937 6938 - If TgSplit execution mode, 6939 omit lgkmcnt(0). 6940 - If OpenCL and 6941 address space is 6942 not generic, omit 6943 lgkmcnt(0). 6944 - Could be split into 6945 separate s_waitcnt 6946 vmcnt(0) and 6947 s_waitcnt 6948 lgkmcnt(0) to allow 6949 them to be 6950 independently moved 6951 according to the 6952 following rules. 6953 - s_waitcnt vmcnt(0) 6954 must happen after 6955 any preceding 6956 global/generic 6957 load/store/load 6958 atomic/store 6959 atomic/atomicrmw. 6960 - s_waitcnt lgkmcnt(0) 6961 must happen after 6962 any preceding 6963 local/generic 6964 load/store/load 6965 atomic/store 6966 atomic/atomicrmw. 6967 - Must happen before 6968 the following 6969 store. 6970 - Ensures that all 6971 memory operations 6972 to memory have 6973 completed before 6974 performing the 6975 store that is being 6976 released. 6977 6978 2. buffer/global/flat_store 6979 store atomic release - system - global 1. s_waitcnt lgkmcnt(0) & 6980 - generic vmcnt(0) 6981 6982 - If TgSplit execution mode, 6983 omit lgkmcnt(0). 6984 - If OpenCL and 6985 address space is 6986 not generic, omit 6987 lgkmcnt(0). 6988 - Could be split into 6989 separate s_waitcnt 6990 vmcnt(0) and 6991 s_waitcnt 6992 lgkmcnt(0) to allow 6993 them to be 6994 independently moved 6995 according to the 6996 following rules. 6997 - s_waitcnt vmcnt(0) 6998 must happen after any 6999 preceding 7000 global/generic 7001 load/store/load 7002 atomic/store 7003 atomic/atomicrmw. 7004 - s_waitcnt lgkmcnt(0) 7005 must happen after any 7006 preceding 7007 local/generic 7008 load/store/load 7009 atomic/store 7010 atomic/atomicrmw. 7011 - Must happen before 7012 the following 7013 store. 7014 - Ensures that all 7015 memory operations 7016 to memory and the L2 7017 writeback have 7018 completed before 7019 performing the 7020 store that is being 7021 released. 7022 7023 2. buffer/global/flat_store 7024 atomicrmw release - singlethread - global 1. buffer/global/flat_atomic 7025 - wavefront - generic 7026 atomicrmw release - singlethread - local *If TgSplit execution mode, 7027 - wavefront local address space cannot 7028 be used.* 7029 7030 1. ds_atomic 7031 atomicrmw release - workgroup - global 1. s_waitcnt lgkm/vmcnt(0) 7032 - generic 7033 - Use lgkmcnt(0) if not 7034 TgSplit execution mode 7035 and vmcnt(0) if TgSplit 7036 execution mode. 7037 - If OpenCL, omit 7038 lgkmcnt(0). 7039 - s_waitcnt vmcnt(0) 7040 must happen after 7041 any preceding 7042 global/generic load/store/ 7043 load atomic/store atomic/ 7044 atomicrmw. 7045 - s_waitcnt lgkmcnt(0) 7046 must happen after 7047 any preceding 7048 local/generic 7049 load/store/load 7050 atomic/store 7051 atomic/atomicrmw. 7052 - Must happen before 7053 the following 7054 atomicrmw. 7055 - Ensures that all 7056 memory operations 7057 have 7058 completed before 7059 performing the 7060 atomicrmw that is 7061 being released. 7062 7063 2. buffer/global/flat_atomic 7064 atomicrmw release - workgroup - local *If TgSplit execution mode, 7065 local address space cannot 7066 be used.* 7067 7068 1. ds_atomic 7069 atomicrmw release - agent - global 1. s_waitcnt lgkmcnt(0) & 7070 - generic vmcnt(0) 7071 7072 - If TgSplit execution mode, 7073 omit lgkmcnt(0). 7074 - If OpenCL, omit 7075 lgkmcnt(0). 7076 - Could be split into 7077 separate s_waitcnt 7078 vmcnt(0) and 7079 s_waitcnt 7080 lgkmcnt(0) to allow 7081 them to be 7082 independently moved 7083 according to the 7084 following rules. 7085 - s_waitcnt vmcnt(0) 7086 must happen after 7087 any preceding 7088 global/generic 7089 load/store/load 7090 atomic/store 7091 atomic/atomicrmw. 7092 - s_waitcnt lgkmcnt(0) 7093 must happen after 7094 any preceding 7095 local/generic 7096 load/store/load 7097 atomic/store 7098 atomic/atomicrmw. 7099 - Must happen before 7100 the following 7101 atomicrmw. 7102 - Ensures that all 7103 memory operations 7104 to global and local 7105 have completed 7106 before performing 7107 the atomicrmw that 7108 is being released. 7109 7110 2. buffer/global/flat_atomic 7111 atomicrmw release - system - global 1. s_waitcnt lgkmcnt(0) & 7112 - generic vmcnt(0) 7113 7114 - If TgSplit execution mode, 7115 omit lgkmcnt(0). 7116 - If OpenCL, omit 7117 lgkmcnt(0). 7118 - Could be split into 7119 separate s_waitcnt 7120 vmcnt(0) and 7121 s_waitcnt 7122 lgkmcnt(0) to allow 7123 them to be 7124 independently moved 7125 according to the 7126 following rules. 7127 - s_waitcnt vmcnt(0) 7128 must happen after 7129 any preceding 7130 global/generic 7131 load/store/load 7132 atomic/store 7133 atomic/atomicrmw. 7134 - s_waitcnt lgkmcnt(0) 7135 must happen after 7136 any preceding 7137 local/generic 7138 load/store/load 7139 atomic/store 7140 atomic/atomicrmw. 7141 - Must happen before 7142 the following 7143 atomicrmw. 7144 - Ensures that all 7145 memory operations 7146 to memory and the L2 7147 writeback have 7148 completed before 7149 performing the 7150 store that is being 7151 released. 7152 7153 2. buffer/global/flat_atomic 7154 fence release - singlethread *none* *none* 7155 - wavefront 7156 fence release - workgroup *none* 1. s_waitcnt lgkm/vmcnt(0) 7157 7158 - Use lgkmcnt(0) if not 7159 TgSplit execution mode 7160 and vmcnt(0) if TgSplit 7161 execution mode. 7162 - If OpenCL and 7163 address space is 7164 not generic, omit 7165 lgkmcnt(0). 7166 - If OpenCL and 7167 address space is 7168 local, omit 7169 vmcnt(0). 7170 - However, since LLVM 7171 currently has no 7172 address space on 7173 the fence need to 7174 conservatively 7175 always generate. If 7176 fence had an 7177 address space then 7178 set to address 7179 space of OpenCL 7180 fence flag, or to 7181 generic if both 7182 local and global 7183 flags are 7184 specified. 7185 - s_waitcnt vmcnt(0) 7186 must happen after 7187 any preceding 7188 global/generic 7189 load/store/ 7190 load atomic/store atomic/ 7191 atomicrmw. 7192 - s_waitcnt lgkmcnt(0) 7193 must happen after 7194 any preceding 7195 local/generic 7196 load/load 7197 atomic/store/store 7198 atomic/atomicrmw. 7199 - Must happen before 7200 any following store 7201 atomic/atomicrmw 7202 with an equal or 7203 wider sync scope 7204 and memory ordering 7205 stronger than 7206 unordered (this is 7207 termed the 7208 fence-paired-atomic). 7209 - Ensures that all 7210 memory operations 7211 have 7212 completed before 7213 performing the 7214 following 7215 fence-paired-atomic. 7216 7217 fence release - agent *none* 1. s_waitcnt lgkmcnt(0) & 7218 vmcnt(0) 7219 7220 - If TgSplit execution mode, 7221 omit lgkmcnt(0). 7222 - If OpenCL and 7223 address space is 7224 not generic, omit 7225 lgkmcnt(0). 7226 - If OpenCL and 7227 address space is 7228 local, omit 7229 vmcnt(0). 7230 - However, since LLVM 7231 currently has no 7232 address space on 7233 the fence need to 7234 conservatively 7235 always generate. If 7236 fence had an 7237 address space then 7238 set to address 7239 space of OpenCL 7240 fence flag, or to 7241 generic if both 7242 local and global 7243 flags are 7244 specified. 7245 - Could be split into 7246 separate s_waitcnt 7247 vmcnt(0) and 7248 s_waitcnt 7249 lgkmcnt(0) to allow 7250 them to be 7251 independently moved 7252 according to the 7253 following rules. 7254 - s_waitcnt vmcnt(0) 7255 must happen after 7256 any preceding 7257 global/generic 7258 load/store/load 7259 atomic/store 7260 atomic/atomicrmw. 7261 - s_waitcnt lgkmcnt(0) 7262 must happen after 7263 any preceding 7264 local/generic 7265 load/store/load 7266 atomic/store 7267 atomic/atomicrmw. 7268 - Must happen before 7269 any following store 7270 atomic/atomicrmw 7271 with an equal or 7272 wider sync scope 7273 and memory ordering 7274 stronger than 7275 unordered (this is 7276 termed the 7277 fence-paired-atomic). 7278 - Ensures that all 7279 memory operations 7280 have 7281 completed before 7282 performing the 7283 following 7284 fence-paired-atomic. 7285 7286 fence release - system *none* 1. s_waitcnt lgkmcnt(0) & 7287 vmcnt(0) 7288 7289 - If TgSplit execution mode, 7290 omit lgkmcnt(0). 7291 - If OpenCL and 7292 address space is 7293 not generic, omit 7294 lgkmcnt(0). 7295 - If OpenCL and 7296 address space is 7297 local, omit 7298 vmcnt(0). 7299 - However, since LLVM 7300 currently has no 7301 address space on 7302 the fence need to 7303 conservatively 7304 always generate. If 7305 fence had an 7306 address space then 7307 set to address 7308 space of OpenCL 7309 fence flag, or to 7310 generic if both 7311 local and global 7312 flags are 7313 specified. 7314 - Could be split into 7315 separate s_waitcnt 7316 vmcnt(0) and 7317 s_waitcnt 7318 lgkmcnt(0) to allow 7319 them to be 7320 independently moved 7321 according to the 7322 following rules. 7323 - s_waitcnt vmcnt(0) 7324 must happen after 7325 any preceding 7326 global/generic 7327 load/store/load 7328 atomic/store 7329 atomic/atomicrmw. 7330 - s_waitcnt lgkmcnt(0) 7331 must happen after 7332 any preceding 7333 local/generic 7334 load/store/load 7335 atomic/store 7336 atomic/atomicrmw. 7337 - Must happen before 7338 any following store 7339 atomic/atomicrmw 7340 with an equal or 7341 wider sync scope 7342 and memory ordering 7343 stronger than 7344 unordered (this is 7345 termed the 7346 fence-paired-atomic). 7347 - Ensures that all 7348 memory operations 7349 have 7350 completed before 7351 performing the 7352 following 7353 fence-paired-atomic. 7354 7355 **Acquire-Release Atomic** 7356 ------------------------------------------------------------------------------------ 7357 atomicrmw acq_rel - singlethread - global 1. buffer/global/flat_atomic 7358 - wavefront - generic 7359 atomicrmw acq_rel - singlethread - local *If TgSplit execution mode, 7360 - wavefront local address space cannot 7361 be used.* 7362 7363 1. ds_atomic 7364 atomicrmw acq_rel - workgroup - global 1. s_waitcnt lgkm/vmcnt(0) 7365 7366 - Use lgkmcnt(0) if not 7367 TgSplit execution mode 7368 and vmcnt(0) if TgSplit 7369 execution mode. 7370 - If OpenCL, omit 7371 lgkmcnt(0). 7372 - Must happen after 7373 any preceding 7374 local/generic 7375 load/store/load 7376 atomic/store 7377 atomic/atomicrmw. 7378 - s_waitcnt vmcnt(0) 7379 must happen after 7380 any preceding 7381 global/generic load/store/ 7382 load atomic/store atomic/ 7383 atomicrmw. 7384 - s_waitcnt lgkmcnt(0) 7385 must happen after 7386 any preceding 7387 local/generic 7388 load/store/load 7389 atomic/store 7390 atomic/atomicrmw. 7391 - Must happen before 7392 the following 7393 atomicrmw. 7394 - Ensures that all 7395 memory operations 7396 have 7397 completed before 7398 performing the 7399 atomicrmw that is 7400 being released. 7401 7402 2. buffer/global_atomic 7403 3. s_waitcnt vmcnt(0) 7404 7405 - If not TgSplit execution 7406 mode, omit. 7407 - Must happen before 7408 the following 7409 buffer_wbinvl1_vol. 7410 - Ensures any 7411 following global 7412 data read is no 7413 older than the 7414 atomicrmw value 7415 being acquired. 7416 7417 4. buffer_wbinvl1_vol 7418 7419 - If not TgSplit execution 7420 mode, omit. 7421 - Ensures that 7422 following 7423 loads will not see 7424 stale data. 7425 7426 atomicrmw acq_rel - workgroup - local *If TgSplit execution mode, 7427 local address space cannot 7428 be used.* 7429 7430 1. ds_atomic 7431 2. s_waitcnt lgkmcnt(0) 7432 7433 - If OpenCL, omit. 7434 - Must happen before 7435 any following 7436 global/generic 7437 load/load 7438 atomic/store/store 7439 atomic/atomicrmw. 7440 - Ensures any 7441 following global 7442 data read is no 7443 older than the local load 7444 atomic value being 7445 acquired. 7446 7447 atomicrmw acq_rel - workgroup - generic 1. s_waitcnt lgkm/vmcnt(0) 7448 7449 - Use lgkmcnt(0) if not 7450 TgSplit execution mode 7451 and vmcnt(0) if TgSplit 7452 execution mode. 7453 - If OpenCL, omit 7454 lgkmcnt(0). 7455 - s_waitcnt vmcnt(0) 7456 must happen after 7457 any preceding 7458 global/generic load/store/ 7459 load atomic/store atomic/ 7460 atomicrmw. 7461 - s_waitcnt lgkmcnt(0) 7462 must happen after 7463 any preceding 7464 local/generic 7465 load/store/load 7466 atomic/store 7467 atomic/atomicrmw. 7468 - Must happen before 7469 the following 7470 atomicrmw. 7471 - Ensures that all 7472 memory operations 7473 have 7474 completed before 7475 performing the 7476 atomicrmw that is 7477 being released. 7478 7479 2. flat_atomic 7480 3. s_waitcnt lgkmcnt(0) & 7481 vmcnt(0) 7482 7483 - If not TgSplit execution 7484 mode, omit vmcnt(0). 7485 - If OpenCL, omit 7486 lgkmcnt(0). 7487 - Must happen before 7488 the following 7489 buffer_wbinvl1_vol and 7490 any following 7491 global/generic 7492 load/load 7493 atomic/store/store 7494 atomic/atomicrmw. 7495 - Ensures any 7496 following global 7497 data read is no 7498 older than a local load 7499 atomic value being 7500 acquired. 7501 7502 3. buffer_wbinvl1_vol 7503 7504 - If not TgSplit execution 7505 mode, omit. 7506 - Ensures that 7507 following 7508 loads will not see 7509 stale data. 7510 7511 atomicrmw acq_rel - agent - global 1. s_waitcnt lgkmcnt(0) & 7512 vmcnt(0) 7513 7514 - If TgSplit execution mode, 7515 omit lgkmcnt(0). 7516 - If OpenCL, omit 7517 lgkmcnt(0). 7518 - Could be split into 7519 separate s_waitcnt 7520 vmcnt(0) and 7521 s_waitcnt 7522 lgkmcnt(0) to allow 7523 them to be 7524 independently moved 7525 according to the 7526 following rules. 7527 - s_waitcnt vmcnt(0) 7528 must happen after 7529 any preceding 7530 global/generic 7531 load/store/load 7532 atomic/store 7533 atomic/atomicrmw. 7534 - s_waitcnt lgkmcnt(0) 7535 must happen after 7536 any preceding 7537 local/generic 7538 load/store/load 7539 atomic/store 7540 atomic/atomicrmw. 7541 - Must happen before 7542 the following 7543 atomicrmw. 7544 - Ensures that all 7545 memory operations 7546 to global have 7547 completed before 7548 performing the 7549 atomicrmw that is 7550 being released. 7551 7552 2. buffer/global_atomic 7553 3. s_waitcnt vmcnt(0) 7554 7555 - Must happen before 7556 following 7557 buffer_wbinvl1_vol. 7558 - Ensures the 7559 atomicrmw has 7560 completed before 7561 invalidating the 7562 cache. 7563 7564 4. buffer_wbinvl1_vol 7565 7566 - Must happen before 7567 any following 7568 global/generic 7569 load/load 7570 atomic/atomicrmw. 7571 - Ensures that 7572 following loads 7573 will not see stale 7574 global data. 7575 7576 atomicrmw acq_rel - system - global 1. s_waitcnt lgkmcnt(0) & 7577 vmcnt(0) 7578 7579 - If TgSplit execution mode, 7580 omit lgkmcnt(0). 7581 - If OpenCL, omit 7582 lgkmcnt(0). 7583 - Could be split into 7584 separate s_waitcnt 7585 vmcnt(0) and 7586 s_waitcnt 7587 lgkmcnt(0) to allow 7588 them to be 7589 independently moved 7590 according to the 7591 following rules. 7592 - s_waitcnt vmcnt(0) 7593 must happen after 7594 any preceding 7595 global/generic 7596 load/store/load 7597 atomic/store 7598 atomic/atomicrmw. 7599 - s_waitcnt lgkmcnt(0) 7600 must happen after 7601 any preceding 7602 local/generic 7603 load/store/load 7604 atomic/store 7605 atomic/atomicrmw. 7606 - Must happen before 7607 the following 7608 atomicrmw. 7609 - Ensures that all 7610 memory operations 7611 to global and L2 writeback 7612 have completed before 7613 performing the 7614 atomicrmw that is 7615 being released. 7616 7617 2. buffer/global_atomic 7618 3. s_waitcnt vmcnt(0) 7619 7620 - Must happen before 7621 following 7622 buffer_wbinvl1_vol. 7623 - Ensures the 7624 atomicrmw has 7625 completed before 7626 invalidating the 7627 caches. 7628 7629 4. buffer_wbinvl1_vol 7630 7631 - Must happen before 7632 any following 7633 global/generic 7634 load/load 7635 atomic/atomicrmw. 7636 - Ensures that 7637 following 7638 loads will not see 7639 stale L1 global data. 7640 MTYPE RW and CC memory will 7641 never be stale in L2 due to 7642 the memory probes. 7643 7644 atomicrmw acq_rel - agent - generic 1. s_waitcnt lgkmcnt(0) & 7645 vmcnt(0) 7646 7647 - If TgSplit execution mode, 7648 omit lgkmcnt(0). 7649 - If OpenCL, omit 7650 lgkmcnt(0). 7651 - Could be split into 7652 separate s_waitcnt 7653 vmcnt(0) and 7654 s_waitcnt 7655 lgkmcnt(0) to allow 7656 them to be 7657 independently moved 7658 according to the 7659 following rules. 7660 - s_waitcnt vmcnt(0) 7661 must happen after 7662 any preceding 7663 global/generic 7664 load/store/load 7665 atomic/store 7666 atomic/atomicrmw. 7667 - s_waitcnt lgkmcnt(0) 7668 must happen after 7669 any preceding 7670 local/generic 7671 load/store/load 7672 atomic/store 7673 atomic/atomicrmw. 7674 - Must happen before 7675 the following 7676 atomicrmw. 7677 - Ensures that all 7678 memory operations 7679 to global have 7680 completed before 7681 performing the 7682 atomicrmw that is 7683 being released. 7684 7685 2. flat_atomic 7686 3. s_waitcnt vmcnt(0) & 7687 lgkmcnt(0) 7688 7689 - If TgSplit execution mode, 7690 omit lgkmcnt(0). 7691 - If OpenCL, omit 7692 lgkmcnt(0). 7693 - Must happen before 7694 following 7695 buffer_wbinvl1_vol. 7696 - Ensures the 7697 atomicrmw has 7698 completed before 7699 invalidating the 7700 cache. 7701 7702 4. buffer_wbinvl1_vol 7703 7704 - Must happen before 7705 any following 7706 global/generic 7707 load/load 7708 atomic/atomicrmw. 7709 - Ensures that 7710 following loads 7711 will not see stale 7712 global data. 7713 7714 atomicrmw acq_rel - system - generic 1. s_waitcnt lgkmcnt(0) & 7715 vmcnt(0) 7716 7717 - If TgSplit execution mode, 7718 omit lgkmcnt(0). 7719 - If OpenCL, omit 7720 lgkmcnt(0). 7721 - Could be split into 7722 separate s_waitcnt 7723 vmcnt(0) and 7724 s_waitcnt 7725 lgkmcnt(0) to allow 7726 them to be 7727 independently moved 7728 according to the 7729 following rules. 7730 - s_waitcnt vmcnt(0) 7731 must happen after 7732 any preceding 7733 global/generic 7734 load/store/load 7735 atomic/store 7736 atomic/atomicrmw. 7737 - s_waitcnt lgkmcnt(0) 7738 must happen after 7739 any preceding 7740 local/generic 7741 load/store/load 7742 atomic/store 7743 atomic/atomicrmw. 7744 - Must happen before 7745 the following 7746 atomicrmw. 7747 - Ensures that all 7748 memory operations 7749 to global and L2 writeback 7750 have completed before 7751 performing the 7752 atomicrmw that is 7753 being released. 7754 7755 2. flat_atomic 7756 3. s_waitcnt vmcnt(0) & 7757 lgkmcnt(0) 7758 7759 - If TgSplit execution mode, 7760 omit lgkmcnt(0). 7761 - If OpenCL, omit 7762 lgkmcnt(0). 7763 - Must happen before 7764 following 7765 buffer_wbinvl1_vol. 7766 - Ensures the 7767 atomicrmw has 7768 completed before 7769 invalidating the 7770 caches. 7771 7772 4. buffer_wbinvl1_vol 7773 7774 - Must happen before 7775 any following 7776 global/generic 7777 load/load 7778 atomic/atomicrmw. 7779 - Ensures that 7780 following 7781 loads will not see 7782 stale L1 global data. 7783 MTYPE RW and CC memory will 7784 never be stale in L2 due to 7785 the memory probes. 7786 7787 fence acq_rel - singlethread *none* *none* 7788 - wavefront 7789 fence acq_rel - workgroup *none* 1. s_waitcnt lgkm/vmcnt(0) 7790 7791 - Use lgkmcnt(0) if not 7792 TgSplit execution mode 7793 and vmcnt(0) if TgSplit 7794 execution mode. 7795 - If OpenCL and 7796 address space is 7797 not generic, omit 7798 lgkmcnt(0). 7799 - If OpenCL and 7800 address space is 7801 local, omit 7802 vmcnt(0). 7803 - However, 7804 since LLVM 7805 currently has no 7806 address space on 7807 the fence need to 7808 conservatively 7809 always generate 7810 (see comment for 7811 previous fence). 7812 - s_waitcnt vmcnt(0) 7813 must happen after 7814 any preceding 7815 global/generic 7816 load/store/ 7817 load atomic/store atomic/ 7818 atomicrmw. 7819 - s_waitcnt lgkmcnt(0) 7820 must happen after 7821 any preceding 7822 local/generic 7823 load/load 7824 atomic/store/store 7825 atomic/atomicrmw. 7826 - Must happen before 7827 any following 7828 global/generic 7829 load/load 7830 atomic/store/store 7831 atomic/atomicrmw. 7832 - Ensures that all 7833 memory operations 7834 have 7835 completed before 7836 performing any 7837 following global 7838 memory operations. 7839 - Ensures that the 7840 preceding 7841 local/generic load 7842 atomic/atomicrmw 7843 with an equal or 7844 wider sync scope 7845 and memory ordering 7846 stronger than 7847 unordered (this is 7848 termed the 7849 acquire-fence-paired-atomic) 7850 has completed 7851 before following 7852 global memory 7853 operations. This 7854 satisfies the 7855 requirements of 7856 acquire. 7857 - Ensures that all 7858 previous memory 7859 operations have 7860 completed before a 7861 following 7862 local/generic store 7863 atomic/atomicrmw 7864 with an equal or 7865 wider sync scope 7866 and memory ordering 7867 stronger than 7868 unordered (this is 7869 termed the 7870 release-fence-paired-atomic). 7871 This satisfies the 7872 requirements of 7873 release. 7874 - Must happen before 7875 the following 7876 buffer_wbinvl1_vol. 7877 - Ensures that the 7878 acquire-fence-paired 7879 atomic has completed 7880 before invalidating 7881 the 7882 cache. Therefore 7883 any following 7884 locations read must 7885 be no older than 7886 the value read by 7887 the 7888 acquire-fence-paired-atomic. 7889 7890 3. buffer_wbinvl1_vol 7891 7892 - If not TgSplit execution 7893 mode, omit. 7894 - Ensures that 7895 following 7896 loads will not see 7897 stale data. 7898 7899 fence acq_rel - agent *none* 1. s_waitcnt lgkmcnt(0) & 7900 vmcnt(0) 7901 7902 - If TgSplit execution mode, 7903 omit lgkmcnt(0). 7904 - If OpenCL and 7905 address space is 7906 not generic, omit 7907 lgkmcnt(0). 7908 - However, since LLVM 7909 currently has no 7910 address space on 7911 the fence need to 7912 conservatively 7913 always generate 7914 (see comment for 7915 previous fence). 7916 - Could be split into 7917 separate s_waitcnt 7918 vmcnt(0) and 7919 s_waitcnt 7920 lgkmcnt(0) to allow 7921 them to be 7922 independently moved 7923 according to the 7924 following rules. 7925 - s_waitcnt vmcnt(0) 7926 must happen after 7927 any preceding 7928 global/generic 7929 load/store/load 7930 atomic/store 7931 atomic/atomicrmw. 7932 - s_waitcnt lgkmcnt(0) 7933 must happen after 7934 any preceding 7935 local/generic 7936 load/store/load 7937 atomic/store 7938 atomic/atomicrmw. 7939 - Must happen before 7940 the following 7941 buffer_wbinvl1_vol. 7942 - Ensures that the 7943 preceding 7944 global/local/generic 7945 load 7946 atomic/atomicrmw 7947 with an equal or 7948 wider sync scope 7949 and memory ordering 7950 stronger than 7951 unordered (this is 7952 termed the 7953 acquire-fence-paired-atomic) 7954 has completed 7955 before invalidating 7956 the cache. This 7957 satisfies the 7958 requirements of 7959 acquire. 7960 - Ensures that all 7961 previous memory 7962 operations have 7963 completed before a 7964 following 7965 global/local/generic 7966 store 7967 atomic/atomicrmw 7968 with an equal or 7969 wider sync scope 7970 and memory ordering 7971 stronger than 7972 unordered (this is 7973 termed the 7974 release-fence-paired-atomic). 7975 This satisfies the 7976 requirements of 7977 release. 7978 7979 2. buffer_wbinvl1_vol 7980 7981 - Must happen before 7982 any following 7983 global/generic 7984 load/load 7985 atomic/store/store 7986 atomic/atomicrmw. 7987 - Ensures that 7988 following loads 7989 will not see stale 7990 global data. This 7991 satisfies the 7992 requirements of 7993 acquire. 7994 7995 fence acq_rel - system *none* 1. s_waitcnt lgkmcnt(0) & 7996 vmcnt(0) 7997 7998 - If TgSplit execution mode, 7999 omit lgkmcnt(0). 8000 - If OpenCL and 8001 address space is 8002 not generic, omit 8003 lgkmcnt(0). 8004 - However, since LLVM 8005 currently has no 8006 address space on 8007 the fence need to 8008 conservatively 8009 always generate 8010 (see comment for 8011 previous fence). 8012 - Could be split into 8013 separate s_waitcnt 8014 vmcnt(0) and 8015 s_waitcnt 8016 lgkmcnt(0) to allow 8017 them to be 8018 independently moved 8019 according to the 8020 following rules. 8021 - s_waitcnt vmcnt(0) 8022 must happen after 8023 any preceding 8024 global/generic 8025 load/store/load 8026 atomic/store 8027 atomic/atomicrmw. 8028 - s_waitcnt lgkmcnt(0) 8029 must happen after 8030 any preceding 8031 local/generic 8032 load/store/load 8033 atomic/store 8034 atomic/atomicrmw. 8035 - Must happen before 8036 the following 8037 buffer_wbinvl1_vol. 8038 - Ensures that the 8039 preceding 8040 global/local/generic 8041 load 8042 atomic/atomicrmw 8043 with an equal or 8044 wider sync scope 8045 and memory ordering 8046 stronger than 8047 unordered (this is 8048 termed the 8049 acquire-fence-paired-atomic) 8050 has completed 8051 before invalidating 8052 the cache. This 8053 satisfies the 8054 requirements of 8055 acquire. 8056 - Ensures that all 8057 previous memory 8058 operations have 8059 completed before a 8060 following 8061 global/local/generic 8062 store 8063 atomic/atomicrmw 8064 with an equal or 8065 wider sync scope 8066 and memory ordering 8067 stronger than 8068 unordered (this is 8069 termed the 8070 release-fence-paired-atomic). 8071 This satisfies the 8072 requirements of 8073 release. 8074 8075 2. buffer_wbinvl1_vol 8076 8077 - Must happen before 8078 any following 8079 global/generic 8080 load/load 8081 atomic/store/store 8082 atomic/atomicrmw. 8083 - Ensures that 8084 following 8085 loads will not see 8086 stale L1 global data. 8087 MTYPE RW and CC memory will 8088 never be stale in L2 due to 8089 the memory probes. 8090 8091 **Sequential Consistent Atomic** 8092 ------------------------------------------------------------------------------------ 8093 load atomic seq_cst - singlethread - global *Same as corresponding 8094 - wavefront - local load atomic acquire, 8095 - generic except must generated 8096 all instructions even 8097 for OpenCL.* 8098 load atomic seq_cst - workgroup - global 1. s_waitcnt lgkm/vmcnt(0) 8099 - generic 8100 - Use lgkmcnt(0) if not 8101 TgSplit execution mode 8102 and vmcnt(0) if TgSplit 8103 execution mode. 8104 - s_waitcnt lgkmcnt(0) must 8105 happen after 8106 preceding 8107 local/generic load 8108 atomic/store 8109 atomic/atomicrmw 8110 with memory 8111 ordering of seq_cst 8112 and with equal or 8113 wider sync scope. 8114 (Note that seq_cst 8115 fences have their 8116 own s_waitcnt 8117 lgkmcnt(0) and so do 8118 not need to be 8119 considered.) 8120 - s_waitcnt vmcnt(0) 8121 must happen after 8122 preceding 8123 global/generic load 8124 atomic/store 8125 atomic/atomicrmw 8126 with memory 8127 ordering of seq_cst 8128 and with equal or 8129 wider sync scope. 8130 (Note that seq_cst 8131 fences have their 8132 own s_waitcnt 8133 vmcnt(0) and so do 8134 not need to be 8135 considered.) 8136 - Ensures any 8137 preceding 8138 sequential 8139 consistent global/local 8140 memory instructions 8141 have completed 8142 before executing 8143 this sequentially 8144 consistent 8145 instruction. This 8146 prevents reordering 8147 a seq_cst store 8148 followed by a 8149 seq_cst load. (Note 8150 that seq_cst is 8151 stronger than 8152 acquire/release as 8153 the reordering of 8154 load acquire 8155 followed by a store 8156 release is 8157 prevented by the 8158 s_waitcnt of 8159 the release, but 8160 there is nothing 8161 preventing a store 8162 release followed by 8163 load acquire from 8164 completing out of 8165 order. The s_waitcnt 8166 could be placed after 8167 seq_store or before 8168 the seq_load. We 8169 choose the load to 8170 make the s_waitcnt be 8171 as late as possible 8172 so that the store 8173 may have already 8174 completed.) 8175 8176 2. *Following 8177 instructions same as 8178 corresponding load 8179 atomic acquire, 8180 except must generated 8181 all instructions even 8182 for OpenCL.* 8183 load atomic seq_cst - workgroup - local *If TgSplit execution mode, 8184 local address space cannot 8185 be used.* 8186 8187 *Same as corresponding 8188 load atomic acquire, 8189 except must generated 8190 all instructions even 8191 for OpenCL.* 8192 8193 load atomic seq_cst - agent - global 1. s_waitcnt lgkmcnt(0) & 8194 - system - generic vmcnt(0) 8195 8196 - If TgSplit execution mode, 8197 omit lgkmcnt(0). 8198 - Could be split into 8199 separate s_waitcnt 8200 vmcnt(0) 8201 and s_waitcnt 8202 lgkmcnt(0) to allow 8203 them to be 8204 independently moved 8205 according to the 8206 following rules. 8207 - s_waitcnt lgkmcnt(0) 8208 must happen after 8209 preceding 8210 global/generic load 8211 atomic/store 8212 atomic/atomicrmw 8213 with memory 8214 ordering of seq_cst 8215 and with equal or 8216 wider sync scope. 8217 (Note that seq_cst 8218 fences have their 8219 own s_waitcnt 8220 lgkmcnt(0) and so do 8221 not need to be 8222 considered.) 8223 - s_waitcnt vmcnt(0) 8224 must happen after 8225 preceding 8226 global/generic load 8227 atomic/store 8228 atomic/atomicrmw 8229 with memory 8230 ordering of seq_cst 8231 and with equal or 8232 wider sync scope. 8233 (Note that seq_cst 8234 fences have their 8235 own s_waitcnt 8236 vmcnt(0) and so do 8237 not need to be 8238 considered.) 8239 - Ensures any 8240 preceding 8241 sequential 8242 consistent global 8243 memory instructions 8244 have completed 8245 before executing 8246 this sequentially 8247 consistent 8248 instruction. This 8249 prevents reordering 8250 a seq_cst store 8251 followed by a 8252 seq_cst load. (Note 8253 that seq_cst is 8254 stronger than 8255 acquire/release as 8256 the reordering of 8257 load acquire 8258 followed by a store 8259 release is 8260 prevented by the 8261 s_waitcnt of 8262 the release, but 8263 there is nothing 8264 preventing a store 8265 release followed by 8266 load acquire from 8267 completing out of 8268 order. The s_waitcnt 8269 could be placed after 8270 seq_store or before 8271 the seq_load. We 8272 choose the load to 8273 make the s_waitcnt be 8274 as late as possible 8275 so that the store 8276 may have already 8277 completed.) 8278 8279 2. *Following 8280 instructions same as 8281 corresponding load 8282 atomic acquire, 8283 except must generated 8284 all instructions even 8285 for OpenCL.* 8286 store atomic seq_cst - singlethread - global *Same as corresponding 8287 - wavefront - local store atomic release, 8288 - workgroup - generic except must generated 8289 - agent all instructions even 8290 - system for OpenCL.* 8291 atomicrmw seq_cst - singlethread - global *Same as corresponding 8292 - wavefront - local atomicrmw acq_rel, 8293 - workgroup - generic except must generated 8294 - agent all instructions even 8295 - system for OpenCL.* 8296 fence seq_cst - singlethread *none* *Same as corresponding 8297 - wavefront fence acq_rel, 8298 - workgroup except must generated 8299 - agent all instructions even 8300 - system for OpenCL.* 8301 ============ ============ ============== ========== ================================ 8302 8303.. _amdgpu-amdhsa-memory-model-gfx10: 8304 8305Memory Model GFX10 8306++++++++++++++++++ 8307 8308For GFX10: 8309 8310* Each agent has multiple shader arrays (SA). 8311* Each SA has multiple work-group processors (WGP). 8312* Each WGP has multiple compute units (CU). 8313* Each CU has multiple SIMDs that execute wavefronts. 8314* The wavefronts for a single work-group are executed in the same 8315 WGP. In CU wavefront execution mode the wavefronts may be executed by 8316 different SIMDs in the same CU. In WGP wavefront execution mode the 8317 wavefronts may be executed by different SIMDs in different CUs in the same 8318 WGP. 8319* Each WGP has a single LDS memory shared by the wavefronts of the work-groups 8320 executing on it. 8321* All LDS operations of a WGP are performed as wavefront wide operations in a 8322 global order and involve no caching. Completion is reported to a wavefront in 8323 execution order. 8324* The LDS memory has multiple request queues shared by the SIMDs of a 8325 WGP. Therefore, the LDS operations performed by different wavefronts of a 8326 work-group can be reordered relative to each other, which can result in 8327 reordering the visibility of vector memory operations with respect to LDS 8328 operations of other wavefronts in the same work-group. A ``s_waitcnt 8329 lgkmcnt(0)`` is required to ensure synchronization between LDS operations and 8330 vector memory operations between wavefronts of a work-group, but not between 8331 operations performed by the same wavefront. 8332* The vector memory operations are performed as wavefront wide operations. 8333 Completion of load/store/sample operations are reported to a wavefront in 8334 execution order of other load/store/sample operations performed by that 8335 wavefront. 8336* The vector memory operations access a vector L0 cache. There is a single L0 8337 cache per CU. Each SIMD of a CU accesses the same L0 cache. Therefore, no 8338 special action is required for coherence between the lanes of a single 8339 wavefront. However, a ``buffer_gl0_inv`` is required for coherence between 8340 wavefronts executing in the same work-group as they may be executing on SIMDs 8341 of different CUs that access different L0s. A ``buffer_gl0_inv`` is also 8342 required for coherence between wavefronts executing in different work-groups 8343 as they may be executing on different WGPs. 8344* The scalar memory operations access a scalar L0 cache shared by all wavefronts 8345 on a WGP. The scalar and vector L0 caches are not coherent. However, scalar 8346 operations are used in a restricted way so do not impact the memory model. See 8347 :ref:`amdgpu-amdhsa-memory-spaces`. 8348* The vector and scalar memory L0 caches use an L1 cache shared by all WGPs on 8349 the same SA. Therefore, no special action is required for coherence between 8350 the wavefronts of a single work-group. However, a ``buffer_gl1_inv`` is 8351 required for coherence between wavefronts executing in different work-groups 8352 as they may be executing on different SAs that access different L1s. 8353* The L1 caches have independent quadrants to service disjoint ranges of virtual 8354 addresses. 8355* Each L0 cache has a separate request queue per L1 quadrant. Therefore, the 8356 vector and scalar memory operations performed by different wavefronts, whether 8357 executing in the same or different work-groups (which may be executing on 8358 different CUs accessing different L0s), can be reordered relative to each 8359 other. A ``s_waitcnt vmcnt(0) & vscnt(0)`` is required to ensure 8360 synchronization between vector memory operations of different wavefronts. It 8361 ensures a previous vector memory operation has completed before executing a 8362 subsequent vector memory or LDS operation and so can be used to meet the 8363 requirements of acquire, release and sequential consistency. 8364* The L1 caches use an L2 cache shared by all SAs on the same agent. 8365* The L2 cache has independent channels to service disjoint ranges of virtual 8366 addresses. 8367* Each L1 quadrant of a single SA accesses a different L2 channel. Each L1 8368 quadrant has a separate request queue per L2 channel. Therefore, the vector 8369 and scalar memory operations performed by wavefronts executing in different 8370 work-groups (which may be executing on different SAs) of an agent can be 8371 reordered relative to each other. A ``s_waitcnt vmcnt(0) & vscnt(0)`` is 8372 required to ensure synchronization between vector memory operations of 8373 different SAs. It ensures a previous vector memory operation has completed 8374 before executing a subsequent vector memory and so can be used to meet the 8375 requirements of acquire, release and sequential consistency. 8376* The L2 cache can be kept coherent with other agents on some targets, or ranges 8377 of virtual addresses can be set up to bypass it to ensure system coherence. 8378 8379Scalar memory operations are only used to access memory that is proven to not 8380change during the execution of the kernel dispatch. This includes constant 8381address space and global address space for program scope ``const`` variables. 8382Therefore, the kernel machine code does not have to maintain the scalar cache to 8383ensure it is coherent with the vector caches. The scalar and vector caches are 8384invalidated between kernel dispatches by CP since constant address space data 8385may change between kernel dispatch executions. See 8386:ref:`amdgpu-amdhsa-memory-spaces`. 8387 8388The one exception is if scalar writes are used to spill SGPR registers. In this 8389case the AMDGPU backend ensures the memory location used to spill is never 8390accessed by vector memory operations at the same time. If scalar writes are used 8391then a ``s_dcache_wb`` is inserted before the ``s_endpgm`` and before a function 8392return since the locations may be used for vector memory instructions by a 8393future wavefront that uses the same scratch area, or a function call that 8394creates a frame at the same address, respectively. There is no need for a 8395``s_dcache_inv`` as all scalar writes are write-before-read in the same thread. 8396 8397For kernarg backing memory: 8398 8399* CP invalidates the L0 and L1 caches at the start of each kernel dispatch. 8400* On dGPU the kernarg backing memory is accessed as MTYPE UC (uncached) to avoid 8401 needing to invalidate the L2 cache. 8402* On APU the kernarg backing memory is accessed as MTYPE CC (cache coherent) and 8403 so the L2 cache will be coherent with the CPU and other agents. 8404 8405Scratch backing memory (which is used for the private address space) is accessed 8406with MTYPE NC (non-coherent). Since the private address space is only accessed 8407by a single thread, and is always write-before-read, there is never a need to 8408invalidate these entries from the L0 or L1 caches. 8409 8410Wavefronts are executed in native mode with in-order reporting of loads and 8411sample instructions. In this mode vmcnt reports completion of load, atomic with 8412return and sample instructions in order, and the vscnt reports the completion of 8413store and atomic without return in order. See ``MEM_ORDERED`` field in 8414:ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx10-table`. 8415 8416Wavefronts can be executed in WGP or CU wavefront execution mode: 8417 8418* In WGP wavefront execution mode the wavefronts of a work-group are executed 8419 on the SIMDs of both CUs of the WGP. Therefore, explicit management of the per 8420 CU L0 caches is required for work-group synchronization. Also accesses to L1 8421 at work-group scope need to be explicitly ordered as the accesses from 8422 different CUs are not ordered. 8423* In CU wavefront execution mode the wavefronts of a work-group are executed on 8424 the SIMDs of a single CU of the WGP. Therefore, all global memory access by 8425 the work-group access the same L0 which in turn ensures L1 accesses are 8426 ordered and so do not require explicit management of the caches for 8427 work-group synchronization. 8428 8429See ``WGP_MODE`` field in 8430:ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx10-table` and 8431:ref:`amdgpu-target-features`. 8432 8433The code sequences used to implement the memory model for GFX10 are defined in 8434table :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx10-table`. 8435 8436 .. table:: AMDHSA Memory Model Code Sequences GFX10 8437 :name: amdgpu-amdhsa-memory-model-code-sequences-gfx10-table 8438 8439 ============ ============ ============== ========== ================================ 8440 LLVM Instr LLVM Memory LLVM Memory AMDGPU AMDGPU Machine Code 8441 Ordering Sync Scope Address GFX10 8442 Space 8443 ============ ============ ============== ========== ================================ 8444 **Non-Atomic** 8445 ------------------------------------------------------------------------------------ 8446 load *none* *none* - global - !volatile & !nontemporal 8447 - generic 8448 - private 1. buffer/global/flat_load 8449 - constant 8450 - !volatile & nontemporal 8451 8452 1. buffer/global/flat_load 8453 slc=1 8454 8455 - volatile 8456 8457 1. buffer/global/flat_load 8458 glc=1 dlc=1 8459 2. s_waitcnt vmcnt(0) 8460 8461 - Must happen before 8462 any following volatile 8463 global/generic 8464 load/store. 8465 - Ensures that 8466 volatile 8467 operations to 8468 different 8469 addresses will not 8470 be reordered by 8471 hardware. 8472 8473 load *none* *none* - local 1. ds_load 8474 store *none* *none* - global - !volatile & !nontemporal 8475 - generic 8476 - private 1. buffer/global/flat_store 8477 - constant 8478 - !volatile & nontemporal 8479 8480 1. buffer/global/flat_store 8481 slc=1 8482 8483 - volatile 8484 8485 1. buffer/global/flat_store 8486 2. s_waitcnt vscnt(0) 8487 8488 - Must happen before 8489 any following volatile 8490 global/generic 8491 load/store. 8492 - Ensures that 8493 volatile 8494 operations to 8495 different 8496 addresses will not 8497 be reordered by 8498 hardware. 8499 8500 store *none* *none* - local 1. ds_store 8501 **Unordered Atomic** 8502 ------------------------------------------------------------------------------------ 8503 load atomic unordered *any* *any* *Same as non-atomic*. 8504 store atomic unordered *any* *any* *Same as non-atomic*. 8505 atomicrmw unordered *any* *any* *Same as monotonic atomic*. 8506 **Monotonic Atomic** 8507 ------------------------------------------------------------------------------------ 8508 load atomic monotonic - singlethread - global 1. buffer/global/flat_load 8509 - wavefront - generic 8510 load atomic monotonic - workgroup - global 1. buffer/global/flat_load 8511 - generic glc=1 8512 8513 - If CU wavefront execution 8514 mode, omit glc=1. 8515 8516 load atomic monotonic - singlethread - local 1. ds_load 8517 - wavefront 8518 - workgroup 8519 load atomic monotonic - agent - global 1. buffer/global/flat_load 8520 - system - generic glc=1 dlc=1 8521 store atomic monotonic - singlethread - global 1. buffer/global/flat_store 8522 - wavefront - generic 8523 - workgroup 8524 - agent 8525 - system 8526 store atomic monotonic - singlethread - local 1. ds_store 8527 - wavefront 8528 - workgroup 8529 atomicrmw monotonic - singlethread - global 1. buffer/global/flat_atomic 8530 - wavefront - generic 8531 - workgroup 8532 - agent 8533 - system 8534 atomicrmw monotonic - singlethread - local 1. ds_atomic 8535 - wavefront 8536 - workgroup 8537 **Acquire Atomic** 8538 ------------------------------------------------------------------------------------ 8539 load atomic acquire - singlethread - global 1. buffer/global/ds/flat_load 8540 - wavefront - local 8541 - generic 8542 load atomic acquire - workgroup - global 1. buffer/global_load glc=1 8543 8544 - If CU wavefront execution 8545 mode, omit glc=1. 8546 8547 2. s_waitcnt vmcnt(0) 8548 8549 - If CU wavefront execution 8550 mode, omit. 8551 - Must happen before 8552 the following buffer_gl0_inv 8553 and before any following 8554 global/generic 8555 load/load 8556 atomic/store/store 8557 atomic/atomicrmw. 8558 8559 3. buffer_gl0_inv 8560 8561 - If CU wavefront execution 8562 mode, omit. 8563 - Ensures that 8564 following 8565 loads will not see 8566 stale data. 8567 8568 load atomic acquire - workgroup - local 1. ds_load 8569 2. s_waitcnt lgkmcnt(0) 8570 8571 - If OpenCL, omit. 8572 - Must happen before 8573 the following buffer_gl0_inv 8574 and before any following 8575 global/generic load/load 8576 atomic/store/store 8577 atomic/atomicrmw. 8578 - Ensures any 8579 following global 8580 data read is no 8581 older than the local load 8582 atomic value being 8583 acquired. 8584 8585 3. buffer_gl0_inv 8586 8587 - If CU wavefront execution 8588 mode, omit. 8589 - If OpenCL, omit. 8590 - Ensures that 8591 following 8592 loads will not see 8593 stale data. 8594 8595 load atomic acquire - workgroup - generic 1. flat_load glc=1 8596 8597 - If CU wavefront execution 8598 mode, omit glc=1. 8599 8600 2. s_waitcnt lgkmcnt(0) & 8601 vmcnt(0) 8602 8603 - If CU wavefront execution 8604 mode, omit vmcnt(0). 8605 - If OpenCL, omit 8606 lgkmcnt(0). 8607 - Must happen before 8608 the following 8609 buffer_gl0_inv and any 8610 following global/generic 8611 load/load 8612 atomic/store/store 8613 atomic/atomicrmw. 8614 - Ensures any 8615 following global 8616 data read is no 8617 older than a local load 8618 atomic value being 8619 acquired. 8620 8621 3. buffer_gl0_inv 8622 8623 - If CU wavefront execution 8624 mode, omit. 8625 - Ensures that 8626 following 8627 loads will not see 8628 stale data. 8629 8630 load atomic acquire - agent - global 1. buffer/global_load 8631 - system glc=1 dlc=1 8632 2. s_waitcnt vmcnt(0) 8633 8634 - Must happen before 8635 following 8636 buffer_gl*_inv. 8637 - Ensures the load 8638 has completed 8639 before invalidating 8640 the caches. 8641 8642 3. buffer_gl0_inv; 8643 buffer_gl1_inv 8644 8645 - Must happen before 8646 any following 8647 global/generic 8648 load/load 8649 atomic/atomicrmw. 8650 - Ensures that 8651 following 8652 loads will not see 8653 stale global data. 8654 8655 load atomic acquire - agent - generic 1. flat_load glc=1 dlc=1 8656 - system 2. s_waitcnt vmcnt(0) & 8657 lgkmcnt(0) 8658 8659 - If OpenCL omit 8660 lgkmcnt(0). 8661 - Must happen before 8662 following 8663 buffer_gl*_invl. 8664 - Ensures the flat_load 8665 has completed 8666 before invalidating 8667 the caches. 8668 8669 3. buffer_gl0_inv; 8670 buffer_gl1_inv 8671 8672 - Must happen before 8673 any following 8674 global/generic 8675 load/load 8676 atomic/atomicrmw. 8677 - Ensures that 8678 following loads 8679 will not see stale 8680 global data. 8681 8682 atomicrmw acquire - singlethread - global 1. buffer/global/ds/flat_atomic 8683 - wavefront - local 8684 - generic 8685 atomicrmw acquire - workgroup - global 1. buffer/global_atomic 8686 2. s_waitcnt vm/vscnt(0) 8687 8688 - If CU wavefront execution 8689 mode, omit. 8690 - Use vmcnt(0) if atomic with 8691 return and vscnt(0) if 8692 atomic with no-return. 8693 - Must happen before 8694 the following buffer_gl0_inv 8695 and before any following 8696 global/generic 8697 load/load 8698 atomic/store/store 8699 atomic/atomicrmw. 8700 8701 3. buffer_gl0_inv 8702 8703 - If CU wavefront execution 8704 mode, omit. 8705 - Ensures that 8706 following 8707 loads will not see 8708 stale data. 8709 8710 atomicrmw acquire - workgroup - local 1. ds_atomic 8711 2. s_waitcnt lgkmcnt(0) 8712 8713 - If OpenCL, omit. 8714 - Must happen before 8715 the following 8716 buffer_gl0_inv. 8717 - Ensures any 8718 following global 8719 data read is no 8720 older than the local 8721 atomicrmw value 8722 being acquired. 8723 8724 3. buffer_gl0_inv 8725 8726 - If OpenCL omit. 8727 - Ensures that 8728 following 8729 loads will not see 8730 stale data. 8731 8732 atomicrmw acquire - workgroup - generic 1. flat_atomic 8733 2. s_waitcnt lgkmcnt(0) & 8734 vm/vscnt(0) 8735 8736 - If CU wavefront execution 8737 mode, omit vm/vscnt(0). 8738 - If OpenCL, omit lgkmcnt(0). 8739 - Use vmcnt(0) if atomic with 8740 return and vscnt(0) if 8741 atomic with no-return. 8742 - Must happen before 8743 the following 8744 buffer_gl0_inv. 8745 - Ensures any 8746 following global 8747 data read is no 8748 older than a local 8749 atomicrmw value 8750 being acquired. 8751 8752 3. buffer_gl0_inv 8753 8754 - If CU wavefront execution 8755 mode, omit. 8756 - Ensures that 8757 following 8758 loads will not see 8759 stale data. 8760 8761 atomicrmw acquire - agent - global 1. buffer/global_atomic 8762 - system 2. s_waitcnt vm/vscnt(0) 8763 8764 - Use vmcnt(0) if atomic with 8765 return and vscnt(0) if 8766 atomic with no-return. 8767 - Must happen before 8768 following 8769 buffer_gl*_inv. 8770 - Ensures the 8771 atomicrmw has 8772 completed before 8773 invalidating the 8774 caches. 8775 8776 3. buffer_gl0_inv; 8777 buffer_gl1_inv 8778 8779 - Must happen before 8780 any following 8781 global/generic 8782 load/load 8783 atomic/atomicrmw. 8784 - Ensures that 8785 following loads 8786 will not see stale 8787 global data. 8788 8789 atomicrmw acquire - agent - generic 1. flat_atomic 8790 - system 2. s_waitcnt vm/vscnt(0) & 8791 lgkmcnt(0) 8792 8793 - If OpenCL, omit 8794 lgkmcnt(0). 8795 - Use vmcnt(0) if atomic with 8796 return and vscnt(0) if 8797 atomic with no-return. 8798 - Must happen before 8799 following 8800 buffer_gl*_inv. 8801 - Ensures the 8802 atomicrmw has 8803 completed before 8804 invalidating the 8805 caches. 8806 8807 3. buffer_gl0_inv; 8808 buffer_gl1_inv 8809 8810 - Must happen before 8811 any following 8812 global/generic 8813 load/load 8814 atomic/atomicrmw. 8815 - Ensures that 8816 following loads 8817 will not see stale 8818 global data. 8819 8820 fence acquire - singlethread *none* *none* 8821 - wavefront 8822 fence acquire - workgroup *none* 1. s_waitcnt lgkmcnt(0) & 8823 vmcnt(0) & vscnt(0) 8824 8825 - If CU wavefront execution 8826 mode, omit vmcnt(0) and 8827 vscnt(0). 8828 - If OpenCL and 8829 address space is 8830 not generic, omit 8831 lgkmcnt(0). 8832 - If OpenCL and 8833 address space is 8834 local, omit 8835 vmcnt(0) and vscnt(0). 8836 - However, since LLVM 8837 currently has no 8838 address space on 8839 the fence need to 8840 conservatively 8841 always generate. If 8842 fence had an 8843 address space then 8844 set to address 8845 space of OpenCL 8846 fence flag, or to 8847 generic if both 8848 local and global 8849 flags are 8850 specified. 8851 - Could be split into 8852 separate s_waitcnt 8853 vmcnt(0), s_waitcnt 8854 vscnt(0) and s_waitcnt 8855 lgkmcnt(0) to allow 8856 them to be 8857 independently moved 8858 according to the 8859 following rules. 8860 - s_waitcnt vmcnt(0) 8861 must happen after 8862 any preceding 8863 global/generic load 8864 atomic/ 8865 atomicrmw-with-return-value 8866 with an equal or 8867 wider sync scope 8868 and memory ordering 8869 stronger than 8870 unordered (this is 8871 termed the 8872 fence-paired-atomic). 8873 - s_waitcnt vscnt(0) 8874 must happen after 8875 any preceding 8876 global/generic 8877 atomicrmw-no-return-value 8878 with an equal or 8879 wider sync scope 8880 and memory ordering 8881 stronger than 8882 unordered (this is 8883 termed the 8884 fence-paired-atomic). 8885 - s_waitcnt lgkmcnt(0) 8886 must happen after 8887 any preceding 8888 local/generic load 8889 atomic/atomicrmw 8890 with an equal or 8891 wider sync scope 8892 and memory ordering 8893 stronger than 8894 unordered (this is 8895 termed the 8896 fence-paired-atomic). 8897 - Must happen before 8898 the following 8899 buffer_gl0_inv. 8900 - Ensures that the 8901 fence-paired atomic 8902 has completed 8903 before invalidating 8904 the 8905 cache. Therefore 8906 any following 8907 locations read must 8908 be no older than 8909 the value read by 8910 the 8911 fence-paired-atomic. 8912 8913 3. buffer_gl0_inv 8914 8915 - If CU wavefront execution 8916 mode, omit. 8917 - Ensures that 8918 following 8919 loads will not see 8920 stale data. 8921 8922 fence acquire - agent *none* 1. s_waitcnt lgkmcnt(0) & 8923 - system vmcnt(0) & vscnt(0) 8924 8925 - If OpenCL and 8926 address space is 8927 not generic, omit 8928 lgkmcnt(0). 8929 - If OpenCL and 8930 address space is 8931 local, omit 8932 vmcnt(0) and vscnt(0). 8933 - However, since LLVM 8934 currently has no 8935 address space on 8936 the fence need to 8937 conservatively 8938 always generate 8939 (see comment for 8940 previous fence). 8941 - Could be split into 8942 separate s_waitcnt 8943 vmcnt(0), s_waitcnt 8944 vscnt(0) and s_waitcnt 8945 lgkmcnt(0) to allow 8946 them to be 8947 independently moved 8948 according to the 8949 following rules. 8950 - s_waitcnt vmcnt(0) 8951 must happen after 8952 any preceding 8953 global/generic load 8954 atomic/ 8955 atomicrmw-with-return-value 8956 with an equal or 8957 wider sync scope 8958 and memory ordering 8959 stronger than 8960 unordered (this is 8961 termed the 8962 fence-paired-atomic). 8963 - s_waitcnt vscnt(0) 8964 must happen after 8965 any preceding 8966 global/generic 8967 atomicrmw-no-return-value 8968 with an equal or 8969 wider sync scope 8970 and memory ordering 8971 stronger than 8972 unordered (this is 8973 termed the 8974 fence-paired-atomic). 8975 - s_waitcnt lgkmcnt(0) 8976 must happen after 8977 any preceding 8978 local/generic load 8979 atomic/atomicrmw 8980 with an equal or 8981 wider sync scope 8982 and memory ordering 8983 stronger than 8984 unordered (this is 8985 termed the 8986 fence-paired-atomic). 8987 - Must happen before 8988 the following 8989 buffer_gl*_inv. 8990 - Ensures that the 8991 fence-paired atomic 8992 has completed 8993 before invalidating 8994 the 8995 caches. Therefore 8996 any following 8997 locations read must 8998 be no older than 8999 the value read by 9000 the 9001 fence-paired-atomic. 9002 9003 2. buffer_gl0_inv; 9004 buffer_gl1_inv 9005 9006 - Must happen before any 9007 following global/generic 9008 load/load 9009 atomic/store/store 9010 atomic/atomicrmw. 9011 - Ensures that 9012 following loads 9013 will not see stale 9014 global data. 9015 9016 **Release Atomic** 9017 ------------------------------------------------------------------------------------ 9018 store atomic release - singlethread - global 1. buffer/global/ds/flat_store 9019 - wavefront - local 9020 - generic 9021 store atomic release - workgroup - global 1. s_waitcnt lgkmcnt(0) & 9022 - generic vmcnt(0) & vscnt(0) 9023 9024 - If CU wavefront execution 9025 mode, omit vmcnt(0) and 9026 vscnt(0). 9027 - If OpenCL, omit 9028 lgkmcnt(0). 9029 - Could be split into 9030 separate s_waitcnt 9031 vmcnt(0), s_waitcnt 9032 vscnt(0) and s_waitcnt 9033 lgkmcnt(0) to allow 9034 them to be 9035 independently moved 9036 according to the 9037 following rules. 9038 - s_waitcnt vmcnt(0) 9039 must happen after 9040 any preceding 9041 global/generic load/load 9042 atomic/ 9043 atomicrmw-with-return-value. 9044 - s_waitcnt vscnt(0) 9045 must happen after 9046 any preceding 9047 global/generic 9048 store/store 9049 atomic/ 9050 atomicrmw-no-return-value. 9051 - s_waitcnt lgkmcnt(0) 9052 must happen after 9053 any preceding 9054 local/generic 9055 load/store/load 9056 atomic/store 9057 atomic/atomicrmw. 9058 - Must happen before 9059 the following 9060 store. 9061 - Ensures that all 9062 memory operations 9063 have 9064 completed before 9065 performing the 9066 store that is being 9067 released. 9068 9069 2. buffer/global/flat_store 9070 store atomic release - workgroup - local 1. s_waitcnt vmcnt(0) & vscnt(0) 9071 9072 - If CU wavefront execution 9073 mode, omit. 9074 - If OpenCL, omit. 9075 - Could be split into 9076 separate s_waitcnt 9077 vmcnt(0) and s_waitcnt 9078 vscnt(0) to allow 9079 them to be 9080 independently moved 9081 according to the 9082 following rules. 9083 - s_waitcnt vmcnt(0) 9084 must happen after 9085 any preceding 9086 global/generic load/load 9087 atomic/ 9088 atomicrmw-with-return-value. 9089 - s_waitcnt vscnt(0) 9090 must happen after 9091 any preceding 9092 global/generic 9093 store/store atomic/ 9094 atomicrmw-no-return-value. 9095 - Must happen before 9096 the following 9097 store. 9098 - Ensures that all 9099 global memory 9100 operations have 9101 completed before 9102 performing the 9103 store that is being 9104 released. 9105 9106 2. ds_store 9107 store atomic release - agent - global 1. s_waitcnt lgkmcnt(0) & 9108 - system - generic vmcnt(0) & vscnt(0) 9109 9110 - If OpenCL and 9111 address space is 9112 not generic, omit 9113 lgkmcnt(0). 9114 - Could be split into 9115 separate s_waitcnt 9116 vmcnt(0), s_waitcnt vscnt(0) 9117 and s_waitcnt 9118 lgkmcnt(0) to allow 9119 them to be 9120 independently moved 9121 according to the 9122 following rules. 9123 - s_waitcnt vmcnt(0) 9124 must happen after 9125 any preceding 9126 global/generic 9127 load/load 9128 atomic/ 9129 atomicrmw-with-return-value. 9130 - s_waitcnt vscnt(0) 9131 must happen after 9132 any preceding 9133 global/generic 9134 store/store atomic/ 9135 atomicrmw-no-return-value. 9136 - s_waitcnt lgkmcnt(0) 9137 must happen after 9138 any preceding 9139 local/generic 9140 load/store/load 9141 atomic/store 9142 atomic/atomicrmw. 9143 - Must happen before 9144 the following 9145 store. 9146 - Ensures that all 9147 memory operations 9148 have 9149 completed before 9150 performing the 9151 store that is being 9152 released. 9153 9154 2. buffer/global/flat_store 9155 atomicrmw release - singlethread - global 1. buffer/global/ds/flat_atomic 9156 - wavefront - local 9157 - generic 9158 atomicrmw release - workgroup - global 1. s_waitcnt lgkmcnt(0) & 9159 - generic vmcnt(0) & vscnt(0) 9160 9161 - If CU wavefront execution 9162 mode, omit vmcnt(0) and 9163 vscnt(0). 9164 - If OpenCL, omit lgkmcnt(0). 9165 - Could be split into 9166 separate s_waitcnt 9167 vmcnt(0), s_waitcnt 9168 vscnt(0) and s_waitcnt 9169 lgkmcnt(0) to allow 9170 them to be 9171 independently moved 9172 according to the 9173 following rules. 9174 - s_waitcnt vmcnt(0) 9175 must happen after 9176 any preceding 9177 global/generic load/load 9178 atomic/ 9179 atomicrmw-with-return-value. 9180 - s_waitcnt vscnt(0) 9181 must happen after 9182 any preceding 9183 global/generic 9184 store/store 9185 atomic/ 9186 atomicrmw-no-return-value. 9187 - s_waitcnt lgkmcnt(0) 9188 must happen after 9189 any preceding 9190 local/generic 9191 load/store/load 9192 atomic/store 9193 atomic/atomicrmw. 9194 - Must happen before 9195 the following 9196 atomicrmw. 9197 - Ensures that all 9198 memory operations 9199 have 9200 completed before 9201 performing the 9202 atomicrmw that is 9203 being released. 9204 9205 2. buffer/global/flat_atomic 9206 atomicrmw release - workgroup - local 1. s_waitcnt vmcnt(0) & vscnt(0) 9207 9208 - If CU wavefront execution 9209 mode, omit. 9210 - If OpenCL, omit. 9211 - Could be split into 9212 separate s_waitcnt 9213 vmcnt(0) and s_waitcnt 9214 vscnt(0) to allow 9215 them to be 9216 independently moved 9217 according to the 9218 following rules. 9219 - s_waitcnt vmcnt(0) 9220 must happen after 9221 any preceding 9222 global/generic load/load 9223 atomic/ 9224 atomicrmw-with-return-value. 9225 - s_waitcnt vscnt(0) 9226 must happen after 9227 any preceding 9228 global/generic 9229 store/store atomic/ 9230 atomicrmw-no-return-value. 9231 - Must happen before 9232 the following 9233 store. 9234 - Ensures that all 9235 global memory 9236 operations have 9237 completed before 9238 performing the 9239 store that is being 9240 released. 9241 9242 2. ds_atomic 9243 atomicrmw release - agent - global 1. s_waitcnt lgkmcnt(0) & 9244 - system - generic vmcnt(0) & vscnt(0) 9245 9246 - If OpenCL, omit 9247 lgkmcnt(0). 9248 - Could be split into 9249 separate s_waitcnt 9250 vmcnt(0), s_waitcnt 9251 vscnt(0) and s_waitcnt 9252 lgkmcnt(0) to allow 9253 them to be 9254 independently moved 9255 according to the 9256 following rules. 9257 - s_waitcnt vmcnt(0) 9258 must happen after 9259 any preceding 9260 global/generic 9261 load/load atomic/ 9262 atomicrmw-with-return-value. 9263 - s_waitcnt vscnt(0) 9264 must happen after 9265 any preceding 9266 global/generic 9267 store/store atomic/ 9268 atomicrmw-no-return-value. 9269 - s_waitcnt lgkmcnt(0) 9270 must happen after 9271 any preceding 9272 local/generic 9273 load/store/load 9274 atomic/store 9275 atomic/atomicrmw. 9276 - Must happen before 9277 the following 9278 atomicrmw. 9279 - Ensures that all 9280 memory operations 9281 to global and local 9282 have completed 9283 before performing 9284 the atomicrmw that 9285 is being released. 9286 9287 2. buffer/global/flat_atomic 9288 fence release - singlethread *none* *none* 9289 - wavefront 9290 fence release - workgroup *none* 1. s_waitcnt lgkmcnt(0) & 9291 vmcnt(0) & vscnt(0) 9292 9293 - If CU wavefront execution 9294 mode, omit vmcnt(0) and 9295 vscnt(0). 9296 - If OpenCL and 9297 address space is 9298 not generic, omit 9299 lgkmcnt(0). 9300 - If OpenCL and 9301 address space is 9302 local, omit 9303 vmcnt(0) and vscnt(0). 9304 - However, since LLVM 9305 currently has no 9306 address space on 9307 the fence need to 9308 conservatively 9309 always generate. If 9310 fence had an 9311 address space then 9312 set to address 9313 space of OpenCL 9314 fence flag, or to 9315 generic if both 9316 local and global 9317 flags are 9318 specified. 9319 - Could be split into 9320 separate s_waitcnt 9321 vmcnt(0), s_waitcnt 9322 vscnt(0) and s_waitcnt 9323 lgkmcnt(0) to allow 9324 them to be 9325 independently moved 9326 according to the 9327 following rules. 9328 - s_waitcnt vmcnt(0) 9329 must happen after 9330 any preceding 9331 global/generic 9332 load/load 9333 atomic/ 9334 atomicrmw-with-return-value. 9335 - s_waitcnt vscnt(0) 9336 must happen after 9337 any preceding 9338 global/generic 9339 store/store atomic/ 9340 atomicrmw-no-return-value. 9341 - s_waitcnt lgkmcnt(0) 9342 must happen after 9343 any preceding 9344 local/generic 9345 load/store/load 9346 atomic/store atomic/ 9347 atomicrmw. 9348 - Must happen before 9349 any following store 9350 atomic/atomicrmw 9351 with an equal or 9352 wider sync scope 9353 and memory ordering 9354 stronger than 9355 unordered (this is 9356 termed the 9357 fence-paired-atomic). 9358 - Ensures that all 9359 memory operations 9360 have 9361 completed before 9362 performing the 9363 following 9364 fence-paired-atomic. 9365 9366 fence release - agent *none* 1. s_waitcnt lgkmcnt(0) & 9367 - system vmcnt(0) & vscnt(0) 9368 9369 - If OpenCL and 9370 address space is 9371 not generic, omit 9372 lgkmcnt(0). 9373 - If OpenCL and 9374 address space is 9375 local, omit 9376 vmcnt(0) and vscnt(0). 9377 - However, since LLVM 9378 currently has no 9379 address space on 9380 the fence need to 9381 conservatively 9382 always generate. If 9383 fence had an 9384 address space then 9385 set to address 9386 space of OpenCL 9387 fence flag, or to 9388 generic if both 9389 local and global 9390 flags are 9391 specified. 9392 - Could be split into 9393 separate s_waitcnt 9394 vmcnt(0), s_waitcnt 9395 vscnt(0) and s_waitcnt 9396 lgkmcnt(0) to allow 9397 them to be 9398 independently moved 9399 according to the 9400 following rules. 9401 - s_waitcnt vmcnt(0) 9402 must happen after 9403 any preceding 9404 global/generic 9405 load/load atomic/ 9406 atomicrmw-with-return-value. 9407 - s_waitcnt vscnt(0) 9408 must happen after 9409 any preceding 9410 global/generic 9411 store/store atomic/ 9412 atomicrmw-no-return-value. 9413 - s_waitcnt lgkmcnt(0) 9414 must happen after 9415 any preceding 9416 local/generic 9417 load/store/load 9418 atomic/store 9419 atomic/atomicrmw. 9420 - Must happen before 9421 any following store 9422 atomic/atomicrmw 9423 with an equal or 9424 wider sync scope 9425 and memory ordering 9426 stronger than 9427 unordered (this is 9428 termed the 9429 fence-paired-atomic). 9430 - Ensures that all 9431 memory operations 9432 have 9433 completed before 9434 performing the 9435 following 9436 fence-paired-atomic. 9437 9438 **Acquire-Release Atomic** 9439 ------------------------------------------------------------------------------------ 9440 atomicrmw acq_rel - singlethread - global 1. buffer/global/ds/flat_atomic 9441 - wavefront - local 9442 - generic 9443 atomicrmw acq_rel - workgroup - global 1. s_waitcnt lgkmcnt(0) & 9444 vmcnt(0) & vscnt(0) 9445 9446 - If CU wavefront execution 9447 mode, omit vmcnt(0) and 9448 vscnt(0). 9449 - If OpenCL, omit 9450 lgkmcnt(0). 9451 - Must happen after 9452 any preceding 9453 local/generic 9454 load/store/load 9455 atomic/store 9456 atomic/atomicrmw. 9457 - Could be split into 9458 separate s_waitcnt 9459 vmcnt(0), s_waitcnt 9460 vscnt(0), and s_waitcnt 9461 lgkmcnt(0) to allow 9462 them to be 9463 independently moved 9464 according to the 9465 following rules. 9466 - s_waitcnt vmcnt(0) 9467 must happen after 9468 any preceding 9469 global/generic load/load 9470 atomic/ 9471 atomicrmw-with-return-value. 9472 - s_waitcnt vscnt(0) 9473 must happen after 9474 any preceding 9475 global/generic 9476 store/store 9477 atomic/ 9478 atomicrmw-no-return-value. 9479 - s_waitcnt lgkmcnt(0) 9480 must happen after 9481 any preceding 9482 local/generic 9483 load/store/load 9484 atomic/store 9485 atomic/atomicrmw. 9486 - Must happen before 9487 the following 9488 atomicrmw. 9489 - Ensures that all 9490 memory operations 9491 have 9492 completed before 9493 performing the 9494 atomicrmw that is 9495 being released. 9496 9497 2. buffer/global_atomic 9498 3. s_waitcnt vm/vscnt(0) 9499 9500 - If CU wavefront execution 9501 mode, omit. 9502 - Use vmcnt(0) if atomic with 9503 return and vscnt(0) if 9504 atomic with no-return. 9505 - Must happen before 9506 the following 9507 buffer_gl0_inv. 9508 - Ensures any 9509 following global 9510 data read is no 9511 older than the 9512 atomicrmw value 9513 being acquired. 9514 9515 4. buffer_gl0_inv 9516 9517 - If CU wavefront execution 9518 mode, omit. 9519 - Ensures that 9520 following 9521 loads will not see 9522 stale data. 9523 9524 atomicrmw acq_rel - workgroup - local 1. s_waitcnt vmcnt(0) & vscnt(0) 9525 9526 - If CU wavefront execution 9527 mode, omit. 9528 - If OpenCL, omit. 9529 - Could be split into 9530 separate s_waitcnt 9531 vmcnt(0) and s_waitcnt 9532 vscnt(0) to allow 9533 them to be 9534 independently moved 9535 according to the 9536 following rules. 9537 - s_waitcnt vmcnt(0) 9538 must happen after 9539 any preceding 9540 global/generic load/load 9541 atomic/ 9542 atomicrmw-with-return-value. 9543 - s_waitcnt vscnt(0) 9544 must happen after 9545 any preceding 9546 global/generic 9547 store/store atomic/ 9548 atomicrmw-no-return-value. 9549 - Must happen before 9550 the following 9551 store. 9552 - Ensures that all 9553 global memory 9554 operations have 9555 completed before 9556 performing the 9557 store that is being 9558 released. 9559 9560 2. ds_atomic 9561 3. s_waitcnt lgkmcnt(0) 9562 9563 - If OpenCL, omit. 9564 - Must happen before 9565 the following 9566 buffer_gl0_inv. 9567 - Ensures any 9568 following global 9569 data read is no 9570 older than the local load 9571 atomic value being 9572 acquired. 9573 9574 4. buffer_gl0_inv 9575 9576 - If CU wavefront execution 9577 mode, omit. 9578 - If OpenCL omit. 9579 - Ensures that 9580 following 9581 loads will not see 9582 stale data. 9583 9584 atomicrmw acq_rel - workgroup - generic 1. s_waitcnt lgkmcnt(0) & 9585 vmcnt(0) & vscnt(0) 9586 9587 - If CU wavefront execution 9588 mode, omit vmcnt(0) and 9589 vscnt(0). 9590 - If OpenCL, omit lgkmcnt(0). 9591 - Could be split into 9592 separate s_waitcnt 9593 vmcnt(0), s_waitcnt 9594 vscnt(0) and s_waitcnt 9595 lgkmcnt(0) to allow 9596 them to be 9597 independently moved 9598 according to the 9599 following rules. 9600 - s_waitcnt vmcnt(0) 9601 must happen after 9602 any preceding 9603 global/generic load/load 9604 atomic/ 9605 atomicrmw-with-return-value. 9606 - s_waitcnt vscnt(0) 9607 must happen after 9608 any preceding 9609 global/generic 9610 store/store 9611 atomic/ 9612 atomicrmw-no-return-value. 9613 - s_waitcnt lgkmcnt(0) 9614 must happen after 9615 any preceding 9616 local/generic 9617 load/store/load 9618 atomic/store 9619 atomic/atomicrmw. 9620 - Must happen before 9621 the following 9622 atomicrmw. 9623 - Ensures that all 9624 memory operations 9625 have 9626 completed before 9627 performing the 9628 atomicrmw that is 9629 being released. 9630 9631 2. flat_atomic 9632 3. s_waitcnt lgkmcnt(0) & 9633 vmcnt(0) & vscnt(0) 9634 9635 - If CU wavefront execution 9636 mode, omit vmcnt(0) and 9637 vscnt(0). 9638 - If OpenCL, omit lgkmcnt(0). 9639 - Must happen before 9640 the following 9641 buffer_gl0_inv. 9642 - Ensures any 9643 following global 9644 data read is no 9645 older than the load 9646 atomic value being 9647 acquired. 9648 9649 3. buffer_gl0_inv 9650 9651 - If CU wavefront execution 9652 mode, omit. 9653 - Ensures that 9654 following 9655 loads will not see 9656 stale data. 9657 9658 atomicrmw acq_rel - agent - global 1. s_waitcnt lgkmcnt(0) & 9659 - system vmcnt(0) & vscnt(0) 9660 9661 - If OpenCL, omit 9662 lgkmcnt(0). 9663 - Could be split into 9664 separate s_waitcnt 9665 vmcnt(0), s_waitcnt 9666 vscnt(0) and s_waitcnt 9667 lgkmcnt(0) to allow 9668 them to be 9669 independently moved 9670 according to the 9671 following rules. 9672 - s_waitcnt vmcnt(0) 9673 must happen after 9674 any preceding 9675 global/generic 9676 load/load atomic/ 9677 atomicrmw-with-return-value. 9678 - s_waitcnt vscnt(0) 9679 must happen after 9680 any preceding 9681 global/generic 9682 store/store atomic/ 9683 atomicrmw-no-return-value. 9684 - s_waitcnt lgkmcnt(0) 9685 must happen after 9686 any preceding 9687 local/generic 9688 load/store/load 9689 atomic/store 9690 atomic/atomicrmw. 9691 - Must happen before 9692 the following 9693 atomicrmw. 9694 - Ensures that all 9695 memory operations 9696 to global have 9697 completed before 9698 performing the 9699 atomicrmw that is 9700 being released. 9701 9702 2. buffer/global_atomic 9703 3. s_waitcnt vm/vscnt(0) 9704 9705 - Use vmcnt(0) if atomic with 9706 return and vscnt(0) if 9707 atomic with no-return. 9708 - Must happen before 9709 following 9710 buffer_gl*_inv. 9711 - Ensures the 9712 atomicrmw has 9713 completed before 9714 invalidating the 9715 caches. 9716 9717 4. buffer_gl0_inv; 9718 buffer_gl1_inv 9719 9720 - Must happen before 9721 any following 9722 global/generic 9723 load/load 9724 atomic/atomicrmw. 9725 - Ensures that 9726 following loads 9727 will not see stale 9728 global data. 9729 9730 atomicrmw acq_rel - agent - generic 1. s_waitcnt lgkmcnt(0) & 9731 - system vmcnt(0) & vscnt(0) 9732 9733 - If OpenCL, omit 9734 lgkmcnt(0). 9735 - Could be split into 9736 separate s_waitcnt 9737 vmcnt(0), s_waitcnt 9738 vscnt(0), and s_waitcnt 9739 lgkmcnt(0) to allow 9740 them to be 9741 independently moved 9742 according to the 9743 following rules. 9744 - s_waitcnt vmcnt(0) 9745 must happen after 9746 any preceding 9747 global/generic 9748 load/load atomic 9749 atomicrmw-with-return-value. 9750 - s_waitcnt vscnt(0) 9751 must happen after 9752 any preceding 9753 global/generic 9754 store/store atomic/ 9755 atomicrmw-no-return-value. 9756 - s_waitcnt lgkmcnt(0) 9757 must happen after 9758 any preceding 9759 local/generic 9760 load/store/load 9761 atomic/store 9762 atomic/atomicrmw. 9763 - Must happen before 9764 the following 9765 atomicrmw. 9766 - Ensures that all 9767 memory operations 9768 have 9769 completed before 9770 performing the 9771 atomicrmw that is 9772 being released. 9773 9774 2. flat_atomic 9775 3. s_waitcnt vm/vscnt(0) & 9776 lgkmcnt(0) 9777 9778 - If OpenCL, omit 9779 lgkmcnt(0). 9780 - Use vmcnt(0) if atomic with 9781 return and vscnt(0) if 9782 atomic with no-return. 9783 - Must happen before 9784 following 9785 buffer_gl*_inv. 9786 - Ensures the 9787 atomicrmw has 9788 completed before 9789 invalidating the 9790 caches. 9791 9792 4. buffer_gl0_inv; 9793 buffer_gl1_inv 9794 9795 - Must happen before 9796 any following 9797 global/generic 9798 load/load 9799 atomic/atomicrmw. 9800 - Ensures that 9801 following loads 9802 will not see stale 9803 global data. 9804 9805 fence acq_rel - singlethread *none* *none* 9806 - wavefront 9807 fence acq_rel - workgroup *none* 1. s_waitcnt lgkmcnt(0) & 9808 vmcnt(0) & vscnt(0) 9809 9810 - If CU wavefront execution 9811 mode, omit vmcnt(0) and 9812 vscnt(0). 9813 - If OpenCL and 9814 address space is 9815 not generic, omit 9816 lgkmcnt(0). 9817 - If OpenCL and 9818 address space is 9819 local, omit 9820 vmcnt(0) and vscnt(0). 9821 - However, 9822 since LLVM 9823 currently has no 9824 address space on 9825 the fence need to 9826 conservatively 9827 always generate 9828 (see comment for 9829 previous fence). 9830 - Could be split into 9831 separate s_waitcnt 9832 vmcnt(0), s_waitcnt 9833 vscnt(0) and s_waitcnt 9834 lgkmcnt(0) to allow 9835 them to be 9836 independently moved 9837 according to the 9838 following rules. 9839 - s_waitcnt vmcnt(0) 9840 must happen after 9841 any preceding 9842 global/generic 9843 load/load 9844 atomic/ 9845 atomicrmw-with-return-value. 9846 - s_waitcnt vscnt(0) 9847 must happen after 9848 any preceding 9849 global/generic 9850 store/store atomic/ 9851 atomicrmw-no-return-value. 9852 - s_waitcnt lgkmcnt(0) 9853 must happen after 9854 any preceding 9855 local/generic 9856 load/store/load 9857 atomic/store atomic/ 9858 atomicrmw. 9859 - Must happen before 9860 any following 9861 global/generic 9862 load/load 9863 atomic/store/store 9864 atomic/atomicrmw. 9865 - Ensures that all 9866 memory operations 9867 have 9868 completed before 9869 performing any 9870 following global 9871 memory operations. 9872 - Ensures that the 9873 preceding 9874 local/generic load 9875 atomic/atomicrmw 9876 with an equal or 9877 wider sync scope 9878 and memory ordering 9879 stronger than 9880 unordered (this is 9881 termed the 9882 acquire-fence-paired-atomic) 9883 has completed 9884 before following 9885 global memory 9886 operations. This 9887 satisfies the 9888 requirements of 9889 acquire. 9890 - Ensures that all 9891 previous memory 9892 operations have 9893 completed before a 9894 following 9895 local/generic store 9896 atomic/atomicrmw 9897 with an equal or 9898 wider sync scope 9899 and memory ordering 9900 stronger than 9901 unordered (this is 9902 termed the 9903 release-fence-paired-atomic). 9904 This satisfies the 9905 requirements of 9906 release. 9907 - Must happen before 9908 the following 9909 buffer_gl0_inv. 9910 - Ensures that the 9911 acquire-fence-paired 9912 atomic has completed 9913 before invalidating 9914 the 9915 cache. Therefore 9916 any following 9917 locations read must 9918 be no older than 9919 the value read by 9920 the 9921 acquire-fence-paired-atomic. 9922 9923 3. buffer_gl0_inv 9924 9925 - If CU wavefront execution 9926 mode, omit. 9927 - Ensures that 9928 following 9929 loads will not see 9930 stale data. 9931 9932 fence acq_rel - agent *none* 1. s_waitcnt lgkmcnt(0) & 9933 - system vmcnt(0) & vscnt(0) 9934 9935 - If OpenCL and 9936 address space is 9937 not generic, omit 9938 lgkmcnt(0). 9939 - If OpenCL and 9940 address space is 9941 local, omit 9942 vmcnt(0) and vscnt(0). 9943 - However, since LLVM 9944 currently has no 9945 address space on 9946 the fence need to 9947 conservatively 9948 always generate 9949 (see comment for 9950 previous fence). 9951 - Could be split into 9952 separate s_waitcnt 9953 vmcnt(0), s_waitcnt 9954 vscnt(0) and s_waitcnt 9955 lgkmcnt(0) to allow 9956 them to be 9957 independently moved 9958 according to the 9959 following rules. 9960 - s_waitcnt vmcnt(0) 9961 must happen after 9962 any preceding 9963 global/generic 9964 load/load 9965 atomic/ 9966 atomicrmw-with-return-value. 9967 - s_waitcnt vscnt(0) 9968 must happen after 9969 any preceding 9970 global/generic 9971 store/store atomic/ 9972 atomicrmw-no-return-value. 9973 - s_waitcnt lgkmcnt(0) 9974 must happen after 9975 any preceding 9976 local/generic 9977 load/store/load 9978 atomic/store 9979 atomic/atomicrmw. 9980 - Must happen before 9981 the following 9982 buffer_gl*_inv. 9983 - Ensures that the 9984 preceding 9985 global/local/generic 9986 load 9987 atomic/atomicrmw 9988 with an equal or 9989 wider sync scope 9990 and memory ordering 9991 stronger than 9992 unordered (this is 9993 termed the 9994 acquire-fence-paired-atomic) 9995 has completed 9996 before invalidating 9997 the caches. This 9998 satisfies the 9999 requirements of 10000 acquire. 10001 - Ensures that all 10002 previous memory 10003 operations have 10004 completed before a 10005 following 10006 global/local/generic 10007 store 10008 atomic/atomicrmw 10009 with an equal or 10010 wider sync scope 10011 and memory ordering 10012 stronger than 10013 unordered (this is 10014 termed the 10015 release-fence-paired-atomic). 10016 This satisfies the 10017 requirements of 10018 release. 10019 10020 2. buffer_gl0_inv; 10021 buffer_gl1_inv 10022 10023 - Must happen before 10024 any following 10025 global/generic 10026 load/load 10027 atomic/store/store 10028 atomic/atomicrmw. 10029 - Ensures that 10030 following loads 10031 will not see stale 10032 global data. This 10033 satisfies the 10034 requirements of 10035 acquire. 10036 10037 **Sequential Consistent Atomic** 10038 ------------------------------------------------------------------------------------ 10039 load atomic seq_cst - singlethread - global *Same as corresponding 10040 - wavefront - local load atomic acquire, 10041 - generic except must generated 10042 all instructions even 10043 for OpenCL.* 10044 load atomic seq_cst - workgroup - global 1. s_waitcnt lgkmcnt(0) & 10045 - generic vmcnt(0) & vscnt(0) 10046 10047 - If CU wavefront execution 10048 mode, omit vmcnt(0) and 10049 vscnt(0). 10050 - Could be split into 10051 separate s_waitcnt 10052 vmcnt(0), s_waitcnt 10053 vscnt(0), and s_waitcnt 10054 lgkmcnt(0) to allow 10055 them to be 10056 independently moved 10057 according to the 10058 following rules. 10059 - s_waitcnt lgkmcnt(0) must 10060 happen after 10061 preceding 10062 local/generic load 10063 atomic/store 10064 atomic/atomicrmw 10065 with memory 10066 ordering of seq_cst 10067 and with equal or 10068 wider sync scope. 10069 (Note that seq_cst 10070 fences have their 10071 own s_waitcnt 10072 lgkmcnt(0) and so do 10073 not need to be 10074 considered.) 10075 - s_waitcnt vmcnt(0) 10076 must happen after 10077 preceding 10078 global/generic load 10079 atomic/ 10080 atomicrmw-with-return-value 10081 with memory 10082 ordering of seq_cst 10083 and with equal or 10084 wider sync scope. 10085 (Note that seq_cst 10086 fences have their 10087 own s_waitcnt 10088 vmcnt(0) and so do 10089 not need to be 10090 considered.) 10091 - s_waitcnt vscnt(0) 10092 Must happen after 10093 preceding 10094 global/generic store 10095 atomic/ 10096 atomicrmw-no-return-value 10097 with memory 10098 ordering of seq_cst 10099 and with equal or 10100 wider sync scope. 10101 (Note that seq_cst 10102 fences have their 10103 own s_waitcnt 10104 vscnt(0) and so do 10105 not need to be 10106 considered.) 10107 - Ensures any 10108 preceding 10109 sequential 10110 consistent global/local 10111 memory instructions 10112 have completed 10113 before executing 10114 this sequentially 10115 consistent 10116 instruction. This 10117 prevents reordering 10118 a seq_cst store 10119 followed by a 10120 seq_cst load. (Note 10121 that seq_cst is 10122 stronger than 10123 acquire/release as 10124 the reordering of 10125 load acquire 10126 followed by a store 10127 release is 10128 prevented by the 10129 s_waitcnt of 10130 the release, but 10131 there is nothing 10132 preventing a store 10133 release followed by 10134 load acquire from 10135 completing out of 10136 order. The s_waitcnt 10137 could be placed after 10138 seq_store or before 10139 the seq_load. We 10140 choose the load to 10141 make the s_waitcnt be 10142 as late as possible 10143 so that the store 10144 may have already 10145 completed.) 10146 10147 2. *Following 10148 instructions same as 10149 corresponding load 10150 atomic acquire, 10151 except must generated 10152 all instructions even 10153 for OpenCL.* 10154 load atomic seq_cst - workgroup - local 10155 10156 1. s_waitcnt vmcnt(0) & vscnt(0) 10157 10158 - If CU wavefront execution 10159 mode, omit. 10160 - Could be split into 10161 separate s_waitcnt 10162 vmcnt(0) and s_waitcnt 10163 vscnt(0) to allow 10164 them to be 10165 independently moved 10166 according to the 10167 following rules. 10168 - s_waitcnt vmcnt(0) 10169 Must happen after 10170 preceding 10171 global/generic load 10172 atomic/ 10173 atomicrmw-with-return-value 10174 with memory 10175 ordering of seq_cst 10176 and with equal or 10177 wider sync scope. 10178 (Note that seq_cst 10179 fences have their 10180 own s_waitcnt 10181 vmcnt(0) and so do 10182 not need to be 10183 considered.) 10184 - s_waitcnt vscnt(0) 10185 Must happen after 10186 preceding 10187 global/generic store 10188 atomic/ 10189 atomicrmw-no-return-value 10190 with memory 10191 ordering of seq_cst 10192 and with equal or 10193 wider sync scope. 10194 (Note that seq_cst 10195 fences have their 10196 own s_waitcnt 10197 vscnt(0) and so do 10198 not need to be 10199 considered.) 10200 - Ensures any 10201 preceding 10202 sequential 10203 consistent global 10204 memory instructions 10205 have completed 10206 before executing 10207 this sequentially 10208 consistent 10209 instruction. This 10210 prevents reordering 10211 a seq_cst store 10212 followed by a 10213 seq_cst load. (Note 10214 that seq_cst is 10215 stronger than 10216 acquire/release as 10217 the reordering of 10218 load acquire 10219 followed by a store 10220 release is 10221 prevented by the 10222 s_waitcnt of 10223 the release, but 10224 there is nothing 10225 preventing a store 10226 release followed by 10227 load acquire from 10228 completing out of 10229 order. The s_waitcnt 10230 could be placed after 10231 seq_store or before 10232 the seq_load. We 10233 choose the load to 10234 make the s_waitcnt be 10235 as late as possible 10236 so that the store 10237 may have already 10238 completed.) 10239 10240 2. *Following 10241 instructions same as 10242 corresponding load 10243 atomic acquire, 10244 except must generated 10245 all instructions even 10246 for OpenCL.* 10247 10248 load atomic seq_cst - agent - global 1. s_waitcnt lgkmcnt(0) & 10249 - system - generic vmcnt(0) & vscnt(0) 10250 10251 - Could be split into 10252 separate s_waitcnt 10253 vmcnt(0), s_waitcnt 10254 vscnt(0) and s_waitcnt 10255 lgkmcnt(0) to allow 10256 them to be 10257 independently moved 10258 according to the 10259 following rules. 10260 - s_waitcnt lgkmcnt(0) 10261 must happen after 10262 preceding 10263 local load 10264 atomic/store 10265 atomic/atomicrmw 10266 with memory 10267 ordering of seq_cst 10268 and with equal or 10269 wider sync scope. 10270 (Note that seq_cst 10271 fences have their 10272 own s_waitcnt 10273 lgkmcnt(0) and so do 10274 not need to be 10275 considered.) 10276 - s_waitcnt vmcnt(0) 10277 must happen after 10278 preceding 10279 global/generic load 10280 atomic/ 10281 atomicrmw-with-return-value 10282 with memory 10283 ordering of seq_cst 10284 and with equal or 10285 wider sync scope. 10286 (Note that seq_cst 10287 fences have their 10288 own s_waitcnt 10289 vmcnt(0) and so do 10290 not need to be 10291 considered.) 10292 - s_waitcnt vscnt(0) 10293 Must happen after 10294 preceding 10295 global/generic store 10296 atomic/ 10297 atomicrmw-no-return-value 10298 with memory 10299 ordering of seq_cst 10300 and with equal or 10301 wider sync scope. 10302 (Note that seq_cst 10303 fences have their 10304 own s_waitcnt 10305 vscnt(0) and so do 10306 not need to be 10307 considered.) 10308 - Ensures any 10309 preceding 10310 sequential 10311 consistent global 10312 memory instructions 10313 have completed 10314 before executing 10315 this sequentially 10316 consistent 10317 instruction. This 10318 prevents reordering 10319 a seq_cst store 10320 followed by a 10321 seq_cst load. (Note 10322 that seq_cst is 10323 stronger than 10324 acquire/release as 10325 the reordering of 10326 load acquire 10327 followed by a store 10328 release is 10329 prevented by the 10330 s_waitcnt of 10331 the release, but 10332 there is nothing 10333 preventing a store 10334 release followed by 10335 load acquire from 10336 completing out of 10337 order. The s_waitcnt 10338 could be placed after 10339 seq_store or before 10340 the seq_load. We 10341 choose the load to 10342 make the s_waitcnt be 10343 as late as possible 10344 so that the store 10345 may have already 10346 completed.) 10347 10348 2. *Following 10349 instructions same as 10350 corresponding load 10351 atomic acquire, 10352 except must generated 10353 all instructions even 10354 for OpenCL.* 10355 store atomic seq_cst - singlethread - global *Same as corresponding 10356 - wavefront - local store atomic release, 10357 - workgroup - generic except must generated 10358 - agent all instructions even 10359 - system for OpenCL.* 10360 atomicrmw seq_cst - singlethread - global *Same as corresponding 10361 - wavefront - local atomicrmw acq_rel, 10362 - workgroup - generic except must generated 10363 - agent all instructions even 10364 - system for OpenCL.* 10365 fence seq_cst - singlethread *none* *Same as corresponding 10366 - wavefront fence acq_rel, 10367 - workgroup except must generated 10368 - agent all instructions even 10369 - system for OpenCL.* 10370 ============ ============ ============== ========== ================================ 10371 10372Trap Handler ABI 10373~~~~~~~~~~~~~~~~ 10374 10375For code objects generated by the AMDGPU backend for HSA [HSA]_ compatible 10376runtimes (see :ref:`amdgpu-os`), the runtime installs a trap handler that 10377supports the ``s_trap`` instruction. For usage see: 10378 10379- :ref:`amdgpu-trap-handler-for-amdhsa-os-v2-table` 10380- :ref:`amdgpu-trap-handler-for-amdhsa-os-v3-table` 10381- :ref:`amdgpu-trap-handler-for-amdhsa-os-v4-table` 10382 10383 .. table:: AMDGPU Trap Handler for AMDHSA OS Code Object V2 10384 :name: amdgpu-trap-handler-for-amdhsa-os-v2-table 10385 10386 =================== =============== =============== ======================================= 10387 Usage Code Sequence Trap Handler Description 10388 Inputs 10389 =================== =============== =============== ======================================= 10390 reserved ``s_trap 0x00`` Reserved by hardware. 10391 ``debugtrap(arg)`` ``s_trap 0x01`` ``SGPR0-1``: Reserved for Finalizer HSA ``debugtrap`` 10392 ``queue_ptr`` intrinsic (not implemented). 10393 ``VGPR0``: 10394 ``arg`` 10395 ``llvm.trap`` ``s_trap 0x02`` ``SGPR0-1``: Causes wave to be halted with the PC at 10396 ``queue_ptr`` the trap instruction. The associated 10397 queue is signalled to put it into the 10398 error state. When the queue is put in 10399 the error state, the waves executing 10400 dispatches on the queue will be 10401 terminated. 10402 ``llvm.debugtrap`` ``s_trap 0x03`` *none* - If debugger not enabled then behaves 10403 as a no-operation. The trap handler 10404 is entered and immediately returns to 10405 continue execution of the wavefront. 10406 - If the debugger is enabled, causes 10407 the debug trap to be reported by the 10408 debugger and the wavefront is put in 10409 the halt state with the PC at the 10410 instruction. The debugger must 10411 increment the PC and resume the wave. 10412 reserved ``s_trap 0x04`` Reserved. 10413 reserved ``s_trap 0x05`` Reserved. 10414 reserved ``s_trap 0x06`` Reserved. 10415 reserved ``s_trap 0x07`` Reserved. 10416 reserved ``s_trap 0x08`` Reserved. 10417 reserved ``s_trap 0xfe`` Reserved. 10418 reserved ``s_trap 0xff`` Reserved. 10419 =================== =============== =============== ======================================= 10420 10421.. 10422 10423 .. table:: AMDGPU Trap Handler for AMDHSA OS Code Object V3 10424 :name: amdgpu-trap-handler-for-amdhsa-os-v3-table 10425 10426 =================== =============== =============== ======================================= 10427 Usage Code Sequence Trap Handler Description 10428 Inputs 10429 =================== =============== =============== ======================================= 10430 reserved ``s_trap 0x00`` Reserved by hardware. 10431 debugger breakpoint ``s_trap 0x01`` *none* Reserved for debugger to use for 10432 breakpoints. Causes wave to be halted 10433 with the PC at the trap instruction. 10434 The debugger is responsible to resume 10435 the wave, including the instruction 10436 that the breakpoint overwrote. 10437 ``llvm.trap`` ``s_trap 0x02`` ``SGPR0-1``: Causes wave to be halted with the PC at 10438 ``queue_ptr`` the trap instruction. The associated 10439 queue is signalled to put it into the 10440 error state. When the queue is put in 10441 the error state, the waves executing 10442 dispatches on the queue will be 10443 terminated. 10444 ``llvm.debugtrap`` ``s_trap 0x03`` *none* - If debugger not enabled then behaves 10445 as a no-operation. The trap handler 10446 is entered and immediately returns to 10447 continue execution of the wavefront. 10448 - If the debugger is enabled, causes 10449 the debug trap to be reported by the 10450 debugger and the wavefront is put in 10451 the halt state with the PC at the 10452 instruction. The debugger must 10453 increment the PC and resume the wave. 10454 reserved ``s_trap 0x04`` Reserved. 10455 reserved ``s_trap 0x05`` Reserved. 10456 reserved ``s_trap 0x06`` Reserved. 10457 reserved ``s_trap 0x07`` Reserved. 10458 reserved ``s_trap 0x08`` Reserved. 10459 reserved ``s_trap 0xfe`` Reserved. 10460 reserved ``s_trap 0xff`` Reserved. 10461 =================== =============== =============== ======================================= 10462 10463.. 10464 10465 .. table:: AMDGPU Trap Handler for AMDHSA OS Code Object V4 10466 :name: amdgpu-trap-handler-for-amdhsa-os-v4-table 10467 10468 =================== =============== ================ ================= ======================================= 10469 Usage Code Sequence GFX6-GFX8 Inputs GFX9-GFX10 Inputs Description 10470 =================== =============== ================ ================= ======================================= 10471 reserved ``s_trap 0x00`` Reserved by hardware. 10472 debugger breakpoint ``s_trap 0x01`` *none* *none* Reserved for debugger to use for 10473 breakpoints. Causes wave to be halted 10474 with the PC at the trap instruction. 10475 The debugger is responsible to resume 10476 the wave, including the instruction 10477 that the breakpoint overwrote. 10478 ``llvm.trap`` ``s_trap 0x02`` ``SGPR0-1``: *none* Causes wave to be halted with the PC at 10479 ``queue_ptr`` the trap instruction. The associated 10480 queue is signalled to put it into the 10481 error state. When the queue is put in 10482 the error state, the waves executing 10483 dispatches on the queue will be 10484 terminated. 10485 ``llvm.debugtrap`` ``s_trap 0x03`` *none* *none* - If debugger not enabled then behaves 10486 as a no-operation. The trap handler 10487 is entered and immediately returns to 10488 continue execution of the wavefront. 10489 - If the debugger is enabled, causes 10490 the debug trap to be reported by the 10491 debugger and the wavefront is put in 10492 the halt state with the PC at the 10493 instruction. The debugger must 10494 increment the PC and resume the wave. 10495 reserved ``s_trap 0x04`` Reserved. 10496 reserved ``s_trap 0x05`` Reserved. 10497 reserved ``s_trap 0x06`` Reserved. 10498 reserved ``s_trap 0x07`` Reserved. 10499 reserved ``s_trap 0x08`` Reserved. 10500 reserved ``s_trap 0xfe`` Reserved. 10501 reserved ``s_trap 0xff`` Reserved. 10502 =================== =============== ================ ================= ======================================= 10503 10504.. _amdgpu-amdhsa-function-call-convention: 10505 10506Call Convention 10507~~~~~~~~~~~~~~~ 10508 10509.. note:: 10510 10511 This section is currently incomplete and has inaccuracies. It is WIP that will 10512 be updated as information is determined. 10513 10514See :ref:`amdgpu-dwarf-address-space-identifier` for information on swizzled 10515addresses. Unswizzled addresses are normal linear addresses. 10516 10517.. _amdgpu-amdhsa-function-call-convention-kernel-functions: 10518 10519Kernel Functions 10520++++++++++++++++ 10521 10522This section describes the call convention ABI for the outer kernel function. 10523 10524See :ref:`amdgpu-amdhsa-initial-kernel-execution-state` for the kernel call 10525convention. 10526 10527The following is not part of the AMDGPU kernel calling convention but describes 10528how the AMDGPU implements function calls: 10529 105301. Clang decides the kernarg layout to match the *HSA Programmer's Language 10531 Reference* [HSA]_. 10532 10533 - All structs are passed directly. 10534 - Lambda values are passed *TBA*. 10535 10536 .. TODO:: 10537 10538 - Does this really follow HSA rules? Or are structs >16 bytes passed 10539 by-value struct? 10540 - What is ABI for lambda values? 10541 105424. The kernel performs certain setup in its prolog, as described in 10543 :ref:`amdgpu-amdhsa-kernel-prolog`. 10544 10545.. _amdgpu-amdhsa-function-call-convention-non-kernel-functions: 10546 10547Non-Kernel Functions 10548++++++++++++++++++++ 10549 10550This section describes the call convention ABI for functions other than the 10551outer kernel function. 10552 10553If a kernel has function calls then scratch is always allocated and used for 10554the call stack which grows from low address to high address using the swizzled 10555scratch address space. 10556 10557On entry to a function: 10558 105591. SGPR0-3 contain a V# with the following properties (see 10560 :ref:`amdgpu-amdhsa-kernel-prolog-private-segment-buffer`): 10561 10562 * Base address pointing to the beginning of the wavefront scratch backing 10563 memory. 10564 * Swizzled with dword element size and stride of wavefront size elements. 10565 105662. The FLAT_SCRATCH register pair is setup. See 10567 :ref:`amdgpu-amdhsa-kernel-prolog-flat-scratch`. 105683. GFX6-GFX8: M0 register set to the size of LDS in bytes. See 10569 :ref:`amdgpu-amdhsa-kernel-prolog-m0`. 105704. The EXEC register is set to the lanes active on entry to the function. 105715. MODE register: *TBD* 105726. VGPR0-31 and SGPR4-29 are used to pass function input arguments as described 10573 below. 105747. SGPR30-31 return address (RA). The code address that the function must 10575 return to when it completes. The value is undefined if the function is *no 10576 return*. 105778. SGPR32 is used for the stack pointer (SP). It is an unswizzled scratch 10578 offset relative to the beginning of the wavefront scratch backing memory. 10579 10580 The unswizzled SP can be used with buffer instructions as an unswizzled SGPR 10581 offset with the scratch V# in SGPR0-3 to access the stack in a swizzled 10582 manner. 10583 10584 The unswizzled SP value can be converted into the swizzled SP value by: 10585 10586 | swizzled SP = unswizzled SP / wavefront size 10587 10588 This may be used to obtain the private address space address of stack 10589 objects and to convert this address to a flat address by adding the flat 10590 scratch aperture base address. 10591 10592 The swizzled SP value is always 4 bytes aligned for the ``r600`` 10593 architecture and 16 byte aligned for the ``amdgcn`` architecture. 10594 10595 .. note:: 10596 10597 The ``amdgcn`` value is selected to avoid dynamic stack alignment for the 10598 OpenCL language which has the largest base type defined as 16 bytes. 10599 10600 On entry, the swizzled SP value is the address of the first function 10601 argument passed on the stack. Other stack passed arguments are positive 10602 offsets from the entry swizzled SP value. 10603 10604 The function may use positive offsets beyond the last stack passed argument 10605 for stack allocated local variables and register spill slots. If necessary, 10606 the function may align these to greater alignment than 16 bytes. After these 10607 the function may dynamically allocate space for such things as runtime sized 10608 ``alloca`` local allocations. 10609 10610 If the function calls another function, it will place any stack allocated 10611 arguments after the last local allocation and adjust SGPR32 to the address 10612 after the last local allocation. 10613 106149. All other registers are unspecified. 1061510. Any necessary ``s_waitcnt`` has been performed to ensure memory is available 10616 to the function. 10617 10618On exit from a function: 10619 106201. VGPR0-31 and SGPR4-29 are used to pass function result arguments as 10621 described below. Any registers used are considered clobbered registers. 106222. The following registers are preserved and have the same value as on entry: 10623 10624 * FLAT_SCRATCH 10625 * EXEC 10626 * GFX6-GFX8: M0 10627 * All SGPR registers except the clobbered registers of SGPR4-31. 10628 * VGPR40-47 10629 * VGPR56-63 10630 * VGPR72-79 10631 * VGPR88-95 10632 * VGPR104-111 10633 * VGPR120-127 10634 * VGPR136-143 10635 * VGPR152-159 10636 * VGPR168-175 10637 * VGPR184-191 10638 * VGPR200-207 10639 * VGPR216-223 10640 * VGPR232-239 10641 * VGPR248-255 10642 10643 .. note:: 10644 10645 Except the argument registers, the VGPRs clobbered and the preserved 10646 registers are intermixed at regular intervals in order to keep a 10647 similar ratio independent of the number of allocated VGPRs. 10648 10649 * Lanes of all VGPRs that are inactive at the call site. 10650 10651 For the AMDGPU backend, an inter-procedural register allocation (IPRA) 10652 optimization may mark some of clobbered SGPR and VGPR registers as 10653 preserved if it can be determined that the called function does not change 10654 their value. 10655 106562. The PC is set to the RA provided on entry. 106573. MODE register: *TBD*. 106584. All other registers are clobbered. 106595. Any necessary ``s_waitcnt`` has been performed to ensure memory accessed by 10660 function is available to the caller. 10661 10662.. TODO:: 10663 10664 - On gfx908 are all ACC registers clobbered? 10665 10666 - How are function results returned? The address of structured types is passed 10667 by reference, but what about other types? 10668 10669The function input arguments are made up of the formal arguments explicitly 10670declared by the source language function plus the implicit input arguments used 10671by the implementation. 10672 10673The source language input arguments are: 10674 106751. Any source language implicit ``this`` or ``self`` argument comes first as a 10676 pointer type. 106772. Followed by the function formal arguments in left to right source order. 10678 10679The source language result arguments are: 10680 106811. The function result argument. 10682 10683The source language input or result struct type arguments that are less than or 10684equal to 16 bytes, are decomposed recursively into their base type fields, and 10685each field is passed as if a separate argument. For input arguments, if the 10686called function requires the struct to be in memory, for example because its 10687address is taken, then the function body is responsible for allocating a stack 10688location and copying the field arguments into it. Clang terms this *direct 10689struct*. 10690 10691The source language input struct type arguments that are greater than 16 bytes, 10692are passed by reference. The caller is responsible for allocating a stack 10693location to make a copy of the struct value and pass the address as the input 10694argument. The called function is responsible to perform the dereference when 10695accessing the input argument. Clang terms this *by-value struct*. 10696 10697A source language result struct type argument that is greater than 16 bytes, is 10698returned by reference. The caller is responsible for allocating a stack location 10699to hold the result value and passes the address as the last input argument 10700(before the implicit input arguments). In this case there are no result 10701arguments. The called function is responsible to perform the dereference when 10702storing the result value. Clang terms this *structured return (sret)*. 10703 10704*TODO: correct the ``sret`` definition.* 10705 10706.. TODO:: 10707 10708 Is this definition correct? Or is ``sret`` only used if passing in registers, and 10709 pass as non-decomposed struct as stack argument? Or something else? Is the 10710 memory location in the caller stack frame, or a stack memory argument and so 10711 no address is passed as the caller can directly write to the argument stack 10712 location? But then the stack location is still live after return. If an 10713 argument stack location is it the first stack argument or the last one? 10714 10715Lambda argument types are treated as struct types with an implementation defined 10716set of fields. 10717 10718.. TODO:: 10719 10720 Need to specify the ABI for lambda types for AMDGPU. 10721 10722For AMDGPU backend all source language arguments (including the decomposed 10723struct type arguments) are passed in VGPRs unless marked ``inreg`` in which case 10724they are passed in SGPRs. 10725 10726The AMDGPU backend walks the function call graph from the leaves to determine 10727which implicit input arguments are used, propagating to each caller of the 10728function. The used implicit arguments are appended to the function arguments 10729after the source language arguments in the following order: 10730 10731.. TODO:: 10732 10733 Is recursion or external functions supported? 10734 107351. Work-Item ID (1 VGPR) 10736 10737 The X, Y and Z work-item ID are packed into a single VGRP with the following 10738 layout. Only fields actually used by the function are set. The other bits 10739 are undefined. 10740 10741 The values come from the initial kernel execution state. See 10742 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`. 10743 10744 .. table:: Work-item implicit argument layout 10745 :name: amdgpu-amdhsa-workitem-implicit-argument-layout-table 10746 10747 ======= ======= ============== 10748 Bits Size Field Name 10749 ======= ======= ============== 10750 9:0 10 bits X Work-Item ID 10751 19:10 10 bits Y Work-Item ID 10752 29:20 10 bits Z Work-Item ID 10753 31:30 2 bits Unused 10754 ======= ======= ============== 10755 107562. Dispatch Ptr (2 SGPRs) 10757 10758 The value comes from the initial kernel execution state. See 10759 :ref:`amdgpu-amdhsa-sgpr-register-set-up-order-table`. 10760 107613. Queue Ptr (2 SGPRs) 10762 10763 The value comes from the initial kernel execution state. See 10764 :ref:`amdgpu-amdhsa-sgpr-register-set-up-order-table`. 10765 107664. Kernarg Segment Ptr (2 SGPRs) 10767 10768 The value comes from the initial kernel execution state. See 10769 :ref:`amdgpu-amdhsa-sgpr-register-set-up-order-table`. 10770 107715. Dispatch id (2 SGPRs) 10772 10773 The value comes from the initial kernel execution state. See 10774 :ref:`amdgpu-amdhsa-sgpr-register-set-up-order-table`. 10775 107766. Work-Group ID X (1 SGPR) 10777 10778 The value comes from the initial kernel execution state. See 10779 :ref:`amdgpu-amdhsa-sgpr-register-set-up-order-table`. 10780 107817. Work-Group ID Y (1 SGPR) 10782 10783 The value comes from the initial kernel execution state. See 10784 :ref:`amdgpu-amdhsa-sgpr-register-set-up-order-table`. 10785 107868. Work-Group ID Z (1 SGPR) 10787 10788 The value comes from the initial kernel execution state. See 10789 :ref:`amdgpu-amdhsa-sgpr-register-set-up-order-table`. 10790 107919. Implicit Argument Ptr (2 SGPRs) 10792 10793 The value is computed by adding an offset to Kernarg Segment Ptr to get the 10794 global address space pointer to the first kernarg implicit argument. 10795 10796The input and result arguments are assigned in order in the following manner: 10797 10798.. note:: 10799 10800 There are likely some errors and omissions in the following description that 10801 need correction. 10802 10803 .. TODO:: 10804 10805 Check the Clang source code to decipher how function arguments and return 10806 results are handled. Also see the AMDGPU specific values used. 10807 10808* VGPR arguments are assigned to consecutive VGPRs starting at VGPR0 up to 10809 VGPR31. 10810 10811 If there are more arguments than will fit in these registers, the remaining 10812 arguments are allocated on the stack in order on naturally aligned 10813 addresses. 10814 10815 .. TODO:: 10816 10817 How are overly aligned structures allocated on the stack? 10818 10819* SGPR arguments are assigned to consecutive SGPRs starting at SGPR0 up to 10820 SGPR29. 10821 10822 If there are more arguments than will fit in these registers, the remaining 10823 arguments are allocated on the stack in order on naturally aligned 10824 addresses. 10825 10826Note that decomposed struct type arguments may have some fields passed in 10827registers and some in memory. 10828 10829.. TODO:: 10830 10831 So, a struct which can pass some fields as decomposed register arguments, will 10832 pass the rest as decomposed stack elements? But an argument that will not start 10833 in registers will not be decomposed and will be passed as a non-decomposed 10834 stack value? 10835 10836The following is not part of the AMDGPU function calling convention but 10837describes how the AMDGPU implements function calls: 10838 108391. SGPR33 is used as a frame pointer (FP) if necessary. Like the SP it is an 10840 unswizzled scratch address. It is only needed if runtime sized ``alloca`` 10841 are used, or for the reasons defined in ``SIFrameLowering``. 108422. Runtime stack alignment is supported. SGPR34 is used as a base pointer (BP) 10843 to access the incoming stack arguments in the function. The BP is needed 10844 only when the function requires the runtime stack alignment. 10845 108463. Allocating SGPR arguments on the stack are not supported. 10847 108484. No CFI is currently generated. See 10849 :ref:`amdgpu-dwarf-call-frame-information`. 10850 10851 .. note:: 10852 10853 CFI will be generated that defines the CFA as the unswizzled address 10854 relative to the wave scratch base in the unswizzled private address space 10855 of the lowest address stack allocated local variable. 10856 10857 ``DW_AT_frame_base`` will be defined as the swizzled address in the 10858 swizzled private address space by dividing the CFA by the wavefront size 10859 (since CFA is always at least dword aligned which matches the scratch 10860 swizzle element size). 10861 10862 If no dynamic stack alignment was performed, the stack allocated arguments 10863 are accessed as negative offsets relative to ``DW_AT_frame_base``, and the 10864 local variables and register spill slots are accessed as positive offsets 10865 relative to ``DW_AT_frame_base``. 10866 108675. Function argument passing is implemented by copying the input physical 10868 registers to virtual registers on entry. The register allocator can spill if 10869 necessary. These are copied back to physical registers at call sites. The 10870 net effect is that each function call can have these values in entirely 10871 distinct locations. The IPRA can help avoid shuffling argument registers. 108726. Call sites are implemented by setting up the arguments at positive offsets 10873 from SP. Then SP is incremented to account for the known frame size before 10874 the call and decremented after the call. 10875 10876 .. note:: 10877 10878 The CFI will reflect the changed calculation needed to compute the CFA 10879 from SP. 10880 108817. 4 byte spill slots are used in the stack frame. One slot is allocated for an 10882 emergency spill slot. Buffer instructions are used for stack accesses and 10883 not the ``flat_scratch`` instruction. 10884 10885 .. TODO:: 10886 10887 Explain when the emergency spill slot is used. 10888 10889.. TODO:: 10890 10891 Possible broken issues: 10892 10893 - Stack arguments must be aligned to required alignment. 10894 - Stack is aligned to max(16, max formal argument alignment) 10895 - Direct argument < 64 bits should check register budget. 10896 - Register budget calculation should respect ``inreg`` for SGPR. 10897 - SGPR overflow is not handled. 10898 - struct with 1 member unpeeling is not checking size of member. 10899 - ``sret`` is after ``this`` pointer. 10900 - Caller is not implementing stack realignment: need an extra pointer. 10901 - Should say AMDGPU passes FP rather than SP. 10902 - Should CFI define CFA as address of locals or arguments. Difference is 10903 apparent when have implemented dynamic alignment. 10904 - If ``SCRATCH`` instruction could allow negative offsets, then can make FP be 10905 highest address of stack frame and use negative offset for locals. Would 10906 allow SP to be the same as FP and could support signal-handler-like as now 10907 have a real SP for the top of the stack. 10908 - How is ``sret`` passed on the stack? In argument stack area? Can it overlay 10909 arguments? 10910 10911AMDPAL 10912------ 10913 10914This section provides code conventions used when the target triple OS is 10915``amdpal`` (see :ref:`amdgpu-target-triples`). 10916 10917.. _amdgpu-amdpal-code-object-metadata-section: 10918 10919Code Object Metadata 10920~~~~~~~~~~~~~~~~~~~~ 10921 10922.. note:: 10923 10924 The metadata is currently in development and is subject to major 10925 changes. Only the current version is supported. *When this document 10926 was generated the version was 2.6.* 10927 10928Code object metadata is specified by the ``NT_AMDGPU_METADATA`` note 10929record (see :ref:`amdgpu-note-records-v3-v4`). 10930 10931The metadata is represented as Message Pack formatted binary data (see 10932[MsgPack]_). The top level is a Message Pack map that includes the keys 10933defined in table :ref:`amdgpu-amdpal-code-object-metadata-map-table` 10934and referenced tables. 10935 10936Additional information can be added to the maps. To avoid conflicts, any 10937key names should be prefixed by "*vendor-name*." where ``vendor-name`` 10938can be the name of the vendor and specific vendor tool that generates the 10939information. The prefix is abbreviated to simply "." when it appears 10940within a map that has been added by the same *vendor-name*. 10941 10942 .. table:: AMDPAL Code Object Metadata Map 10943 :name: amdgpu-amdpal-code-object-metadata-map-table 10944 10945 =================== ============== ========= ====================================================================== 10946 String Key Value Type Required? Description 10947 =================== ============== ========= ====================================================================== 10948 "amdpal.version" sequence of Required PAL code object metadata (major, minor) version. The current values 10949 2 integers are defined by *Util::Abi::PipelineMetadata(Major|Minor)Version*. 10950 "amdpal.pipelines" sequence of Required Per-pipeline metadata. See 10951 map :ref:`amdgpu-amdpal-code-object-pipeline-metadata-map-table` for the 10952 definition of the keys included in that map. 10953 =================== ============== ========= ====================================================================== 10954 10955.. 10956 10957 .. table:: AMDPAL Code Object Pipeline Metadata Map 10958 :name: amdgpu-amdpal-code-object-pipeline-metadata-map-table 10959 10960 ====================================== ============== ========= =================================================== 10961 String Key Value Type Required? Description 10962 ====================================== ============== ========= =================================================== 10963 ".name" string Source name of the pipeline. 10964 ".type" string Pipeline type, e.g. VsPs. Values include: 10965 10966 - "VsPs" 10967 - "Gs" 10968 - "Cs" 10969 - "Ngg" 10970 - "Tess" 10971 - "GsTess" 10972 - "NggTess" 10973 10974 ".internal_pipeline_hash" sequence of Required Internal compiler hash for this pipeline. Lower 10975 2 integers 64 bits is the "stable" portion of the hash, used 10976 for e.g. shader replacement lookup. Upper 64 bits 10977 is the "unique" portion of the hash, used for 10978 e.g. pipeline cache lookup. The value is 10979 implementation defined, and can not be relied on 10980 between different builds of the compiler. 10981 ".shaders" map Per-API shader metadata. See 10982 :ref:`amdgpu-amdpal-code-object-shader-map-table` 10983 for the definition of the keys included in that 10984 map. 10985 ".hardware_stages" map Per-hardware stage metadata. See 10986 :ref:`amdgpu-amdpal-code-object-hardware-stage-map-table` 10987 for the definition of the keys included in that 10988 map. 10989 ".shader_functions" map Per-shader function metadata. See 10990 :ref:`amdgpu-amdpal-code-object-shader-function-map-table` 10991 for the definition of the keys included in that 10992 map. 10993 ".registers" map Required Hardware register configuration. See 10994 :ref:`amdgpu-amdpal-code-object-register-map-table` 10995 for the definition of the keys included in that 10996 map. 10997 ".user_data_limit" integer Number of user data entries accessed by this 10998 pipeline. 10999 ".spill_threshold" integer The user data spill threshold. 0xFFFF for 11000 NoUserDataSpilling. 11001 ".uses_viewport_array_index" boolean Indicates whether or not the pipeline uses the 11002 viewport array index feature. Pipelines which use 11003 this feature can render into all 16 viewports, 11004 whereas pipelines which do not use it are 11005 restricted to viewport #0. 11006 ".es_gs_lds_size" integer Size in bytes of LDS space used internally for 11007 handling data-passing between the ES and GS 11008 shader stages. This can be zero if the data is 11009 passed using off-chip buffers. This value should 11010 be used to program all user-SGPRs which have been 11011 marked with "UserDataMapping::EsGsLdsSize" 11012 (typically only the GS and VS HW stages will ever 11013 have a user-SGPR so marked). 11014 ".nggSubgroupSize" integer Explicit maximum subgroup size for NGG shaders 11015 (maximum number of threads in a subgroup). 11016 ".num_interpolants" integer Graphics only. Number of PS interpolants. 11017 ".mesh_scratch_memory_size" integer Max mesh shader scratch memory used. 11018 ".api" string Name of the client graphics API. 11019 ".api_create_info" binary Graphics API shader create info binary blob. Can 11020 be defined by the driver using the compiler if 11021 they want to be able to correlate API-specific 11022 information used during creation at a later time. 11023 ====================================== ============== ========= =================================================== 11024 11025.. 11026 11027 .. table:: AMDPAL Code Object Shader Map 11028 :name: amdgpu-amdpal-code-object-shader-map-table 11029 11030 11031 +-------------+--------------+-------------------------------------------------------------------+ 11032 |String Key |Value Type |Description | 11033 +=============+==============+===================================================================+ 11034 |- ".compute" |map |See :ref:`amdgpu-amdpal-code-object-api-shader-metadata-map-table` | 11035 |- ".vertex" | |for the definition of the keys included in that map. | 11036 |- ".hull" | | | 11037 |- ".domain" | | | 11038 |- ".geometry"| | | 11039 |- ".pixel" | | | 11040 +-------------+--------------+-------------------------------------------------------------------+ 11041 11042.. 11043 11044 .. table:: AMDPAL Code Object API Shader Metadata Map 11045 :name: amdgpu-amdpal-code-object-api-shader-metadata-map-table 11046 11047 ==================== ============== ========= ===================================================================== 11048 String Key Value Type Required? Description 11049 ==================== ============== ========= ===================================================================== 11050 ".api_shader_hash" sequence of Required Input shader hash, typically passed in from the client. The value 11051 2 integers is implementation defined, and can not be relied on between 11052 different builds of the compiler. 11053 ".hardware_mapping" sequence of Required Flags indicating the HW stages this API shader maps to. Values 11054 string include: 11055 11056 - ".ls" 11057 - ".hs" 11058 - ".es" 11059 - ".gs" 11060 - ".vs" 11061 - ".ps" 11062 - ".cs" 11063 11064 ==================== ============== ========= ===================================================================== 11065 11066.. 11067 11068 .. table:: AMDPAL Code Object Hardware Stage Map 11069 :name: amdgpu-amdpal-code-object-hardware-stage-map-table 11070 11071 +-------------+--------------+-----------------------------------------------------------------------+ 11072 |String Key |Value Type |Description | 11073 +=============+==============+=======================================================================+ 11074 |- ".ls" |map |See :ref:`amdgpu-amdpal-code-object-hardware-stage-metadata-map-table` | 11075 |- ".hs" | |for the definition of the keys included in that map. | 11076 |- ".es" | | | 11077 |- ".gs" | | | 11078 |- ".vs" | | | 11079 |- ".ps" | | | 11080 |- ".cs" | | | 11081 +-------------+--------------+-----------------------------------------------------------------------+ 11082 11083.. 11084 11085 .. table:: AMDPAL Code Object Hardware Stage Metadata Map 11086 :name: amdgpu-amdpal-code-object-hardware-stage-metadata-map-table 11087 11088 ========================== ============== ========= =============================================================== 11089 String Key Value Type Required? Description 11090 ========================== ============== ========= =============================================================== 11091 ".entry_point" string The ELF symbol pointing to this pipeline's stage entry point. 11092 ".scratch_memory_size" integer Scratch memory size in bytes. 11093 ".lds_size" integer Local Data Share size in bytes. 11094 ".perf_data_buffer_size" integer Performance data buffer size in bytes. 11095 ".vgpr_count" integer Number of VGPRs used. 11096 ".sgpr_count" integer Number of SGPRs used. 11097 ".vgpr_limit" integer If non-zero, indicates the shader was compiled with a 11098 directive to instruct the compiler to limit the VGPR usage to 11099 be less than or equal to the specified value (only set if 11100 different from HW default). 11101 ".sgpr_limit" integer SGPR count upper limit (only set if different from HW 11102 default). 11103 ".threadgroup_dimensions" sequence of Thread-group X/Y/Z dimensions (Compute only). 11104 3 integers 11105 ".wavefront_size" integer Wavefront size (only set if different from HW default). 11106 ".uses_uavs" boolean The shader reads or writes UAVs. 11107 ".uses_rovs" boolean The shader reads or writes ROVs. 11108 ".writes_uavs" boolean The shader writes to one or more UAVs. 11109 ".writes_depth" boolean The shader writes out a depth value. 11110 ".uses_append_consume" boolean The shader uses append and/or consume operations, either 11111 memory or GDS. 11112 ".uses_prim_id" boolean The shader uses PrimID. 11113 ========================== ============== ========= =============================================================== 11114 11115.. 11116 11117 .. table:: AMDPAL Code Object Shader Function Map 11118 :name: amdgpu-amdpal-code-object-shader-function-map-table 11119 11120 =============== ============== ==================================================================== 11121 String Key Value Type Description 11122 =============== ============== ==================================================================== 11123 *symbol name* map *symbol name* is the ELF symbol name of the shader function code 11124 entry address. The value is the function's metadata. See 11125 :ref:`amdgpu-amdpal-code-object-shader-function-metadata-map-table`. 11126 =============== ============== ==================================================================== 11127 11128.. 11129 11130 .. table:: AMDPAL Code Object Shader Function Metadata Map 11131 :name: amdgpu-amdpal-code-object-shader-function-metadata-map-table 11132 11133 ============================= ============== ================================================================= 11134 String Key Value Type Description 11135 ============================= ============== ================================================================= 11136 ".api_shader_hash" sequence of Input shader hash, typically passed in from the client. The value 11137 2 integers is implementation defined, and can not be relied on between 11138 different builds of the compiler. 11139 ".scratch_memory_size" sequence of Size in bytes of scratch memory used by the shader. 11140 2 integers 11141 ".lds_size" sequence of Size in bytes of LDS memory. 11142 2 integers 11143 ".vgpr_count" integer Number of VGPRs used by the shader. 11144 ".sgpr_count" integer Number of SGPRs used by the shader. 11145 ".stack_frame_size_in_bytes" integer Amount of stack size used by the shader. 11146 ".shader_subtype" string Shader subtype/kind. Values include: 11147 11148 - "Unknown" 11149 11150 ============================= ============== ================================================================= 11151 11152.. 11153 11154 .. table:: AMDPAL Code Object Register Map 11155 :name: amdgpu-amdpal-code-object-register-map-table 11156 11157 ========================== ============== ==================================================================== 11158 32-bit Integer Key Value Type Description 11159 ========================== ============== ==================================================================== 11160 ``reg offset`` 32-bit integer ``reg offset`` is the dword offset into the GFXIP register space of 11161 a GRBM register (i.e., driver accessible GPU register number, not 11162 shader GPR register number). The driver is required to program each 11163 specified register to the corresponding specified value when 11164 executing this pipeline. Typically, the ``reg offsets`` are the 11165 ``uint16_t`` offsets to each register as defined by the hardware 11166 chip headers. The register is set to the provided value. However, a 11167 ``reg offset`` that specifies a user data register (e.g., 11168 COMPUTE_USER_DATA_0) needs special treatment. See 11169 :ref:`amdgpu-amdpal-code-object-user-data-section` section for more 11170 information. 11171 ========================== ============== ==================================================================== 11172 11173.. _amdgpu-amdpal-code-object-user-data-section: 11174 11175User Data 11176+++++++++ 11177 11178Each hardware stage has a set of 32-bit physical SPI *user data registers* 11179(either 16 or 32 based on graphics IP and the stage) which can be 11180written from a command buffer and then loaded into SGPRs when waves are 11181launched via a subsequent dispatch or draw operation. This is the way 11182most arguments are passed from the application/runtime to a hardware 11183shader. 11184 11185PAL abstracts this functionality by exposing a set of 128 *user data 11186entries* per pipeline a client can use to pass arguments from a command 11187buffer to one or more shaders in that pipeline. The ELF code object must 11188specify a mapping from virtualized *user data entries* to physical *user 11189data registers*, and PAL is responsible for implementing that mapping, 11190including spilling overflow *user data entries* to memory if needed. 11191 11192Since the *user data registers* are GRBM-accessible SPI registers, this 11193mapping is actually embedded in the ``.registers`` metadata entry. For 11194most registers, the value in that map is a literal 32-bit value that 11195should be written to the register by the driver. However, when the 11196register is a *user data register* (any USER_DATA register e.g., 11197SPI_SHADER_USER_DATA_PS_5), the value is instead an encoding that tells 11198the driver to write either a *user data entry* value or one of several 11199driver-internal values to the register. This encoding is described in 11200the following table: 11201 11202.. note:: 11203 11204 Currently, *user data registers* 0 and 1 (e.g., SPI_SHADER_USER_DATA_PS_0, 11205 and SPI_SHADER_USER_DATA_PS_1) are reserved. *User data register* 0 must 11206 always be programmed to the address of the GlobalTable, and *user data 11207 register* 1 must always be programmed to the address of the PerShaderTable. 11208 11209.. 11210 11211 .. table:: AMDPAL User Data Mapping 11212 :name: amdgpu-amdpal-code-object-metadata-user-data-mapping-table 11213 11214 ========== ================= =============================================================================== 11215 Value Name Description 11216 ========== ================= =============================================================================== 11217 0..127 *User Data Entry* 32-bit value of user_data_entry[N] as specified via *CmdSetUserData()* 11218 0x10000000 GlobalTable 32-bit pointer to GPU memory containing the global internal table (should 11219 always point to *user data register* 0). 11220 0x10000001 PerShaderTable 32-bit pointer to GPU memory containing the per-shader internal table. See 11221 :ref:`amdgpu-amdpal-code-object-metadata-user-data-per-shader-table-section` 11222 for more detail (should always point to *user data register* 1). 11223 0x10000002 SpillTable 32-bit pointer to GPU memory containing the user data spill table. See 11224 :ref:`amdgpu-amdpal-code-object-metadata-user-data-spill-table-section` for 11225 more detail. 11226 0x10000003 BaseVertex Vertex offset (32-bit unsigned integer). Not needed if the pipeline doesn't 11227 reference the draw index in the vertex shader. Only supported by the first 11228 stage in a graphics pipeline. 11229 0x10000004 BaseInstance Instance offset (32-bit unsigned integer). Only supported by the first stage in 11230 a graphics pipeline. 11231 0x10000005 DrawIndex Draw index (32-bit unsigned integer). Only supported by the first stage in a 11232 graphics pipeline. 11233 0x10000006 Workgroup Thread group count (32-bit unsigned integer). Low half of a 64-bit address of 11234 a buffer containing the grid dimensions for a Compute dispatch operation. The 11235 high half of the address is stored in the next sequential user-SGPR. Only 11236 supported by compute pipelines. 11237 0x1000000A EsGsLdsSize Indicates that PAL will program this user-SGPR to contain the amount of LDS 11238 space used for the ES/GS pseudo-ring-buffer for passing data between shader 11239 stages. 11240 0x1000000B ViewId View id (32-bit unsigned integer) identifies a view of graphic 11241 pipeline instancing. 11242 0x1000000C StreamOutTable 32-bit pointer to GPU memory containing the stream out target SRD table. This 11243 can only appear for one shader stage per pipeline. 11244 0x1000000D PerShaderPerfData 32-bit pointer to GPU memory containing the per-shader performance data buffer. 11245 0x1000000F VertexBufferTable 32-bit pointer to GPU memory containing the vertex buffer SRD table. This can 11246 only appear for one shader stage per pipeline. 11247 0x10000010 UavExportTable 32-bit pointer to GPU memory containing the UAV export SRD table. This can 11248 only appear for one shader stage per pipeline (PS). These replace color targets 11249 and are completely separate from any UAVs used by the shader. This is optional, 11250 and only used by the PS when UAV exports are used to replace color-target 11251 exports to optimize specific shaders. 11252 0x10000011 NggCullingData 64-bit pointer to GPU memory containing the hardware register data needed by 11253 some NGG pipelines to perform culling. This value contains the address of the 11254 first of two consecutive registers which provide the full GPU address. 11255 0x10000015 FetchShaderPtr 64-bit pointer to GPU memory containing the fetch shader subroutine. 11256 ========== ================= =============================================================================== 11257 11258.. _amdgpu-amdpal-code-object-metadata-user-data-per-shader-table-section: 11259 11260Per-Shader Table 11261################ 11262 11263Low 32 bits of the GPU address for an optional buffer in the ``.data`` 11264section of the ELF. The high 32 bits of the address match the high 32 bits 11265of the shader's program counter. 11266 11267The buffer can be anything the shader compiler needs it for, and 11268allows each shader to have its own region of the ``.data`` section. 11269Typically, this could be a table of buffer SRD's and the data pointed to 11270by the buffer SRD's, but it could be a flat-address region of memory as 11271well. Its layout and usage are defined by the shader compiler. 11272 11273Each shader's table in the ``.data`` section is referenced by the symbol 11274``_amdgpu_``\ *xs*\ ``_shdr_intrl_data`` where *xs* corresponds with the 11275hardware shader stage the data is for. E.g., 11276``_amdgpu_cs_shdr_intrl_data`` for the compute shader hardware stage. 11277 11278.. _amdgpu-amdpal-code-object-metadata-user-data-spill-table-section: 11279 11280Spill Table 11281########### 11282 11283It is possible for a hardware shader to need access to more *user data 11284entries* than there are slots available in user data registers for one 11285or more hardware shader stages. In that case, the PAL runtime expects 11286the necessary *user data entries* to be spilled to GPU memory and use 11287one user data register to point to the spilled user data memory. The 11288value of the *user data entry* must then represent the location where 11289a shader expects to read the low 32-bits of the table's GPU virtual 11290address. The *spill table* itself represents a set of 32-bit values 11291managed by the PAL runtime in GPU-accessible memory that can be made 11292indirectly accessible to a hardware shader. 11293 11294Unspecified OS 11295-------------- 11296 11297This section provides code conventions used when the target triple OS is 11298empty (see :ref:`amdgpu-target-triples`). 11299 11300Trap Handler ABI 11301~~~~~~~~~~~~~~~~ 11302 11303For code objects generated by AMDGPU backend for non-amdhsa OS, the runtime does 11304not install a trap handler. The ``llvm.trap`` and ``llvm.debugtrap`` 11305instructions are handled as follows: 11306 11307 .. table:: AMDGPU Trap Handler for Non-AMDHSA OS 11308 :name: amdgpu-trap-handler-for-non-amdhsa-os-table 11309 11310 =============== =============== =========================================== 11311 Usage Code Sequence Description 11312 =============== =============== =========================================== 11313 llvm.trap s_endpgm Causes wavefront to be terminated. 11314 llvm.debugtrap *none* Compiler warning given that there is no 11315 trap handler installed. 11316 =============== =============== =========================================== 11317 11318Source Languages 11319================ 11320 11321.. _amdgpu-opencl: 11322 11323OpenCL 11324------ 11325 11326When the language is OpenCL the following differences occur: 11327 113281. The OpenCL memory model is used (see :ref:`amdgpu-amdhsa-memory-model`). 113292. The AMDGPU backend appends additional arguments to the kernel's explicit 11330 arguments for the AMDHSA OS (see 11331 :ref:`opencl-kernel-implicit-arguments-appended-for-amdhsa-os-table`). 113323. Additional metadata is generated 11333 (see :ref:`amdgpu-amdhsa-code-object-metadata`). 11334 11335 .. table:: OpenCL kernel implicit arguments appended for AMDHSA OS 11336 :name: opencl-kernel-implicit-arguments-appended-for-amdhsa-os-table 11337 11338 ======== ==== ========= =========================================== 11339 Position Byte Byte Description 11340 Size Alignment 11341 ======== ==== ========= =========================================== 11342 1 8 8 OpenCL Global Offset X 11343 2 8 8 OpenCL Global Offset Y 11344 3 8 8 OpenCL Global Offset Z 11345 4 8 8 OpenCL address of printf buffer 11346 5 8 8 OpenCL address of virtual queue used by 11347 enqueue_kernel. 11348 6 8 8 OpenCL address of AqlWrap struct used by 11349 enqueue_kernel. 11350 7 8 8 Pointer argument used for Multi-gird 11351 synchronization. 11352 ======== ==== ========= =========================================== 11353 11354.. _amdgpu-hcc: 11355 11356HCC 11357--- 11358 11359When the language is HCC the following differences occur: 11360 113611. The HSA memory model is used (see :ref:`amdgpu-amdhsa-memory-model`). 11362 11363.. _amdgpu-assembler: 11364 11365Assembler 11366--------- 11367 11368AMDGPU backend has LLVM-MC based assembler which is currently in development. 11369It supports AMDGCN GFX6-GFX10. 11370 11371This section describes general syntax for instructions and operands. 11372 11373Instructions 11374~~~~~~~~~~~~ 11375 11376An instruction has the following :doc:`syntax<AMDGPUInstructionSyntax>`: 11377 11378 | ``<``\ *opcode*\ ``> <``\ *operand0*\ ``>, <``\ *operand1*\ ``>,... 11379 <``\ *modifier0*\ ``> <``\ *modifier1*\ ``>...`` 11380 11381:doc:`Operands<AMDGPUOperandSyntax>` are comma-separated while 11382:doc:`modifiers<AMDGPUModifierSyntax>` are space-separated. 11383 11384The order of operands and modifiers is fixed. 11385Most modifiers are optional and may be omitted. 11386 11387Links to detailed instruction syntax description may be found in the following 11388table. Note that features under development are not included 11389in this description. 11390 11391 =================================== ======================================= 11392 Core ISA ISA Extensions 11393 =================================== ======================================= 11394 :doc:`GFX7<AMDGPU/AMDGPUAsmGFX7>` \- 11395 :doc:`GFX8<AMDGPU/AMDGPUAsmGFX8>` \- 11396 :doc:`GFX9<AMDGPU/AMDGPUAsmGFX9>` :doc:`gfx900<AMDGPU/AMDGPUAsmGFX900>` 11397 11398 :doc:`gfx902<AMDGPU/AMDGPUAsmGFX900>` 11399 11400 :doc:`gfx904<AMDGPU/AMDGPUAsmGFX904>` 11401 11402 :doc:`gfx906<AMDGPU/AMDGPUAsmGFX906>` 11403 11404 :doc:`gfx908<AMDGPU/AMDGPUAsmGFX908>` 11405 11406 :doc:`gfx909<AMDGPU/AMDGPUAsmGFX900>` 11407 11408 :doc:`gfx90a<AMDGPU/AMDGPUAsmGFX90a>` 11409 11410 :doc:`GFX10<AMDGPU/AMDGPUAsmGFX10>` :doc:`gfx1011<AMDGPU/AMDGPUAsmGFX1011>` 11411 11412 :doc:`gfx1012<AMDGPU/AMDGPUAsmGFX1011>` 11413 =================================== ======================================= 11414 11415For more information about instructions, their semantics and supported 11416combinations of operands, refer to one of instruction set architecture manuals 11417[AMD-GCN-GFX6]_, [AMD-GCN-GFX7]_, [AMD-GCN-GFX8]_, [AMD-GCN-GFX9]_, 11418[AMD-GCN-GFX908-CDNA1]_, [AMD-GCN-GFX10-RDNA1]_ and [AMD-GCN-GFX10-RDNA2]_. 11419 11420Operands 11421~~~~~~~~ 11422 11423Detailed description of operands may be found :doc:`here<AMDGPUOperandSyntax>`. 11424 11425Modifiers 11426~~~~~~~~~ 11427 11428Detailed description of modifiers may be found 11429:doc:`here<AMDGPUModifierSyntax>`. 11430 11431Instruction Examples 11432~~~~~~~~~~~~~~~~~~~~ 11433 11434DS 11435++ 11436 11437.. code-block:: nasm 11438 11439 ds_add_u32 v2, v4 offset:16 11440 ds_write_src2_b64 v2 offset0:4 offset1:8 11441 ds_cmpst_f32 v2, v4, v6 11442 ds_min_rtn_f64 v[8:9], v2, v[4:5] 11443 11444For full list of supported instructions, refer to "LDS/GDS instructions" in ISA 11445Manual. 11446 11447FLAT 11448++++ 11449 11450.. code-block:: nasm 11451 11452 flat_load_dword v1, v[3:4] 11453 flat_store_dwordx3 v[3:4], v[5:7] 11454 flat_atomic_swap v1, v[3:4], v5 glc 11455 flat_atomic_cmpswap v1, v[3:4], v[5:6] glc slc 11456 flat_atomic_fmax_x2 v[1:2], v[3:4], v[5:6] glc 11457 11458For full list of supported instructions, refer to "FLAT instructions" in ISA 11459Manual. 11460 11461MUBUF 11462+++++ 11463 11464.. code-block:: nasm 11465 11466 buffer_load_dword v1, off, s[4:7], s1 11467 buffer_store_dwordx4 v[1:4], v2, ttmp[4:7], s1 offen offset:4 glc tfe 11468 buffer_store_format_xy v[1:2], off, s[4:7], s1 11469 buffer_wbinvl1 11470 buffer_atomic_inc v1, v2, s[8:11], s4 idxen offset:4 slc 11471 11472For full list of supported instructions, refer to "MUBUF Instructions" in ISA 11473Manual. 11474 11475SMRD/SMEM 11476+++++++++ 11477 11478.. code-block:: nasm 11479 11480 s_load_dword s1, s[2:3], 0xfc 11481 s_load_dwordx8 s[8:15], s[2:3], s4 11482 s_load_dwordx16 s[88:103], s[2:3], s4 11483 s_dcache_inv_vol 11484 s_memtime s[4:5] 11485 11486For full list of supported instructions, refer to "Scalar Memory Operations" in 11487ISA Manual. 11488 11489SOP1 11490++++ 11491 11492.. code-block:: nasm 11493 11494 s_mov_b32 s1, s2 11495 s_mov_b64 s[0:1], 0x80000000 11496 s_cmov_b32 s1, 200 11497 s_wqm_b64 s[2:3], s[4:5] 11498 s_bcnt0_i32_b64 s1, s[2:3] 11499 s_swappc_b64 s[2:3], s[4:5] 11500 s_cbranch_join s[4:5] 11501 11502For full list of supported instructions, refer to "SOP1 Instructions" in ISA 11503Manual. 11504 11505SOP2 11506++++ 11507 11508.. code-block:: nasm 11509 11510 s_add_u32 s1, s2, s3 11511 s_and_b64 s[2:3], s[4:5], s[6:7] 11512 s_cselect_b32 s1, s2, s3 11513 s_andn2_b32 s2, s4, s6 11514 s_lshr_b64 s[2:3], s[4:5], s6 11515 s_ashr_i32 s2, s4, s6 11516 s_bfm_b64 s[2:3], s4, s6 11517 s_bfe_i64 s[2:3], s[4:5], s6 11518 s_cbranch_g_fork s[4:5], s[6:7] 11519 11520For full list of supported instructions, refer to "SOP2 Instructions" in ISA 11521Manual. 11522 11523SOPC 11524++++ 11525 11526.. code-block:: nasm 11527 11528 s_cmp_eq_i32 s1, s2 11529 s_bitcmp1_b32 s1, s2 11530 s_bitcmp0_b64 s[2:3], s4 11531 s_setvskip s3, s5 11532 11533For full list of supported instructions, refer to "SOPC Instructions" in ISA 11534Manual. 11535 11536SOPP 11537++++ 11538 11539.. code-block:: nasm 11540 11541 s_barrier 11542 s_nop 2 11543 s_endpgm 11544 s_waitcnt 0 ; Wait for all counters to be 0 11545 s_waitcnt vmcnt(0) & expcnt(0) & lgkmcnt(0) ; Equivalent to above 11546 s_waitcnt vmcnt(1) ; Wait for vmcnt counter to be 1. 11547 s_sethalt 9 11548 s_sleep 10 11549 s_sendmsg 0x1 11550 s_sendmsg sendmsg(MSG_INTERRUPT) 11551 s_trap 1 11552 11553For full list of supported instructions, refer to "SOPP Instructions" in ISA 11554Manual. 11555 11556Unless otherwise mentioned, little verification is performed on the operands 11557of SOPP Instructions, so it is up to the programmer to be familiar with the 11558range or acceptable values. 11559 11560VALU 11561++++ 11562 11563For vector ALU instruction opcodes (VOP1, VOP2, VOP3, VOPC, VOP_DPP, VOP_SDWA), 11564the assembler will automatically use optimal encoding based on its operands. To 11565force specific encoding, one can add a suffix to the opcode of the instruction: 11566 11567* _e32 for 32-bit VOP1/VOP2/VOPC 11568* _e64 for 64-bit VOP3 11569* _dpp for VOP_DPP 11570* _sdwa for VOP_SDWA 11571 11572VOP1/VOP2/VOP3/VOPC examples: 11573 11574.. code-block:: nasm 11575 11576 v_mov_b32 v1, v2 11577 v_mov_b32_e32 v1, v2 11578 v_nop 11579 v_cvt_f64_i32_e32 v[1:2], v2 11580 v_floor_f32_e32 v1, v2 11581 v_bfrev_b32_e32 v1, v2 11582 v_add_f32_e32 v1, v2, v3 11583 v_mul_i32_i24_e64 v1, v2, 3 11584 v_mul_i32_i24_e32 v1, -3, v3 11585 v_mul_i32_i24_e32 v1, -100, v3 11586 v_addc_u32 v1, s[0:1], v2, v3, s[2:3] 11587 v_max_f16_e32 v1, v2, v3 11588 11589VOP_DPP examples: 11590 11591.. code-block:: nasm 11592 11593 v_mov_b32 v0, v0 quad_perm:[0,2,1,1] 11594 v_sin_f32 v0, v0 row_shl:1 row_mask:0xa bank_mask:0x1 bound_ctrl:0 11595 v_mov_b32 v0, v0 wave_shl:1 11596 v_mov_b32 v0, v0 row_mirror 11597 v_mov_b32 v0, v0 row_bcast:31 11598 v_mov_b32 v0, v0 quad_perm:[1,3,0,1] row_mask:0xa bank_mask:0x1 bound_ctrl:0 11599 v_add_f32 v0, v0, |v0| row_shl:1 row_mask:0xa bank_mask:0x1 bound_ctrl:0 11600 v_max_f16 v1, v2, v3 row_shl:1 row_mask:0xa bank_mask:0x1 bound_ctrl:0 11601 11602VOP_SDWA examples: 11603 11604.. code-block:: nasm 11605 11606 v_mov_b32 v1, v2 dst_sel:BYTE_0 dst_unused:UNUSED_PRESERVE src0_sel:DWORD 11607 v_min_u32 v200, v200, v1 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_1 src1_sel:DWORD 11608 v_sin_f32 v0, v0 dst_unused:UNUSED_PAD src0_sel:WORD_1 11609 v_fract_f32 v0, |v0| dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_1 11610 v_cmpx_le_u32 vcc, v1, v2 src0_sel:BYTE_2 src1_sel:WORD_0 11611 11612For full list of supported instructions, refer to "Vector ALU instructions". 11613 11614.. _amdgpu-amdhsa-assembler-predefined-symbols-v2: 11615 11616Code Object V2 Predefined Symbols 11617~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 11618 11619.. warning:: 11620 Code object V2 is not the default code object version emitted by 11621 this version of LLVM. 11622 11623The AMDGPU assembler defines and updates some symbols automatically. These 11624symbols do not affect code generation. 11625 11626.option.machine_version_major 11627+++++++++++++++++++++++++++++ 11628 11629Set to the GFX major generation number of the target being assembled for. For 11630example, when assembling for a "GFX9" target this will be set to the integer 11631value "9". The possible GFX major generation numbers are presented in 11632:ref:`amdgpu-processors`. 11633 11634.option.machine_version_minor 11635+++++++++++++++++++++++++++++ 11636 11637Set to the GFX minor generation number of the target being assembled for. For 11638example, when assembling for a "GFX810" target this will be set to the integer 11639value "1". The possible GFX minor generation numbers are presented in 11640:ref:`amdgpu-processors`. 11641 11642.option.machine_version_stepping 11643++++++++++++++++++++++++++++++++ 11644 11645Set to the GFX stepping generation number of the target being assembled for. 11646For example, when assembling for a "GFX704" target this will be set to the 11647integer value "4". The possible GFX stepping generation numbers are presented 11648in :ref:`amdgpu-processors`. 11649 11650.kernel.vgpr_count 11651++++++++++++++++++ 11652 11653Set to zero each time a 11654:ref:`amdgpu-amdhsa-assembler-directive-amdgpu_hsa_kernel` directive is 11655encountered. At each instruction, if the current value of this symbol is less 11656than or equal to the maximum VGPR number explicitly referenced within that 11657instruction then the symbol value is updated to equal that VGPR number plus 11658one. 11659 11660.kernel.sgpr_count 11661++++++++++++++++++ 11662 11663Set to zero each time a 11664:ref:`amdgpu-amdhsa-assembler-directive-amdgpu_hsa_kernel` directive is 11665encountered. At each instruction, if the current value of this symbol is less 11666than or equal to the maximum VGPR number explicitly referenced within that 11667instruction then the symbol value is updated to equal that SGPR number plus 11668one. 11669 11670.. _amdgpu-amdhsa-assembler-directives-v2: 11671 11672Code Object V2 Directives 11673~~~~~~~~~~~~~~~~~~~~~~~~~ 11674 11675.. warning:: 11676 Code object V2 is not the default code object version emitted by 11677 this version of LLVM. 11678 11679AMDGPU ABI defines auxiliary data in output code object. In assembly source, 11680one can specify them with assembler directives. 11681 11682.hsa_code_object_version major, minor 11683+++++++++++++++++++++++++++++++++++++ 11684 11685*major* and *minor* are integers that specify the version of the HSA code 11686object that will be generated by the assembler. 11687 11688.hsa_code_object_isa [major, minor, stepping, vendor, arch] 11689+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ 11690 11691 11692*major*, *minor*, and *stepping* are all integers that describe the instruction 11693set architecture (ISA) version of the assembly program. 11694 11695*vendor* and *arch* are quoted strings. *vendor* should always be equal to 11696"AMD" and *arch* should always be equal to "AMDGPU". 11697 11698By default, the assembler will derive the ISA version, *vendor*, and *arch* 11699from the value of the -mcpu option that is passed to the assembler. 11700 11701.. _amdgpu-amdhsa-assembler-directive-amdgpu_hsa_kernel: 11702 11703.amdgpu_hsa_kernel (name) 11704+++++++++++++++++++++++++ 11705 11706This directives specifies that the symbol with given name is a kernel entry 11707point (label) and the object should contain corresponding symbol of type 11708STT_AMDGPU_HSA_KERNEL. 11709 11710.amd_kernel_code_t 11711++++++++++++++++++ 11712 11713This directive marks the beginning of a list of key / value pairs that are used 11714to specify the amd_kernel_code_t object that will be emitted by the assembler. 11715The list must be terminated by the *.end_amd_kernel_code_t* directive. For any 11716amd_kernel_code_t values that are unspecified a default value will be used. The 11717default value for all keys is 0, with the following exceptions: 11718 11719- *amd_code_version_major* defaults to 1. 11720- *amd_kernel_code_version_minor* defaults to 2. 11721- *amd_machine_kind* defaults to 1. 11722- *amd_machine_version_major*, *machine_version_minor*, and 11723 *amd_machine_version_stepping* are derived from the value of the -mcpu option 11724 that is passed to the assembler. 11725- *kernel_code_entry_byte_offset* defaults to 256. 11726- *wavefront_size* defaults 6 for all targets before GFX10. For GFX10 onwards 11727 defaults to 6 if target feature ``wavefrontsize64`` is enabled, otherwise 5. 11728 Note that wavefront size is specified as a power of two, so a value of **n** 11729 means a size of 2^ **n**. 11730- *call_convention* defaults to -1. 11731- *kernarg_segment_alignment*, *group_segment_alignment*, and 11732 *private_segment_alignment* default to 4. Note that alignments are specified 11733 as a power of 2, so a value of **n** means an alignment of 2^ **n**. 11734- *enable_tg_split* defaults to 1 if target feature ``tgsplit`` is enabled for 11735 GFX90A onwards. 11736- *enable_wgp_mode* defaults to 1 if target feature ``cumode`` is disabled for 11737 GFX10 onwards. 11738- *enable_mem_ordered* defaults to 1 for GFX10 onwards. 11739 11740The *.amd_kernel_code_t* directive must be placed immediately after the 11741function label and before any instructions. 11742 11743For a full list of amd_kernel_code_t keys, refer to AMDGPU ABI document, 11744comments in lib/Target/AMDGPU/AmdKernelCodeT.h and test/CodeGen/AMDGPU/hsa.s. 11745 11746.. _amdgpu-amdhsa-assembler-example-v2: 11747 11748Code Object V2 Example Source Code 11749~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 11750 11751.. warning:: 11752 Code Object V2 is not the default code object version emitted by 11753 this version of LLVM. 11754 11755Here is an example of a minimal assembly source file, defining one HSA kernel: 11756 11757.. code:: 11758 :number-lines: 11759 11760 .hsa_code_object_version 1,0 11761 .hsa_code_object_isa 11762 11763 .hsatext 11764 .globl hello_world 11765 .p2align 8 11766 .amdgpu_hsa_kernel hello_world 11767 11768 hello_world: 11769 11770 .amd_kernel_code_t 11771 enable_sgpr_kernarg_segment_ptr = 1 11772 is_ptr64 = 1 11773 compute_pgm_rsrc1_vgprs = 0 11774 compute_pgm_rsrc1_sgprs = 0 11775 compute_pgm_rsrc2_user_sgpr = 2 11776 compute_pgm_rsrc1_wgp_mode = 0 11777 compute_pgm_rsrc1_mem_ordered = 0 11778 compute_pgm_rsrc1_fwd_progress = 1 11779 .end_amd_kernel_code_t 11780 11781 s_load_dwordx2 s[0:1], s[0:1] 0x0 11782 v_mov_b32 v0, 3.14159 11783 s_waitcnt lgkmcnt(0) 11784 v_mov_b32 v1, s0 11785 v_mov_b32 v2, s1 11786 flat_store_dword v[1:2], v0 11787 s_endpgm 11788 .Lfunc_end0: 11789 .size hello_world, .Lfunc_end0-hello_world 11790 11791.. _amdgpu-amdhsa-assembler-predefined-symbols-v3-v4: 11792 11793Code Object V3 to V4 Predefined Symbols 11794~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 11795 11796The AMDGPU assembler defines and updates some symbols automatically. These 11797symbols do not affect code generation. 11798 11799.amdgcn.gfx_generation_number 11800+++++++++++++++++++++++++++++ 11801 11802Set to the GFX major generation number of the target being assembled for. For 11803example, when assembling for a "GFX9" target this will be set to the integer 11804value "9". The possible GFX major generation numbers are presented in 11805:ref:`amdgpu-processors`. 11806 11807.amdgcn.gfx_generation_minor 11808++++++++++++++++++++++++++++ 11809 11810Set to the GFX minor generation number of the target being assembled for. For 11811example, when assembling for a "GFX810" target this will be set to the integer 11812value "1". The possible GFX minor generation numbers are presented in 11813:ref:`amdgpu-processors`. 11814 11815.amdgcn.gfx_generation_stepping 11816+++++++++++++++++++++++++++++++ 11817 11818Set to the GFX stepping generation number of the target being assembled for. 11819For example, when assembling for a "GFX704" target this will be set to the 11820integer value "4". The possible GFX stepping generation numbers are presented 11821in :ref:`amdgpu-processors`. 11822 11823.. _amdgpu-amdhsa-assembler-symbol-next_free_vgpr: 11824 11825.amdgcn.next_free_vgpr 11826++++++++++++++++++++++ 11827 11828Set to zero before assembly begins. At each instruction, if the current value 11829of this symbol is less than or equal to the maximum VGPR number explicitly 11830referenced within that instruction then the symbol value is updated to equal 11831that VGPR number plus one. 11832 11833May be used to set the `.amdhsa_next_free_vgpr` directive in 11834:ref:`amdhsa-kernel-directives-table`. 11835 11836May be set at any time, e.g. manually set to zero at the start of each kernel. 11837 11838.. _amdgpu-amdhsa-assembler-symbol-next_free_sgpr: 11839 11840.amdgcn.next_free_sgpr 11841++++++++++++++++++++++ 11842 11843Set to zero before assembly begins. At each instruction, if the current value 11844of this symbol is less than or equal the maximum SGPR number explicitly 11845referenced within that instruction then the symbol value is updated to equal 11846that SGPR number plus one. 11847 11848May be used to set the `.amdhsa_next_free_spgr` directive in 11849:ref:`amdhsa-kernel-directives-table`. 11850 11851May be set at any time, e.g. manually set to zero at the start of each kernel. 11852 11853.. _amdgpu-amdhsa-assembler-directives-v3-v4: 11854 11855Code Object V3 to V4 Directives 11856~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 11857 11858Directives which begin with ``.amdgcn`` are valid for all ``amdgcn`` 11859architecture processors, and are not OS-specific. Directives which begin with 11860``.amdhsa`` are specific to ``amdgcn`` architecture processors when the 11861``amdhsa`` OS is specified. See :ref:`amdgpu-target-triples` and 11862:ref:`amdgpu-processors`. 11863 11864.. _amdgpu-assembler-directive-amdgcn-target: 11865 11866.amdgcn_target <target-triple> "-" <target-id> 11867++++++++++++++++++++++++++++++++++++++++++++++ 11868 11869Optional directive which declares the ``<target-triple>-<target-id>`` supported 11870by the containing assembler source file. Used by the assembler to validate 11871command-line options such as ``-triple``, ``-mcpu``, and 11872``--offload-arch=<target-id>``. A non-canonical target ID is allowed. See 11873:ref:`amdgpu-target-triples` and :ref:`amdgpu-target-id`. 11874 11875.. note:: 11876 11877 The target ID syntax used for code object V2 to V3 for this directive differs 11878 from that used elsewhere. See :ref:`amdgpu-target-id-v2-v3`. 11879 11880.amdhsa_kernel <name> 11881+++++++++++++++++++++ 11882 11883Creates a correctly aligned AMDHSA kernel descriptor and a symbol, 11884``<name>.kd``, in the current location of the current section. Only valid when 11885the OS is ``amdhsa``. ``<name>`` must be a symbol that labels the first 11886instruction to execute, and does not need to be previously defined. 11887 11888Marks the beginning of a list of directives used to generate the bytes of a 11889kernel descriptor, as described in :ref:`amdgpu-amdhsa-kernel-descriptor`. 11890Directives which may appear in this list are described in 11891:ref:`amdhsa-kernel-directives-table`. Directives may appear in any order, must 11892be valid for the target being assembled for, and cannot be repeated. Directives 11893support the range of values specified by the field they reference in 11894:ref:`amdgpu-amdhsa-kernel-descriptor`. If a directive is not specified, it is 11895assumed to have its default value, unless it is marked as "Required", in which 11896case it is an error to omit the directive. This list of directives is 11897terminated by an ``.end_amdhsa_kernel`` directive. 11898 11899 .. table:: AMDHSA Kernel Assembler Directives 11900 :name: amdhsa-kernel-directives-table 11901 11902 ======================================================== =================== ============ =================== 11903 Directive Default Supported On Description 11904 ======================================================== =================== ============ =================== 11905 ``.amdhsa_group_segment_fixed_size`` 0 GFX6-GFX10 Controls GROUP_SEGMENT_FIXED_SIZE in 11906 :ref:`amdgpu-amdhsa-kernel-descriptor-v3-table`. 11907 ``.amdhsa_private_segment_fixed_size`` 0 GFX6-GFX10 Controls PRIVATE_SEGMENT_FIXED_SIZE in 11908 :ref:`amdgpu-amdhsa-kernel-descriptor-v3-table`. 11909 ``.amdhsa_kernarg_size`` 0 GFX6-GFX10 Controls KERNARG_SIZE in 11910 :ref:`amdgpu-amdhsa-kernel-descriptor-v3-table`. 11911 ``.amdhsa_user_sgpr_private_segment_buffer`` 0 GFX6-GFX10 Controls ENABLE_SGPR_PRIVATE_SEGMENT_BUFFER in 11912 :ref:`amdgpu-amdhsa-kernel-descriptor-v3-table`. 11913 ``.amdhsa_user_sgpr_dispatch_ptr`` 0 GFX6-GFX10 Controls ENABLE_SGPR_DISPATCH_PTR in 11914 :ref:`amdgpu-amdhsa-kernel-descriptor-v3-table`. 11915 ``.amdhsa_user_sgpr_queue_ptr`` 0 GFX6-GFX10 Controls ENABLE_SGPR_QUEUE_PTR in 11916 :ref:`amdgpu-amdhsa-kernel-descriptor-v3-table`. 11917 ``.amdhsa_user_sgpr_kernarg_segment_ptr`` 0 GFX6-GFX10 Controls ENABLE_SGPR_KERNARG_SEGMENT_PTR in 11918 :ref:`amdgpu-amdhsa-kernel-descriptor-v3-table`. 11919 ``.amdhsa_user_sgpr_dispatch_id`` 0 GFX6-GFX10 Controls ENABLE_SGPR_DISPATCH_ID in 11920 :ref:`amdgpu-amdhsa-kernel-descriptor-v3-table`. 11921 ``.amdhsa_user_sgpr_flat_scratch_init`` 0 GFX6-GFX10 Controls ENABLE_SGPR_FLAT_SCRATCH_INIT in 11922 :ref:`amdgpu-amdhsa-kernel-descriptor-v3-table`. 11923 ``.amdhsa_user_sgpr_private_segment_size`` 0 GFX6-GFX10 Controls ENABLE_SGPR_PRIVATE_SEGMENT_SIZE in 11924 :ref:`amdgpu-amdhsa-kernel-descriptor-v3-table`. 11925 ``.amdhsa_wavefront_size32`` Target GFX10 Controls ENABLE_WAVEFRONT_SIZE32 in 11926 Feature :ref:`amdgpu-amdhsa-kernel-descriptor-v3-table`. 11927 Specific 11928 (wavefrontsize64) 11929 ``.amdhsa_system_sgpr_private_segment_wavefront_offset`` 0 GFX6-GFX10 Controls ENABLE_PRIVATE_SEGMENT in 11930 :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx10-table`. 11931 ``.amdhsa_system_sgpr_workgroup_id_x`` 1 GFX6-GFX10 Controls ENABLE_SGPR_WORKGROUP_ID_X in 11932 :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx10-table`. 11933 ``.amdhsa_system_sgpr_workgroup_id_y`` 0 GFX6-GFX10 Controls ENABLE_SGPR_WORKGROUP_ID_Y in 11934 :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx10-table`. 11935 ``.amdhsa_system_sgpr_workgroup_id_z`` 0 GFX6-GFX10 Controls ENABLE_SGPR_WORKGROUP_ID_Z in 11936 :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx10-table`. 11937 ``.amdhsa_system_sgpr_workgroup_info`` 0 GFX6-GFX10 Controls ENABLE_SGPR_WORKGROUP_INFO in 11938 :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx10-table`. 11939 ``.amdhsa_system_vgpr_workitem_id`` 0 GFX6-GFX10 Controls ENABLE_VGPR_WORKITEM_ID in 11940 :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx10-table`. 11941 Possible values are defined in 11942 :ref:`amdgpu-amdhsa-system-vgpr-work-item-id-enumeration-values-table`. 11943 ``.amdhsa_next_free_vgpr`` Required GFX6-GFX10 Maximum VGPR number explicitly referenced, plus one. 11944 Used to calculate GRANULATED_WORKITEM_VGPR_COUNT in 11945 :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx10-table`. 11946 ``.amdhsa_next_free_sgpr`` Required GFX6-GFX10 Maximum SGPR number explicitly referenced, plus one. 11947 Used to calculate GRANULATED_WAVEFRONT_SGPR_COUNT in 11948 :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx10-table`. 11949 ``.amdhsa_accum_offset`` Required GFX90A Offset of a first AccVGPR in the unified register file. 11950 Used to calculate ACCUM_OFFSET in 11951 :ref:`amdgpu-amdhsa-compute_pgm_rsrc3-gfx90a-table`. 11952 ``.amdhsa_reserve_vcc`` 1 GFX6-GFX10 Whether the kernel may use the special VCC SGPR. 11953 Used to calculate GRANULATED_WAVEFRONT_SGPR_COUNT in 11954 :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx10-table`. 11955 ``.amdhsa_reserve_flat_scratch`` 1 GFX7-GFX10 Whether the kernel may use flat instructions to access 11956 scratch memory. Used to calculate 11957 GRANULATED_WAVEFRONT_SGPR_COUNT in 11958 :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx10-table`. 11959 ``.amdhsa_reserve_xnack_mask`` Target GFX8-GFX10 Whether the kernel may trigger XNACK replay. 11960 Feature Used to calculate GRANULATED_WAVEFRONT_SGPR_COUNT in 11961 Specific :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx10-table`. 11962 (xnack) 11963 ``.amdhsa_float_round_mode_32`` 0 GFX6-GFX10 Controls FLOAT_ROUND_MODE_32 in 11964 :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx10-table`. 11965 Possible values are defined in 11966 :ref:`amdgpu-amdhsa-floating-point-rounding-mode-enumeration-values-table`. 11967 ``.amdhsa_float_round_mode_16_64`` 0 GFX6-GFX10 Controls FLOAT_ROUND_MODE_16_64 in 11968 :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx10-table`. 11969 Possible values are defined in 11970 :ref:`amdgpu-amdhsa-floating-point-rounding-mode-enumeration-values-table`. 11971 ``.amdhsa_float_denorm_mode_32`` 0 GFX6-GFX10 Controls FLOAT_DENORM_MODE_32 in 11972 :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx10-table`. 11973 Possible values are defined in 11974 :ref:`amdgpu-amdhsa-floating-point-denorm-mode-enumeration-values-table`. 11975 ``.amdhsa_float_denorm_mode_16_64`` 3 GFX6-GFX10 Controls FLOAT_DENORM_MODE_16_64 in 11976 :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx10-table`. 11977 Possible values are defined in 11978 :ref:`amdgpu-amdhsa-floating-point-denorm-mode-enumeration-values-table`. 11979 ``.amdhsa_dx10_clamp`` 1 GFX6-GFX10 Controls ENABLE_DX10_CLAMP in 11980 :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx10-table`. 11981 ``.amdhsa_ieee_mode`` 1 GFX6-GFX10 Controls ENABLE_IEEE_MODE in 11982 :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx10-table`. 11983 ``.amdhsa_fp16_overflow`` 0 GFX9-GFX10 Controls FP16_OVFL in 11984 :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx10-table`. 11985 ``.amdhsa_tg_split`` Target GFX90A Controls TG_SPLIT in 11986 Feature :ref:`amdgpu-amdhsa-compute_pgm_rsrc3-gfx90a-table`. 11987 Specific 11988 (tgsplit) 11989 ``.amdhsa_workgroup_processor_mode`` Target GFX10 Controls ENABLE_WGP_MODE in 11990 Feature :ref:`amdgpu-amdhsa-kernel-descriptor-v3-table`. 11991 Specific 11992 (cumode) 11993 ``.amdhsa_memory_ordered`` 1 GFX10 Controls MEM_ORDERED in 11994 :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx10-table`. 11995 ``.amdhsa_forward_progress`` 0 GFX10 Controls FWD_PROGRESS in 11996 :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx10-table`. 11997 ``.amdhsa_exception_fp_ieee_invalid_op`` 0 GFX6-GFX10 Controls ENABLE_EXCEPTION_IEEE_754_FP_INVALID_OPERATION in 11998 :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx10-table`. 11999 ``.amdhsa_exception_fp_denorm_src`` 0 GFX6-GFX10 Controls ENABLE_EXCEPTION_FP_DENORMAL_SOURCE in 12000 :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx10-table`. 12001 ``.amdhsa_exception_fp_ieee_div_zero`` 0 GFX6-GFX10 Controls ENABLE_EXCEPTION_IEEE_754_FP_DIVISION_BY_ZERO in 12002 :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx10-table`. 12003 ``.amdhsa_exception_fp_ieee_overflow`` 0 GFX6-GFX10 Controls ENABLE_EXCEPTION_IEEE_754_FP_OVERFLOW in 12004 :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx10-table`. 12005 ``.amdhsa_exception_fp_ieee_underflow`` 0 GFX6-GFX10 Controls ENABLE_EXCEPTION_IEEE_754_FP_UNDERFLOW in 12006 :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx10-table`. 12007 ``.amdhsa_exception_fp_ieee_inexact`` 0 GFX6-GFX10 Controls ENABLE_EXCEPTION_IEEE_754_FP_INEXACT in 12008 :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx10-table`. 12009 ``.amdhsa_exception_int_div_zero`` 0 GFX6-GFX10 Controls ENABLE_EXCEPTION_INT_DIVIDE_BY_ZERO in 12010 :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx10-table`. 12011 ======================================================== =================== ============ =================== 12012 12013.amdgpu_metadata 12014++++++++++++++++ 12015 12016Optional directive which declares the contents of the ``NT_AMDGPU_METADATA`` 12017note record (see :ref:`amdgpu-elf-note-records-table-v3-v4`). 12018 12019The contents must be in the [YAML]_ markup format, with the same structure and 12020semantics described in :ref:`amdgpu-amdhsa-code-object-metadata-v3` or 12021:ref:`amdgpu-amdhsa-code-object-metadata-v4`. 12022 12023This directive is terminated by an ``.end_amdgpu_metadata`` directive. 12024 12025.. _amdgpu-amdhsa-assembler-example-v3-v4: 12026 12027Code Object V3 to V4 Example Source Code 12028~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 12029 12030Here is an example of a minimal assembly source file, defining one HSA kernel: 12031 12032.. code:: 12033 :number-lines: 12034 12035 .amdgcn_target "amdgcn-amd-amdhsa--gfx900+xnack" // optional 12036 12037 .text 12038 .globl hello_world 12039 .p2align 8 12040 .type hello_world,@function 12041 hello_world: 12042 s_load_dwordx2 s[0:1], s[0:1] 0x0 12043 v_mov_b32 v0, 3.14159 12044 s_waitcnt lgkmcnt(0) 12045 v_mov_b32 v1, s0 12046 v_mov_b32 v2, s1 12047 flat_store_dword v[1:2], v0 12048 s_endpgm 12049 .Lfunc_end0: 12050 .size hello_world, .Lfunc_end0-hello_world 12051 12052 .rodata 12053 .p2align 6 12054 .amdhsa_kernel hello_world 12055 .amdhsa_user_sgpr_kernarg_segment_ptr 1 12056 .amdhsa_next_free_vgpr .amdgcn.next_free_vgpr 12057 .amdhsa_next_free_sgpr .amdgcn.next_free_sgpr 12058 .end_amdhsa_kernel 12059 12060 .amdgpu_metadata 12061 --- 12062 amdhsa.version: 12063 - 1 12064 - 0 12065 amdhsa.kernels: 12066 - .name: hello_world 12067 .symbol: hello_world.kd 12068 .kernarg_segment_size: 48 12069 .group_segment_fixed_size: 0 12070 .private_segment_fixed_size: 0 12071 .kernarg_segment_align: 4 12072 .wavefront_size: 64 12073 .sgpr_count: 2 12074 .vgpr_count: 3 12075 .max_flat_workgroup_size: 256 12076 ... 12077 .end_amdgpu_metadata 12078 12079If an assembly source file contains multiple kernels and/or functions, the 12080:ref:`amdgpu-amdhsa-assembler-symbol-next_free_vgpr` and 12081:ref:`amdgpu-amdhsa-assembler-symbol-next_free_sgpr` symbols may be reset using 12082the ``.set <symbol>, <expression>`` directive. For example, in the case of two 12083kernels, where ``function1`` is only called from ``kernel1`` it is sufficient 12084to group the function with the kernel that calls it and reset the symbols 12085between the two connected components: 12086 12087.. code:: 12088 :number-lines: 12089 12090 .amdgcn_target "amdgcn-amd-amdhsa--gfx900+xnack" // optional 12091 12092 // gpr tracking symbols are implicitly set to zero 12093 12094 .text 12095 .globl kern0 12096 .p2align 8 12097 .type kern0,@function 12098 kern0: 12099 // ... 12100 s_endpgm 12101 .Lkern0_end: 12102 .size kern0, .Lkern0_end-kern0 12103 12104 .rodata 12105 .p2align 6 12106 .amdhsa_kernel kern0 12107 // ... 12108 .amdhsa_next_free_vgpr .amdgcn.next_free_vgpr 12109 .amdhsa_next_free_sgpr .amdgcn.next_free_sgpr 12110 .end_amdhsa_kernel 12111 12112 // reset symbols to begin tracking usage in func1 and kern1 12113 .set .amdgcn.next_free_vgpr, 0 12114 .set .amdgcn.next_free_sgpr, 0 12115 12116 .text 12117 .hidden func1 12118 .global func1 12119 .p2align 2 12120 .type func1,@function 12121 func1: 12122 // ... 12123 s_setpc_b64 s[30:31] 12124 .Lfunc1_end: 12125 .size func1, .Lfunc1_end-func1 12126 12127 .globl kern1 12128 .p2align 8 12129 .type kern1,@function 12130 kern1: 12131 // ... 12132 s_getpc_b64 s[4:5] 12133 s_add_u32 s4, s4, func1@rel32@lo+4 12134 s_addc_u32 s5, s5, func1@rel32@lo+4 12135 s_swappc_b64 s[30:31], s[4:5] 12136 // ... 12137 s_endpgm 12138 .Lkern1_end: 12139 .size kern1, .Lkern1_end-kern1 12140 12141 .rodata 12142 .p2align 6 12143 .amdhsa_kernel kern1 12144 // ... 12145 .amdhsa_next_free_vgpr .amdgcn.next_free_vgpr 12146 .amdhsa_next_free_sgpr .amdgcn.next_free_sgpr 12147 .end_amdhsa_kernel 12148 12149These symbols cannot identify connected components in order to automatically 12150track the usage for each kernel. However, in some cases careful organization of 12151the kernels and functions in the source file means there is minimal additional 12152effort required to accurately calculate GPR usage. 12153 12154Additional Documentation 12155======================== 12156 12157.. [AMD-GCN-GFX6] `AMD Southern Islands Series ISA <http://developer.amd.com/wordpress/media/2012/12/AMD_Southern_Islands_Instruction_Set_Architecture.pdf>`__ 12158.. [AMD-GCN-GFX7] `AMD Sea Islands Series ISA <http://developer.amd.com/wordpress/media/2013/07/AMD_Sea_Islands_Instruction_Set_Architecture.pdf>`_ 12159.. [AMD-GCN-GFX8] `AMD GCN3 Instruction Set Architecture <http://amd-dev.wpengine.netdna-cdn.com/wordpress/media/2013/12/AMD_GCN3_Instruction_Set_Architecture_rev1.1.pdf>`__ 12160.. [AMD-GCN-GFX9] `AMD Vega Instruction Set Architecture <http://developer.amd.com/wordpress/media/2013/12/Vega_Shader_ISA_28July2017.pdf>`__ 12161.. [AMD-GCN-GFX908-CDNA1] `AMD Instinct MI100 Instruction Set Architecture <https://developer.amd.com/wp-content/resources/CDNA1_Shader_ISA_14December2020.pdf>`__ 12162.. [AMD-GCN-GFX10-RDNA1] `AMD RDNA 1.0 Instruction Set Architecture <https://gpuopen.com/wp-content/uploads/2019/08/RDNA_Shader_ISA_5August2019.pdf>`__ 12163.. [AMD-GCN-GFX10-RDNA2] `AMD RDNA 2 Instruction Set Architecture <https://developer.amd.com/wp-content/resources/RDNA2_Shader_ISA_November2020.pdf>`__ 12164.. [AMD-RADEON-HD-2000-3000] `AMD R6xx shader ISA <http://developer.amd.com/wordpress/media/2012/10/R600_Instruction_Set_Architecture.pdf>`__ 12165.. [AMD-RADEON-HD-4000] `AMD R7xx shader ISA <http://developer.amd.com/wordpress/media/2012/10/R700-Family_Instruction_Set_Architecture.pdf>`__ 12166.. [AMD-RADEON-HD-5000] `AMD Evergreen shader ISA <http://developer.amd.com/wordpress/media/2012/10/AMD_Evergreen-Family_Instruction_Set_Architecture.pdf>`__ 12167.. [AMD-RADEON-HD-6000] `AMD Cayman/Trinity shader ISA <http://developer.amd.com/wordpress/media/2012/10/AMD_HD_6900_Series_Instruction_Set_Architecture.pdf>`__ 12168.. [AMD-ROCm] `AMD ROCm™ Platform <https://rocmdocs.amd.com/>`__ 12169.. [AMD-ROCm-github] `AMD ROCm™ github <http://github.com/RadeonOpenCompute>`__ 12170.. [AMD-ROCm-Release-Notes] `AMD ROCm Release Notes <https://github.com/RadeonOpenCompute/ROCm>`__ 12171.. [CLANG-ATTR] `Attributes in Clang <https://clang.llvm.org/docs/AttributeReference.html>`__ 12172.. [DWARF] `DWARF Debugging Information Format <http://dwarfstd.org/>`__ 12173.. [ELF] `Executable and Linkable Format (ELF) <http://www.sco.com/developers/gabi/>`__ 12174.. [HRF] `Heterogeneous-race-free Memory Models <http://benedictgaster.org/wp-content/uploads/2014/01/asplos269-FINAL.pdf>`__ 12175.. [HSA] `Heterogeneous System Architecture (HSA) Foundation <http://www.hsafoundation.com/>`__ 12176.. [MsgPack] `Message Pack <http://www.msgpack.org/>`__ 12177.. [OpenCL] `The OpenCL Specification Version 2.0 <http://www.khronos.org/registry/cl/specs/opencl-2.0.pdf>`__ 12178.. [SEMVER] `Semantic Versioning <https://semver.org/>`__ 12179.. [YAML] `YAML Ain't Markup Language (YAML™) Version 1.2 <http://www.yaml.org/spec/1.2/spec.html>`__ 12180