1============================= 2User Guide for AMDGPU Backend 3============================= 4 5.. contents:: 6 :local: 7 8.. toctree:: 9 :hidden: 10 11 AMDGPU/AMDGPUAsmGFX7 12 AMDGPU/AMDGPUAsmGFX8 13 AMDGPU/AMDGPUAsmGFX9 14 AMDGPU/AMDGPUAsmGFX900 15 AMDGPU/AMDGPUAsmGFX904 16 AMDGPU/AMDGPUAsmGFX906 17 AMDGPU/AMDGPUAsmGFX908 18 AMDGPU/AMDGPUAsmGFX90a 19 AMDGPU/AMDGPUAsmGFX10 20 AMDGPU/AMDGPUAsmGFX1011 21 AMDGPUModifierSyntax 22 AMDGPUOperandSyntax 23 AMDGPUInstructionSyntax 24 AMDGPUInstructionNotation 25 AMDGPUDwarfExtensionsForHeterogeneousDebugging 26 27Introduction 28============ 29 30The AMDGPU backend provides ISA code generation for AMD GPUs, starting with the 31R600 family up until the current GCN families. It lives in the 32``llvm/lib/Target/AMDGPU`` directory. 33 34LLVM 35==== 36 37.. _amdgpu-target-triples: 38 39Target Triples 40-------------- 41 42Use the Clang option ``-target <Architecture>-<Vendor>-<OS>-<Environment>`` 43to specify the target triple: 44 45 .. table:: AMDGPU Architectures 46 :name: amdgpu-architecture-table 47 48 ============ ============================================================== 49 Architecture Description 50 ============ ============================================================== 51 ``r600`` AMD GPUs HD2XXX-HD6XXX for graphics and compute shaders. 52 ``amdgcn`` AMD GPUs GCN GFX6 onwards for graphics and compute shaders. 53 ============ ============================================================== 54 55 .. table:: AMDGPU Vendors 56 :name: amdgpu-vendor-table 57 58 ============ ============================================================== 59 Vendor Description 60 ============ ============================================================== 61 ``amd`` Can be used for all AMD GPU usage. 62 ``mesa3d`` Can be used if the OS is ``mesa3d``. 63 ============ ============================================================== 64 65 .. table:: AMDGPU Operating Systems 66 :name: amdgpu-os 67 68 ============== ============================================================ 69 OS Description 70 ============== ============================================================ 71 *<empty>* Defaults to the *unknown* OS. 72 ``amdhsa`` Compute kernels executed on HSA [HSA]_ compatible runtimes 73 such as: 74 75 - AMD's ROCm™ runtime [AMD-ROCm]_ using the *rocm-amdhsa* 76 loader on Linux. See *AMD ROCm Platform Release Notes* 77 [AMD-ROCm-Release-Notes]_ for supported hardware and 78 software. 79 - AMD's PAL runtime using the *pal-amdhsa* loader on 80 Windows. 81 82 ``amdpal`` Graphic shaders and compute kernels executed on AMD's PAL 83 runtime using the *pal-amdpal* loader on Windows and Linux 84 Pro. 85 ``mesa3d`` Graphic shaders and compute kernels executed on AMD's Mesa 86 3D runtime using the *mesa-mesa3d* loader on Linux. 87 ============== ============================================================ 88 89 .. table:: AMDGPU Environments 90 :name: amdgpu-environment-table 91 92 ============ ============================================================== 93 Environment Description 94 ============ ============================================================== 95 *<empty>* Default. 96 ============ ============================================================== 97 98.. _amdgpu-processors: 99 100Processors 101---------- 102 103Use the Clang options ``-mcpu=<target-id>`` or ``--offload-arch=<target-id>`` to 104specify the AMDGPU processor together with optional target features. See 105:ref:`amdgpu-target-id` and :ref:`amdgpu-target-features` for AMD GPU target 106specific information. 107 108Every processor supports every OS ABI (see :ref:`amdgpu-os`) with the following exceptions: 109 110* ``amdhsa`` is not supported in ``r600`` architecture (see :ref:`amdgpu-architecture-table`). 111 112 113 .. table:: AMDGPU Processors 114 :name: amdgpu-processor-table 115 116 =========== =============== ============ ===== ================= =============== =============== ====================== 117 Processor Alternative Target dGPU/ Target Target OS Support Example 118 Processor Triple APU Features Properties *(see* Products 119 Architecture Supported `amdgpu-os`_ 120 *and 121 corresponding 122 runtime release 123 notes for 124 current 125 information and 126 level of 127 support)* 128 =========== =============== ============ ===== ================= =============== =============== ====================== 129 **Radeon HD 2000/3000 Series (R600)** [AMD-RADEON-HD-2000-3000]_ 130 ----------------------------------------------------------------------------------------------------------------------- 131 ``r600`` ``r600`` dGPU - Does not 132 support 133 generic 134 address 135 space 136 ``r630`` ``r600`` dGPU - Does not 137 support 138 generic 139 address 140 space 141 ``rs880`` ``r600`` dGPU - Does not 142 support 143 generic 144 address 145 space 146 ``rv670`` ``r600`` dGPU - Does not 147 support 148 generic 149 address 150 space 151 **Radeon HD 4000 Series (R700)** [AMD-RADEON-HD-4000]_ 152 ----------------------------------------------------------------------------------------------------------------------- 153 ``rv710`` ``r600`` dGPU - Does not 154 support 155 generic 156 address 157 space 158 ``rv730`` ``r600`` dGPU - Does not 159 support 160 generic 161 address 162 space 163 ``rv770`` ``r600`` dGPU - Does not 164 support 165 generic 166 address 167 space 168 **Radeon HD 5000 Series (Evergreen)** [AMD-RADEON-HD-5000]_ 169 ----------------------------------------------------------------------------------------------------------------------- 170 ``cedar`` ``r600`` dGPU - Does not 171 support 172 generic 173 address 174 space 175 ``cypress`` ``r600`` dGPU - Does not 176 support 177 generic 178 address 179 space 180 ``juniper`` ``r600`` dGPU - Does not 181 support 182 generic 183 address 184 space 185 ``redwood`` ``r600`` dGPU - Does not 186 support 187 generic 188 address 189 space 190 ``sumo`` ``r600`` dGPU - Does not 191 support 192 generic 193 address 194 space 195 **Radeon HD 6000 Series (Northern Islands)** [AMD-RADEON-HD-6000]_ 196 ----------------------------------------------------------------------------------------------------------------------- 197 ``barts`` ``r600`` dGPU - Does not 198 support 199 generic 200 address 201 space 202 ``caicos`` ``r600`` dGPU - Does not 203 support 204 generic 205 address 206 space 207 ``cayman`` ``r600`` dGPU - Does not 208 support 209 generic 210 address 211 space 212 ``turks`` ``r600`` dGPU - Does not 213 support 214 generic 215 address 216 space 217 **GCN GFX6 (Southern Islands (SI))** [AMD-GCN-GFX6]_ 218 ----------------------------------------------------------------------------------------------------------------------- 219 ``gfx600`` - ``tahiti`` ``amdgcn`` dGPU - Does not - *pal-amdpal* 220 support 221 generic 222 address 223 space 224 ``gfx601`` - ``pitcairn`` ``amdgcn`` dGPU - Does not - *pal-amdpal* 225 - ``verde`` support 226 generic 227 address 228 space 229 ``gfx602`` - ``hainan`` ``amdgcn`` dGPU - Does not - *pal-amdpal* 230 - ``oland`` support 231 generic 232 address 233 space 234 **GCN GFX7 (Sea Islands (CI))** [AMD-GCN-GFX7]_ 235 ----------------------------------------------------------------------------------------------------------------------- 236 ``gfx700`` - ``kaveri`` ``amdgcn`` APU - Offset - *rocm-amdhsa* - A6-7000 237 flat - *pal-amdhsa* - A6 Pro-7050B 238 scratch - *pal-amdpal* - A8-7100 239 - A8 Pro-7150B 240 - A10-7300 241 - A10 Pro-7350B 242 - FX-7500 243 - A8-7200P 244 - A10-7400P 245 - FX-7600P 246 ``gfx701`` - ``hawaii`` ``amdgcn`` dGPU - Offset - *rocm-amdhsa* - FirePro W8100 247 flat - *pal-amdhsa* - FirePro W9100 248 scratch - *pal-amdpal* - FirePro S9150 249 - FirePro S9170 250 ``gfx702`` ``amdgcn`` dGPU - Offset - *rocm-amdhsa* - Radeon R9 290 251 flat - *pal-amdhsa* - Radeon R9 290x 252 scratch - *pal-amdpal* - Radeon R390 253 - Radeon R390x 254 ``gfx703`` - ``kabini`` ``amdgcn`` APU - Offset - *pal-amdhsa* - E1-2100 255 - ``mullins`` flat - *pal-amdpal* - E1-2200 256 scratch - E1-2500 257 - E2-3000 258 - E2-3800 259 - A4-5000 260 - A4-5100 261 - A6-5200 262 - A4 Pro-3340B 263 ``gfx704`` - ``bonaire`` ``amdgcn`` dGPU - Offset - *pal-amdhsa* - Radeon HD 7790 264 flat - *pal-amdpal* - Radeon HD 8770 265 scratch - R7 260 266 - R7 260X 267 ``gfx705`` ``amdgcn`` APU - Offset - *pal-amdhsa* *TBA* 268 flat - *pal-amdpal* 269 scratch .. TODO:: 270 271 Add product 272 names. 273 274 **GCN GFX8 (Volcanic Islands (VI))** [AMD-GCN-GFX8]_ 275 ----------------------------------------------------------------------------------------------------------------------- 276 ``gfx801`` - ``carrizo`` ``amdgcn`` APU - xnack - Offset - *rocm-amdhsa* - A6-8500P 277 flat - *pal-amdhsa* - Pro A6-8500B 278 scratch - *pal-amdpal* - A8-8600P 279 - Pro A8-8600B 280 - FX-8800P 281 - Pro A12-8800B 282 - A10-8700P 283 - Pro A10-8700B 284 - A10-8780P 285 - A10-9600P 286 - A10-9630P 287 - A12-9700P 288 - A12-9730P 289 - FX-9800P 290 - FX-9830P 291 - E2-9010 292 - A6-9210 293 - A9-9410 294 ``gfx802`` - ``iceland`` ``amdgcn`` dGPU - Offset - *rocm-amdhsa* - Radeon R9 285 295 - ``tonga`` flat - *pal-amdhsa* - Radeon R9 380 296 scratch - *pal-amdpal* - Radeon R9 385 297 ``gfx803`` - ``fiji`` ``amdgcn`` dGPU - *rocm-amdhsa* - Radeon R9 Nano 298 - *pal-amdhsa* - Radeon R9 Fury 299 - *pal-amdpal* - Radeon R9 FuryX 300 - Radeon Pro Duo 301 - FirePro S9300x2 302 - Radeon Instinct MI8 303 \ - ``polaris10`` ``amdgcn`` dGPU - Offset - *rocm-amdhsa* - Radeon RX 470 304 flat - *pal-amdhsa* - Radeon RX 480 305 scratch - *pal-amdpal* - Radeon Instinct MI6 306 \ - ``polaris11`` ``amdgcn`` dGPU - Offset - *rocm-amdhsa* - Radeon RX 460 307 flat - *pal-amdhsa* 308 scratch - *pal-amdpal* 309 ``gfx805`` - ``tongapro`` ``amdgcn`` dGPU - Offset - *rocm-amdhsa* - FirePro S7150 310 flat - *pal-amdhsa* - FirePro S7100 311 scratch - *pal-amdpal* - FirePro W7100 312 - Mobile FirePro 313 M7170 314 ``gfx810`` - ``stoney`` ``amdgcn`` APU - xnack - Offset - *rocm-amdhsa* *TBA* 315 flat - *pal-amdhsa* 316 scratch - *pal-amdpal* .. TODO:: 317 318 Add product 319 names. 320 321 **GCN GFX9 (Vega)** [AMD-GCN-GFX900-GFX904-VEGA]_ [AMD-GCN-GFX906-VEGA7NM]_ [AMD-GCN-GFX908-CDNA1]_ 322 ----------------------------------------------------------------------------------------------------------------------- 323 ``gfx900`` ``amdgcn`` dGPU - xnack - Absolute - *rocm-amdhsa* - Radeon Vega 324 flat - *pal-amdhsa* Frontier Edition 325 scratch - *pal-amdpal* - Radeon RX Vega 56 326 - Radeon RX Vega 64 327 - Radeon RX Vega 64 328 Liquid 329 - Radeon Instinct MI25 330 ``gfx902`` ``amdgcn`` APU - xnack - Absolute - *rocm-amdhsa* - Ryzen 3 2200G 331 flat - *pal-amdhsa* - Ryzen 5 2400G 332 scratch - *pal-amdpal* 333 ``gfx904`` ``amdgcn`` dGPU - xnack - *rocm-amdhsa* *TBA* 334 - *pal-amdhsa* 335 - *pal-amdpal* .. TODO:: 336 337 Add product 338 names. 339 340 ``gfx906`` ``amdgcn`` dGPU - sramecc - Absolute - *rocm-amdhsa* - Radeon Instinct MI50 341 - xnack flat - *pal-amdhsa* - Radeon Instinct MI60 342 scratch - *pal-amdpal* - Radeon VII 343 - Radeon Pro VII 344 ``gfx908`` ``amdgcn`` dGPU - sramecc - *rocm-amdhsa* - AMD Instinct MI100 Accelerator 345 - xnack - Absolute 346 flat 347 scratch 348 ``gfx909`` ``amdgcn`` APU - xnack - Absolute - *pal-amdpal* *TBA* 349 flat 350 scratch .. TODO:: 351 352 Add product 353 names. 354 355 ``gfx90a`` ``amdgcn`` dGPU - sramecc - Absolute - *rocm-amdhsa* *TBA* 356 - tgsplit flat 357 - xnack scratch .. TODO:: 358 - Packed 359 work-item Add product 360 IDs names. 361 362 ``gfx90c`` ``amdgcn`` APU - xnack - Absolute - *pal-amdpal* - Ryzen 7 4700G 363 flat - Ryzen 7 4700GE 364 scratch - Ryzen 5 4600G 365 - Ryzen 5 4600GE 366 - Ryzen 3 4300G 367 - Ryzen 3 4300GE 368 - Ryzen Pro 4000G 369 - Ryzen 7 Pro 4700G 370 - Ryzen 7 Pro 4750GE 371 - Ryzen 5 Pro 4650G 372 - Ryzen 5 Pro 4650GE 373 - Ryzen 3 Pro 4350G 374 - Ryzen 3 Pro 4350GE 375 376 **GCN GFX10 (RDNA 1)** [AMD-GCN-GFX10-RDNA1]_ 377 ----------------------------------------------------------------------------------------------------------------------- 378 ``gfx1010`` ``amdgcn`` dGPU - cumode - Absolute - *rocm-amdhsa* - Radeon RX 5700 379 - wavefrontsize64 flat - *pal-amdhsa* - Radeon RX 5700 XT 380 - xnack scratch - *pal-amdpal* - Radeon Pro 5600 XT 381 - Radeon Pro 5600M 382 ``gfx1011`` ``amdgcn`` dGPU - cumode - *rocm-amdhsa* - Radeon Pro V520 383 - wavefrontsize64 - Absolute - *pal-amdhsa* 384 - xnack flat - *pal-amdpal* 385 scratch 386 ``gfx1012`` ``amdgcn`` dGPU - cumode - Absolute - *rocm-amdhsa* - Radeon RX 5500 387 - wavefrontsize64 flat - *pal-amdhsa* - Radeon RX 5500 XT 388 - xnack scratch - *pal-amdpal* 389 ``gfx1013`` ``amdgcn`` APU - cumode - Absolute - *rocm-amdhsa* *TBA* 390 - wavefrontsize64 flat - *pal-amdhsa* 391 - xnack scratch - *pal-amdpal* .. TODO:: 392 393 Add product 394 names. 395 396 **GCN GFX10 (RDNA 2)** [AMD-GCN-GFX10-RDNA2]_ 397 ----------------------------------------------------------------------------------------------------------------------- 398 ``gfx1030`` ``amdgcn`` dGPU - cumode - Absolute - *rocm-amdhsa* - Radeon RX 6800 399 - wavefrontsize64 flat - *pal-amdhsa* - Radeon RX 6800 XT 400 scratch - *pal-amdpal* - Radeon RX 6900 XT 401 ``gfx1031`` ``amdgcn`` dGPU - cumode - Absolute - *rocm-amdhsa* - Radeon RX 6700 XT 402 - wavefrontsize64 flat - *pal-amdhsa* 403 scratch - *pal-amdpal* 404 ``gfx1032`` ``amdgcn`` dGPU - cumode - Absolute - *rocm-amdhsa* *TBA* 405 - wavefrontsize64 flat - *pal-amdhsa* 406 scratch - *pal-amdpal* .. TODO:: 407 408 Add product 409 names. 410 411 ``gfx1033`` ``amdgcn`` APU - cumode - Absolute - *pal-amdpal* *TBA* 412 - wavefrontsize64 flat 413 scratch .. TODO:: 414 415 Add product 416 names. 417 ``gfx1034`` ``amdgcn`` dGPU - cumode - Absolute - *pal-amdpal* *TBA* 418 - wavefrontsize64 flat 419 scratch .. TODO:: 420 421 Add product 422 names. 423 424 ``gfx1035`` ``amdgcn`` APU - cumode - Absolute - *pal-amdpal* *TBA* 425 - wavefrontsize64 flat 426 scratch .. TODO:: 427 Add product 428 names. 429 430 =========== =============== ============ ===== ================= =============== =============== ====================== 431 432.. _amdgpu-target-features: 433 434Target Features 435--------------- 436 437Target features control how code is generated to support certain 438processor specific features. Not all target features are supported by 439all processors. The runtime must ensure that the features supported by 440the device used to execute the code match the features enabled when 441generating the code. A mismatch of features may result in incorrect 442execution, or a reduction in performance. 443 444The target features supported by each processor is listed in 445:ref:`amdgpu-processor-table`. 446 447Target features are controlled by exactly one of the following Clang 448options: 449 450``-mcpu=<target-id>`` or ``--offload-arch=<target-id>`` 451 452 The ``-mcpu`` and ``--offload-arch`` can specify the target feature as 453 optional components of the target ID. If omitted, the target feature has the 454 ``any`` value. See :ref:`amdgpu-target-id`. 455 456``-m[no-]<target-feature>`` 457 458 Target features not specified by the target ID are specified using a 459 separate option. These target features can have an ``on`` or ``off`` 460 value. ``on`` is specified by omitting the ``no-`` prefix, and 461 ``off`` is specified by including the ``no-`` prefix. The default 462 if not specified is ``off``. 463 464For example: 465 466``-mcpu=gfx908:xnack+`` 467 Enable the ``xnack`` feature. 468``-mcpu=gfx908:xnack-`` 469 Disable the ``xnack`` feature. 470``-mcumode`` 471 Enable the ``cumode`` feature. 472``-mno-cumode`` 473 Disable the ``cumode`` feature. 474 475 .. table:: AMDGPU Target Features 476 :name: amdgpu-target-features-table 477 478 =============== ============================ ================================================== 479 Target Feature Clang Option to Control Description 480 Name 481 =============== ============================ ================================================== 482 cumode - ``-m[no-]cumode`` Control the wavefront execution mode used 483 when generating code for kernels. When disabled 484 native WGP wavefront execution mode is used, 485 when enabled CU wavefront execution mode is used 486 (see :ref:`amdgpu-amdhsa-memory-model`). 487 488 sramecc - ``-mcpu`` If specified, generate code that can only be 489 - ``--offload-arch`` loaded and executed in a process that has a 490 matching setting for SRAMECC. 491 492 If not specified for code object V2 to V3, generate 493 code that can be loaded and executed in a process 494 with SRAMECC enabled. 495 496 If not specified for code object V4, generate 497 code that can be loaded and executed in a process 498 with either setting of SRAMECC. 499 500 tgsplit ``-m[no-]tgsplit`` Enable/disable generating code that assumes 501 work-groups are launched in threadgroup split mode. 502 When enabled the waves of a work-group may be 503 launched in different CUs. 504 505 wavefrontsize64 - ``-m[no-]wavefrontsize64`` Control the wavefront size used when 506 generating code for kernels. When disabled 507 native wavefront size 32 is used, when enabled 508 wavefront size 64 is used. 509 510 xnack - ``-mcpu`` If specified, generate code that can only be 511 - ``--offload-arch`` loaded and executed in a process that has a 512 matching setting for XNACK replay. 513 514 If not specified for code object V2 to V3, generate 515 code that can be loaded and executed in a process 516 with XNACK replay enabled. 517 518 If not specified for code object V4, generate 519 code that can be loaded and executed in a process 520 with either setting of XNACK replay. 521 522 XNACK replay can be used for demand paging and 523 page migration. If enabled in the device, then if 524 a page fault occurs the code may execute 525 incorrectly unless generated with XNACK replay 526 enabled, or generated for code object V4 without 527 specifying XNACK replay. Executing code that was 528 generated with XNACK replay enabled, or generated 529 for code object V4 without specifying XNACK replay, 530 on a device that does not have XNACK replay 531 enabled will execute correctly but may be less 532 performant than code generated for XNACK replay 533 disabled. 534 =============== ============================ ================================================== 535 536.. _amdgpu-target-id: 537 538Target ID 539--------- 540 541AMDGPU supports target IDs. See `Clang Offload Bundler 542<https://clang.llvm.org/docs/ClangOffloadBundler.html>`_ for a general 543description. The AMDGPU target specific information is: 544 545**processor** 546 Is an AMDGPU processor or alternative processor name specified in 547 :ref:`amdgpu-processor-table`. The non-canonical form target ID allows both 548 the primary processor and alternative processor names. The canonical form 549 target ID only allow the primary processor name. 550 551**target-feature** 552 Is a target feature name specified in :ref:`amdgpu-target-features-table` that 553 is supported by the processor. The target features supported by each processor 554 is specified in :ref:`amdgpu-processor-table`. Those that can be specified in 555 a target ID are marked as being controlled by ``-mcpu`` and 556 ``--offload-arch``. Each target feature must appear at most once in a target 557 ID. The non-canonical form target ID allows the target features to be 558 specified in any order. The canonical form target ID requires the target 559 features to be specified in alphabetic order. 560 561.. _amdgpu-target-id-v2-v3: 562 563Code Object V2 to V3 Target ID 564~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 565 566The target ID syntax for code object V2 to V3 is the same as defined in `Clang 567Offload Bundler <https://clang.llvm.org/docs/ClangOffloadBundler.html>`_ except 568when used in the :ref:`amdgpu-assembler-directive-amdgcn-target` assembler 569directive and the bundle entry ID. In those cases it has the following BNF 570syntax: 571 572.. code:: 573 574 <target-id> ::== <processor> ( "+" <target-feature> )* 575 576Where a target feature is omitted if *Off* and present if *On* or *Any*. 577 578.. note:: 579 580 The code object V2 to V3 cannot represent *Any* and treats it the same as 581 *On*. 582 583.. _amdgpu-embedding-bundled-objects: 584 585Embedding Bundled Code Objects 586------------------------------ 587 588AMDGPU supports the HIP and OpenMP languages that perform code object embedding 589as described in `Clang Offload Bundler 590<https://clang.llvm.org/docs/ClangOffloadBundler.html>`_. 591 592.. note:: 593 594 The target ID syntax used for code object V2 to V3 for a bundle entry ID 595 differs from that used elsewhere. See :ref:`amdgpu-target-id-v2-v3`. 596 597.. _amdgpu-address-spaces: 598 599Address Spaces 600-------------- 601 602The AMDGPU architecture supports a number of memory address spaces. The address 603space names use the OpenCL standard names, with some additions. 604 605The AMDGPU address spaces correspond to target architecture specific LLVM 606address space numbers used in LLVM IR. 607 608The AMDGPU address spaces are described in 609:ref:`amdgpu-address-spaces-table`. Only 64-bit process address spaces are 610supported for the ``amdgcn`` target. 611 612 .. table:: AMDGPU Address Spaces 613 :name: amdgpu-address-spaces-table 614 615 ================================= =============== =========== ================ ======= ============================ 616 .. 64-Bit Process Address Space 617 --------------------------------- --------------- ----------- ---------------- ------------------------------------ 618 Address Space Name LLVM IR Address HSA Segment Hardware Address NULL Value 619 Space Number Name Name Size 620 ================================= =============== =========== ================ ======= ============================ 621 Generic 0 flat flat 64 0x0000000000000000 622 Global 1 global global 64 0x0000000000000000 623 Region 2 N/A GDS 32 *not implemented for AMDHSA* 624 Local 3 group LDS 32 0xFFFFFFFF 625 Constant 4 constant *same as global* 64 0x0000000000000000 626 Private 5 private scratch 32 0xFFFFFFFF 627 Constant 32-bit 6 *TODO* 0x00000000 628 Buffer Fat Pointer (experimental) 7 *TODO* 629 ================================= =============== =========== ================ ======= ============================ 630 631**Generic** 632 The generic address space is supported unless the *Target Properties* column 633 of :ref:`amdgpu-processor-table` specifies *Does not support generic address 634 space*. 635 636 The generic address space uses the hardware flat address support for two fixed 637 ranges of virtual addresses (the private and local apertures), that are 638 outside the range of addressable global memory, to map from a flat address to 639 a private or local address. This uses FLAT instructions that can take a flat 640 address and access global, private (scratch), and group (LDS) memory depending 641 on if the address is within one of the aperture ranges. 642 643 Flat access to scratch requires hardware aperture setup and setup in the 644 kernel prologue (see :ref:`amdgpu-amdhsa-kernel-prolog-flat-scratch`). Flat 645 access to LDS requires hardware aperture setup and M0 (GFX7-GFX8) register 646 setup (see :ref:`amdgpu-amdhsa-kernel-prolog-m0`). 647 648 To convert between a private or group address space address (termed a segment 649 address) and a flat address the base address of the corresponding aperture 650 can be used. For GFX7-GFX8 these are available in the 651 :ref:`amdgpu-amdhsa-hsa-aql-queue` the address of which can be obtained with 652 Queue Ptr SGPR (see :ref:`amdgpu-amdhsa-initial-kernel-execution-state`). For 653 GFX9-GFX10 the aperture base addresses are directly available as inline 654 constant registers ``SRC_SHARED_BASE/LIMIT`` and ``SRC_PRIVATE_BASE/LIMIT``. 655 In 64-bit address mode the aperture sizes are 2^32 bytes and the base is 656 aligned to 2^32 which makes it easier to convert from flat to segment or 657 segment to flat. 658 659 A global address space address has the same value when used as a flat address 660 so no conversion is needed. 661 662**Global and Constant** 663 The global and constant address spaces both use global virtual addresses, 664 which are the same virtual address space used by the CPU. However, some 665 virtual addresses may only be accessible to the CPU, some only accessible 666 by the GPU, and some by both. 667 668 Using the constant address space indicates that the data will not change 669 during the execution of the kernel. This allows scalar read instructions to 670 be used. As the constant address space could only be modified on the host 671 side, a generic pointer loaded from the constant address space is safe to be 672 assumed as a global pointer since only the device global memory is visible 673 and managed on the host side. The vector and scalar L1 caches are invalidated 674 of volatile data before each kernel dispatch execution to allow constant 675 memory to change values between kernel dispatches. 676 677**Region** 678 The region address space uses the hardware Global Data Store (GDS). All 679 wavefronts executing on the same device will access the same memory for any 680 given region address. However, the same region address accessed by wavefronts 681 executing on different devices will access different memory. It is higher 682 performance than global memory. It is allocated by the runtime. The data 683 store (DS) instructions can be used to access it. 684 685**Local** 686 The local address space uses the hardware Local Data Store (LDS) which is 687 automatically allocated when the hardware creates the wavefronts of a 688 work-group, and freed when all the wavefronts of a work-group have 689 terminated. All wavefronts belonging to the same work-group will access the 690 same memory for any given local address. However, the same local address 691 accessed by wavefronts belonging to different work-groups will access 692 different memory. It is higher performance than global memory. The data store 693 (DS) instructions can be used to access it. 694 695**Private** 696 The private address space uses the hardware scratch memory support which 697 automatically allocates memory when it creates a wavefront and frees it when 698 a wavefronts terminates. The memory accessed by a lane of a wavefront for any 699 given private address will be different to the memory accessed by another lane 700 of the same or different wavefront for the same private address. 701 702 If a kernel dispatch uses scratch, then the hardware allocates memory from a 703 pool of backing memory allocated by the runtime for each wavefront. The lanes 704 of the wavefront access this using dword (4 byte) interleaving. The mapping 705 used from private address to backing memory address is: 706 707 ``wavefront-scratch-base + 708 ((private-address / 4) * wavefront-size * 4) + 709 (wavefront-lane-id * 4) + (private-address % 4)`` 710 711 If each lane of a wavefront accesses the same private address, the 712 interleaving results in adjacent dwords being accessed and hence requires 713 fewer cache lines to be fetched. 714 715 There are different ways that the wavefront scratch base address is 716 determined by a wavefront (see 717 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`). 718 719 Scratch memory can be accessed in an interleaved manner using buffer 720 instructions with the scratch buffer descriptor and per wavefront scratch 721 offset, by the scratch instructions, or by flat instructions. Multi-dword 722 access is not supported except by flat and scratch instructions in 723 GFX9-GFX10. 724 725**Constant 32-bit** 726 *TODO* 727 728**Buffer Fat Pointer** 729 The buffer fat pointer is an experimental address space that is currently 730 unsupported in the backend. It exposes a non-integral pointer that is in 731 the future intended to support the modelling of 128-bit buffer descriptors 732 plus a 32-bit offset into the buffer (in total encapsulating a 160-bit 733 *pointer*), allowing normal LLVM load/store/atomic operations to be used to 734 model the buffer descriptors used heavily in graphics workloads targeting 735 the backend. 736 737.. _amdgpu-memory-scopes: 738 739Memory Scopes 740------------- 741 742This section provides LLVM memory synchronization scopes supported by the AMDGPU 743backend memory model when the target triple OS is ``amdhsa`` (see 744:ref:`amdgpu-amdhsa-memory-model` and :ref:`amdgpu-target-triples`). 745 746The memory model supported is based on the HSA memory model [HSA]_ which is 747based in turn on HRF-indirect with scope inclusion [HRF]_. The happens-before 748relation is transitive over the synchronizes-with relation independent of scope 749and synchronizes-with allows the memory scope instances to be inclusive (see 750table :ref:`amdgpu-amdhsa-llvm-sync-scopes-table`). 751 752This is different to the OpenCL [OpenCL]_ memory model which does not have scope 753inclusion and requires the memory scopes to exactly match. However, this 754is conservatively correct for OpenCL. 755 756 .. table:: AMDHSA LLVM Sync Scopes 757 :name: amdgpu-amdhsa-llvm-sync-scopes-table 758 759 ======================= =================================================== 760 LLVM Sync Scope Description 761 ======================= =================================================== 762 *none* The default: ``system``. 763 764 Synchronizes with, and participates in modification 765 and seq_cst total orderings with, other operations 766 (except image operations) for all address spaces 767 (except private, or generic that accesses private) 768 provided the other operation's sync scope is: 769 770 - ``system``. 771 - ``agent`` and executed by a thread on the same 772 agent. 773 - ``workgroup`` and executed by a thread in the 774 same work-group. 775 - ``wavefront`` and executed by a thread in the 776 same wavefront. 777 778 ``agent`` Synchronizes with, and participates in modification 779 and seq_cst total orderings with, other operations 780 (except image operations) for all address spaces 781 (except private, or generic that accesses private) 782 provided the other operation's sync scope is: 783 784 - ``system`` or ``agent`` and executed by a thread 785 on the same agent. 786 - ``workgroup`` and executed by a thread in the 787 same work-group. 788 - ``wavefront`` and executed by a thread in the 789 same wavefront. 790 791 ``workgroup`` Synchronizes with, and participates in modification 792 and seq_cst total orderings with, other operations 793 (except image operations) for all address spaces 794 (except private, or generic that accesses private) 795 provided the other operation's sync scope is: 796 797 - ``system``, ``agent`` or ``workgroup`` and 798 executed by a thread in the same work-group. 799 - ``wavefront`` and executed by a thread in the 800 same wavefront. 801 802 ``wavefront`` Synchronizes with, and participates in modification 803 and seq_cst total orderings with, other operations 804 (except image operations) for all address spaces 805 (except private, or generic that accesses private) 806 provided the other operation's sync scope is: 807 808 - ``system``, ``agent``, ``workgroup`` or 809 ``wavefront`` and executed by a thread in the 810 same wavefront. 811 812 ``singlethread`` Only synchronizes with and participates in 813 modification and seq_cst total orderings with, 814 other operations (except image operations) running 815 in the same thread for all address spaces (for 816 example, in signal handlers). 817 818 ``one-as`` Same as ``system`` but only synchronizes with other 819 operations within the same address space. 820 821 ``agent-one-as`` Same as ``agent`` but only synchronizes with other 822 operations within the same address space. 823 824 ``workgroup-one-as`` Same as ``workgroup`` but only synchronizes with 825 other operations within the same address space. 826 827 ``wavefront-one-as`` Same as ``wavefront`` but only synchronizes with 828 other operations within the same address space. 829 830 ``singlethread-one-as`` Same as ``singlethread`` but only synchronizes with 831 other operations within the same address space. 832 ======================= =================================================== 833 834LLVM IR Intrinsics 835------------------ 836 837The AMDGPU backend implements the following LLVM IR intrinsics. 838 839*This section is WIP.* 840 841.. TODO:: 842 843 List AMDGPU intrinsics. 844 845LLVM IR Attributes 846------------------ 847 848The AMDGPU backend supports the following LLVM IR attributes. 849 850 .. table:: AMDGPU LLVM IR Attributes 851 :name: amdgpu-llvm-ir-attributes-table 852 853 ======================================= ========================================================== 854 LLVM Attribute Description 855 ======================================= ========================================================== 856 "amdgpu-flat-work-group-size"="min,max" Specify the minimum and maximum flat work group sizes that 857 will be specified when the kernel is dispatched. Generated 858 by the ``amdgpu_flat_work_group_size`` CLANG attribute [CLANG-ATTR]_. 859 "amdgpu-implicitarg-num-bytes"="n" Number of kernel argument bytes to add to the kernel 860 argument block size for the implicit arguments. This 861 varies by OS and language (for OpenCL see 862 :ref:`opencl-kernel-implicit-arguments-appended-for-amdhsa-os-table`). 863 "amdgpu-num-sgpr"="n" Specifies the number of SGPRs to use. Generated by 864 the ``amdgpu_num_sgpr`` CLANG attribute [CLANG-ATTR]_. 865 "amdgpu-num-vgpr"="n" Specifies the number of VGPRs to use. Generated by the 866 ``amdgpu_num_vgpr`` CLANG attribute [CLANG-ATTR]_. 867 "amdgpu-waves-per-eu"="m,n" Specify the minimum and maximum number of waves per 868 execution unit. Generated by the ``amdgpu_waves_per_eu`` 869 CLANG attribute [CLANG-ATTR]_. 870 "amdgpu-ieee" true/false. Specify whether the function expects the IEEE field of the 871 mode register to be set on entry. Overrides the default for 872 the calling convention. 873 "amdgpu-dx10-clamp" true/false. Specify whether the function expects the DX10_CLAMP field of 874 the mode register to be set on entry. Overrides the default 875 for the calling convention. 876 ======================================= ========================================================== 877 878.. _amdgpu-elf-code-object: 879 880ELF Code Object 881=============== 882 883The AMDGPU backend generates a standard ELF [ELF]_ relocatable code object that 884can be linked by ``lld`` to produce a standard ELF shared code object which can 885be loaded and executed on an AMDGPU target. 886 887.. _amdgpu-elf-header: 888 889Header 890------ 891 892The AMDGPU backend uses the following ELF header: 893 894 .. table:: AMDGPU ELF Header 895 :name: amdgpu-elf-header-table 896 897 ========================== =============================== 898 Field Value 899 ========================== =============================== 900 ``e_ident[EI_CLASS]`` ``ELFCLASS64`` 901 ``e_ident[EI_DATA]`` ``ELFDATA2LSB`` 902 ``e_ident[EI_OSABI]`` - ``ELFOSABI_NONE`` 903 - ``ELFOSABI_AMDGPU_HSA`` 904 - ``ELFOSABI_AMDGPU_PAL`` 905 - ``ELFOSABI_AMDGPU_MESA3D`` 906 ``e_ident[EI_ABIVERSION]`` - ``ELFABIVERSION_AMDGPU_HSA_V2`` 907 - ``ELFABIVERSION_AMDGPU_HSA_V3`` 908 - ``ELFABIVERSION_AMDGPU_HSA_V4`` 909 - ``ELFABIVERSION_AMDGPU_PAL`` 910 - ``ELFABIVERSION_AMDGPU_MESA3D`` 911 ``e_type`` - ``ET_REL`` 912 - ``ET_DYN`` 913 ``e_machine`` ``EM_AMDGPU`` 914 ``e_entry`` 0 915 ``e_flags`` See :ref:`amdgpu-elf-header-e_flags-v2-table`, 916 :ref:`amdgpu-elf-header-e_flags-table-v3`, 917 and :ref:`amdgpu-elf-header-e_flags-table-v4` 918 ========================== =============================== 919 920.. 921 922 .. table:: AMDGPU ELF Header Enumeration Values 923 :name: amdgpu-elf-header-enumeration-values-table 924 925 =============================== ===== 926 Name Value 927 =============================== ===== 928 ``EM_AMDGPU`` 224 929 ``ELFOSABI_NONE`` 0 930 ``ELFOSABI_AMDGPU_HSA`` 64 931 ``ELFOSABI_AMDGPU_PAL`` 65 932 ``ELFOSABI_AMDGPU_MESA3D`` 66 933 ``ELFABIVERSION_AMDGPU_HSA_V2`` 0 934 ``ELFABIVERSION_AMDGPU_HSA_V3`` 1 935 ``ELFABIVERSION_AMDGPU_HSA_V4`` 2 936 ``ELFABIVERSION_AMDGPU_PAL`` 0 937 ``ELFABIVERSION_AMDGPU_MESA3D`` 0 938 =============================== ===== 939 940``e_ident[EI_CLASS]`` 941 The ELF class is: 942 943 * ``ELFCLASS32`` for ``r600`` architecture. 944 945 * ``ELFCLASS64`` for ``amdgcn`` architecture which only supports 64-bit 946 process address space applications. 947 948``e_ident[EI_DATA]`` 949 All AMDGPU targets use ``ELFDATA2LSB`` for little-endian byte ordering. 950 951``e_ident[EI_OSABI]`` 952 One of the following AMDGPU target architecture specific OS ABIs 953 (see :ref:`amdgpu-os`): 954 955 * ``ELFOSABI_NONE`` for *unknown* OS. 956 957 * ``ELFOSABI_AMDGPU_HSA`` for ``amdhsa`` OS. 958 959 * ``ELFOSABI_AMDGPU_PAL`` for ``amdpal`` OS. 960 961 * ``ELFOSABI_AMDGPU_MESA3D`` for ``mesa3D`` OS. 962 963``e_ident[EI_ABIVERSION]`` 964 The ABI version of the AMDGPU target architecture specific OS ABI to which the code 965 object conforms: 966 967 * ``ELFABIVERSION_AMDGPU_HSA_V2`` is used to specify the version of AMD HSA 968 runtime ABI for code object V2. Specify using the Clang option 969 ``-mcode-object-version=2``. 970 971 * ``ELFABIVERSION_AMDGPU_HSA_V3`` is used to specify the version of AMD HSA 972 runtime ABI for code object V3. Specify using the Clang option 973 ``-mcode-object-version=3``. This is the default code object 974 version if not specified. 975 976 * ``ELFABIVERSION_AMDGPU_HSA_V4`` is used to specify the version of AMD HSA 977 runtime ABI for code object V4. Specify using the Clang option 978 ``-mcode-object-version=4``. 979 980 * ``ELFABIVERSION_AMDGPU_PAL`` is used to specify the version of AMD PAL 981 runtime ABI. 982 983 * ``ELFABIVERSION_AMDGPU_MESA3D`` is used to specify the version of AMD MESA 984 3D runtime ABI. 985 986``e_type`` 987 Can be one of the following values: 988 989 990 ``ET_REL`` 991 The type produced by the AMDGPU backend compiler as it is relocatable code 992 object. 993 994 ``ET_DYN`` 995 The type produced by the linker as it is a shared code object. 996 997 The AMD HSA runtime loader requires a ``ET_DYN`` code object. 998 999``e_machine`` 1000 The value ``EM_AMDGPU`` is used for the machine for all processors supported 1001 by the ``r600`` and ``amdgcn`` architectures (see 1002 :ref:`amdgpu-processor-table`). The specific processor is specified in the 1003 ``NT_AMD_HSA_ISA_VERSION`` note record for code object V2 (see 1004 :ref:`amdgpu-note-records-v2`) and in the ``EF_AMDGPU_MACH`` bit field of the 1005 ``e_flags`` for code object V3 to V4 (see 1006 :ref:`amdgpu-elf-header-e_flags-table-v3` and 1007 :ref:`amdgpu-elf-header-e_flags-table-v4`). 1008 1009``e_entry`` 1010 The entry point is 0 as the entry points for individual kernels must be 1011 selected in order to invoke them through AQL packets. 1012 1013``e_flags`` 1014 The AMDGPU backend uses the following ELF header flags: 1015 1016 .. table:: AMDGPU ELF Header ``e_flags`` for Code Object V2 1017 :name: amdgpu-elf-header-e_flags-v2-table 1018 1019 ===================================== ===== ============================= 1020 Name Value Description 1021 ===================================== ===== ============================= 1022 ``EF_AMDGPU_FEATURE_XNACK_V2`` 0x01 Indicates if the ``xnack`` 1023 target feature is 1024 enabled for all code 1025 contained in the code object. 1026 If the processor 1027 does not support the 1028 ``xnack`` target 1029 feature then must 1030 be 0. 1031 See 1032 :ref:`amdgpu-target-features`. 1033 ``EF_AMDGPU_FEATURE_TRAP_HANDLER_V2`` 0x02 Indicates if the trap 1034 handler is enabled for all 1035 code contained in the code 1036 object. If the processor 1037 does not support a trap 1038 handler then must be 0. 1039 See 1040 :ref:`amdgpu-target-features`. 1041 ===================================== ===== ============================= 1042 1043 .. table:: AMDGPU ELF Header ``e_flags`` for Code Object V3 1044 :name: amdgpu-elf-header-e_flags-table-v3 1045 1046 ================================= ===== ============================= 1047 Name Value Description 1048 ================================= ===== ============================= 1049 ``EF_AMDGPU_MACH`` 0x0ff AMDGPU processor selection 1050 mask for 1051 ``EF_AMDGPU_MACH_xxx`` values 1052 defined in 1053 :ref:`amdgpu-ef-amdgpu-mach-table`. 1054 ``EF_AMDGPU_FEATURE_XNACK_V3`` 0x100 Indicates if the ``xnack`` 1055 target feature is 1056 enabled for all code 1057 contained in the code object. 1058 If the processor 1059 does not support the 1060 ``xnack`` target 1061 feature then must 1062 be 0. 1063 See 1064 :ref:`amdgpu-target-features`. 1065 ``EF_AMDGPU_FEATURE_SRAMECC_V3`` 0x200 Indicates if the ``sramecc`` 1066 target feature is 1067 enabled for all code 1068 contained in the code object. 1069 If the processor 1070 does not support the 1071 ``sramecc`` target 1072 feature then must 1073 be 0. 1074 See 1075 :ref:`amdgpu-target-features`. 1076 ================================= ===== ============================= 1077 1078 .. table:: AMDGPU ELF Header ``e_flags`` for Code Object V4 1079 :name: amdgpu-elf-header-e_flags-table-v4 1080 1081 ============================================ ===== =================================== 1082 Name Value Description 1083 ============================================ ===== =================================== 1084 ``EF_AMDGPU_MACH`` 0x0ff AMDGPU processor selection 1085 mask for 1086 ``EF_AMDGPU_MACH_xxx`` values 1087 defined in 1088 :ref:`amdgpu-ef-amdgpu-mach-table`. 1089 ``EF_AMDGPU_FEATURE_XNACK_V4`` 0x300 XNACK selection mask for 1090 ``EF_AMDGPU_FEATURE_XNACK_*_V4`` 1091 values. 1092 ``EF_AMDGPU_FEATURE_XNACK_UNSUPPORTED_V4`` 0x000 XNACK unsuppored. 1093 ``EF_AMDGPU_FEATURE_XNACK_ANY_V4`` 0x100 XNACK can have any value. 1094 ``EF_AMDGPU_FEATURE_XNACK_OFF_V4`` 0x200 XNACK disabled. 1095 ``EF_AMDGPU_FEATURE_XNACK_ON_V4`` 0x300 XNACK enabled. 1096 ``EF_AMDGPU_FEATURE_SRAMECC_V4`` 0xc00 SRAMECC selection mask for 1097 ``EF_AMDGPU_FEATURE_SRAMECC_*_V4`` 1098 values. 1099 ``EF_AMDGPU_FEATURE_SRAMECC_UNSUPPORTED_V4`` 0x000 SRAMECC unsuppored. 1100 ``EF_AMDGPU_FEATURE_SRAMECC_ANY_V4`` 0x400 SRAMECC can have any value. 1101 ``EF_AMDGPU_FEATURE_SRAMECC_OFF_V4`` 0x800 SRAMECC disabled, 1102 ``EF_AMDGPU_FEATURE_SRAMECC_ON_V4`` 0xc00 SRAMECC enabled. 1103 ============================================ ===== =================================== 1104 1105 .. table:: AMDGPU ``EF_AMDGPU_MACH`` Values 1106 :name: amdgpu-ef-amdgpu-mach-table 1107 1108 ==================================== ========== ============================= 1109 Name Value Description (see 1110 :ref:`amdgpu-processor-table`) 1111 ==================================== ========== ============================= 1112 ``EF_AMDGPU_MACH_NONE`` 0x000 *not specified* 1113 ``EF_AMDGPU_MACH_R600_R600`` 0x001 ``r600`` 1114 ``EF_AMDGPU_MACH_R600_R630`` 0x002 ``r630`` 1115 ``EF_AMDGPU_MACH_R600_RS880`` 0x003 ``rs880`` 1116 ``EF_AMDGPU_MACH_R600_RV670`` 0x004 ``rv670`` 1117 ``EF_AMDGPU_MACH_R600_RV710`` 0x005 ``rv710`` 1118 ``EF_AMDGPU_MACH_R600_RV730`` 0x006 ``rv730`` 1119 ``EF_AMDGPU_MACH_R600_RV770`` 0x007 ``rv770`` 1120 ``EF_AMDGPU_MACH_R600_CEDAR`` 0x008 ``cedar`` 1121 ``EF_AMDGPU_MACH_R600_CYPRESS`` 0x009 ``cypress`` 1122 ``EF_AMDGPU_MACH_R600_JUNIPER`` 0x00a ``juniper`` 1123 ``EF_AMDGPU_MACH_R600_REDWOOD`` 0x00b ``redwood`` 1124 ``EF_AMDGPU_MACH_R600_SUMO`` 0x00c ``sumo`` 1125 ``EF_AMDGPU_MACH_R600_BARTS`` 0x00d ``barts`` 1126 ``EF_AMDGPU_MACH_R600_CAICOS`` 0x00e ``caicos`` 1127 ``EF_AMDGPU_MACH_R600_CAYMAN`` 0x00f ``cayman`` 1128 ``EF_AMDGPU_MACH_R600_TURKS`` 0x010 ``turks`` 1129 *reserved* 0x011 - Reserved for ``r600`` 1130 0x01f architecture processors. 1131 ``EF_AMDGPU_MACH_AMDGCN_GFX600`` 0x020 ``gfx600`` 1132 ``EF_AMDGPU_MACH_AMDGCN_GFX601`` 0x021 ``gfx601`` 1133 ``EF_AMDGPU_MACH_AMDGCN_GFX700`` 0x022 ``gfx700`` 1134 ``EF_AMDGPU_MACH_AMDGCN_GFX701`` 0x023 ``gfx701`` 1135 ``EF_AMDGPU_MACH_AMDGCN_GFX702`` 0x024 ``gfx702`` 1136 ``EF_AMDGPU_MACH_AMDGCN_GFX703`` 0x025 ``gfx703`` 1137 ``EF_AMDGPU_MACH_AMDGCN_GFX704`` 0x026 ``gfx704`` 1138 *reserved* 0x027 Reserved. 1139 ``EF_AMDGPU_MACH_AMDGCN_GFX801`` 0x028 ``gfx801`` 1140 ``EF_AMDGPU_MACH_AMDGCN_GFX802`` 0x029 ``gfx802`` 1141 ``EF_AMDGPU_MACH_AMDGCN_GFX803`` 0x02a ``gfx803`` 1142 ``EF_AMDGPU_MACH_AMDGCN_GFX810`` 0x02b ``gfx810`` 1143 ``EF_AMDGPU_MACH_AMDGCN_GFX900`` 0x02c ``gfx900`` 1144 ``EF_AMDGPU_MACH_AMDGCN_GFX902`` 0x02d ``gfx902`` 1145 ``EF_AMDGPU_MACH_AMDGCN_GFX904`` 0x02e ``gfx904`` 1146 ``EF_AMDGPU_MACH_AMDGCN_GFX906`` 0x02f ``gfx906`` 1147 ``EF_AMDGPU_MACH_AMDGCN_GFX908`` 0x030 ``gfx908`` 1148 ``EF_AMDGPU_MACH_AMDGCN_GFX909`` 0x031 ``gfx909`` 1149 ``EF_AMDGPU_MACH_AMDGCN_GFX90C`` 0x032 ``gfx90c`` 1150 ``EF_AMDGPU_MACH_AMDGCN_GFX1010`` 0x033 ``gfx1010`` 1151 ``EF_AMDGPU_MACH_AMDGCN_GFX1011`` 0x034 ``gfx1011`` 1152 ``EF_AMDGPU_MACH_AMDGCN_GFX1012`` 0x035 ``gfx1012`` 1153 ``EF_AMDGPU_MACH_AMDGCN_GFX1030`` 0x036 ``gfx1030`` 1154 ``EF_AMDGPU_MACH_AMDGCN_GFX1031`` 0x037 ``gfx1031`` 1155 ``EF_AMDGPU_MACH_AMDGCN_GFX1032`` 0x038 ``gfx1032`` 1156 ``EF_AMDGPU_MACH_AMDGCN_GFX1033`` 0x039 ``gfx1033`` 1157 ``EF_AMDGPU_MACH_AMDGCN_GFX602`` 0x03a ``gfx602`` 1158 ``EF_AMDGPU_MACH_AMDGCN_GFX705`` 0x03b ``gfx705`` 1159 ``EF_AMDGPU_MACH_AMDGCN_GFX805`` 0x03c ``gfx805`` 1160 ``EF_AMDGPU_MACH_AMDGCN_GFX1035`` 0x03d ``gfx1035`` 1161 ``EF_AMDGPU_MACH_AMDGCN_GFX1034`` 0x03e ``gfx1034`` 1162 ``EF_AMDGPU_MACH_AMDGCN_GFX90A`` 0x03f ``gfx90a`` 1163 *reserved* 0x040 Reserved. 1164 *reserved* 0x041 Reserved. 1165 ``EF_AMDGPU_MACH_AMDGCN_GFX1013`` 0x042 ``gfx1013`` 1166 *reserved* 0x043 Reserved. 1167 *reserved* 0x044 Reserved. 1168 *reserved* 0x045 Reserved. 1169 ==================================== ========== ============================= 1170 1171Sections 1172-------- 1173 1174An AMDGPU target ELF code object has the standard ELF sections which include: 1175 1176 .. table:: AMDGPU ELF Sections 1177 :name: amdgpu-elf-sections-table 1178 1179 ================== ================ ================================= 1180 Name Type Attributes 1181 ================== ================ ================================= 1182 ``.bss`` ``SHT_NOBITS`` ``SHF_ALLOC`` + ``SHF_WRITE`` 1183 ``.data`` ``SHT_PROGBITS`` ``SHF_ALLOC`` + ``SHF_WRITE`` 1184 ``.debug_``\ *\** ``SHT_PROGBITS`` *none* 1185 ``.dynamic`` ``SHT_DYNAMIC`` ``SHF_ALLOC`` 1186 ``.dynstr`` ``SHT_PROGBITS`` ``SHF_ALLOC`` 1187 ``.dynsym`` ``SHT_PROGBITS`` ``SHF_ALLOC`` 1188 ``.got`` ``SHT_PROGBITS`` ``SHF_ALLOC`` + ``SHF_WRITE`` 1189 ``.hash`` ``SHT_HASH`` ``SHF_ALLOC`` 1190 ``.note`` ``SHT_NOTE`` *none* 1191 ``.rela``\ *name* ``SHT_RELA`` *none* 1192 ``.rela.dyn`` ``SHT_RELA`` *none* 1193 ``.rodata`` ``SHT_PROGBITS`` ``SHF_ALLOC`` 1194 ``.shstrtab`` ``SHT_STRTAB`` *none* 1195 ``.strtab`` ``SHT_STRTAB`` *none* 1196 ``.symtab`` ``SHT_SYMTAB`` *none* 1197 ``.text`` ``SHT_PROGBITS`` ``SHF_ALLOC`` + ``SHF_EXECINSTR`` 1198 ================== ================ ================================= 1199 1200These sections have their standard meanings (see [ELF]_) and are only generated 1201if needed. 1202 1203``.debug``\ *\** 1204 The standard DWARF sections. See :ref:`amdgpu-dwarf-debug-information` for 1205 information on the DWARF produced by the AMDGPU backend. 1206 1207``.dynamic``, ``.dynstr``, ``.dynsym``, ``.hash`` 1208 The standard sections used by a dynamic loader. 1209 1210``.note`` 1211 See :ref:`amdgpu-note-records` for the note records supported by the AMDGPU 1212 backend. 1213 1214``.rela``\ *name*, ``.rela.dyn`` 1215 For relocatable code objects, *name* is the name of the section that the 1216 relocation records apply. For example, ``.rela.text`` is the section name for 1217 relocation records associated with the ``.text`` section. 1218 1219 For linked shared code objects, ``.rela.dyn`` contains all the relocation 1220 records from each of the relocatable code object's ``.rela``\ *name* sections. 1221 1222 See :ref:`amdgpu-relocation-records` for the relocation records supported by 1223 the AMDGPU backend. 1224 1225``.text`` 1226 The executable machine code for the kernels and functions they call. Generated 1227 as position independent code. See :ref:`amdgpu-code-conventions` for 1228 information on conventions used in the isa generation. 1229 1230.. _amdgpu-note-records: 1231 1232Note Records 1233------------ 1234 1235The AMDGPU backend code object contains ELF note records in the ``.note`` 1236section. The set of generated notes and their semantics depend on the code 1237object version; see :ref:`amdgpu-note-records-v2` and 1238:ref:`amdgpu-note-records-v3-v4`. 1239 1240As required by ``ELFCLASS32`` and ``ELFCLASS64``, minimal zero-byte padding 1241must be generated after the ``name`` field to ensure the ``desc`` field is 4 1242byte aligned. In addition, minimal zero-byte padding must be generated to 1243ensure the ``desc`` field size is a multiple of 4 bytes. The ``sh_addralign`` 1244field of the ``.note`` section must be at least 4 to indicate at least 8 byte 1245alignment. 1246 1247.. _amdgpu-note-records-v2: 1248 1249Code Object V2 Note Records 1250~~~~~~~~~~~~~~~~~~~~~~~~~~~ 1251 1252.. warning:: 1253 Code object V2 is not the default code object version emitted by 1254 this version of LLVM. 1255 1256The AMDGPU backend code object uses the following ELF note record in the 1257``.note`` section when compiling for code object V2. 1258 1259The note record vendor field is "AMD". 1260 1261Additional note records may be present, but any which are not documented here 1262are deprecated and should not be used. 1263 1264 .. table:: AMDGPU Code Object V2 ELF Note Records 1265 :name: amdgpu-elf-note-records-v2-table 1266 1267 ===== ===================================== ====================================== 1268 Name Type Description 1269 ===== ===================================== ====================================== 1270 "AMD" ``NT_AMD_HSA_CODE_OBJECT_VERSION`` Code object version. 1271 "AMD" ``NT_AMD_HSA_HSAIL`` HSAIL properties generated by the HSAIL 1272 Finalizer and not the LLVM compiler. 1273 "AMD" ``NT_AMD_HSA_ISA_VERSION`` Target ISA version. 1274 "AMD" ``NT_AMD_HSA_METADATA`` Metadata null terminated string in 1275 YAML [YAML]_ textual format. 1276 "AMD" ``NT_AMD_HSA_ISA_NAME`` Target ISA name. 1277 ===== ===================================== ====================================== 1278 1279.. 1280 1281 .. table:: AMDGPU Code Object V2 ELF Note Record Enumeration Values 1282 :name: amdgpu-elf-note-record-enumeration-values-v2-table 1283 1284 ===================================== ===== 1285 Name Value 1286 ===================================== ===== 1287 ``NT_AMD_HSA_CODE_OBJECT_VERSION`` 1 1288 ``NT_AMD_HSA_HSAIL`` 2 1289 ``NT_AMD_HSA_ISA_VERSION`` 3 1290 *reserved* 4-9 1291 ``NT_AMD_HSA_METADATA`` 10 1292 ``NT_AMD_HSA_ISA_NAME`` 11 1293 ===================================== ===== 1294 1295``NT_AMD_HSA_CODE_OBJECT_VERSION`` 1296 Specifies the code object version number. The description field has the 1297 following layout: 1298 1299 .. code:: c 1300 1301 struct amdgpu_hsa_note_code_object_version_s { 1302 uint32_t major_version; 1303 uint32_t minor_version; 1304 }; 1305 1306 The ``major_version`` has a value less than or equal to 2. 1307 1308``NT_AMD_HSA_HSAIL`` 1309 Specifies the HSAIL properties used by the HSAIL Finalizer. The description 1310 field has the following layout: 1311 1312 .. code:: c 1313 1314 struct amdgpu_hsa_note_hsail_s { 1315 uint32_t hsail_major_version; 1316 uint32_t hsail_minor_version; 1317 uint8_t profile; 1318 uint8_t machine_model; 1319 uint8_t default_float_round; 1320 }; 1321 1322``NT_AMD_HSA_ISA_VERSION`` 1323 Specifies the target ISA version. The description field has the following layout: 1324 1325 .. code:: c 1326 1327 struct amdgpu_hsa_note_isa_s { 1328 uint16_t vendor_name_size; 1329 uint16_t architecture_name_size; 1330 uint32_t major; 1331 uint32_t minor; 1332 uint32_t stepping; 1333 char vendor_and_architecture_name[1]; 1334 }; 1335 1336 ``vendor_name_size`` and ``architecture_name_size`` are the length of the 1337 vendor and architecture names respectively, including the NUL character. 1338 1339 ``vendor_and_architecture_name`` contains the NUL terminates string for the 1340 vendor, immediately followed by the NUL terminated string for the 1341 architecture. 1342 1343 This note record is used by the HSA runtime loader. 1344 1345 Code object V2 only supports a limited number of processors and has fixed 1346 settings for target features. See 1347 :ref:`amdgpu-elf-note-record-supported_processors-v2-table` for a list of 1348 processors and the corresponding target ID. In the table the note record ISA 1349 name is a concatenation of the vendor name, architecture name, major, minor, 1350 and stepping separated by a ":". 1351 1352 The target ID column shows the processor name and fixed target features used 1353 by the LLVM compiler. The LLVM compiler does not generate a 1354 ``NT_AMD_HSA_HSAIL`` note record. 1355 1356 A code object generated by the Finalizer also uses code object V2 and always 1357 generates a ``NT_AMD_HSA_HSAIL`` note record. The processor name and 1358 ``sramecc`` target feature is as shown in 1359 :ref:`amdgpu-elf-note-record-supported_processors-v2-table` but the ``xnack`` 1360 target feature is specified by the ``EF_AMDGPU_FEATURE_XNACK_V2`` ``e_flags`` 1361 bit. 1362 1363``NT_AMD_HSA_ISA_NAME`` 1364 Specifies the target ISA name as a non-NUL terminated string. 1365 1366 This note record is not used by the HSA runtime loader. 1367 1368 See the ``NT_AMD_HSA_ISA_VERSION`` note record description of the code object 1369 V2's limited support of processors and fixed settings for target features. 1370 1371 See :ref:`amdgpu-elf-note-record-supported_processors-v2-table` for a mapping 1372 from the string to the corresponding target ID. If the ``xnack`` target 1373 feature is supported and enabled, the string produced by the LLVM compiler 1374 will may have a ``+xnack`` appended. The Finlizer did not do the appending and 1375 instead used the ``EF_AMDGPU_FEATURE_XNACK_V2`` ``e_flags`` bit. 1376 1377``NT_AMD_HSA_METADATA`` 1378 Specifies extensible metadata associated with the code objects executed on HSA 1379 [HSA]_ compatible runtimes (see :ref:`amdgpu-os`). It is required when the 1380 target triple OS is ``amdhsa`` (see :ref:`amdgpu-target-triples`). See 1381 :ref:`amdgpu-amdhsa-code-object-metadata-v2` for the syntax of the code object 1382 metadata string. 1383 1384 .. table:: AMDGPU Code Object V2 Supported Processors and Fixed Target Feature Settings 1385 :name: amdgpu-elf-note-record-supported_processors-v2-table 1386 1387 ===================== ========================== 1388 Note Record ISA Name Target ID 1389 ===================== ========================== 1390 ``AMD:AMDGPU:6:0:0`` ``gfx600`` 1391 ``AMD:AMDGPU:6:0:1`` ``gfx601`` 1392 ``AMD:AMDGPU:6:0:2`` ``gfx602`` 1393 ``AMD:AMDGPU:7:0:0`` ``gfx700`` 1394 ``AMD:AMDGPU:7:0:1`` ``gfx701`` 1395 ``AMD:AMDGPU:7:0:2`` ``gfx702`` 1396 ``AMD:AMDGPU:7:0:3`` ``gfx703`` 1397 ``AMD:AMDGPU:7:0:4`` ``gfx704`` 1398 ``AMD:AMDGPU:7:0:5`` ``gfx705`` 1399 ``AMD:AMDGPU:8:0:0`` ``gfx802`` 1400 ``AMD:AMDGPU:8:0:1`` ``gfx801:xnack+`` 1401 ``AMD:AMDGPU:8:0:2`` ``gfx802`` 1402 ``AMD:AMDGPU:8:0:3`` ``gfx803`` 1403 ``AMD:AMDGPU:8:0:4`` ``gfx803`` 1404 ``AMD:AMDGPU:8:0:5`` ``gfx805`` 1405 ``AMD:AMDGPU:8:1:0`` ``gfx810:xnack+`` 1406 ``AMD:AMDGPU:9:0:0`` ``gfx900:xnack-`` 1407 ``AMD:AMDGPU:9:0:1`` ``gfx900:xnack+`` 1408 ``AMD:AMDGPU:9:0:2`` ``gfx902:xnack-`` 1409 ``AMD:AMDGPU:9:0:3`` ``gfx902:xnack+`` 1410 ``AMD:AMDGPU:9:0:4`` ``gfx904:xnack-`` 1411 ``AMD:AMDGPU:9:0:5`` ``gfx904:xnack+`` 1412 ``AMD:AMDGPU:9:0:6`` ``gfx906:sramecc-:xnack-`` 1413 ``AMD:AMDGPU:9:0:7`` ``gfx906:sramecc-:xnack+`` 1414 ``AMD:AMDGPU:9:0:12`` ``gfx90c:xnack-`` 1415 ===================== ========================== 1416 1417.. _amdgpu-note-records-v3-v4: 1418 1419Code Object V3 to V4 Note Records 1420~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 1421 1422The AMDGPU backend code object uses the following ELF note record in the 1423``.note`` section when compiling for code object V3 to V4. 1424 1425The note record vendor field is "AMDGPU". 1426 1427Additional note records may be present, but any which are not documented here 1428are deprecated and should not be used. 1429 1430 .. table:: AMDGPU Code Object V3 to V4 ELF Note Records 1431 :name: amdgpu-elf-note-records-table-v3-v4 1432 1433 ======== ============================== ====================================== 1434 Name Type Description 1435 ======== ============================== ====================================== 1436 "AMDGPU" ``NT_AMDGPU_METADATA`` Metadata in Message Pack [MsgPack]_ 1437 binary format. 1438 ======== ============================== ====================================== 1439 1440.. 1441 1442 .. table:: AMDGPU Code Object V3 to V4 ELF Note Record Enumeration Values 1443 :name: amdgpu-elf-note-record-enumeration-values-table-v3-v4 1444 1445 ============================== ===== 1446 Name Value 1447 ============================== ===== 1448 *reserved* 0-31 1449 ``NT_AMDGPU_METADATA`` 32 1450 ============================== ===== 1451 1452``NT_AMDGPU_METADATA`` 1453 Specifies extensible metadata associated with an AMDGPU code object. It is 1454 encoded as a map in the Message Pack [MsgPack]_ binary data format. See 1455 :ref:`amdgpu-amdhsa-code-object-metadata-v3` and 1456 :ref:`amdgpu-amdhsa-code-object-metadata-v4` for the map keys defined for the 1457 ``amdhsa`` OS. 1458 1459.. _amdgpu-symbols: 1460 1461Symbols 1462------- 1463 1464Symbols include the following: 1465 1466 .. table:: AMDGPU ELF Symbols 1467 :name: amdgpu-elf-symbols-table 1468 1469 ===================== ================== ================ ================== 1470 Name Type Section Description 1471 ===================== ================== ================ ================== 1472 *link-name* ``STT_OBJECT`` - ``.data`` Global variable 1473 - ``.rodata`` 1474 - ``.bss`` 1475 *link-name*\ ``.kd`` ``STT_OBJECT`` - ``.rodata`` Kernel descriptor 1476 *link-name* ``STT_FUNC`` - ``.text`` Kernel entry point 1477 *link-name* ``STT_OBJECT`` - SHN_AMDGPU_LDS Global variable in LDS 1478 ===================== ================== ================ ================== 1479 1480Global variable 1481 Global variables both used and defined by the compilation unit. 1482 1483 If the symbol is defined in the compilation unit then it is allocated in the 1484 appropriate section according to if it has initialized data or is readonly. 1485 1486 If the symbol is external then its section is ``STN_UNDEF`` and the loader 1487 will resolve relocations using the definition provided by another code object 1488 or explicitly defined by the runtime. 1489 1490 If the symbol resides in local/group memory (LDS) then its section is the 1491 special processor specific section name ``SHN_AMDGPU_LDS``, and the 1492 ``st_value`` field describes alignment requirements as it does for common 1493 symbols. 1494 1495 .. TODO:: 1496 1497 Add description of linked shared object symbols. Seems undefined symbols 1498 are marked as STT_NOTYPE. 1499 1500Kernel descriptor 1501 Every HSA kernel has an associated kernel descriptor. It is the address of the 1502 kernel descriptor that is used in the AQL dispatch packet used to invoke the 1503 kernel, not the kernel entry point. The layout of the HSA kernel descriptor is 1504 defined in :ref:`amdgpu-amdhsa-kernel-descriptor`. 1505 1506Kernel entry point 1507 Every HSA kernel also has a symbol for its machine code entry point. 1508 1509.. _amdgpu-relocation-records: 1510 1511Relocation Records 1512------------------ 1513 1514AMDGPU backend generates ``Elf64_Rela`` relocation records. Supported 1515relocatable fields are: 1516 1517``word32`` 1518 This specifies a 32-bit field occupying 4 bytes with arbitrary byte 1519 alignment. These values use the same byte order as other word values in the 1520 AMDGPU architecture. 1521 1522``word64`` 1523 This specifies a 64-bit field occupying 8 bytes with arbitrary byte 1524 alignment. These values use the same byte order as other word values in the 1525 AMDGPU architecture. 1526 1527Following notations are used for specifying relocation calculations: 1528 1529**A** 1530 Represents the addend used to compute the value of the relocatable field. 1531 1532**G** 1533 Represents the offset into the global offset table at which the relocation 1534 entry's symbol will reside during execution. 1535 1536**GOT** 1537 Represents the address of the global offset table. 1538 1539**P** 1540 Represents the place (section offset for ``et_rel`` or address for ``et_dyn``) 1541 of the storage unit being relocated (computed using ``r_offset``). 1542 1543**S** 1544 Represents the value of the symbol whose index resides in the relocation 1545 entry. Relocations not using this must specify a symbol index of 1546 ``STN_UNDEF``. 1547 1548**B** 1549 Represents the base address of a loaded executable or shared object which is 1550 the difference between the ELF address and the actual load address. 1551 Relocations using this are only valid in executable or shared objects. 1552 1553The following relocation types are supported: 1554 1555 .. table:: AMDGPU ELF Relocation Records 1556 :name: amdgpu-elf-relocation-records-table 1557 1558 ========================== ======= ===== ========== ============================== 1559 Relocation Type Kind Value Field Calculation 1560 ========================== ======= ===== ========== ============================== 1561 ``R_AMDGPU_NONE`` 0 *none* *none* 1562 ``R_AMDGPU_ABS32_LO`` Static, 1 ``word32`` (S + A) & 0xFFFFFFFF 1563 Dynamic 1564 ``R_AMDGPU_ABS32_HI`` Static, 2 ``word32`` (S + A) >> 32 1565 Dynamic 1566 ``R_AMDGPU_ABS64`` Static, 3 ``word64`` S + A 1567 Dynamic 1568 ``R_AMDGPU_REL32`` Static 4 ``word32`` S + A - P 1569 ``R_AMDGPU_REL64`` Static 5 ``word64`` S + A - P 1570 ``R_AMDGPU_ABS32`` Static, 6 ``word32`` S + A 1571 Dynamic 1572 ``R_AMDGPU_GOTPCREL`` Static 7 ``word32`` G + GOT + A - P 1573 ``R_AMDGPU_GOTPCREL32_LO`` Static 8 ``word32`` (G + GOT + A - P) & 0xFFFFFFFF 1574 ``R_AMDGPU_GOTPCREL32_HI`` Static 9 ``word32`` (G + GOT + A - P) >> 32 1575 ``R_AMDGPU_REL32_LO`` Static 10 ``word32`` (S + A - P) & 0xFFFFFFFF 1576 ``R_AMDGPU_REL32_HI`` Static 11 ``word32`` (S + A - P) >> 32 1577 *reserved* 12 1578 ``R_AMDGPU_RELATIVE64`` Dynamic 13 ``word64`` B + A 1579 ``R_AMDGPU_REL16`` Static 14 ``word16`` ((S + A - P) - 4) / 4 1580 ========================== ======= ===== ========== ============================== 1581 1582``R_AMDGPU_ABS32_LO`` and ``R_AMDGPU_ABS32_HI`` are only supported by 1583the ``mesa3d`` OS, which does not support ``R_AMDGPU_ABS64``. 1584 1585There is no current OS loader support for 32-bit programs and so 1586``R_AMDGPU_ABS32`` is not used. 1587 1588.. _amdgpu-loaded-code-object-path-uniform-resource-identifier: 1589 1590Loaded Code Object Path Uniform Resource Identifier (URI) 1591--------------------------------------------------------- 1592 1593The AMD GPU code object loader represents the path of the ELF shared object from 1594which the code object was loaded as a textual Unifom Resource Identifier (URI). 1595Note that the code object is the in memory loaded relocated form of the ELF 1596shared object. Multiple code objects may be loaded at different memory 1597addresses in the same process from the same ELF shared object. 1598 1599The loaded code object path URI syntax is defined by the following BNF syntax: 1600 1601.. code:: 1602 1603 code_object_uri ::== file_uri | memory_uri 1604 file_uri ::== "file://" file_path [ range_specifier ] 1605 memory_uri ::== "memory://" process_id range_specifier 1606 range_specifier ::== [ "#" | "?" ] "offset=" number "&" "size=" number 1607 file_path ::== URI_ENCODED_OS_FILE_PATH 1608 process_id ::== DECIMAL_NUMBER 1609 number ::== HEX_NUMBER | DECIMAL_NUMBER | OCTAL_NUMBER 1610 1611**number** 1612 Is a C integral literal where hexadecimal values are prefixed by "0x" or "0X", 1613 and octal values by "0". 1614 1615**file_path** 1616 Is the file's path specified as a URI encoded UTF-8 string. In URI encoding, 1617 every character that is not in the regular expression ``[a-zA-Z0-9/_.~-]`` is 1618 encoded as two uppercase hexadecimal digits proceeded by "%". Directories in 1619 the path are separated by "/". 1620 1621**offset** 1622 Is a 0-based byte offset to the start of the code object. For a file URI, it 1623 is from the start of the file specified by the ``file_path``, and if omitted 1624 defaults to 0. For a memory URI, it is the memory address and is required. 1625 1626**size** 1627 Is the number of bytes in the code object. For a file URI, if omitted it 1628 defaults to the size of the file. It is required for a memory URI. 1629 1630**process_id** 1631 Is the identity of the process owning the memory. For Linux it is the C 1632 unsigned integral decimal literal for the process ID (PID). 1633 1634For example: 1635 1636.. code:: 1637 1638 file:///dir1/dir2/file1 1639 file:///dir3/dir4/file2#offset=0x2000&size=3000 1640 memory://1234#offset=0x20000&size=3000 1641 1642.. _amdgpu-dwarf-debug-information: 1643 1644DWARF Debug Information 1645======================= 1646 1647.. warning:: 1648 1649 This section describes **provisional support** for AMDGPU DWARF [DWARF]_ that 1650 is not currently fully implemented and is subject to change. 1651 1652AMDGPU generates DWARF [DWARF]_ debugging information ELF sections (see 1653:ref:`amdgpu-elf-code-object`) which contain information that maps the code 1654object executable code and data to the source language constructs. It can be 1655used by tools such as debuggers and profilers. It uses features defined in 1656:doc:`AMDGPUDwarfExtensionsForHeterogeneousDebugging` that are made available in 1657DWARF Version 4 and DWARF Version 5 as an LLVM vendor extension. 1658 1659This section defines the AMDGPU target architecture specific DWARF mappings. 1660 1661.. _amdgpu-dwarf-register-identifier: 1662 1663Register Identifier 1664------------------- 1665 1666This section defines the AMDGPU target architecture register numbers used in 1667DWARF operation expressions (see DWARF Version 5 section 2.5 and 1668:ref:`amdgpu-dwarf-operation-expressions`) and Call Frame Information 1669instructions (see DWARF Version 5 section 6.4 and 1670:ref:`amdgpu-dwarf-call-frame-information`). 1671 1672A single code object can contain code for kernels that have different wavefront 1673sizes. The vector registers and some scalar registers are based on the wavefront 1674size. AMDGPU defines distinct DWARF registers for each wavefront size. This 1675simplifies the consumer of the DWARF so that each register has a fixed size, 1676rather than being dynamic according to the wavefront size mode. Similarly, 1677distinct DWARF registers are defined for those registers that vary in size 1678according to the process address size. This allows a consumer to treat a 1679specific AMDGPU processor as a single architecture regardless of how it is 1680configured at run time. The compiler explicitly specifies the DWARF registers 1681that match the mode in which the code it is generating will be executed. 1682 1683DWARF registers are encoded as numbers, which are mapped to architecture 1684registers. The mapping for AMDGPU is defined in 1685:ref:`amdgpu-dwarf-register-mapping-table`. All AMDGPU targets use the same 1686mapping. 1687 1688.. table:: AMDGPU DWARF Register Mapping 1689 :name: amdgpu-dwarf-register-mapping-table 1690 1691 ============== ================= ======== ================================== 1692 DWARF Register AMDGPU Register Bit Size Description 1693 ============== ================= ======== ================================== 1694 0 PC_32 32 Program Counter (PC) when 1695 executing in a 32-bit process 1696 address space. Used in the CFI to 1697 describe the PC of the calling 1698 frame. 1699 1 EXEC_MASK_32 32 Execution Mask Register when 1700 executing in wavefront 32 mode. 1701 2-15 *Reserved* *Reserved for highly accessed 1702 registers using DWARF shortcut.* 1703 16 PC_64 64 Program Counter (PC) when 1704 executing in a 64-bit process 1705 address space. Used in the CFI to 1706 describe the PC of the calling 1707 frame. 1708 17 EXEC_MASK_64 64 Execution Mask Register when 1709 executing in wavefront 64 mode. 1710 18-31 *Reserved* *Reserved for highly accessed 1711 registers using DWARF shortcut.* 1712 32-95 SGPR0-SGPR63 32 Scalar General Purpose 1713 Registers. 1714 96-127 *Reserved* *Reserved for frequently accessed 1715 registers using DWARF 1-byte ULEB.* 1716 128 STATUS 32 Status Register. 1717 129-511 *Reserved* *Reserved for future Scalar 1718 Architectural Registers.* 1719 512 VCC_32 32 Vector Condition Code Register 1720 when executing in wavefront 32 1721 mode. 1722 513-1023 *Reserved* *Reserved for future Vector 1723 Architectural Registers when 1724 executing in wavefront 32 mode.* 1725 768 VCC_64 64 Vector Condition Code Register 1726 when executing in wavefront 64 1727 mode. 1728 769-1023 *Reserved* *Reserved for future Vector 1729 Architectural Registers when 1730 executing in wavefront 64 mode.* 1731 1024-1087 *Reserved* *Reserved for padding.* 1732 1088-1129 SGPR64-SGPR105 32 Scalar General Purpose Registers. 1733 1130-1535 *Reserved* *Reserved for future Scalar 1734 General Purpose Registers.* 1735 1536-1791 VGPR0-VGPR255 32*32 Vector General Purpose Registers 1736 when executing in wavefront 32 1737 mode. 1738 1792-2047 *Reserved* *Reserved for future Vector 1739 General Purpose Registers when 1740 executing in wavefront 32 mode.* 1741 2048-2303 AGPR0-AGPR255 32*32 Vector Accumulation Registers 1742 when executing in wavefront 32 1743 mode. 1744 2304-2559 *Reserved* *Reserved for future Vector 1745 Accumulation Registers when 1746 executing in wavefront 32 mode.* 1747 2560-2815 VGPR0-VGPR255 64*32 Vector General Purpose Registers 1748 when executing in wavefront 64 1749 mode. 1750 2816-3071 *Reserved* *Reserved for future Vector 1751 General Purpose Registers when 1752 executing in wavefront 64 mode.* 1753 3072-3327 AGPR0-AGPR255 64*32 Vector Accumulation Registers 1754 when executing in wavefront 64 1755 mode. 1756 3328-3583 *Reserved* *Reserved for future Vector 1757 Accumulation Registers when 1758 executing in wavefront 64 mode.* 1759 ============== ================= ======== ================================== 1760 1761The vector registers are represented as the full size for the wavefront. They 1762are organized as consecutive dwords (32-bits), one per lane, with the dword at 1763the least significant bit position corresponding to lane 0 and so forth. DWARF 1764location expressions involving the ``DW_OP_LLVM_offset`` and 1765``DW_OP_LLVM_push_lane`` operations are used to select the part of the vector 1766register corresponding to the lane that is executing the current thread of 1767execution in languages that are implemented using a SIMD or SIMT execution 1768model. 1769 1770If the wavefront size is 32 lanes then the wavefront 32 mode register 1771definitions are used. If the wavefront size is 64 lanes then the wavefront 64 1772mode register definitions are used. Some AMDGPU targets support executing in 1773both wavefront 32 and wavefront 64 mode. The register definitions corresponding 1774to the wavefront mode of the generated code will be used. 1775 1776If code is generated to execute in a 32-bit process address space, then the 177732-bit process address space register definitions are used. If code is generated 1778to execute in a 64-bit process address space, then the 64-bit process address 1779space register definitions are used. The ``amdgcn`` target only supports the 178064-bit process address space. 1781 1782.. _amdgpu-dwarf-address-class-identifier: 1783 1784Address Class Identifier 1785------------------------ 1786 1787The DWARF address class represents the source language memory space. See DWARF 1788Version 5 section 2.12 which is updated by the *DWARF Extensions For 1789Heterogeneous Debugging* section :ref:`amdgpu-dwarf-segment_addresses`. 1790 1791The DWARF address class mapping used for AMDGPU is defined in 1792:ref:`amdgpu-dwarf-address-class-mapping-table`. 1793 1794.. table:: AMDGPU DWARF Address Class Mapping 1795 :name: amdgpu-dwarf-address-class-mapping-table 1796 1797 ========================= ====== ================= 1798 DWARF AMDGPU 1799 -------------------------------- ----------------- 1800 Address Class Name Value Address Space 1801 ========================= ====== ================= 1802 ``DW_ADDR_none`` 0x0000 Generic (Flat) 1803 ``DW_ADDR_LLVM_global`` 0x0001 Global 1804 ``DW_ADDR_LLVM_constant`` 0x0002 Global 1805 ``DW_ADDR_LLVM_group`` 0x0003 Local (group/LDS) 1806 ``DW_ADDR_LLVM_private`` 0x0004 Private (Scratch) 1807 ``DW_ADDR_AMDGPU_region`` 0x8000 Region (GDS) 1808 ========================= ====== ================= 1809 1810The DWARF address class values defined in the *DWARF Extensions For 1811Heterogeneous Debugging* section :ref:`amdgpu-dwarf-segment_addresses` are used. 1812 1813In addition, ``DW_ADDR_AMDGPU_region`` is encoded as a vendor extension. This is 1814available for use for the AMD extension for access to the hardware GDS memory 1815which is scratchpad memory allocated per device. 1816 1817For AMDGPU if no ``DW_AT_address_class`` attribute is present, then the default 1818address class of ``DW_ADDR_none`` is used. 1819 1820See :ref:`amdgpu-dwarf-address-space-identifier` for information on the AMDGPU 1821mapping of DWARF address classes to DWARF address spaces, including address size 1822and NULL value. 1823 1824.. _amdgpu-dwarf-address-space-identifier: 1825 1826Address Space Identifier 1827------------------------ 1828 1829DWARF address spaces correspond to target architecture specific linear 1830addressable memory areas. See DWARF Version 5 section 2.12 and *DWARF Extensions 1831For Heterogeneous Debugging* section :ref:`amdgpu-dwarf-segment_addresses`. 1832 1833The DWARF address space mapping used for AMDGPU is defined in 1834:ref:`amdgpu-dwarf-address-space-mapping-table`. 1835 1836.. table:: AMDGPU DWARF Address Space Mapping 1837 :name: amdgpu-dwarf-address-space-mapping-table 1838 1839 ======================================= ===== ======= ======== ================= ======================= 1840 DWARF AMDGPU Notes 1841 --------------------------------------- ----- ---------------- ----------------- ----------------------- 1842 Address Space Name Value Address Bit Size Address Space 1843 --------------------------------------- ----- ------- -------- ----------------- ----------------------- 1844 .. 64-bit 32-bit 1845 process process 1846 address address 1847 space space 1848 ======================================= ===== ======= ======== ================= ======================= 1849 ``DW_ASPACE_none`` 0x00 64 32 Global *default address space* 1850 ``DW_ASPACE_AMDGPU_generic`` 0x01 64 32 Generic (Flat) 1851 ``DW_ASPACE_AMDGPU_region`` 0x02 32 32 Region (GDS) 1852 ``DW_ASPACE_AMDGPU_local`` 0x03 32 32 Local (group/LDS) 1853 *Reserved* 0x04 1854 ``DW_ASPACE_AMDGPU_private_lane`` 0x05 32 32 Private (Scratch) *focused lane* 1855 ``DW_ASPACE_AMDGPU_private_wave`` 0x06 32 32 Private (Scratch) *unswizzled wavefront* 1856 ======================================= ===== ======= ======== ================= ======================= 1857 1858See :ref:`amdgpu-address-spaces` for information on the AMDGPU address spaces 1859including address size and NULL value. 1860 1861The ``DW_ASPACE_none`` address space is the default target architecture address 1862space used in DWARF operations that do not specify an address space. It 1863therefore has to map to the global address space so that the ``DW_OP_addr*`` and 1864related operations can refer to addresses in the program code. 1865 1866The ``DW_ASPACE_AMDGPU_generic`` address space allows location expressions to 1867specify the flat address space. If the address corresponds to an address in the 1868local address space, then it corresponds to the wavefront that is executing the 1869focused thread of execution. If the address corresponds to an address in the 1870private address space, then it corresponds to the lane that is executing the 1871focused thread of execution for languages that are implemented using a SIMD or 1872SIMT execution model. 1873 1874.. note:: 1875 1876 CUDA-like languages such as HIP that do not have address spaces in the 1877 language type system, but do allow variables to be allocated in different 1878 address spaces, need to explicitly specify the ``DW_ASPACE_AMDGPU_generic`` 1879 address space in the DWARF expression operations as the default address space 1880 is the global address space. 1881 1882The ``DW_ASPACE_AMDGPU_local`` address space allows location expressions to 1883specify the local address space corresponding to the wavefront that is executing 1884the focused thread of execution. 1885 1886The ``DW_ASPACE_AMDGPU_private_lane`` address space allows location expressions 1887to specify the private address space corresponding to the lane that is executing 1888the focused thread of execution for languages that are implemented using a SIMD 1889or SIMT execution model. 1890 1891The ``DW_ASPACE_AMDGPU_private_wave`` address space allows location expressions 1892to specify the unswizzled private address space corresponding to the wavefront 1893that is executing the focused thread of execution. The wavefront view of private 1894memory is the per wavefront unswizzled backing memory layout defined in 1895:ref:`amdgpu-address-spaces`, such that address 0 corresponds to the first 1896location for the backing memory of the wavefront (namely the address is not 1897offset by ``wavefront-scratch-base``). The following formula can be used to 1898convert from a ``DW_ASPACE_AMDGPU_private_lane`` address to a 1899``DW_ASPACE_AMDGPU_private_wave`` address: 1900 1901:: 1902 1903 private-address-wavefront = 1904 ((private-address-lane / 4) * wavefront-size * 4) + 1905 (wavefront-lane-id * 4) + (private-address-lane % 4) 1906 1907If the ``DW_ASPACE_AMDGPU_private_lane`` address is dword aligned, and the start 1908of the dwords for each lane starting with lane 0 is required, then this 1909simplifies to: 1910 1911:: 1912 1913 private-address-wavefront = 1914 private-address-lane * wavefront-size 1915 1916A compiler can use the ``DW_ASPACE_AMDGPU_private_wave`` address space to read a 1917complete spilled vector register back into a complete vector register in the 1918CFI. The frame pointer can be a private lane address which is dword aligned, 1919which can be shifted to multiply by the wavefront size, and then used to form a 1920private wavefront address that gives a location for a contiguous set of dwords, 1921one per lane, where the vector register dwords are spilled. The compiler knows 1922the wavefront size since it generates the code. Note that the type of the 1923address may have to be converted as the size of a 1924``DW_ASPACE_AMDGPU_private_lane`` address may be smaller than the size of a 1925``DW_ASPACE_AMDGPU_private_wave`` address. 1926 1927.. _amdgpu-dwarf-lane-identifier: 1928 1929Lane identifier 1930--------------- 1931 1932DWARF lane identifies specify a target architecture lane position for hardware 1933that executes in a SIMD or SIMT manner, and on which a source language maps its 1934threads of execution onto those lanes. The DWARF lane identifier is pushed by 1935the ``DW_OP_LLVM_push_lane`` DWARF expression operation. See DWARF Version 5 1936section 2.5 which is updated by *DWARF Extensions For Heterogeneous Debugging* 1937section :ref:`amdgpu-dwarf-operation-expressions`. 1938 1939For AMDGPU, the lane identifier corresponds to the hardware lane ID of a 1940wavefront. It is numbered from 0 to the wavefront size minus 1. 1941 1942Operation Expressions 1943--------------------- 1944 1945DWARF expressions are used to compute program values and the locations of 1946program objects. See DWARF Version 5 section 2.5 and 1947:ref:`amdgpu-dwarf-operation-expressions`. 1948 1949DWARF location descriptions describe how to access storage which includes memory 1950and registers. When accessing storage on AMDGPU, bytes are ordered with least 1951significant bytes first, and bits are ordered within bytes with least 1952significant bits first. 1953 1954For AMDGPU CFI expressions, ``DW_OP_LLVM_select_bit_piece`` is used to describe 1955unwinding vector registers that are spilled under the execution mask to memory: 1956the zero-single location description is the vector register, and the one-single 1957location description is the spilled memory location description. The 1958``DW_OP_LLVM_form_aspace_address`` is used to specify the address space of the 1959memory location description. 1960 1961In AMDGPU expressions, ``DW_OP_LLVM_select_bit_piece`` is used by the 1962``DW_AT_LLVM_lane_pc`` attribute expression where divergent control flow is 1963controlled by the execution mask. An undefined location description together 1964with ``DW_OP_LLVM_extend`` is used to indicate the lane was not active on entry 1965to the subprogram. See :ref:`amdgpu-dwarf-dw-at-llvm-lane-pc` for an example. 1966 1967Debugger Information Entry Attributes 1968------------------------------------- 1969 1970This section describes how certain debugger information entry attributes are 1971used by AMDGPU. See the sections in DWARF Version 5 section 2 which are updated 1972by *DWARF Extensions For Heterogeneous Debugging* section 1973:ref:`amdgpu-dwarf-debugging-information-entry-attributes`. 1974 1975.. _amdgpu-dwarf-dw-at-llvm-lane-pc: 1976 1977``DW_AT_LLVM_lane_pc`` 1978~~~~~~~~~~~~~~~~~~~~~~ 1979 1980For AMDGPU, the ``DW_AT_LLVM_lane_pc`` attribute is used to specify the program 1981location of the separate lanes of a SIMT thread. 1982 1983If the lane is an active lane then this will be the same as the current program 1984location. 1985 1986If the lane is inactive, but was active on entry to the subprogram, then this is 1987the program location in the subprogram at which execution of the lane is 1988conceptual positioned. 1989 1990If the lane was not active on entry to the subprogram, then this will be the 1991undefined location. A client debugger can check if the lane is part of a valid 1992work-group by checking that the lane is in the range of the associated 1993work-group within the grid, accounting for partial work-groups. If it is not, 1994then the debugger can omit any information for the lane. Otherwise, the debugger 1995may repeatedly unwind the stack and inspect the ``DW_AT_LLVM_lane_pc`` of the 1996calling subprogram until it finds a non-undefined location. Conceptually the 1997lane only has the call frames that it has a non-undefined 1998``DW_AT_LLVM_lane_pc``. 1999 2000The following example illustrates how the AMDGPU backend can generate a DWARF 2001location list expression for the nested ``IF/THEN/ELSE`` structures of the 2002following subprogram pseudo code for a target with 64 lanes per wavefront. 2003 2004.. code:: 2005 :number-lines: 2006 2007 SUBPROGRAM X 2008 BEGIN 2009 a; 2010 IF (c1) THEN 2011 b; 2012 IF (c2) THEN 2013 c; 2014 ELSE 2015 d; 2016 ENDIF 2017 e; 2018 ELSE 2019 f; 2020 ENDIF 2021 g; 2022 END 2023 2024The AMDGPU backend may generate the following pseudo LLVM MIR to manipulate the 2025execution mask (``EXEC``) to linearize the control flow. The condition is 2026evaluated to make a mask of the lanes for which the condition evaluates to true. 2027First the ``THEN`` region is executed by setting the ``EXEC`` mask to the 2028logical ``AND`` of the current ``EXEC`` mask with the condition mask. Then the 2029``ELSE`` region is executed by negating the ``EXEC`` mask and logical ``AND`` of 2030the saved ``EXEC`` mask at the start of the region. After the ``IF/THEN/ELSE`` 2031region the ``EXEC`` mask is restored to the value it had at the beginning of the 2032region. This is shown below. Other approaches are possible, but the basic 2033concept is the same. 2034 2035.. code:: 2036 :number-lines: 2037 2038 $lex_start: 2039 a; 2040 %1 = EXEC 2041 %2 = c1 2042 $lex_1_start: 2043 EXEC = %1 & %2 2044 $if_1_then: 2045 b; 2046 %3 = EXEC 2047 %4 = c2 2048 $lex_1_1_start: 2049 EXEC = %3 & %4 2050 $lex_1_1_then: 2051 c; 2052 EXEC = ~EXEC & %3 2053 $lex_1_1_else: 2054 d; 2055 EXEC = %3 2056 $lex_1_1_end: 2057 e; 2058 EXEC = ~EXEC & %1 2059 $lex_1_else: 2060 f; 2061 EXEC = %1 2062 $lex_1_end: 2063 g; 2064 $lex_end: 2065 2066To create the DWARF location list expression that defines the location 2067description of a vector of lane program locations, the LLVM MIR ``DBG_VALUE`` 2068pseudo instruction can be used to annotate the linearized control flow. This can 2069be done by defining an artificial variable for the lane PC. The DWARF location 2070list expression created for it is used as the value of the 2071``DW_AT_LLVM_lane_pc`` attribute on the subprogram's debugger information entry. 2072 2073A DWARF procedure is defined for each well nested structured control flow region 2074which provides the conceptual lane program location for a lane if it is not 2075active (namely it is divergent). The DWARF operation expression for each region 2076conceptually inherits the value of the immediately enclosing region and modifies 2077it according to the semantics of the region. 2078 2079For an ``IF/THEN/ELSE`` region the divergent program location is at the start of 2080the region for the ``THEN`` region since it is executed first. For the ``ELSE`` 2081region the divergent program location is at the end of the ``IF/THEN/ELSE`` 2082region since the ``THEN`` region has completed. 2083 2084The lane PC artificial variable is assigned at each region transition. It uses 2085the immediately enclosing region's DWARF procedure to compute the program 2086location for each lane assuming they are divergent, and then modifies the result 2087by inserting the current program location for each lane that the ``EXEC`` mask 2088indicates is active. 2089 2090By having separate DWARF procedures for each region, they can be reused to 2091define the value for any nested region. This reduces the total size of the DWARF 2092operation expressions. 2093 2094The following provides an example using pseudo LLVM MIR. 2095 2096.. code:: 2097 :number-lines: 2098 2099 $lex_start: 2100 DEFINE_DWARF %__uint_64 = DW_TAG_base_type[ 2101 DW_AT_name = "__uint64"; 2102 DW_AT_byte_size = 8; 2103 DW_AT_encoding = DW_ATE_unsigned; 2104 ]; 2105 DEFINE_DWARF %__active_lane_pc = DW_TAG_dwarf_procedure[ 2106 DW_AT_name = "__active_lane_pc"; 2107 DW_AT_location = [ 2108 DW_OP_regx PC; 2109 DW_OP_LLVM_extend 64, 64; 2110 DW_OP_regval_type EXEC, %uint_64; 2111 DW_OP_LLVM_select_bit_piece 64, 64; 2112 ]; 2113 ]; 2114 DEFINE_DWARF %__divergent_lane_pc = DW_TAG_dwarf_procedure[ 2115 DW_AT_name = "__divergent_lane_pc"; 2116 DW_AT_location = [ 2117 DW_OP_LLVM_undefined; 2118 DW_OP_LLVM_extend 64, 64; 2119 ]; 2120 ]; 2121 DBG_VALUE $noreg, $noreg, %DW_AT_LLVM_lane_pc, DIExpression[ 2122 DW_OP_call_ref %__divergent_lane_pc; 2123 DW_OP_call_ref %__active_lane_pc; 2124 ]; 2125 a; 2126 %1 = EXEC; 2127 DBG_VALUE %1, $noreg, %__lex_1_save_exec; 2128 %2 = c1; 2129 $lex_1_start: 2130 EXEC = %1 & %2; 2131 $lex_1_then: 2132 DEFINE_DWARF %__divergent_lane_pc_1_then = DW_TAG_dwarf_procedure[ 2133 DW_AT_name = "__divergent_lane_pc_1_then"; 2134 DW_AT_location = DIExpression[ 2135 DW_OP_call_ref %__divergent_lane_pc; 2136 DW_OP_addrx &lex_1_start; 2137 DW_OP_stack_value; 2138 DW_OP_LLVM_extend 64, 64; 2139 DW_OP_call_ref %__lex_1_save_exec; 2140 DW_OP_deref_type 64, %__uint_64; 2141 DW_OP_LLVM_select_bit_piece 64, 64; 2142 ]; 2143 ]; 2144 DBG_VALUE $noreg, $noreg, %DW_AT_LLVM_lane_pc, DIExpression[ 2145 DW_OP_call_ref %__divergent_lane_pc_1_then; 2146 DW_OP_call_ref %__active_lane_pc; 2147 ]; 2148 b; 2149 %3 = EXEC; 2150 DBG_VALUE %3, %__lex_1_1_save_exec; 2151 %4 = c2; 2152 $lex_1_1_start: 2153 EXEC = %3 & %4; 2154 $lex_1_1_then: 2155 DEFINE_DWARF %__divergent_lane_pc_1_1_then = DW_TAG_dwarf_procedure[ 2156 DW_AT_name = "__divergent_lane_pc_1_1_then"; 2157 DW_AT_location = DIExpression[ 2158 DW_OP_call_ref %__divergent_lane_pc_1_then; 2159 DW_OP_addrx &lex_1_1_start; 2160 DW_OP_stack_value; 2161 DW_OP_LLVM_extend 64, 64; 2162 DW_OP_call_ref %__lex_1_1_save_exec; 2163 DW_OP_deref_type 64, %__uint_64; 2164 DW_OP_LLVM_select_bit_piece 64, 64; 2165 ]; 2166 ]; 2167 DBG_VALUE $noreg, $noreg, %DW_AT_LLVM_lane_pc, DIExpression[ 2168 DW_OP_call_ref %__divergent_lane_pc_1_1_then; 2169 DW_OP_call_ref %__active_lane_pc; 2170 ]; 2171 c; 2172 EXEC = ~EXEC & %3; 2173 $lex_1_1_else: 2174 DEFINE_DWARF %__divergent_lane_pc_1_1_else = DW_TAG_dwarf_procedure[ 2175 DW_AT_name = "__divergent_lane_pc_1_1_else"; 2176 DW_AT_location = DIExpression[ 2177 DW_OP_call_ref %__divergent_lane_pc_1_then; 2178 DW_OP_addrx &lex_1_1_end; 2179 DW_OP_stack_value; 2180 DW_OP_LLVM_extend 64, 64; 2181 DW_OP_call_ref %__lex_1_1_save_exec; 2182 DW_OP_deref_type 64, %__uint_64; 2183 DW_OP_LLVM_select_bit_piece 64, 64; 2184 ]; 2185 ]; 2186 DBG_VALUE $noreg, $noreg, %DW_AT_LLVM_lane_pc, DIExpression[ 2187 DW_OP_call_ref %__divergent_lane_pc_1_1_else; 2188 DW_OP_call_ref %__active_lane_pc; 2189 ]; 2190 d; 2191 EXEC = %3; 2192 $lex_1_1_end: 2193 DBG_VALUE $noreg, $noreg, %DW_AT_LLVM_lane_pc, DIExpression[ 2194 DW_OP_call_ref %__divergent_lane_pc; 2195 DW_OP_call_ref %__active_lane_pc; 2196 ]; 2197 e; 2198 EXEC = ~EXEC & %1; 2199 $lex_1_else: 2200 DEFINE_DWARF %__divergent_lane_pc_1_else = DW_TAG_dwarf_procedure[ 2201 DW_AT_name = "__divergent_lane_pc_1_else"; 2202 DW_AT_location = DIExpression[ 2203 DW_OP_call_ref %__divergent_lane_pc; 2204 DW_OP_addrx &lex_1_end; 2205 DW_OP_stack_value; 2206 DW_OP_LLVM_extend 64, 64; 2207 DW_OP_call_ref %__lex_1_save_exec; 2208 DW_OP_deref_type 64, %__uint_64; 2209 DW_OP_LLVM_select_bit_piece 64, 64; 2210 ]; 2211 ]; 2212 DBG_VALUE $noreg, $noreg, %DW_AT_LLVM_lane_pc, DIExpression[ 2213 DW_OP_call_ref %__divergent_lane_pc_1_else; 2214 DW_OP_call_ref %__active_lane_pc; 2215 ]; 2216 f; 2217 EXEC = %1; 2218 $lex_1_end: 2219 DBG_VALUE $noreg, $noreg, %DW_AT_LLVM_lane_pc DIExpression[ 2220 DW_OP_call_ref %__divergent_lane_pc; 2221 DW_OP_call_ref %__active_lane_pc; 2222 ]; 2223 g; 2224 $lex_end: 2225 2226The DWARF procedure ``%__active_lane_pc`` is used to update the lane pc elements 2227that are active, with the current program location. 2228 2229Artificial variables %__lex_1_save_exec and %__lex_1_1_save_exec are created for 2230the execution masks saved on entry to a region. Using the ``DBG_VALUE`` pseudo 2231instruction, location list entries will be created that describe where the 2232artificial variables are allocated at any given program location. The compiler 2233may allocate them to registers or spill them to memory. 2234 2235The DWARF procedures for each region use the values of the saved execution mask 2236artificial variables to only update the lanes that are active on entry to the 2237region. All other lanes retain the value of the enclosing region where they were 2238last active. If they were not active on entry to the subprogram, then will have 2239the undefined location description. 2240 2241Other structured control flow regions can be handled similarly. For example, 2242loops would set the divergent program location for the region at the end of the 2243loop. Any lanes active will be in the loop, and any lanes not active must have 2244exited the loop. 2245 2246An ``IF/THEN/ELSEIF/ELSEIF/...`` region can be treated as a nest of 2247``IF/THEN/ELSE`` regions. 2248 2249The DWARF procedures can use the active lane artificial variable described in 2250:ref:`amdgpu-dwarf-amdgpu-dw-at-llvm-active-lane` rather than the actual 2251``EXEC`` mask in order to support whole or quad wavefront mode. 2252 2253.. _amdgpu-dwarf-amdgpu-dw-at-llvm-active-lane: 2254 2255``DW_AT_LLVM_active_lane`` 2256~~~~~~~~~~~~~~~~~~~~~~~~~~ 2257 2258The ``DW_AT_LLVM_active_lane`` attribute on a subprogram debugger information 2259entry is used to specify the lanes that are conceptually active for a SIMT 2260thread. 2261 2262The execution mask may be modified to implement whole or quad wavefront mode 2263operations. For example, all lanes may need to temporarily be made active to 2264execute a whole wavefront operation. Such regions would save the ``EXEC`` mask, 2265update it to enable the necessary lanes, perform the operations, and then 2266restore the ``EXEC`` mask from the saved value. While executing the whole 2267wavefront region, the conceptual execution mask is the saved value, not the 2268``EXEC`` value. 2269 2270This is handled by defining an artificial variable for the active lane mask. The 2271active lane mask artificial variable would be the actual ``EXEC`` mask for 2272normal regions, and the saved execution mask for regions where the mask is 2273temporarily updated. The location list expression created for this artificial 2274variable is used to define the value of the ``DW_AT_LLVM_active_lane`` 2275attribute. 2276 2277``DW_AT_LLVM_augmentation`` 2278~~~~~~~~~~~~~~~~~~~~~~~~~~~ 2279 2280For AMDGPU, the ``DW_AT_LLVM_augmentation`` attribute of a compilation unit 2281debugger information entry has the following value for the augmentation string: 2282 2283:: 2284 2285 [amdgpu:v0.0] 2286 2287The "vX.Y" specifies the major X and minor Y version number of the AMDGPU 2288extensions used in the DWARF of the compilation unit. The version number 2289conforms to [SEMVER]_. 2290 2291Call Frame Information 2292---------------------- 2293 2294DWARF Call Frame Information (CFI) describes how a consumer can virtually 2295*unwind* call frames in a running process or core dump. See DWARF Version 5 2296section 6.4 and :ref:`amdgpu-dwarf-call-frame-information`. 2297 2298For AMDGPU, the Common Information Entry (CIE) fields have the following values: 2299 23001. ``augmentation`` string contains the following null-terminated UTF-8 string: 2301 2302 :: 2303 2304 [amd:v0.0] 2305 2306 The ``vX.Y`` specifies the major X and minor Y version number of the AMDGPU 2307 extensions used in this CIE or to the FDEs that use it. The version number 2308 conforms to [SEMVER]_. 2309 23102. ``address_size`` for the ``Global`` address space is defined in 2311 :ref:`amdgpu-dwarf-address-space-identifier`. 2312 23133. ``segment_selector_size`` is 0 as AMDGPU does not use a segment selector. 2314 23154. ``code_alignment_factor`` is 4 bytes. 2316 2317 .. TODO:: 2318 2319 Add to :ref:`amdgpu-processor-table` table. 2320 23215. ``data_alignment_factor`` is 4 bytes. 2322 2323 .. TODO:: 2324 2325 Add to :ref:`amdgpu-processor-table` table. 2326 23276. ``return_address_register`` is ``PC_32`` for 32-bit processes and ``PC_64`` 2328 for 64-bit processes defined in :ref:`amdgpu-dwarf-register-identifier`. 2329 23307. ``initial_instructions`` Since a subprogram X with fewer registers can be 2331 called from subprogram Y that has more allocated, X will not change any of 2332 the extra registers as it cannot access them. Therefore, the default rule 2333 for all columns is ``same value``. 2334 2335For AMDGPU the register number follows the numbering defined in 2336:ref:`amdgpu-dwarf-register-identifier`. 2337 2338For AMDGPU the instructions are variable size. A consumer can subtract 1 from 2339the return address to get the address of a byte within the call site 2340instructions. See DWARF Version 5 section 6.4.4. 2341 2342Accelerated Access 2343------------------ 2344 2345See DWARF Version 5 section 6.1. 2346 2347Lookup By Name Section Header 2348~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 2349 2350See DWARF Version 5 section 6.1.1.4.1 and :ref:`amdgpu-dwarf-lookup-by-name`. 2351 2352For AMDGPU the lookup by name section header table: 2353 2354``augmentation_string_size`` (uword) 2355 2356 Set to the length of the ``augmentation_string`` value which is always a 2357 multiple of 4. 2358 2359``augmentation_string`` (sequence of UTF-8 characters) 2360 2361 Contains the following UTF-8 string null padded to a multiple of 4 bytes: 2362 2363 :: 2364 2365 [amdgpu:v0.0] 2366 2367 The "vX.Y" specifies the major X and minor Y version number of the AMDGPU 2368 extensions used in the DWARF of this index. The version number conforms to 2369 [SEMVER]_. 2370 2371 .. note:: 2372 2373 This is different to the DWARF Version 5 definition that requires the first 2374 4 characters to be the vendor ID. But this is consistent with the other 2375 augmentation strings and does allow multiple vendor contributions. However, 2376 backwards compatibility may be more desirable. 2377 2378Lookup By Address Section Header 2379~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 2380 2381See DWARF Version 5 section 6.1.2. 2382 2383For AMDGPU the lookup by address section header table: 2384 2385``address_size`` (ubyte) 2386 2387 Match the address size for the ``Global`` address space defined in 2388 :ref:`amdgpu-dwarf-address-space-identifier`. 2389 2390``segment_selector_size`` (ubyte) 2391 2392 AMDGPU does not use a segment selector so this is 0. The entries in the 2393 ``.debug_aranges`` do not have a segment selector. 2394 2395Line Number Information 2396----------------------- 2397 2398See DWARF Version 5 section 6.2 and :ref:`amdgpu-dwarf-line-number-information`. 2399 2400AMDGPU does not use the ``isa`` state machine registers and always sets it to 0. 2401The instruction set must be obtained from the ELF file header ``e_flags`` field 2402in the ``EF_AMDGPU_MACH`` bit position (see :ref:`ELF Header 2403<amdgpu-elf-header>`). See DWARF Version 5 section 6.2.2. 2404 2405.. TODO:: 2406 2407 Should the ``isa`` state machine register be used to indicate if the code is 2408 in wavefront32 or wavefront64 mode? Or used to specify the architecture ISA? 2409 2410For AMDGPU the line number program header fields have the following values (see 2411DWARF Version 5 section 6.2.4): 2412 2413``address_size`` (ubyte) 2414 Matches the address size for the ``Global`` address space defined in 2415 :ref:`amdgpu-dwarf-address-space-identifier`. 2416 2417``segment_selector_size`` (ubyte) 2418 AMDGPU does not use a segment selector so this is 0. 2419 2420``minimum_instruction_length`` (ubyte) 2421 For GFX9-GFX10 this is 4. 2422 2423``maximum_operations_per_instruction`` (ubyte) 2424 For GFX9-GFX10 this is 1. 2425 2426Source text for online-compiled programs (for example, those compiled by the 2427OpenCL language runtime) may be embedded into the DWARF Version 5 line table. 2428See DWARF Version 5 section 6.2.4.1 which is updated by *DWARF Extensions For 2429Heterogeneous Debugging* section :ref:`DW_LNCT_LLVM_source 2430<amdgpu-dwarf-line-number-information-dw-lnct-llvm-source>`. 2431 2432The Clang option used to control source embedding in AMDGPU is defined in 2433:ref:`amdgpu-clang-debug-options-table`. 2434 2435 .. table:: AMDGPU Clang Debug Options 2436 :name: amdgpu-clang-debug-options-table 2437 2438 ==================== ================================================== 2439 Debug Flag Description 2440 ==================== ================================================== 2441 -g[no-]embed-source Enable/disable embedding source text in DWARF 2442 debug sections. Useful for environments where 2443 source cannot be written to disk, such as 2444 when performing online compilation. 2445 ==================== ================================================== 2446 2447For example: 2448 2449``-gembed-source`` 2450 Enable the embedded source. 2451 2452``-gno-embed-source`` 2453 Disable the embedded source. 2454 245532-Bit and 64-Bit DWARF Formats 2456------------------------------- 2457 2458See DWARF Version 5 section 7.4 and 2459:ref:`amdgpu-dwarf-32-bit-and-64-bit-dwarf-formats`. 2460 2461For AMDGPU: 2462 2463* For the ``amdgcn`` target architecture only the 64-bit process address space 2464 is supported. 2465 2466* The producer can generate either 32-bit or 64-bit DWARF format. LLVM generates 2467 the 32-bit DWARF format. 2468 2469Unit Headers 2470------------ 2471 2472For AMDGPU the following values apply for each of the unit headers described in 2473DWARF Version 5 sections 7.5.1.1, 7.5.1.2, and 7.5.1.3: 2474 2475``address_size`` (ubyte) 2476 Matches the address size for the ``Global`` address space defined in 2477 :ref:`amdgpu-dwarf-address-space-identifier`. 2478 2479.. _amdgpu-code-conventions: 2480 2481Code Conventions 2482================ 2483 2484This section provides code conventions used for each supported target triple OS 2485(see :ref:`amdgpu-target-triples`). 2486 2487AMDHSA 2488------ 2489 2490This section provides code conventions used when the target triple OS is 2491``amdhsa`` (see :ref:`amdgpu-target-triples`). 2492 2493.. _amdgpu-amdhsa-code-object-metadata: 2494 2495Code Object Metadata 2496~~~~~~~~~~~~~~~~~~~~ 2497 2498The code object metadata specifies extensible metadata associated with the code 2499objects executed on HSA [HSA]_ compatible runtimes (see :ref:`amdgpu-os`). The 2500encoding and semantics of this metadata depends on the code object version; see 2501:ref:`amdgpu-amdhsa-code-object-metadata-v2`, 2502:ref:`amdgpu-amdhsa-code-object-metadata-v3`, and 2503:ref:`amdgpu-amdhsa-code-object-metadata-v4`. 2504 2505Code object metadata is specified in a note record (see 2506:ref:`amdgpu-note-records`) and is required when the target triple OS is 2507``amdhsa`` (see :ref:`amdgpu-target-triples`). It must contain the minimum 2508information necessary to support the HSA compatible runtime kernel queries. For 2509example, the segment sizes needed in a dispatch packet. In addition, a 2510high-level language runtime may require other information to be included. For 2511example, the AMD OpenCL runtime records kernel argument information. 2512 2513.. _amdgpu-amdhsa-code-object-metadata-v2: 2514 2515Code Object V2 Metadata 2516+++++++++++++++++++++++ 2517 2518.. warning:: 2519 Code object V2 is not the default code object version emitted by this version 2520 of LLVM. 2521 2522Code object V2 metadata is specified by the ``NT_AMD_HSA_METADATA`` note record 2523(see :ref:`amdgpu-note-records-v2`). 2524 2525The metadata is specified as a YAML formatted string (see [YAML]_ and 2526:doc:`YamlIO`). 2527 2528.. TODO:: 2529 2530 Is the string null terminated? It probably should not if YAML allows it to 2531 contain null characters, otherwise it should be. 2532 2533The metadata is represented as a single YAML document comprised of the mapping 2534defined in table :ref:`amdgpu-amdhsa-code-object-metadata-map-v2-table` and 2535referenced tables. 2536 2537For boolean values, the string values of ``false`` and ``true`` are used for 2538false and true respectively. 2539 2540Additional information can be added to the mappings. To avoid conflicts, any 2541non-AMD key names should be prefixed by "*vendor-name*.". 2542 2543 .. table:: AMDHSA Code Object V2 Metadata Map 2544 :name: amdgpu-amdhsa-code-object-metadata-map-v2-table 2545 2546 ========== ============== ========= ======================================= 2547 String Key Value Type Required? Description 2548 ========== ============== ========= ======================================= 2549 "Version" sequence of Required - The first integer is the major 2550 2 integers version. Currently 1. 2551 - The second integer is the minor 2552 version. Currently 0. 2553 "Printf" sequence of Each string is encoded information 2554 strings about a printf function call. The 2555 encoded information is organized as 2556 fields separated by colon (':'): 2557 2558 ``ID:N:S[0]:S[1]:...:S[N-1]:FormatString`` 2559 2560 where: 2561 2562 ``ID`` 2563 A 32-bit integer as a unique id for 2564 each printf function call 2565 2566 ``N`` 2567 A 32-bit integer equal to the number 2568 of arguments of printf function call 2569 minus 1 2570 2571 ``S[i]`` (where i = 0, 1, ... , N-1) 2572 32-bit integers for the size in bytes 2573 of the i-th FormatString argument of 2574 the printf function call 2575 2576 FormatString 2577 The format string passed to the 2578 printf function call. 2579 "Kernels" sequence of Required Sequence of the mappings for each 2580 mapping kernel in the code object. See 2581 :ref:`amdgpu-amdhsa-code-object-kernel-metadata-map-v2-table` 2582 for the definition of the mapping. 2583 ========== ============== ========= ======================================= 2584 2585.. 2586 2587 .. table:: AMDHSA Code Object V2 Kernel Metadata Map 2588 :name: amdgpu-amdhsa-code-object-kernel-metadata-map-v2-table 2589 2590 ================= ============== ========= ================================ 2591 String Key Value Type Required? Description 2592 ================= ============== ========= ================================ 2593 "Name" string Required Source name of the kernel. 2594 "SymbolName" string Required Name of the kernel 2595 descriptor ELF symbol. 2596 "Language" string Source language of the kernel. 2597 Values include: 2598 2599 - "OpenCL C" 2600 - "OpenCL C++" 2601 - "HCC" 2602 - "OpenMP" 2603 2604 "LanguageVersion" sequence of - The first integer is the major 2605 2 integers version. 2606 - The second integer is the 2607 minor version. 2608 "Attrs" mapping Mapping of kernel attributes. 2609 See 2610 :ref:`amdgpu-amdhsa-code-object-kernel-attribute-metadata-map-v2-table` 2611 for the mapping definition. 2612 "Args" sequence of Sequence of mappings of the 2613 mapping kernel arguments. See 2614 :ref:`amdgpu-amdhsa-code-object-kernel-argument-metadata-map-v2-table` 2615 for the definition of the mapping. 2616 "CodeProps" mapping Mapping of properties related to 2617 the kernel code. See 2618 :ref:`amdgpu-amdhsa-code-object-kernel-code-properties-metadata-map-v2-table` 2619 for the mapping definition. 2620 ================= ============== ========= ================================ 2621 2622.. 2623 2624 .. table:: AMDHSA Code Object V2 Kernel Attribute Metadata Map 2625 :name: amdgpu-amdhsa-code-object-kernel-attribute-metadata-map-v2-table 2626 2627 =================== ============== ========= ============================== 2628 String Key Value Type Required? Description 2629 =================== ============== ========= ============================== 2630 "ReqdWorkGroupSize" sequence of If not 0, 0, 0 then all values 2631 3 integers must be >=1 and the dispatch 2632 work-group size X, Y, Z must 2633 correspond to the specified 2634 values. Defaults to 0, 0, 0. 2635 2636 Corresponds to the OpenCL 2637 ``reqd_work_group_size`` 2638 attribute. 2639 "WorkGroupSizeHint" sequence of The dispatch work-group size 2640 3 integers X, Y, Z is likely to be the 2641 specified values. 2642 2643 Corresponds to the OpenCL 2644 ``work_group_size_hint`` 2645 attribute. 2646 "VecTypeHint" string The name of a scalar or vector 2647 type. 2648 2649 Corresponds to the OpenCL 2650 ``vec_type_hint`` attribute. 2651 2652 "RuntimeHandle" string The external symbol name 2653 associated with a kernel. 2654 OpenCL runtime allocates a 2655 global buffer for the symbol 2656 and saves the kernel's address 2657 to it, which is used for 2658 device side enqueueing. Only 2659 available for device side 2660 enqueued kernels. 2661 =================== ============== ========= ============================== 2662 2663.. 2664 2665 .. table:: AMDHSA Code Object V2 Kernel Argument Metadata Map 2666 :name: amdgpu-amdhsa-code-object-kernel-argument-metadata-map-v2-table 2667 2668 ================= ============== ========= ================================ 2669 String Key Value Type Required? Description 2670 ================= ============== ========= ================================ 2671 "Name" string Kernel argument name. 2672 "TypeName" string Kernel argument type name. 2673 "Size" integer Required Kernel argument size in bytes. 2674 "Align" integer Required Kernel argument alignment in 2675 bytes. Must be a power of two. 2676 "ValueKind" string Required Kernel argument kind that 2677 specifies how to set up the 2678 corresponding argument. 2679 Values include: 2680 2681 "ByValue" 2682 The argument is copied 2683 directly into the kernarg. 2684 2685 "GlobalBuffer" 2686 A global address space pointer 2687 to the buffer data is passed 2688 in the kernarg. 2689 2690 "DynamicSharedPointer" 2691 A group address space pointer 2692 to dynamically allocated LDS 2693 is passed in the kernarg. 2694 2695 "Sampler" 2696 A global address space 2697 pointer to a S# is passed in 2698 the kernarg. 2699 2700 "Image" 2701 A global address space 2702 pointer to a T# is passed in 2703 the kernarg. 2704 2705 "Pipe" 2706 A global address space pointer 2707 to an OpenCL pipe is passed in 2708 the kernarg. 2709 2710 "Queue" 2711 A global address space pointer 2712 to an OpenCL device enqueue 2713 queue is passed in the 2714 kernarg. 2715 2716 "HiddenGlobalOffsetX" 2717 The OpenCL grid dispatch 2718 global offset for the X 2719 dimension is passed in the 2720 kernarg. 2721 2722 "HiddenGlobalOffsetY" 2723 The OpenCL grid dispatch 2724 global offset for the Y 2725 dimension is passed in the 2726 kernarg. 2727 2728 "HiddenGlobalOffsetZ" 2729 The OpenCL grid dispatch 2730 global offset for the Z 2731 dimension is passed in the 2732 kernarg. 2733 2734 "HiddenNone" 2735 An argument that is not used 2736 by the kernel. Space needs to 2737 be left for it, but it does 2738 not need to be set up. 2739 2740 "HiddenPrintfBuffer" 2741 A global address space pointer 2742 to the runtime printf buffer 2743 is passed in kernarg. 2744 2745 "HiddenHostcallBuffer" 2746 A global address space pointer 2747 to the runtime hostcall buffer 2748 is passed in kernarg. 2749 2750 "HiddenDefaultQueue" 2751 A global address space pointer 2752 to the OpenCL device enqueue 2753 queue that should be used by 2754 the kernel by default is 2755 passed in the kernarg. 2756 2757 "HiddenCompletionAction" 2758 A global address space pointer 2759 to help link enqueued kernels into 2760 the ancestor tree for determining 2761 when the parent kernel has finished. 2762 2763 "HiddenMultiGridSyncArg" 2764 A global address space pointer for 2765 multi-grid synchronization is 2766 passed in the kernarg. 2767 2768 "ValueType" string Unused and deprecated. This should no longer 2769 be emitted, but is accepted for compatibility. 2770 2771 2772 "PointeeAlign" integer Alignment in bytes of pointee 2773 type for pointer type kernel 2774 argument. Must be a power 2775 of 2. Only present if 2776 "ValueKind" is 2777 "DynamicSharedPointer". 2778 "AddrSpaceQual" string Kernel argument address space 2779 qualifier. Only present if 2780 "ValueKind" is "GlobalBuffer" or 2781 "DynamicSharedPointer". Values 2782 are: 2783 2784 - "Private" 2785 - "Global" 2786 - "Constant" 2787 - "Local" 2788 - "Generic" 2789 - "Region" 2790 2791 .. TODO:: 2792 2793 Is GlobalBuffer only Global 2794 or Constant? Is 2795 DynamicSharedPointer always 2796 Local? Can HCC allow Generic? 2797 How can Private or Region 2798 ever happen? 2799 2800 "AccQual" string Kernel argument access 2801 qualifier. Only present if 2802 "ValueKind" is "Image" or 2803 "Pipe". Values 2804 are: 2805 2806 - "ReadOnly" 2807 - "WriteOnly" 2808 - "ReadWrite" 2809 2810 .. TODO:: 2811 2812 Does this apply to 2813 GlobalBuffer? 2814 2815 "ActualAccQual" string The actual memory accesses 2816 performed by the kernel on the 2817 kernel argument. Only present if 2818 "ValueKind" is "GlobalBuffer", 2819 "Image", or "Pipe". This may be 2820 more restrictive than indicated 2821 by "AccQual" to reflect what the 2822 kernel actual does. If not 2823 present then the runtime must 2824 assume what is implied by 2825 "AccQual" and "IsConst". Values 2826 are: 2827 2828 - "ReadOnly" 2829 - "WriteOnly" 2830 - "ReadWrite" 2831 2832 "IsConst" boolean Indicates if the kernel argument 2833 is const qualified. Only present 2834 if "ValueKind" is 2835 "GlobalBuffer". 2836 2837 "IsRestrict" boolean Indicates if the kernel argument 2838 is restrict qualified. Only 2839 present if "ValueKind" is 2840 "GlobalBuffer". 2841 2842 "IsVolatile" boolean Indicates if the kernel argument 2843 is volatile qualified. Only 2844 present if "ValueKind" is 2845 "GlobalBuffer". 2846 2847 "IsPipe" boolean Indicates if the kernel argument 2848 is pipe qualified. Only present 2849 if "ValueKind" is "Pipe". 2850 2851 .. TODO:: 2852 2853 Can GlobalBuffer be pipe 2854 qualified? 2855 2856 ================= ============== ========= ================================ 2857 2858.. 2859 2860 .. table:: AMDHSA Code Object V2 Kernel Code Properties Metadata Map 2861 :name: amdgpu-amdhsa-code-object-kernel-code-properties-metadata-map-v2-table 2862 2863 ============================ ============== ========= ===================== 2864 String Key Value Type Required? Description 2865 ============================ ============== ========= ===================== 2866 "KernargSegmentSize" integer Required The size in bytes of 2867 the kernarg segment 2868 that holds the values 2869 of the arguments to 2870 the kernel. 2871 "GroupSegmentFixedSize" integer Required The amount of group 2872 segment memory 2873 required by a 2874 work-group in 2875 bytes. This does not 2876 include any 2877 dynamically allocated 2878 group segment memory 2879 that may be added 2880 when the kernel is 2881 dispatched. 2882 "PrivateSegmentFixedSize" integer Required The amount of fixed 2883 private address space 2884 memory required for a 2885 work-item in 2886 bytes. If the kernel 2887 uses a dynamic call 2888 stack then additional 2889 space must be added 2890 to this value for the 2891 call stack. 2892 "KernargSegmentAlign" integer Required The maximum byte 2893 alignment of 2894 arguments in the 2895 kernarg segment. Must 2896 be a power of 2. 2897 "WavefrontSize" integer Required Wavefront size. Must 2898 be a power of 2. 2899 "NumSGPRs" integer Required Number of scalar 2900 registers used by a 2901 wavefront for 2902 GFX6-GFX10. This 2903 includes the special 2904 SGPRs for VCC, Flat 2905 Scratch (GFX7-GFX10) 2906 and XNACK (for 2907 GFX8-GFX10). It does 2908 not include the 16 2909 SGPR added if a trap 2910 handler is 2911 enabled. It is not 2912 rounded up to the 2913 allocation 2914 granularity. 2915 "NumVGPRs" integer Required Number of vector 2916 registers used by 2917 each work-item for 2918 GFX6-GFX10 2919 "MaxFlatWorkGroupSize" integer Required Maximum flat 2920 work-group size 2921 supported by the 2922 kernel in work-items. 2923 Must be >=1 and 2924 consistent with 2925 ReqdWorkGroupSize if 2926 not 0, 0, 0. 2927 "NumSpilledSGPRs" integer Number of stores from 2928 a scalar register to 2929 a register allocator 2930 created spill 2931 location. 2932 "NumSpilledVGPRs" integer Number of stores from 2933 a vector register to 2934 a register allocator 2935 created spill 2936 location. 2937 ============================ ============== ========= ===================== 2938 2939.. _amdgpu-amdhsa-code-object-metadata-v3: 2940 2941Code Object V3 Metadata 2942+++++++++++++++++++++++ 2943 2944Code object V3 to V4 metadata is specified by the ``NT_AMDGPU_METADATA`` note 2945record (see :ref:`amdgpu-note-records-v3-v4`). 2946 2947The metadata is represented as Message Pack formatted binary data (see 2948[MsgPack]_). The top level is a Message Pack map that includes the 2949keys defined in table 2950:ref:`amdgpu-amdhsa-code-object-metadata-map-table-v3` and referenced 2951tables. 2952 2953Additional information can be added to the maps. To avoid conflicts, 2954any key names should be prefixed by "*vendor-name*." where 2955``vendor-name`` can be the name of the vendor and specific vendor 2956tool that generates the information. The prefix is abbreviated to 2957simply "." when it appears within a map that has been added by the 2958same *vendor-name*. 2959 2960 .. table:: AMDHSA Code Object V3 Metadata Map 2961 :name: amdgpu-amdhsa-code-object-metadata-map-table-v3 2962 2963 ================= ============== ========= ======================================= 2964 String Key Value Type Required? Description 2965 ================= ============== ========= ======================================= 2966 "amdhsa.version" sequence of Required - The first integer is the major 2967 2 integers version. Currently 1. 2968 - The second integer is the minor 2969 version. Currently 0. 2970 "amdhsa.printf" sequence of Each string is encoded information 2971 strings about a printf function call. The 2972 encoded information is organized as 2973 fields separated by colon (':'): 2974 2975 ``ID:N:S[0]:S[1]:...:S[N-1]:FormatString`` 2976 2977 where: 2978 2979 ``ID`` 2980 A 32-bit integer as a unique id for 2981 each printf function call 2982 2983 ``N`` 2984 A 32-bit integer equal to the number 2985 of arguments of printf function call 2986 minus 1 2987 2988 ``S[i]`` (where i = 0, 1, ... , N-1) 2989 32-bit integers for the size in bytes 2990 of the i-th FormatString argument of 2991 the printf function call 2992 2993 FormatString 2994 The format string passed to the 2995 printf function call. 2996 "amdhsa.kernels" sequence of Required Sequence of the maps for each 2997 map kernel in the code object. See 2998 :ref:`amdgpu-amdhsa-code-object-kernel-metadata-map-table-v3` 2999 for the definition of the keys included 3000 in that map. 3001 ================= ============== ========= ======================================= 3002 3003.. 3004 3005 .. table:: AMDHSA Code Object V3 Kernel Metadata Map 3006 :name: amdgpu-amdhsa-code-object-kernel-metadata-map-table-v3 3007 3008 =================================== ============== ========= ================================ 3009 String Key Value Type Required? Description 3010 =================================== ============== ========= ================================ 3011 ".name" string Required Source name of the kernel. 3012 ".symbol" string Required Name of the kernel 3013 descriptor ELF symbol. 3014 ".language" string Source language of the kernel. 3015 Values include: 3016 3017 - "OpenCL C" 3018 - "OpenCL C++" 3019 - "HCC" 3020 - "HIP" 3021 - "OpenMP" 3022 - "Assembler" 3023 3024 ".language_version" sequence of - The first integer is the major 3025 2 integers version. 3026 - The second integer is the 3027 minor version. 3028 ".args" sequence of Sequence of maps of the 3029 map kernel arguments. See 3030 :ref:`amdgpu-amdhsa-code-object-kernel-argument-metadata-map-table-v3` 3031 for the definition of the keys 3032 included in that map. 3033 ".reqd_workgroup_size" sequence of If not 0, 0, 0 then all values 3034 3 integers must be >=1 and the dispatch 3035 work-group size X, Y, Z must 3036 correspond to the specified 3037 values. Defaults to 0, 0, 0. 3038 3039 Corresponds to the OpenCL 3040 ``reqd_work_group_size`` 3041 attribute. 3042 ".workgroup_size_hint" sequence of The dispatch work-group size 3043 3 integers X, Y, Z is likely to be the 3044 specified values. 3045 3046 Corresponds to the OpenCL 3047 ``work_group_size_hint`` 3048 attribute. 3049 ".vec_type_hint" string The name of a scalar or vector 3050 type. 3051 3052 Corresponds to the OpenCL 3053 ``vec_type_hint`` attribute. 3054 3055 ".device_enqueue_symbol" string The external symbol name 3056 associated with a kernel. 3057 OpenCL runtime allocates a 3058 global buffer for the symbol 3059 and saves the kernel's address 3060 to it, which is used for 3061 device side enqueueing. Only 3062 available for device side 3063 enqueued kernels. 3064 ".kernarg_segment_size" integer Required The size in bytes of 3065 the kernarg segment 3066 that holds the values 3067 of the arguments to 3068 the kernel. 3069 ".group_segment_fixed_size" integer Required The amount of group 3070 segment memory 3071 required by a 3072 work-group in 3073 bytes. This does not 3074 include any 3075 dynamically allocated 3076 group segment memory 3077 that may be added 3078 when the kernel is 3079 dispatched. 3080 ".private_segment_fixed_size" integer Required The amount of fixed 3081 private address space 3082 memory required for a 3083 work-item in 3084 bytes. If the kernel 3085 uses a dynamic call 3086 stack then additional 3087 space must be added 3088 to this value for the 3089 call stack. 3090 ".kernarg_segment_align" integer Required The maximum byte 3091 alignment of 3092 arguments in the 3093 kernarg segment. Must 3094 be a power of 2. 3095 ".wavefront_size" integer Required Wavefront size. Must 3096 be a power of 2. 3097 ".sgpr_count" integer Required Number of scalar 3098 registers required by a 3099 wavefront for 3100 GFX6-GFX9. A register 3101 is required if it is 3102 used explicitly, or 3103 if a higher numbered 3104 register is used 3105 explicitly. This 3106 includes the special 3107 SGPRs for VCC, Flat 3108 Scratch (GFX7-GFX9) 3109 and XNACK (for 3110 GFX8-GFX9). It does 3111 not include the 16 3112 SGPR added if a trap 3113 handler is 3114 enabled. It is not 3115 rounded up to the 3116 allocation 3117 granularity. 3118 ".vgpr_count" integer Required Number of vector 3119 registers required by 3120 each work-item for 3121 GFX6-GFX9. A register 3122 is required if it is 3123 used explicitly, or 3124 if a higher numbered 3125 register is used 3126 explicitly. 3127 ".max_flat_workgroup_size" integer Required Maximum flat 3128 work-group size 3129 supported by the 3130 kernel in work-items. 3131 Must be >=1 and 3132 consistent with 3133 ReqdWorkGroupSize if 3134 not 0, 0, 0. 3135 ".sgpr_spill_count" integer Number of stores from 3136 a scalar register to 3137 a register allocator 3138 created spill 3139 location. 3140 ".vgpr_spill_count" integer Number of stores from 3141 a vector register to 3142 a register allocator 3143 created spill 3144 location. 3145 =================================== ============== ========= ================================ 3146 3147.. 3148 3149 .. table:: AMDHSA Code Object V3 Kernel Argument Metadata Map 3150 :name: amdgpu-amdhsa-code-object-kernel-argument-metadata-map-table-v3 3151 3152 ====================== ============== ========= ================================ 3153 String Key Value Type Required? Description 3154 ====================== ============== ========= ================================ 3155 ".name" string Kernel argument name. 3156 ".type_name" string Kernel argument type name. 3157 ".size" integer Required Kernel argument size in bytes. 3158 ".offset" integer Required Kernel argument offset in 3159 bytes. The offset must be a 3160 multiple of the alignment 3161 required by the argument. 3162 ".value_kind" string Required Kernel argument kind that 3163 specifies how to set up the 3164 corresponding argument. 3165 Values include: 3166 3167 "by_value" 3168 The argument is copied 3169 directly into the kernarg. 3170 3171 "global_buffer" 3172 A global address space pointer 3173 to the buffer data is passed 3174 in the kernarg. 3175 3176 "dynamic_shared_pointer" 3177 A group address space pointer 3178 to dynamically allocated LDS 3179 is passed in the kernarg. 3180 3181 "sampler" 3182 A global address space 3183 pointer to a S# is passed in 3184 the kernarg. 3185 3186 "image" 3187 A global address space 3188 pointer to a T# is passed in 3189 the kernarg. 3190 3191 "pipe" 3192 A global address space pointer 3193 to an OpenCL pipe is passed in 3194 the kernarg. 3195 3196 "queue" 3197 A global address space pointer 3198 to an OpenCL device enqueue 3199 queue is passed in the 3200 kernarg. 3201 3202 "hidden_global_offset_x" 3203 The OpenCL grid dispatch 3204 global offset for the X 3205 dimension is passed in the 3206 kernarg. 3207 3208 "hidden_global_offset_y" 3209 The OpenCL grid dispatch 3210 global offset for the Y 3211 dimension is passed in the 3212 kernarg. 3213 3214 "hidden_global_offset_z" 3215 The OpenCL grid dispatch 3216 global offset for the Z 3217 dimension is passed in the 3218 kernarg. 3219 3220 "hidden_none" 3221 An argument that is not used 3222 by the kernel. Space needs to 3223 be left for it, but it does 3224 not need to be set up. 3225 3226 "hidden_printf_buffer" 3227 A global address space pointer 3228 to the runtime printf buffer 3229 is passed in kernarg. 3230 3231 "hidden_hostcall_buffer" 3232 A global address space pointer 3233 to the runtime hostcall buffer 3234 is passed in kernarg. 3235 3236 "hidden_default_queue" 3237 A global address space pointer 3238 to the OpenCL device enqueue 3239 queue that should be used by 3240 the kernel by default is 3241 passed in the kernarg. 3242 3243 "hidden_completion_action" 3244 A global address space pointer 3245 to help link enqueued kernels into 3246 the ancestor tree for determining 3247 when the parent kernel has finished. 3248 3249 "hidden_multigrid_sync_arg" 3250 A global address space pointer for 3251 multi-grid synchronization is 3252 passed in the kernarg. 3253 3254 ".value_type" string Unused and deprecated. This should no longer 3255 be emitted, but is accepted for compatibility. 3256 3257 ".pointee_align" integer Alignment in bytes of pointee 3258 type for pointer type kernel 3259 argument. Must be a power 3260 of 2. Only present if 3261 ".value_kind" is 3262 "dynamic_shared_pointer". 3263 ".address_space" string Kernel argument address space 3264 qualifier. Only present if 3265 ".value_kind" is "global_buffer" or 3266 "dynamic_shared_pointer". Values 3267 are: 3268 3269 - "private" 3270 - "global" 3271 - "constant" 3272 - "local" 3273 - "generic" 3274 - "region" 3275 3276 .. TODO:: 3277 3278 Is "global_buffer" only "global" 3279 or "constant"? Is 3280 "dynamic_shared_pointer" always 3281 "local"? Can HCC allow "generic"? 3282 How can "private" or "region" 3283 ever happen? 3284 3285 ".access" string Kernel argument access 3286 qualifier. Only present if 3287 ".value_kind" is "image" or 3288 "pipe". Values 3289 are: 3290 3291 - "read_only" 3292 - "write_only" 3293 - "read_write" 3294 3295 .. TODO:: 3296 3297 Does this apply to 3298 "global_buffer"? 3299 3300 ".actual_access" string The actual memory accesses 3301 performed by the kernel on the 3302 kernel argument. Only present if 3303 ".value_kind" is "global_buffer", 3304 "image", or "pipe". This may be 3305 more restrictive than indicated 3306 by ".access" to reflect what the 3307 kernel actual does. If not 3308 present then the runtime must 3309 assume what is implied by 3310 ".access" and ".is_const" . Values 3311 are: 3312 3313 - "read_only" 3314 - "write_only" 3315 - "read_write" 3316 3317 ".is_const" boolean Indicates if the kernel argument 3318 is const qualified. Only present 3319 if ".value_kind" is 3320 "global_buffer". 3321 3322 ".is_restrict" boolean Indicates if the kernel argument 3323 is restrict qualified. Only 3324 present if ".value_kind" is 3325 "global_buffer". 3326 3327 ".is_volatile" boolean Indicates if the kernel argument 3328 is volatile qualified. Only 3329 present if ".value_kind" is 3330 "global_buffer". 3331 3332 ".is_pipe" boolean Indicates if the kernel argument 3333 is pipe qualified. Only present 3334 if ".value_kind" is "pipe". 3335 3336 .. TODO:: 3337 3338 Can "global_buffer" be pipe 3339 qualified? 3340 3341 ====================== ============== ========= ================================ 3342 3343.. _amdgpu-amdhsa-code-object-metadata-v4: 3344 3345Code Object V4 Metadata 3346+++++++++++++++++++++++ 3347 3348.. warning:: 3349 Code object V4 is not the default code object version emitted by this version 3350 of LLVM. 3351 3352Code object V4 metadata is the same as 3353:ref:`amdgpu-amdhsa-code-object-metadata-v3` with the changes and additions 3354defined in table :ref:`amdgpu-amdhsa-code-object-metadata-map-table-v3`. 3355 3356 .. table:: AMDHSA Code Object V4 Metadata Map Changes from :ref:`amdgpu-amdhsa-code-object-metadata-v3` 3357 :name: amdgpu-amdhsa-code-object-metadata-map-table-v4 3358 3359 ================= ============== ========= ======================================= 3360 String Key Value Type Required? Description 3361 ================= ============== ========= ======================================= 3362 "amdhsa.version" sequence of Required - The first integer is the major 3363 2 integers version. Currently 1. 3364 - The second integer is the minor 3365 version. Currently 1. 3366 "amdhsa.target" string Required The target name of the code using the syntax: 3367 3368 .. code:: 3369 3370 <target-triple> [ "-" <target-id> ] 3371 3372 A canonical target ID must be 3373 used. See :ref:`amdgpu-target-triples` 3374 and :ref:`amdgpu-target-id`. 3375 ================= ============== ========= ======================================= 3376 3377.. 3378 3379Kernel Dispatch 3380~~~~~~~~~~~~~~~ 3381 3382The HSA architected queuing language (AQL) defines a user space memory interface 3383that can be used to control the dispatch of kernels, in an agent independent 3384way. An agent can have zero or more AQL queues created for it using an HSA 3385compatible runtime (see :ref:`amdgpu-os`), in which AQL packets (all of which 3386are 64 bytes) can be placed. See the *HSA Platform System Architecture 3387Specification* [HSA]_ for the AQL queue mechanics and packet layouts. 3388 3389The packet processor of a kernel agent is responsible for detecting and 3390dispatching HSA kernels from the AQL queues associated with it. For AMD GPUs the 3391packet processor is implemented by the hardware command processor (CP), 3392asynchronous dispatch controller (ADC) and shader processor input controller 3393(SPI). 3394 3395An HSA compatible runtime can be used to allocate an AQL queue object. It uses 3396the kernel mode driver to initialize and register the AQL queue with CP. 3397 3398To dispatch a kernel the following actions are performed. This can occur in the 3399CPU host program, or from an HSA kernel executing on a GPU. 3400 34011. A pointer to an AQL queue for the kernel agent on which the kernel is to be 3402 executed is obtained. 34032. A pointer to the kernel descriptor (see 3404 :ref:`amdgpu-amdhsa-kernel-descriptor`) of the kernel to execute is obtained. 3405 It must be for a kernel that is contained in a code object that that was 3406 loaded by an HSA compatible runtime on the kernel agent with which the AQL 3407 queue is associated. 34083. Space is allocated for the kernel arguments using the HSA compatible runtime 3409 allocator for a memory region with the kernarg property for the kernel agent 3410 that will execute the kernel. It must be at least 16-byte aligned. 34114. Kernel argument values are assigned to the kernel argument memory 3412 allocation. The layout is defined in the *HSA Programmer's Language 3413 Reference* [HSA]_. For AMDGPU the kernel execution directly accesses the 3414 kernel argument memory in the same way constant memory is accessed. (Note 3415 that the HSA specification allows an implementation to copy the kernel 3416 argument contents to another location that is accessed by the kernel.) 34175. An AQL kernel dispatch packet is created on the AQL queue. The HSA compatible 3418 runtime api uses 64-bit atomic operations to reserve space in the AQL queue 3419 for the packet. The packet must be set up, and the final write must use an 3420 atomic store release to set the packet kind to ensure the packet contents are 3421 visible to the kernel agent. AQL defines a doorbell signal mechanism to 3422 notify the kernel agent that the AQL queue has been updated. These rules, and 3423 the layout of the AQL queue and kernel dispatch packet is defined in the *HSA 3424 System Architecture Specification* [HSA]_. 34256. A kernel dispatch packet includes information about the actual dispatch, 3426 such as grid and work-group size, together with information from the code 3427 object about the kernel, such as segment sizes. The HSA compatible runtime 3428 queries on the kernel symbol can be used to obtain the code object values 3429 which are recorded in the :ref:`amdgpu-amdhsa-code-object-metadata`. 34307. CP executes micro-code and is responsible for detecting and setting up the 3431 GPU to execute the wavefronts of a kernel dispatch. 34328. CP ensures that when the a wavefront starts executing the kernel machine 3433 code, the scalar general purpose registers (SGPR) and vector general purpose 3434 registers (VGPR) are set up as required by the machine code. The required 3435 setup is defined in the :ref:`amdgpu-amdhsa-kernel-descriptor`. The initial 3436 register state is defined in 3437 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`. 34389. The prolog of the kernel machine code (see 3439 :ref:`amdgpu-amdhsa-kernel-prolog`) sets up the machine state as necessary 3440 before continuing executing the machine code that corresponds to the kernel. 344110. When the kernel dispatch has completed execution, CP signals the completion 3442 signal specified in the kernel dispatch packet if not 0. 3443 3444.. _amdgpu-amdhsa-memory-spaces: 3445 3446Memory Spaces 3447~~~~~~~~~~~~~ 3448 3449The memory space properties are: 3450 3451 .. table:: AMDHSA Memory Spaces 3452 :name: amdgpu-amdhsa-memory-spaces-table 3453 3454 ================= =========== ======== ======= ================== 3455 Memory Space Name HSA Segment Hardware Address NULL Value 3456 Name Name Size 3457 ================= =========== ======== ======= ================== 3458 Private private scratch 32 0x00000000 3459 Local group LDS 32 0xFFFFFFFF 3460 Global global global 64 0x0000000000000000 3461 Constant constant *same as 64 0x0000000000000000 3462 global* 3463 Generic flat flat 64 0x0000000000000000 3464 Region N/A GDS 32 *not implemented 3465 for AMDHSA* 3466 ================= =========== ======== ======= ================== 3467 3468The global and constant memory spaces both use global virtual addresses, which 3469are the same virtual address space used by the CPU. However, some virtual 3470addresses may only be accessible to the CPU, some only accessible by the GPU, 3471and some by both. 3472 3473Using the constant memory space indicates that the data will not change during 3474the execution of the kernel. This allows scalar read instructions to be 3475used. The vector and scalar L1 caches are invalidated of volatile data before 3476each kernel dispatch execution to allow constant memory to change values between 3477kernel dispatches. 3478 3479The local memory space uses the hardware Local Data Store (LDS) which is 3480automatically allocated when the hardware creates work-groups of wavefronts, and 3481freed when all the wavefronts of a work-group have terminated. The data store 3482(DS) instructions can be used to access it. 3483 3484The private memory space uses the hardware scratch memory support. If the kernel 3485uses scratch, then the hardware allocates memory that is accessed using 3486wavefront lane dword (4 byte) interleaving. The mapping used from private 3487address to physical address is: 3488 3489 ``wavefront-scratch-base + 3490 (private-address * wavefront-size * 4) + 3491 (wavefront-lane-id * 4)`` 3492 3493There are different ways that the wavefront scratch base address is determined 3494by a wavefront (see :ref:`amdgpu-amdhsa-initial-kernel-execution-state`). This 3495memory can be accessed in an interleaved manner using buffer instruction with 3496the scratch buffer descriptor and per wavefront scratch offset, by the scratch 3497instructions, or by flat instructions. If each lane of a wavefront accesses the 3498same private address, the interleaving results in adjacent dwords being accessed 3499and hence requires fewer cache lines to be fetched. Multi-dword access is not 3500supported except by flat and scratch instructions in GFX9-GFX10. 3501 3502The generic address space uses the hardware flat address support available in 3503GFX7-GFX10. This uses two fixed ranges of virtual addresses (the private and 3504local apertures), that are outside the range of addressible global memory, to 3505map from a flat address to a private or local address. 3506 3507FLAT instructions can take a flat address and access global, private (scratch) 3508and group (LDS) memory depending in if the address is within one of the 3509aperture ranges. Flat access to scratch requires hardware aperture setup and 3510setup in the kernel prologue (see 3511:ref:`amdgpu-amdhsa-kernel-prolog-flat-scratch`). Flat access to LDS requires 3512hardware aperture setup and M0 (GFX7-GFX8) register setup (see 3513:ref:`amdgpu-amdhsa-kernel-prolog-m0`). 3514 3515To convert between a segment address and a flat address the base address of the 3516apertures address can be used. For GFX7-GFX8 these are available in the 3517:ref:`amdgpu-amdhsa-hsa-aql-queue` the address of which can be obtained with 3518Queue Ptr SGPR (see :ref:`amdgpu-amdhsa-initial-kernel-execution-state`). For 3519GFX9-GFX10 the aperture base addresses are directly available as inline constant 3520registers ``SRC_SHARED_BASE/LIMIT`` and ``SRC_PRIVATE_BASE/LIMIT``. In 64 bit 3521address mode the aperture sizes are 2^32 bytes and the base is aligned to 2^32 3522which makes it easier to convert from flat to segment or segment to flat. 3523 3524Image and Samplers 3525~~~~~~~~~~~~~~~~~~ 3526 3527Image and sample handles created by an HSA compatible runtime (see 3528:ref:`amdgpu-os`) are 64-bit addresses of a hardware 32-byte V# and 48 byte S# 3529object respectively. In order to support the HSA ``query_sampler`` operations 3530two extra dwords are used to store the HSA BRIG enumeration values for the 3531queries that are not trivially deducible from the S# representation. 3532 3533HSA Signals 3534~~~~~~~~~~~ 3535 3536HSA signal handles created by an HSA compatible runtime (see :ref:`amdgpu-os`) 3537are 64-bit addresses of a structure allocated in memory accessible from both the 3538CPU and GPU. The structure is defined by the runtime and subject to change 3539between releases. For example, see [AMD-ROCm-github]_. 3540 3541.. _amdgpu-amdhsa-hsa-aql-queue: 3542 3543HSA AQL Queue 3544~~~~~~~~~~~~~ 3545 3546The HSA AQL queue structure is defined by an HSA compatible runtime (see 3547:ref:`amdgpu-os`) and subject to change between releases. For example, see 3548[AMD-ROCm-github]_. For some processors it contains fields needed to implement 3549certain language features such as the flat address aperture bases. It also 3550contains fields used by CP such as managing the allocation of scratch memory. 3551 3552.. _amdgpu-amdhsa-kernel-descriptor: 3553 3554Kernel Descriptor 3555~~~~~~~~~~~~~~~~~ 3556 3557A kernel descriptor consists of the information needed by CP to initiate the 3558execution of a kernel, including the entry point address of the machine code 3559that implements the kernel. 3560 3561Code Object V3 Kernel Descriptor 3562++++++++++++++++++++++++++++++++ 3563 3564CP microcode requires the Kernel descriptor to be allocated on 64-byte 3565alignment. 3566 3567The fields used by CP for code objects before V3 also match those specified in 3568:ref:`amdgpu-amdhsa-kernel-descriptor-v3-table`. 3569 3570 .. table:: Code Object V3 Kernel Descriptor 3571 :name: amdgpu-amdhsa-kernel-descriptor-v3-table 3572 3573 ======= ======= =============================== ============================ 3574 Bits Size Field Name Description 3575 ======= ======= =============================== ============================ 3576 31:0 4 bytes GROUP_SEGMENT_FIXED_SIZE The amount of fixed local 3577 address space memory 3578 required for a work-group 3579 in bytes. This does not 3580 include any dynamically 3581 allocated local address 3582 space memory that may be 3583 added when the kernel is 3584 dispatched. 3585 63:32 4 bytes PRIVATE_SEGMENT_FIXED_SIZE The amount of fixed 3586 private address space 3587 memory required for a 3588 work-item in bytes. 3589 Additional space may need to 3590 be added to this value if 3591 the call stack has 3592 non-inlined function calls. 3593 95:64 4 bytes KERNARG_SIZE The size of the kernarg 3594 memory pointed to by the 3595 AQL dispatch packet. The 3596 kernarg memory is used to 3597 pass arguments to the 3598 kernel. 3599 3600 * If the kernarg pointer in 3601 the dispatch packet is NULL 3602 then there are no kernel 3603 arguments. 3604 * If the kernarg pointer in 3605 the dispatch packet is 3606 not NULL and this value 3607 is 0 then the kernarg 3608 memory size is 3609 unspecified. 3610 * If the kernarg pointer in 3611 the dispatch packet is 3612 not NULL and this value 3613 is not 0 then the value 3614 specifies the kernarg 3615 memory size in bytes. It 3616 is recommended to provide 3617 a value as it may be used 3618 by CP to optimize making 3619 the kernarg memory 3620 visible to the kernel 3621 code. 3622 3623 127:96 4 bytes Reserved, must be 0. 3624 191:128 8 bytes KERNEL_CODE_ENTRY_BYTE_OFFSET Byte offset (possibly 3625 negative) from base 3626 address of kernel 3627 descriptor to kernel's 3628 entry point instruction 3629 which must be 256 byte 3630 aligned. 3631 351:272 20 Reserved, must be 0. 3632 bytes 3633 383:352 4 bytes COMPUTE_PGM_RSRC3 GFX6-GFX9 3634 Reserved, must be 0. 3635 GFX90A 3636 Compute Shader (CS) 3637 program settings used by 3638 CP to set up 3639 ``COMPUTE_PGM_RSRC3`` 3640 configuration 3641 register. See 3642 :ref:`amdgpu-amdhsa-compute_pgm_rsrc3-gfx90a-table`. 3643 GFX10 3644 Compute Shader (CS) 3645 program settings used by 3646 CP to set up 3647 ``COMPUTE_PGM_RSRC3`` 3648 configuration 3649 register. See 3650 :ref:`amdgpu-amdhsa-compute_pgm_rsrc3-gfx10-table`. 3651 415:384 4 bytes COMPUTE_PGM_RSRC1 Compute Shader (CS) 3652 program settings used by 3653 CP to set up 3654 ``COMPUTE_PGM_RSRC1`` 3655 configuration 3656 register. See 3657 :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx10-table`. 3658 447:416 4 bytes COMPUTE_PGM_RSRC2 Compute Shader (CS) 3659 program settings used by 3660 CP to set up 3661 ``COMPUTE_PGM_RSRC2`` 3662 configuration 3663 register. See 3664 :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx10-table`. 3665 458:448 7 bits *See separate bits below.* Enable the setup of the 3666 SGPR user data registers 3667 (see 3668 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`). 3669 3670 The total number of SGPR 3671 user data registers 3672 requested must not exceed 3673 16 and match value in 3674 ``compute_pgm_rsrc2.user_sgpr.user_sgpr_count``. 3675 Any requests beyond 16 3676 will be ignored. 3677 >448 1 bit ENABLE_SGPR_PRIVATE_SEGMENT If the *Target Properties* 3678 _BUFFER column of 3679 :ref:`amdgpu-processor-table` 3680 specifies *Architected flat 3681 scratch* then not supported 3682 and must be 0, 3683 >449 1 bit ENABLE_SGPR_DISPATCH_PTR 3684 >450 1 bit ENABLE_SGPR_QUEUE_PTR 3685 >451 1 bit ENABLE_SGPR_KERNARG_SEGMENT_PTR 3686 >452 1 bit ENABLE_SGPR_DISPATCH_ID 3687 >453 1 bit ENABLE_SGPR_FLAT_SCRATCH_INIT If the *Target Properties* 3688 column of 3689 :ref:`amdgpu-processor-table` 3690 specifies *Architected flat 3691 scratch* then not supported 3692 and must be 0, 3693 >454 1 bit ENABLE_SGPR_PRIVATE_SEGMENT 3694 _SIZE 3695 457:455 3 bits Reserved, must be 0. 3696 458 1 bit ENABLE_WAVEFRONT_SIZE32 GFX6-GFX9 3697 Reserved, must be 0. 3698 GFX10 3699 - If 0 execute in 3700 wavefront size 64 mode. 3701 - If 1 execute in 3702 native wavefront size 3703 32 mode. 3704 463:459 1 bit Reserved, must be 0. 3705 464 1 bit RESERVED_464 Deprecated, must be 0. 3706 467:465 3 bits Reserved, must be 0. 3707 468 1 bit RESERVED_468 Deprecated, must be 0. 3708 469:471 3 bits Reserved, must be 0. 3709 511:472 5 bytes Reserved, must be 0. 3710 512 **Total size 64 bytes.** 3711 ======= ==================================================================== 3712 3713.. 3714 3715 .. table:: compute_pgm_rsrc1 for GFX6-GFX10 3716 :name: amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx10-table 3717 3718 ======= ======= =============================== =========================================================================== 3719 Bits Size Field Name Description 3720 ======= ======= =============================== =========================================================================== 3721 5:0 6 bits GRANULATED_WORKITEM_VGPR_COUNT Number of vector register 3722 blocks used by each work-item; 3723 granularity is device 3724 specific: 3725 3726 GFX6-GFX9 3727 - vgprs_used 0..256 3728 - max(0, ceil(vgprs_used / 4) - 1) 3729 GFX90A 3730 - vgprs_used 0..512 3731 - vgprs_used = align(arch_vgprs, 4) 3732 + acc_vgprs 3733 - max(0, ceil(vgprs_used / 8) - 1) 3734 GFX10 (wavefront size 64) 3735 - max_vgpr 1..256 3736 - max(0, ceil(vgprs_used / 4) - 1) 3737 GFX10 (wavefront size 32) 3738 - max_vgpr 1..256 3739 - max(0, ceil(vgprs_used / 8) - 1) 3740 3741 Where vgprs_used is defined 3742 as the highest VGPR number 3743 explicitly referenced plus 3744 one. 3745 3746 Used by CP to set up 3747 ``COMPUTE_PGM_RSRC1.VGPRS``. 3748 3749 The 3750 :ref:`amdgpu-assembler` 3751 calculates this 3752 automatically for the 3753 selected processor from 3754 values provided to the 3755 `.amdhsa_kernel` directive 3756 by the 3757 `.amdhsa_next_free_vgpr` 3758 nested directive (see 3759 :ref:`amdhsa-kernel-directives-table`). 3760 9:6 4 bits GRANULATED_WAVEFRONT_SGPR_COUNT Number of scalar register 3761 blocks used by a wavefront; 3762 granularity is device 3763 specific: 3764 3765 GFX6-GFX8 3766 - sgprs_used 0..112 3767 - max(0, ceil(sgprs_used / 8) - 1) 3768 GFX9 3769 - sgprs_used 0..112 3770 - 2 * max(0, ceil(sgprs_used / 16) - 1) 3771 GFX10 3772 Reserved, must be 0. 3773 (128 SGPRs always 3774 allocated.) 3775 3776 Where sgprs_used is 3777 defined as the highest 3778 SGPR number explicitly 3779 referenced plus one, plus 3780 a target specific number 3781 of additional special 3782 SGPRs for VCC, 3783 FLAT_SCRATCH (GFX7+) and 3784 XNACK_MASK (GFX8+), and 3785 any additional 3786 target specific 3787 limitations. It does not 3788 include the 16 SGPRs added 3789 if a trap handler is 3790 enabled. 3791 3792 The target specific 3793 limitations and special 3794 SGPR layout are defined in 3795 the hardware 3796 documentation, which can 3797 be found in the 3798 :ref:`amdgpu-processors` 3799 table. 3800 3801 Used by CP to set up 3802 ``COMPUTE_PGM_RSRC1.SGPRS``. 3803 3804 The 3805 :ref:`amdgpu-assembler` 3806 calculates this 3807 automatically for the 3808 selected processor from 3809 values provided to the 3810 `.amdhsa_kernel` directive 3811 by the 3812 `.amdhsa_next_free_sgpr` 3813 and `.amdhsa_reserve_*` 3814 nested directives (see 3815 :ref:`amdhsa-kernel-directives-table`). 3816 11:10 2 bits PRIORITY Must be 0. 3817 3818 Start executing wavefront 3819 at the specified priority. 3820 3821 CP is responsible for 3822 filling in 3823 ``COMPUTE_PGM_RSRC1.PRIORITY``. 3824 13:12 2 bits FLOAT_ROUND_MODE_32 Wavefront starts execution 3825 with specified rounding 3826 mode for single (32 3827 bit) floating point 3828 precision floating point 3829 operations. 3830 3831 Floating point rounding 3832 mode values are defined in 3833 :ref:`amdgpu-amdhsa-floating-point-rounding-mode-enumeration-values-table`. 3834 3835 Used by CP to set up 3836 ``COMPUTE_PGM_RSRC1.FLOAT_MODE``. 3837 15:14 2 bits FLOAT_ROUND_MODE_16_64 Wavefront starts execution 3838 with specified rounding 3839 denorm mode for half/double (16 3840 and 64-bit) floating point 3841 precision floating point 3842 operations. 3843 3844 Floating point rounding 3845 mode values are defined in 3846 :ref:`amdgpu-amdhsa-floating-point-rounding-mode-enumeration-values-table`. 3847 3848 Used by CP to set up 3849 ``COMPUTE_PGM_RSRC1.FLOAT_MODE``. 3850 17:16 2 bits FLOAT_DENORM_MODE_32 Wavefront starts execution 3851 with specified denorm mode 3852 for single (32 3853 bit) floating point 3854 precision floating point 3855 operations. 3856 3857 Floating point denorm mode 3858 values are defined in 3859 :ref:`amdgpu-amdhsa-floating-point-denorm-mode-enumeration-values-table`. 3860 3861 Used by CP to set up 3862 ``COMPUTE_PGM_RSRC1.FLOAT_MODE``. 3863 19:18 2 bits FLOAT_DENORM_MODE_16_64 Wavefront starts execution 3864 with specified denorm mode 3865 for half/double (16 3866 and 64-bit) floating point 3867 precision floating point 3868 operations. 3869 3870 Floating point denorm mode 3871 values are defined in 3872 :ref:`amdgpu-amdhsa-floating-point-denorm-mode-enumeration-values-table`. 3873 3874 Used by CP to set up 3875 ``COMPUTE_PGM_RSRC1.FLOAT_MODE``. 3876 20 1 bit PRIV Must be 0. 3877 3878 Start executing wavefront 3879 in privilege trap handler 3880 mode. 3881 3882 CP is responsible for 3883 filling in 3884 ``COMPUTE_PGM_RSRC1.PRIV``. 3885 21 1 bit ENABLE_DX10_CLAMP Wavefront starts execution 3886 with DX10 clamp mode 3887 enabled. Used by the vector 3888 ALU to force DX10 style 3889 treatment of NaN's (when 3890 set, clamp NaN to zero, 3891 otherwise pass NaN 3892 through). 3893 3894 Used by CP to set up 3895 ``COMPUTE_PGM_RSRC1.DX10_CLAMP``. 3896 22 1 bit DEBUG_MODE Must be 0. 3897 3898 Start executing wavefront 3899 in single step mode. 3900 3901 CP is responsible for 3902 filling in 3903 ``COMPUTE_PGM_RSRC1.DEBUG_MODE``. 3904 23 1 bit ENABLE_IEEE_MODE Wavefront starts execution 3905 with IEEE mode 3906 enabled. Floating point 3907 opcodes that support 3908 exception flag gathering 3909 will quiet and propagate 3910 signaling-NaN inputs per 3911 IEEE 754-2008. Min_dx10 and 3912 max_dx10 become IEEE 3913 754-2008 compliant due to 3914 signaling-NaN propagation 3915 and quieting. 3916 3917 Used by CP to set up 3918 ``COMPUTE_PGM_RSRC1.IEEE_MODE``. 3919 24 1 bit BULKY Must be 0. 3920 3921 Only one work-group allowed 3922 to execute on a compute 3923 unit. 3924 3925 CP is responsible for 3926 filling in 3927 ``COMPUTE_PGM_RSRC1.BULKY``. 3928 25 1 bit CDBG_USER Must be 0. 3929 3930 Flag that can be used to 3931 control debugging code. 3932 3933 CP is responsible for 3934 filling in 3935 ``COMPUTE_PGM_RSRC1.CDBG_USER``. 3936 26 1 bit FP16_OVFL GFX6-GFX8 3937 Reserved, must be 0. 3938 GFX9-GFX10 3939 Wavefront starts execution 3940 with specified fp16 overflow 3941 mode. 3942 3943 - If 0, fp16 overflow generates 3944 +/-INF values. 3945 - If 1, fp16 overflow that is the 3946 result of an +/-INF input value 3947 or divide by 0 produces a +/-INF, 3948 otherwise clamps computed 3949 overflow to +/-MAX_FP16 as 3950 appropriate. 3951 3952 Used by CP to set up 3953 ``COMPUTE_PGM_RSRC1.FP16_OVFL``. 3954 28:27 2 bits Reserved, must be 0. 3955 29 1 bit WGP_MODE GFX6-GFX9 3956 Reserved, must be 0. 3957 GFX10 3958 - If 0 execute work-groups in 3959 CU wavefront execution mode. 3960 - If 1 execute work-groups on 3961 in WGP wavefront execution mode. 3962 3963 See :ref:`amdgpu-amdhsa-memory-model`. 3964 3965 Used by CP to set up 3966 ``COMPUTE_PGM_RSRC1.WGP_MODE``. 3967 30 1 bit MEM_ORDERED GFX6-GFX9 3968 Reserved, must be 0. 3969 GFX10 3970 Controls the behavior of the 3971 s_waitcnt's vmcnt and vscnt 3972 counters. 3973 3974 - If 0 vmcnt reports completion 3975 of load and atomic with return 3976 out of order with sample 3977 instructions, and the vscnt 3978 reports the completion of 3979 store and atomic without 3980 return in order. 3981 - If 1 vmcnt reports completion 3982 of load, atomic with return 3983 and sample instructions in 3984 order, and the vscnt reports 3985 the completion of store and 3986 atomic without return in order. 3987 3988 Used by CP to set up 3989 ``COMPUTE_PGM_RSRC1.MEM_ORDERED``. 3990 31 1 bit FWD_PROGRESS GFX6-GFX9 3991 Reserved, must be 0. 3992 GFX10 3993 - If 0 execute SIMD wavefronts 3994 using oldest first policy. 3995 - If 1 execute SIMD wavefronts to 3996 ensure wavefronts will make some 3997 forward progress. 3998 3999 Used by CP to set up 4000 ``COMPUTE_PGM_RSRC1.FWD_PROGRESS``. 4001 32 **Total size 4 bytes** 4002 ======= =================================================================================================================== 4003 4004.. 4005 4006 .. table:: compute_pgm_rsrc2 for GFX6-GFX10 4007 :name: amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx10-table 4008 4009 ======= ======= =============================== =========================================================================== 4010 Bits Size Field Name Description 4011 ======= ======= =============================== =========================================================================== 4012 0 1 bit ENABLE_PRIVATE_SEGMENT * Enable the setup of the 4013 private segment. 4014 * If the *Target Properties* 4015 column of 4016 :ref:`amdgpu-processor-table` 4017 does not specify 4018 *Architected flat 4019 scratch* then enable the 4020 setup of the SGPR 4021 wavefront scratch offset 4022 system register (see 4023 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`). 4024 * If the *Target Properties* 4025 column of 4026 :ref:`amdgpu-processor-table` 4027 specifies *Architected 4028 flat scratch* then enable 4029 the setup of the 4030 FLAT_SCRATCH register 4031 pair (see 4032 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`). 4033 4034 Used by CP to set up 4035 ``COMPUTE_PGM_RSRC2.SCRATCH_EN``. 4036 5:1 5 bits USER_SGPR_COUNT The total number of SGPR 4037 user data registers 4038 requested. This number must 4039 match the number of user 4040 data registers enabled. 4041 4042 Used by CP to set up 4043 ``COMPUTE_PGM_RSRC2.USER_SGPR``. 4044 6 1 bit ENABLE_TRAP_HANDLER Must be 0. 4045 4046 This bit represents 4047 ``COMPUTE_PGM_RSRC2.TRAP_PRESENT``, 4048 which is set by the CP if 4049 the runtime has installed a 4050 trap handler. 4051 7 1 bit ENABLE_SGPR_WORKGROUP_ID_X Enable the setup of the 4052 system SGPR register for 4053 the work-group id in the X 4054 dimension (see 4055 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`). 4056 4057 Used by CP to set up 4058 ``COMPUTE_PGM_RSRC2.TGID_X_EN``. 4059 8 1 bit ENABLE_SGPR_WORKGROUP_ID_Y Enable the setup of the 4060 system SGPR register for 4061 the work-group id in the Y 4062 dimension (see 4063 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`). 4064 4065 Used by CP to set up 4066 ``COMPUTE_PGM_RSRC2.TGID_Y_EN``. 4067 9 1 bit ENABLE_SGPR_WORKGROUP_ID_Z Enable the setup of the 4068 system SGPR register for 4069 the work-group id in the Z 4070 dimension (see 4071 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`). 4072 4073 Used by CP to set up 4074 ``COMPUTE_PGM_RSRC2.TGID_Z_EN``. 4075 10 1 bit ENABLE_SGPR_WORKGROUP_INFO Enable the setup of the 4076 system SGPR register for 4077 work-group information (see 4078 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`). 4079 4080 Used by CP to set up 4081 ``COMPUTE_PGM_RSRC2.TGID_SIZE_EN``. 4082 12:11 2 bits ENABLE_VGPR_WORKITEM_ID Enable the setup of the 4083 VGPR system registers used 4084 for the work-item ID. 4085 :ref:`amdgpu-amdhsa-system-vgpr-work-item-id-enumeration-values-table` 4086 defines the values. 4087 4088 Used by CP to set up 4089 ``COMPUTE_PGM_RSRC2.TIDIG_CMP_CNT``. 4090 13 1 bit ENABLE_EXCEPTION_ADDRESS_WATCH Must be 0. 4091 4092 Wavefront starts execution 4093 with address watch 4094 exceptions enabled which 4095 are generated when L1 has 4096 witnessed a thread access 4097 an *address of 4098 interest*. 4099 4100 CP is responsible for 4101 filling in the address 4102 watch bit in 4103 ``COMPUTE_PGM_RSRC2.EXCP_EN_MSB`` 4104 according to what the 4105 runtime requests. 4106 14 1 bit ENABLE_EXCEPTION_MEMORY Must be 0. 4107 4108 Wavefront starts execution 4109 with memory violation 4110 exceptions exceptions 4111 enabled which are generated 4112 when a memory violation has 4113 occurred for this wavefront from 4114 L1 or LDS 4115 (write-to-read-only-memory, 4116 mis-aligned atomic, LDS 4117 address out of range, 4118 illegal address, etc.). 4119 4120 CP sets the memory 4121 violation bit in 4122 ``COMPUTE_PGM_RSRC2.EXCP_EN_MSB`` 4123 according to what the 4124 runtime requests. 4125 23:15 9 bits GRANULATED_LDS_SIZE Must be 0. 4126 4127 CP uses the rounded value 4128 from the dispatch packet, 4129 not this value, as the 4130 dispatch may contain 4131 dynamically allocated group 4132 segment memory. CP writes 4133 directly to 4134 ``COMPUTE_PGM_RSRC2.LDS_SIZE``. 4135 4136 Amount of group segment 4137 (LDS) to allocate for each 4138 work-group. Granularity is 4139 device specific: 4140 4141 GFX6 4142 roundup(lds-size / (64 * 4)) 4143 GFX7-GFX10 4144 roundup(lds-size / (128 * 4)) 4145 4146 24 1 bit ENABLE_EXCEPTION_IEEE_754_FP Wavefront starts execution 4147 _INVALID_OPERATION with specified exceptions 4148 enabled. 4149 4150 Used by CP to set up 4151 ``COMPUTE_PGM_RSRC2.EXCP_EN`` 4152 (set from bits 0..6). 4153 4154 IEEE 754 FP Invalid 4155 Operation 4156 25 1 bit ENABLE_EXCEPTION_FP_DENORMAL FP Denormal one or more 4157 _SOURCE input operands is a 4158 denormal number 4159 26 1 bit ENABLE_EXCEPTION_IEEE_754_FP IEEE 754 FP Division by 4160 _DIVISION_BY_ZERO Zero 4161 27 1 bit ENABLE_EXCEPTION_IEEE_754_FP IEEE 754 FP FP Overflow 4162 _OVERFLOW 4163 28 1 bit ENABLE_EXCEPTION_IEEE_754_FP IEEE 754 FP Underflow 4164 _UNDERFLOW 4165 29 1 bit ENABLE_EXCEPTION_IEEE_754_FP IEEE 754 FP Inexact 4166 _INEXACT 4167 30 1 bit ENABLE_EXCEPTION_INT_DIVIDE_BY Integer Division by Zero 4168 _ZERO (rcp_iflag_f32 instruction 4169 only) 4170 31 1 bit Reserved, must be 0. 4171 32 **Total size 4 bytes.** 4172 ======= =================================================================================================================== 4173 4174.. 4175 4176 .. table:: compute_pgm_rsrc3 for GFX90A 4177 :name: amdgpu-amdhsa-compute_pgm_rsrc3-gfx90a-table 4178 4179 ======= ======= =============================== =========================================================================== 4180 Bits Size Field Name Description 4181 ======= ======= =============================== =========================================================================== 4182 5:0 6 bits ACCUM_OFFSET Offset of a first AccVGPR in the unified register file. Granularity 4. 4183 Value 0-63. 0 - accum-offset = 4, 1 - accum-offset = 8, ..., 4184 63 - accum-offset = 256. 4185 6:15 10 Reserved, must be 0. 4186 bits 4187 16 1 bit TG_SPLIT - If 0 the waves of a work-group are 4188 launched in the same CU. 4189 - If 1 the waves of a work-group can be 4190 launched in different CUs. The waves 4191 cannot use S_BARRIER or LDS. 4192 17:31 15 Reserved, must be 0. 4193 bits 4194 32 **Total size 4 bytes.** 4195 ======= =================================================================================================================== 4196 4197.. 4198 4199 .. table:: compute_pgm_rsrc3 for GFX10 4200 :name: amdgpu-amdhsa-compute_pgm_rsrc3-gfx10-table 4201 4202 ======= ======= =============================== =========================================================================== 4203 Bits Size Field Name Description 4204 ======= ======= =============================== =========================================================================== 4205 3:0 4 bits SHARED_VGPR_COUNT Number of shared VGPRs for wavefront size 64. Granularity 8. Value 0-120. 4206 compute_pgm_rsrc1.vgprs + shared_vgpr_cnt cannot exceed 64. 4207 31:4 28 Reserved, must be 0. 4208 bits 4209 32 **Total size 4 bytes.** 4210 ======= =================================================================================================================== 4211 4212.. 4213 4214 .. table:: Floating Point Rounding Mode Enumeration Values 4215 :name: amdgpu-amdhsa-floating-point-rounding-mode-enumeration-values-table 4216 4217 ====================================== ===== ============================== 4218 Enumeration Name Value Description 4219 ====================================== ===== ============================== 4220 FLOAT_ROUND_MODE_NEAR_EVEN 0 Round Ties To Even 4221 FLOAT_ROUND_MODE_PLUS_INFINITY 1 Round Toward +infinity 4222 FLOAT_ROUND_MODE_MINUS_INFINITY 2 Round Toward -infinity 4223 FLOAT_ROUND_MODE_ZERO 3 Round Toward 0 4224 ====================================== ===== ============================== 4225 4226.. 4227 4228 .. table:: Floating Point Denorm Mode Enumeration Values 4229 :name: amdgpu-amdhsa-floating-point-denorm-mode-enumeration-values-table 4230 4231 ====================================== ===== ============================== 4232 Enumeration Name Value Description 4233 ====================================== ===== ============================== 4234 FLOAT_DENORM_MODE_FLUSH_SRC_DST 0 Flush Source and Destination 4235 Denorms 4236 FLOAT_DENORM_MODE_FLUSH_DST 1 Flush Output Denorms 4237 FLOAT_DENORM_MODE_FLUSH_SRC 2 Flush Source Denorms 4238 FLOAT_DENORM_MODE_FLUSH_NONE 3 No Flush 4239 ====================================== ===== ============================== 4240 4241.. 4242 4243 .. table:: System VGPR Work-Item ID Enumeration Values 4244 :name: amdgpu-amdhsa-system-vgpr-work-item-id-enumeration-values-table 4245 4246 ======================================== ===== ============================ 4247 Enumeration Name Value Description 4248 ======================================== ===== ============================ 4249 SYSTEM_VGPR_WORKITEM_ID_X 0 Set work-item X dimension 4250 ID. 4251 SYSTEM_VGPR_WORKITEM_ID_X_Y 1 Set work-item X and Y 4252 dimensions ID. 4253 SYSTEM_VGPR_WORKITEM_ID_X_Y_Z 2 Set work-item X, Y and Z 4254 dimensions ID. 4255 SYSTEM_VGPR_WORKITEM_ID_UNDEFINED 3 Undefined. 4256 ======================================== ===== ============================ 4257 4258.. _amdgpu-amdhsa-initial-kernel-execution-state: 4259 4260Initial Kernel Execution State 4261~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 4262 4263This section defines the register state that will be set up by the packet 4264processor prior to the start of execution of every wavefront. This is limited by 4265the constraints of the hardware controllers of CP/ADC/SPI. 4266 4267The order of the SGPR registers is defined, but the compiler can specify which 4268ones are actually setup in the kernel descriptor using the ``enable_sgpr_*`` bit 4269fields (see :ref:`amdgpu-amdhsa-kernel-descriptor`). The register numbers used 4270for enabled registers are dense starting at SGPR0: the first enabled register is 4271SGPR0, the next enabled register is SGPR1 etc.; disabled registers do not have 4272an SGPR number. 4273 4274The initial SGPRs comprise up to 16 User SRGPs that are set by CP and apply to 4275all wavefronts of the grid. It is possible to specify more than 16 User SGPRs 4276using the ``enable_sgpr_*`` bit fields, in which case only the first 16 are 4277actually initialized. These are then immediately followed by the System SGPRs 4278that are set up by ADC/SPI and can have different values for each wavefront of 4279the grid dispatch. 4280 4281SGPR register initial state is defined in 4282:ref:`amdgpu-amdhsa-sgpr-register-set-up-order-table`. 4283 4284 .. table:: SGPR Register Set Up Order 4285 :name: amdgpu-amdhsa-sgpr-register-set-up-order-table 4286 4287 ========== ========================== ====== ============================== 4288 SGPR Order Name Number Description 4289 (kernel descriptor enable of 4290 field) SGPRs 4291 ========== ========================== ====== ============================== 4292 First Private Segment Buffer 4 See 4293 (enable_sgpr_private :ref:`amdgpu-amdhsa-kernel-prolog-private-segment-buffer`. 4294 _segment_buffer) 4295 then Dispatch Ptr 2 64-bit address of AQL dispatch 4296 (enable_sgpr_dispatch_ptr) packet for kernel dispatch 4297 actually executing. 4298 then Queue Ptr 2 64-bit address of amd_queue_t 4299 (enable_sgpr_queue_ptr) object for AQL queue on which 4300 the dispatch packet was 4301 queued. 4302 then Kernarg Segment Ptr 2 64-bit address of Kernarg 4303 (enable_sgpr_kernarg segment. This is directly 4304 _segment_ptr) copied from the 4305 kernarg_address in the kernel 4306 dispatch packet. 4307 4308 Having CP load it once avoids 4309 loading it at the beginning of 4310 every wavefront. 4311 then Dispatch Id 2 64-bit Dispatch ID of the 4312 (enable_sgpr_dispatch_id) dispatch packet being 4313 executed. 4314 then Flat Scratch Init 2 See 4315 (enable_sgpr_flat_scratch :ref:`amdgpu-amdhsa-kernel-prolog-flat-scratch`. 4316 _init) 4317 then Private Segment Size 1 The 32-bit byte size of a 4318 (enable_sgpr_private single work-item's memory 4319 _segment_size) allocation. This is the 4320 value from the kernel 4321 dispatch packet Private 4322 Segment Byte Size rounded up 4323 by CP to a multiple of 4324 DWORD. 4325 4326 Having CP load it once avoids 4327 loading it at the beginning of 4328 every wavefront. 4329 4330 This is not used for 4331 GFX7-GFX8 since it is the same 4332 value as the second SGPR of 4333 Flat Scratch Init. However, it 4334 may be needed for GFX9-GFX10 which 4335 changes the meaning of the 4336 Flat Scratch Init value. 4337 then Work-Group Id X 1 32-bit work-group id in X 4338 (enable_sgpr_workgroup_id dimension of grid for 4339 _X) wavefront. 4340 then Work-Group Id Y 1 32-bit work-group id in Y 4341 (enable_sgpr_workgroup_id dimension of grid for 4342 _Y) wavefront. 4343 then Work-Group Id Z 1 32-bit work-group id in Z 4344 (enable_sgpr_workgroup_id dimension of grid for 4345 _Z) wavefront. 4346 then Work-Group Info 1 {first_wavefront, 14'b0000, 4347 (enable_sgpr_workgroup ordered_append_term[10:0], 4348 _info) threadgroup_size_in_wavefronts[5:0]} 4349 then Scratch Wavefront Offset 1 See 4350 (enable_sgpr_private :ref:`amdgpu-amdhsa-kernel-prolog-flat-scratch`. 4351 _segment_wavefront_offset) and 4352 :ref:`amdgpu-amdhsa-kernel-prolog-private-segment-buffer`. 4353 ========== ========================== ====== ============================== 4354 4355The order of the VGPR registers is defined, but the compiler can specify which 4356ones are actually setup in the kernel descriptor using the ``enable_vgpr*`` bit 4357fields (see :ref:`amdgpu-amdhsa-kernel-descriptor`). The register numbers used 4358for enabled registers are dense starting at VGPR0: the first enabled register is 4359VGPR0, the next enabled register is VGPR1 etc.; disabled registers do not have a 4360VGPR number. 4361 4362There are different methods used for the VGPR initial state: 4363 4364* Unless the *Target Properties* column of :ref:`amdgpu-processor-table` 4365 specifies otherwise, a separate VGPR register is used per work-item ID. The 4366 VGPR register initial state for this method is defined in 4367 :ref:`amdgpu-amdhsa-vgpr-register-set-up-order-for-unpacked-work-item-id-method-table`. 4368* If *Target Properties* column of :ref:`amdgpu-processor-table` 4369 specifies *Packed work-item IDs*, the initial value of VGPR0 register is used 4370 for all work-item IDs. The register layout for this method is defined in 4371 :ref:`amdgpu-amdhsa-register-layout-for-packed-work-item-id-method-table`. 4372 4373 .. table:: VGPR Register Set Up Order for Unpacked Work-Item ID Method 4374 :name: amdgpu-amdhsa-vgpr-register-set-up-order-for-unpacked-work-item-id-method-table 4375 4376 ========== ========================== ====== ============================== 4377 VGPR Order Name Number Description 4378 (kernel descriptor enable of 4379 field) VGPRs 4380 ========== ========================== ====== ============================== 4381 First Work-Item Id X 1 32-bit work-item id in X 4382 (Always initialized) dimension of work-group for 4383 wavefront lane. 4384 then Work-Item Id Y 1 32-bit work-item id in Y 4385 (enable_vgpr_workitem_id dimension of work-group for 4386 > 0) wavefront lane. 4387 then Work-Item Id Z 1 32-bit work-item id in Z 4388 (enable_vgpr_workitem_id dimension of work-group for 4389 > 1) wavefront lane. 4390 ========== ========================== ====== ============================== 4391 4392.. 4393 4394 .. table:: Register Layout for Packed Work-Item ID Method 4395 :name: amdgpu-amdhsa-register-layout-for-packed-work-item-id-method-table 4396 4397 ======= ======= ================ ========================================= 4398 Bits Size Field Name Description 4399 ======= ======= ================ ========================================= 4400 0:9 10 bits Work-Item Id X Work-item id in X 4401 dimension of work-group for 4402 wavefront lane. 4403 4404 Always initialized. 4405 4406 10:19 10 bits Work-Item Id Y Work-item id in Y 4407 dimension of work-group for 4408 wavefront lane. 4409 4410 Initialized if enable_vgpr_workitem_id > 4411 0, otherwise set to 0. 4412 20:29 10 bits Work-Item Id Z Work-item id in Z 4413 dimension of work-group for 4414 wavefront lane. 4415 4416 Initialized if enable_vgpr_workitem_id > 4417 1, otherwise set to 0. 4418 30:31 2 bits Reserved, set to 0. 4419 ======= ======= ================ ========================================= 4420 4421The setting of registers is done by GPU CP/ADC/SPI hardware as follows: 4422 44231. SGPRs before the Work-Group Ids are set by CP using the 16 User Data 4424 registers. 44252. Work-group Id registers X, Y, Z are set by ADC which supports any 4426 combination including none. 44273. Scratch Wavefront Offset is set by SPI in a per wavefront basis which is why 4428 its value cannot be included with the flat scratch init value which is per 4429 queue (see :ref:`amdgpu-amdhsa-kernel-prolog-flat-scratch`). 44304. The VGPRs are set by SPI which only supports specifying either (X), (X, Y) 4431 or (X, Y, Z). 44325. Flat Scratch register pair initialization is described in 4433 :ref:`amdgpu-amdhsa-kernel-prolog-flat-scratch`. 4434 4435The global segment can be accessed either using buffer instructions (GFX6 which 4436has V# 64-bit address support), flat instructions (GFX7-GFX10), or global 4437instructions (GFX9-GFX10). 4438 4439If buffer operations are used, then the compiler can generate a V# with the 4440following properties: 4441 4442* base address of 0 4443* no swizzle 4444* ATC: 1 if IOMMU present (such as APU) 4445* ptr64: 1 4446* MTYPE set to support memory coherence that matches the runtime (such as CC for 4447 APU and NC for dGPU). 4448 4449.. _amdgpu-amdhsa-kernel-prolog: 4450 4451Kernel Prolog 4452~~~~~~~~~~~~~ 4453 4454The compiler performs initialization in the kernel prologue depending on the 4455target and information about things like stack usage in the kernel and called 4456functions. Some of this initialization requires the compiler to request certain 4457User and System SGPRs be present in the 4458:ref:`amdgpu-amdhsa-initial-kernel-execution-state` via the 4459:ref:`amdgpu-amdhsa-kernel-descriptor`. 4460 4461.. _amdgpu-amdhsa-kernel-prolog-cfi: 4462 4463CFI 4464+++ 4465 44661. The CFI return address is undefined. 4467 44682. The CFI CFA is defined using an expression which evaluates to a location 4469 description that comprises one memory location description for the 4470 ``DW_ASPACE_AMDGPU_private_lane`` address space address ``0``. 4471 4472.. _amdgpu-amdhsa-kernel-prolog-m0: 4473 4474M0 4475++ 4476 4477GFX6-GFX8 4478 The M0 register must be initialized with a value at least the total LDS size 4479 if the kernel may access LDS via DS or flat operations. Total LDS size is 4480 available in dispatch packet. For M0, it is also possible to use maximum 4481 possible value of LDS for given target (0x7FFF for GFX6 and 0xFFFF for 4482 GFX7-GFX8). 4483GFX9-GFX10 4484 The M0 register is not used for range checking LDS accesses and so does not 4485 need to be initialized in the prolog. 4486 4487.. _amdgpu-amdhsa-kernel-prolog-stack-pointer: 4488 4489Stack Pointer 4490+++++++++++++ 4491 4492If the kernel has function calls it must set up the ABI stack pointer described 4493in :ref:`amdgpu-amdhsa-function-call-convention-non-kernel-functions` by setting 4494SGPR32 to the unswizzled scratch offset of the address past the last local 4495allocation. 4496 4497.. _amdgpu-amdhsa-kernel-prolog-frame-pointer: 4498 4499Frame Pointer 4500+++++++++++++ 4501 4502If the kernel needs a frame pointer for the reasons defined in 4503``SIFrameLowering`` then SGPR33 is used and is always set to ``0`` in the 4504kernel prolog. If a frame pointer is not required then all uses of the frame 4505pointer are replaced with immediate ``0`` offsets. 4506 4507.. _amdgpu-amdhsa-kernel-prolog-flat-scratch: 4508 4509Flat Scratch 4510++++++++++++ 4511 4512There are different methods used for initializing flat scratch: 4513 4514* If the *Target Properties* column of :ref:`amdgpu-processor-table` 4515 specifies *Does not support generic address space*: 4516 4517 Flat scratch is not supported and there is no flat scratch register pair. 4518 4519* If the *Target Properties* column of :ref:`amdgpu-processor-table` 4520 specifies *Offset flat scratch*: 4521 4522 If the kernel or any function it calls may use flat operations to access 4523 scratch memory, the prolog code must set up the FLAT_SCRATCH register pair 4524 (FLAT_SCRATCH_LO/FLAT_SCRATCH_HI). Initialization uses Flat Scratch Init and 4525 Scratch Wavefront Offset SGPR registers (see 4526 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`): 4527 4528 1. The low word of Flat Scratch Init is the 32-bit byte offset from 4529 ``SH_HIDDEN_PRIVATE_BASE_VIMID`` to the base of scratch backing memory 4530 being managed by SPI for the queue executing the kernel dispatch. This is 4531 the same value used in the Scratch Segment Buffer V# base address. 4532 4533 CP obtains this from the runtime. (The Scratch Segment Buffer base address 4534 is ``SH_HIDDEN_PRIVATE_BASE_VIMID`` plus this offset.) 4535 4536 The prolog must add the value of Scratch Wavefront Offset to get the 4537 wavefront's byte scratch backing memory offset from 4538 ``SH_HIDDEN_PRIVATE_BASE_VIMID``. 4539 4540 The Scratch Wavefront Offset must also be used as an offset with Private 4541 segment address when using the Scratch Segment Buffer. 4542 4543 Since FLAT_SCRATCH_LO is in units of 256 bytes, the offset must be right 4544 shifted by 8 before moving into FLAT_SCRATCH_HI. 4545 4546 FLAT_SCRATCH_HI corresponds to SGPRn-4 on GFX7, and SGPRn-6 on GFX8 (where 4547 SGPRn is the highest numbered SGPR allocated to the wavefront). 4548 FLAT_SCRATCH_HI is multiplied by 256 (as it is in units of 256 bytes) and 4549 added to ``SH_HIDDEN_PRIVATE_BASE_VIMID`` to calculate the per wavefront 4550 FLAT SCRATCH BASE in flat memory instructions that access the scratch 4551 aperture. 4552 2. The second word of Flat Scratch Init is 32-bit byte size of a single 4553 work-items scratch memory usage. 4554 4555 CP obtains this from the runtime, and it is always a multiple of DWORD. CP 4556 checks that the value in the kernel dispatch packet Private Segment Byte 4557 Size is not larger and requests the runtime to increase the queue's scratch 4558 size if necessary. 4559 4560 CP directly loads from the kernel dispatch packet Private Segment Byte Size 4561 field and rounds up to a multiple of DWORD. Having CP load it once avoids 4562 loading it at the beginning of every wavefront. 4563 4564 The kernel prolog code must move it to FLAT_SCRATCH_LO which is SGPRn-3 on 4565 GFX7 and SGPRn-5 on GFX8. FLAT_SCRATCH_LO is used as the FLAT SCRATCH SIZE 4566 in flat memory instructions. 4567 4568* If the *Target Properties* column of :ref:`amdgpu-processor-table` 4569 specifies *Absolute flat scratch*: 4570 4571 If the kernel or any function it calls may use flat operations to access 4572 scratch memory, the prolog code must set up the FLAT_SCRATCH register pair 4573 (FLAT_SCRATCH_LO/FLAT_SCRATCH_HI which are in SGPRn-4/SGPRn-3). Initialization 4574 uses Flat Scratch Init and Scratch Wavefront Offset SGPR registers (see 4575 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`): 4576 4577 The Flat Scratch Init is the 64-bit address of the base of scratch backing 4578 memory being managed by SPI for the queue executing the kernel dispatch. 4579 4580 CP obtains this from the runtime. 4581 4582 The kernel prolog must add the value of the wave's Scratch Wavefront Offset 4583 and move the result as a 64-bit value to the FLAT_SCRATCH SGPR register pair 4584 which is SGPRn-6 and SGPRn-5. It is used as the FLAT SCRATCH BASE in flat 4585 memory instructions. 4586 4587 The Scratch Wavefront Offset must also be used as an offset with Private 4588 segment address when using the Scratch Segment Buffer (see 4589 :ref:`amdgpu-amdhsa-kernel-prolog-private-segment-buffer`). 4590 4591* If the *Target Properties* column of :ref:`amdgpu-processor-table` 4592 specifies *Architected flat scratch*: 4593 4594 If ENABLE_PRIVATE_SEGMENT is enabled in 4595 :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx10-table` then the FLAT_SCRATCH 4596 register pair will be initialized to the 64-bit address of the base of scratch 4597 backing memory being managed by SPI for the queue executing the kernel 4598 dispatch plus the value of the wave's Scratch Wavefront Offset for use as the 4599 flat scratch base in flat memory instructions. 4600 4601.. _amdgpu-amdhsa-kernel-prolog-private-segment-buffer: 4602 4603Private Segment Buffer 4604++++++++++++++++++++++ 4605 4606If the *Target Properties* column of :ref:`amdgpu-processor-table` specifies 4607*Architected flat scratch* then a Private Segment Buffer is not supported. 4608Instead the flat SCRATCH instructions are used. 4609 4610Otherwise, Private Segment Buffer SGPR register is used to initialize 4 SGPRs 4611that are used as a V# to access scratch. CP uses the value provided by the 4612runtime. It is used, together with Scratch Wavefront Offset as an offset, to 4613access the private memory space using a segment address. See 4614:ref:`amdgpu-amdhsa-initial-kernel-execution-state`. 4615 4616The scratch V# is a four-aligned SGPR and always selected for the kernel as 4617follows: 4618 4619 - If it is known during instruction selection that there is stack usage, 4620 SGPR0-3 is reserved for use as the scratch V#. Stack usage is assumed if 4621 optimizations are disabled (``-O0``), if stack objects already exist (for 4622 locals, etc.), or if there are any function calls. 4623 4624 - Otherwise, four high numbered SGPRs beginning at a four-aligned SGPR index 4625 are reserved for the tentative scratch V#. These will be used if it is 4626 determined that spilling is needed. 4627 4628 - If no use is made of the tentative scratch V#, then it is unreserved, 4629 and the register count is determined ignoring it. 4630 - If use is made of the tentative scratch V#, then its register numbers 4631 are shifted to the first four-aligned SGPR index after the highest one 4632 allocated by the register allocator, and all uses are updated. The 4633 register count includes them in the shifted location. 4634 - In either case, if the processor has the SGPR allocation bug, the 4635 tentative allocation is not shifted or unreserved in order to ensure 4636 the register count is higher to workaround the bug. 4637 4638 .. note:: 4639 4640 This approach of using a tentative scratch V# and shifting the register 4641 numbers if used avoids having to perform register allocation a second 4642 time if the tentative V# is eliminated. This is more efficient and 4643 avoids the problem that the second register allocation may perform 4644 spilling which will fail as there is no longer a scratch V#. 4645 4646When the kernel prolog code is being emitted it is known whether the scratch V# 4647described above is actually used. If it is, the prolog code must set it up by 4648copying the Private Segment Buffer to the scratch V# registers and then adding 4649the Private Segment Wavefront Offset to the queue base address in the V#. The 4650result is a V# with a base address pointing to the beginning of the wavefront 4651scratch backing memory. 4652 4653The Private Segment Buffer is always requested, but the Private Segment 4654Wavefront Offset is only requested if it is used (see 4655:ref:`amdgpu-amdhsa-initial-kernel-execution-state`). 4656 4657.. _amdgpu-amdhsa-memory-model: 4658 4659Memory Model 4660~~~~~~~~~~~~ 4661 4662This section describes the mapping of the LLVM memory model onto AMDGPU machine 4663code (see :ref:`memmodel`). 4664 4665The AMDGPU backend supports the memory synchronization scopes specified in 4666:ref:`amdgpu-memory-scopes`. 4667 4668The code sequences used to implement the memory model specify the order of 4669instructions that a single thread must execute. The ``s_waitcnt`` and cache 4670management instructions such as ``buffer_wbinvl1_vol`` are defined with respect 4671to other memory instructions executed by the same thread. This allows them to be 4672moved earlier or later which can allow them to be combined with other instances 4673of the same instruction, or hoisted/sunk out of loops to improve performance. 4674Only the instructions related to the memory model are given; additional 4675``s_waitcnt`` instructions are required to ensure registers are defined before 4676being used. These may be able to be combined with the memory model ``s_waitcnt`` 4677instructions as described above. 4678 4679The AMDGPU backend supports the following memory models: 4680 4681 HSA Memory Model [HSA]_ 4682 The HSA memory model uses a single happens-before relation for all address 4683 spaces (see :ref:`amdgpu-address-spaces`). 4684 OpenCL Memory Model [OpenCL]_ 4685 The OpenCL memory model which has separate happens-before relations for the 4686 global and local address spaces. Only a fence specifying both global and 4687 local address space, and seq_cst instructions join the relationships. Since 4688 the LLVM ``memfence`` instruction does not allow an address space to be 4689 specified the OpenCL fence has to conservatively assume both local and 4690 global address space was specified. However, optimizations can often be 4691 done to eliminate the additional ``s_waitcnt`` instructions when there are 4692 no intervening memory instructions which access the corresponding address 4693 space. The code sequences in the table indicate what can be omitted for the 4694 OpenCL memory. The target triple environment is used to determine if the 4695 source language is OpenCL (see :ref:`amdgpu-opencl`). 4696 4697``ds/flat_load/store/atomic`` instructions to local memory are termed LDS 4698operations. 4699 4700``buffer/global/flat_load/store/atomic`` instructions to global memory are 4701termed vector memory operations. 4702 4703Private address space uses ``buffer_load/store`` using the scratch V# 4704(GFX6-GFX8), or ``scratch_load/store`` (GFX9-GFX10). Since only a single thread 4705is accessing the memory, atomic memory orderings are not meaningful, and all 4706accesses are treated as non-atomic. 4707 4708Constant address space uses ``buffer/global_load`` instructions (or equivalent 4709scalar memory instructions). Since the constant address space contents do not 4710change during the execution of a kernel dispatch it is not legal to perform 4711stores, and atomic memory orderings are not meaningful, and all accesses are 4712treated as non-atomic. 4713 4714A memory synchronization scope wider than work-group is not meaningful for the 4715group (LDS) address space and is treated as work-group. 4716 4717The memory model does not support the region address space which is treated as 4718non-atomic. 4719 4720Acquire memory ordering is not meaningful on store atomic instructions and is 4721treated as non-atomic. 4722 4723Release memory ordering is not meaningful on load atomic instructions and is 4724treated a non-atomic. 4725 4726Acquire-release memory ordering is not meaningful on load or store atomic 4727instructions and is treated as acquire and release respectively. 4728 4729The memory order also adds the single thread optimization constraints defined in 4730table 4731:ref:`amdgpu-amdhsa-memory-model-single-thread-optimization-constraints-table`. 4732 4733 .. table:: AMDHSA Memory Model Single Thread Optimization Constraints 4734 :name: amdgpu-amdhsa-memory-model-single-thread-optimization-constraints-table 4735 4736 ============ ============================================================== 4737 LLVM Memory Optimization Constraints 4738 Ordering 4739 ============ ============================================================== 4740 unordered *none* 4741 monotonic *none* 4742 acquire - If a load atomic/atomicrmw then no following load/load 4743 atomic/store/store atomic/atomicrmw/fence instruction can be 4744 moved before the acquire. 4745 - If a fence then same as load atomic, plus no preceding 4746 associated fence-paired-atomic can be moved after the fence. 4747 release - If a store atomic/atomicrmw then no preceding load/load 4748 atomic/store/store atomic/atomicrmw/fence instruction can be 4749 moved after the release. 4750 - If a fence then same as store atomic, plus no following 4751 associated fence-paired-atomic can be moved before the 4752 fence. 4753 acq_rel Same constraints as both acquire and release. 4754 seq_cst - If a load atomic then same constraints as acquire, plus no 4755 preceding sequentially consistent load atomic/store 4756 atomic/atomicrmw/fence instruction can be moved after the 4757 seq_cst. 4758 - If a store atomic then the same constraints as release, plus 4759 no following sequentially consistent load atomic/store 4760 atomic/atomicrmw/fence instruction can be moved before the 4761 seq_cst. 4762 - If an atomicrmw/fence then same constraints as acq_rel. 4763 ============ ============================================================== 4764 4765The code sequences used to implement the memory model are defined in the 4766following sections: 4767 4768* :ref:`amdgpu-amdhsa-memory-model-gfx6-gfx9` 4769* :ref:`amdgpu-amdhsa-memory-model-gfx90a` 4770* :ref:`amdgpu-amdhsa-memory-model-gfx10` 4771 4772.. _amdgpu-amdhsa-memory-model-gfx6-gfx9: 4773 4774Memory Model GFX6-GFX9 4775++++++++++++++++++++++ 4776 4777For GFX6-GFX9: 4778 4779* Each agent has multiple shader arrays (SA). 4780* Each SA has multiple compute units (CU). 4781* Each CU has multiple SIMDs that execute wavefronts. 4782* The wavefronts for a single work-group are executed in the same CU but may be 4783 executed by different SIMDs. 4784* Each CU has a single LDS memory shared by the wavefronts of the work-groups 4785 executing on it. 4786* All LDS operations of a CU are performed as wavefront wide operations in a 4787 global order and involve no caching. Completion is reported to a wavefront in 4788 execution order. 4789* The LDS memory has multiple request queues shared by the SIMDs of a 4790 CU. Therefore, the LDS operations performed by different wavefronts of a 4791 work-group can be reordered relative to each other, which can result in 4792 reordering the visibility of vector memory operations with respect to LDS 4793 operations of other wavefronts in the same work-group. A ``s_waitcnt 4794 lgkmcnt(0)`` is required to ensure synchronization between LDS operations and 4795 vector memory operations between wavefronts of a work-group, but not between 4796 operations performed by the same wavefront. 4797* The vector memory operations are performed as wavefront wide operations and 4798 completion is reported to a wavefront in execution order. The exception is 4799 that for GFX7-GFX9 ``flat_load/store/atomic`` instructions can report out of 4800 vector memory order if they access LDS memory, and out of LDS operation order 4801 if they access global memory. 4802* The vector memory operations access a single vector L1 cache shared by all 4803 SIMDs a CU. Therefore, no special action is required for coherence between the 4804 lanes of a single wavefront, or for coherence between wavefronts in the same 4805 work-group. A ``buffer_wbinvl1_vol`` is required for coherence between 4806 wavefronts executing in different work-groups as they may be executing on 4807 different CUs. 4808* The scalar memory operations access a scalar L1 cache shared by all wavefronts 4809 on a group of CUs. The scalar and vector L1 caches are not coherent. However, 4810 scalar operations are used in a restricted way so do not impact the memory 4811 model. See :ref:`amdgpu-amdhsa-memory-spaces`. 4812* The vector and scalar memory operations use an L2 cache shared by all CUs on 4813 the same agent. 4814* The L2 cache has independent channels to service disjoint ranges of virtual 4815 addresses. 4816* Each CU has a separate request queue per channel. Therefore, the vector and 4817 scalar memory operations performed by wavefronts executing in different 4818 work-groups (which may be executing on different CUs) of an agent can be 4819 reordered relative to each other. A ``s_waitcnt vmcnt(0)`` is required to 4820 ensure synchronization between vector memory operations of different CUs. It 4821 ensures a previous vector memory operation has completed before executing a 4822 subsequent vector memory or LDS operation and so can be used to meet the 4823 requirements of acquire and release. 4824* The L2 cache can be kept coherent with other agents on some targets, or ranges 4825 of virtual addresses can be set up to bypass it to ensure system coherence. 4826 4827Scalar memory operations are only used to access memory that is proven to not 4828change during the execution of the kernel dispatch. This includes constant 4829address space and global address space for program scope ``const`` variables. 4830Therefore, the kernel machine code does not have to maintain the scalar cache to 4831ensure it is coherent with the vector caches. The scalar and vector caches are 4832invalidated between kernel dispatches by CP since constant address space data 4833may change between kernel dispatch executions. See 4834:ref:`amdgpu-amdhsa-memory-spaces`. 4835 4836The one exception is if scalar writes are used to spill SGPR registers. In this 4837case the AMDGPU backend ensures the memory location used to spill is never 4838accessed by vector memory operations at the same time. If scalar writes are used 4839then a ``s_dcache_wb`` is inserted before the ``s_endpgm`` and before a function 4840return since the locations may be used for vector memory instructions by a 4841future wavefront that uses the same scratch area, or a function call that 4842creates a frame at the same address, respectively. There is no need for a 4843``s_dcache_inv`` as all scalar writes are write-before-read in the same thread. 4844 4845For kernarg backing memory: 4846 4847* CP invalidates the L1 cache at the start of each kernel dispatch. 4848* On dGPU the kernarg backing memory is allocated in host memory accessed as 4849 MTYPE UC (uncached) to avoid needing to invalidate the L2 cache. This also 4850 causes it to be treated as non-volatile and so is not invalidated by 4851 ``*_vol``. 4852* On APU the kernarg backing memory it is accessed as MTYPE CC (cache coherent) 4853 and so the L2 cache will be coherent with the CPU and other agents. 4854 4855Scratch backing memory (which is used for the private address space) is accessed 4856with MTYPE NC_NV (non-coherent non-volatile). Since the private address space is 4857only accessed by a single thread, and is always write-before-read, there is 4858never a need to invalidate these entries from the L1 cache. Hence all cache 4859invalidates are done as ``*_vol`` to only invalidate the volatile cache lines. 4860 4861The code sequences used to implement the memory model for GFX6-GFX9 are defined 4862in table :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx6-gfx9-table`. 4863 4864 .. table:: AMDHSA Memory Model Code Sequences GFX6-GFX9 4865 :name: amdgpu-amdhsa-memory-model-code-sequences-gfx6-gfx9-table 4866 4867 ============ ============ ============== ========== ================================ 4868 LLVM Instr LLVM Memory LLVM Memory AMDGPU AMDGPU Machine Code 4869 Ordering Sync Scope Address GFX6-GFX9 4870 Space 4871 ============ ============ ============== ========== ================================ 4872 **Non-Atomic** 4873 ------------------------------------------------------------------------------------ 4874 load *none* *none* - global - !volatile & !nontemporal 4875 - generic 4876 - private 1. buffer/global/flat_load 4877 - constant 4878 - !volatile & nontemporal 4879 4880 1. buffer/global/flat_load 4881 glc=1 slc=1 4882 4883 - volatile 4884 4885 1. buffer/global/flat_load 4886 glc=1 4887 2. s_waitcnt vmcnt(0) 4888 4889 - Must happen before 4890 any following volatile 4891 global/generic 4892 load/store. 4893 - Ensures that 4894 volatile 4895 operations to 4896 different 4897 addresses will not 4898 be reordered by 4899 hardware. 4900 4901 load *none* *none* - local 1. ds_load 4902 store *none* *none* - global - !volatile & !nontemporal 4903 - generic 4904 - private 1. buffer/global/flat_store 4905 - constant 4906 - !volatile & nontemporal 4907 4908 1. buffer/global/flat_store 4909 glc=1 slc=1 4910 4911 - volatile 4912 4913 1. buffer/global/flat_store 4914 2. s_waitcnt vmcnt(0) 4915 4916 - Must happen before 4917 any following volatile 4918 global/generic 4919 load/store. 4920 - Ensures that 4921 volatile 4922 operations to 4923 different 4924 addresses will not 4925 be reordered by 4926 hardware. 4927 4928 store *none* *none* - local 1. ds_store 4929 **Unordered Atomic** 4930 ------------------------------------------------------------------------------------ 4931 load atomic unordered *any* *any* *Same as non-atomic*. 4932 store atomic unordered *any* *any* *Same as non-atomic*. 4933 atomicrmw unordered *any* *any* *Same as monotonic atomic*. 4934 **Monotonic Atomic** 4935 ------------------------------------------------------------------------------------ 4936 load atomic monotonic - singlethread - global 1. buffer/global/ds/flat_load 4937 - wavefront - local 4938 - workgroup - generic 4939 load atomic monotonic - agent - global 1. buffer/global/flat_load 4940 - system - generic glc=1 4941 store atomic monotonic - singlethread - global 1. buffer/global/flat_store 4942 - wavefront - generic 4943 - workgroup 4944 - agent 4945 - system 4946 store atomic monotonic - singlethread - local 1. ds_store 4947 - wavefront 4948 - workgroup 4949 atomicrmw monotonic - singlethread - global 1. buffer/global/flat_atomic 4950 - wavefront - generic 4951 - workgroup 4952 - agent 4953 - system 4954 atomicrmw monotonic - singlethread - local 1. ds_atomic 4955 - wavefront 4956 - workgroup 4957 **Acquire Atomic** 4958 ------------------------------------------------------------------------------------ 4959 load atomic acquire - singlethread - global 1. buffer/global/ds/flat_load 4960 - wavefront - local 4961 - generic 4962 load atomic acquire - workgroup - global 1. buffer/global_load 4963 load atomic acquire - workgroup - local 1. ds/flat_load 4964 - generic 2. s_waitcnt lgkmcnt(0) 4965 4966 - If OpenCL, omit. 4967 - Must happen before 4968 any following 4969 global/generic 4970 load/load 4971 atomic/store/store 4972 atomic/atomicrmw. 4973 - Ensures any 4974 following global 4975 data read is no 4976 older than a local load 4977 atomic value being 4978 acquired. 4979 4980 load atomic acquire - agent - global 1. buffer/global_load 4981 - system glc=1 4982 2. s_waitcnt vmcnt(0) 4983 4984 - Must happen before 4985 following 4986 buffer_wbinvl1_vol. 4987 - Ensures the load 4988 has completed 4989 before invalidating 4990 the cache. 4991 4992 3. buffer_wbinvl1_vol 4993 4994 - Must happen before 4995 any following 4996 global/generic 4997 load/load 4998 atomic/atomicrmw. 4999 - Ensures that 5000 following 5001 loads will not see 5002 stale global data. 5003 5004 load atomic acquire - agent - generic 1. flat_load glc=1 5005 - system 2. s_waitcnt vmcnt(0) & 5006 lgkmcnt(0) 5007 5008 - If OpenCL omit 5009 lgkmcnt(0). 5010 - Must happen before 5011 following 5012 buffer_wbinvl1_vol. 5013 - Ensures the flat_load 5014 has completed 5015 before invalidating 5016 the cache. 5017 5018 3. buffer_wbinvl1_vol 5019 5020 - Must happen before 5021 any following 5022 global/generic 5023 load/load 5024 atomic/atomicrmw. 5025 - Ensures that 5026 following loads 5027 will not see stale 5028 global data. 5029 5030 atomicrmw acquire - singlethread - global 1. buffer/global/ds/flat_atomic 5031 - wavefront - local 5032 - generic 5033 atomicrmw acquire - workgroup - global 1. buffer/global_atomic 5034 atomicrmw acquire - workgroup - local 1. ds/flat_atomic 5035 - generic 2. s_waitcnt lgkmcnt(0) 5036 5037 - If OpenCL, omit. 5038 - Must happen before 5039 any following 5040 global/generic 5041 load/load 5042 atomic/store/store 5043 atomic/atomicrmw. 5044 - Ensures any 5045 following global 5046 data read is no 5047 older than a local 5048 atomicrmw value 5049 being acquired. 5050 5051 atomicrmw acquire - agent - global 1. buffer/global_atomic 5052 - system 2. s_waitcnt vmcnt(0) 5053 5054 - Must happen before 5055 following 5056 buffer_wbinvl1_vol. 5057 - Ensures the 5058 atomicrmw has 5059 completed before 5060 invalidating the 5061 cache. 5062 5063 3. buffer_wbinvl1_vol 5064 5065 - Must happen before 5066 any following 5067 global/generic 5068 load/load 5069 atomic/atomicrmw. 5070 - Ensures that 5071 following loads 5072 will not see stale 5073 global data. 5074 5075 atomicrmw acquire - agent - generic 1. flat_atomic 5076 - system 2. s_waitcnt vmcnt(0) & 5077 lgkmcnt(0) 5078 5079 - If OpenCL, omit 5080 lgkmcnt(0). 5081 - Must happen before 5082 following 5083 buffer_wbinvl1_vol. 5084 - Ensures the 5085 atomicrmw has 5086 completed before 5087 invalidating the 5088 cache. 5089 5090 3. buffer_wbinvl1_vol 5091 5092 - Must happen before 5093 any following 5094 global/generic 5095 load/load 5096 atomic/atomicrmw. 5097 - Ensures that 5098 following loads 5099 will not see stale 5100 global data. 5101 5102 fence acquire - singlethread *none* *none* 5103 - wavefront 5104 fence acquire - workgroup *none* 1. s_waitcnt lgkmcnt(0) 5105 5106 - If OpenCL and 5107 address space is 5108 not generic, omit. 5109 - However, since LLVM 5110 currently has no 5111 address space on 5112 the fence need to 5113 conservatively 5114 always generate. If 5115 fence had an 5116 address space then 5117 set to address 5118 space of OpenCL 5119 fence flag, or to 5120 generic if both 5121 local and global 5122 flags are 5123 specified. 5124 - Must happen after 5125 any preceding 5126 local/generic load 5127 atomic/atomicrmw 5128 with an equal or 5129 wider sync scope 5130 and memory ordering 5131 stronger than 5132 unordered (this is 5133 termed the 5134 fence-paired-atomic). 5135 - Must happen before 5136 any following 5137 global/generic 5138 load/load 5139 atomic/store/store 5140 atomic/atomicrmw. 5141 - Ensures any 5142 following global 5143 data read is no 5144 older than the 5145 value read by the 5146 fence-paired-atomic. 5147 5148 fence acquire - agent *none* 1. s_waitcnt lgkmcnt(0) & 5149 - system vmcnt(0) 5150 5151 - If OpenCL and 5152 address space is 5153 not generic, omit 5154 lgkmcnt(0). 5155 - However, since LLVM 5156 currently has no 5157 address space on 5158 the fence need to 5159 conservatively 5160 always generate 5161 (see comment for 5162 previous fence). 5163 - Could be split into 5164 separate s_waitcnt 5165 vmcnt(0) and 5166 s_waitcnt 5167 lgkmcnt(0) to allow 5168 them to be 5169 independently moved 5170 according to the 5171 following rules. 5172 - s_waitcnt vmcnt(0) 5173 must happen after 5174 any preceding 5175 global/generic load 5176 atomic/atomicrmw 5177 with an equal or 5178 wider sync scope 5179 and memory ordering 5180 stronger than 5181 unordered (this is 5182 termed the 5183 fence-paired-atomic). 5184 - s_waitcnt lgkmcnt(0) 5185 must happen after 5186 any preceding 5187 local/generic load 5188 atomic/atomicrmw 5189 with an equal or 5190 wider sync scope 5191 and memory ordering 5192 stronger than 5193 unordered (this is 5194 termed the 5195 fence-paired-atomic). 5196 - Must happen before 5197 the following 5198 buffer_wbinvl1_vol. 5199 - Ensures that the 5200 fence-paired atomic 5201 has completed 5202 before invalidating 5203 the 5204 cache. Therefore 5205 any following 5206 locations read must 5207 be no older than 5208 the value read by 5209 the 5210 fence-paired-atomic. 5211 5212 2. buffer_wbinvl1_vol 5213 5214 - Must happen before any 5215 following global/generic 5216 load/load 5217 atomic/store/store 5218 atomic/atomicrmw. 5219 - Ensures that 5220 following loads 5221 will not see stale 5222 global data. 5223 5224 **Release Atomic** 5225 ------------------------------------------------------------------------------------ 5226 store atomic release - singlethread - global 1. buffer/global/ds/flat_store 5227 - wavefront - local 5228 - generic 5229 store atomic release - workgroup - global 1. s_waitcnt lgkmcnt(0) 5230 - generic 5231 - If OpenCL, omit. 5232 - Must happen after 5233 any preceding 5234 local/generic 5235 load/store/load 5236 atomic/store 5237 atomic/atomicrmw. 5238 - Must happen before 5239 the following 5240 store. 5241 - Ensures that all 5242 memory operations 5243 to local have 5244 completed before 5245 performing the 5246 store that is being 5247 released. 5248 5249 2. buffer/global/flat_store 5250 store atomic release - workgroup - local 1. ds_store 5251 store atomic release - agent - global 1. s_waitcnt lgkmcnt(0) & 5252 - system - generic vmcnt(0) 5253 5254 - If OpenCL and 5255 address space is 5256 not generic, omit 5257 lgkmcnt(0). 5258 - Could be split into 5259 separate s_waitcnt 5260 vmcnt(0) and 5261 s_waitcnt 5262 lgkmcnt(0) to allow 5263 them to be 5264 independently moved 5265 according to the 5266 following rules. 5267 - s_waitcnt vmcnt(0) 5268 must happen after 5269 any preceding 5270 global/generic 5271 load/store/load 5272 atomic/store 5273 atomic/atomicrmw. 5274 - s_waitcnt lgkmcnt(0) 5275 must happen after 5276 any preceding 5277 local/generic 5278 load/store/load 5279 atomic/store 5280 atomic/atomicrmw. 5281 - Must happen before 5282 the following 5283 store. 5284 - Ensures that all 5285 memory operations 5286 to memory have 5287 completed before 5288 performing the 5289 store that is being 5290 released. 5291 5292 2. buffer/global/flat_store 5293 atomicrmw release - singlethread - global 1. buffer/global/ds/flat_atomic 5294 - wavefront - local 5295 - generic 5296 atomicrmw release - workgroup - global 1. s_waitcnt lgkmcnt(0) 5297 - generic 5298 - If OpenCL, omit. 5299 - Must happen after 5300 any preceding 5301 local/generic 5302 load/store/load 5303 atomic/store 5304 atomic/atomicrmw. 5305 - Must happen before 5306 the following 5307 atomicrmw. 5308 - Ensures that all 5309 memory operations 5310 to local have 5311 completed before 5312 performing the 5313 atomicrmw that is 5314 being released. 5315 5316 2. buffer/global/flat_atomic 5317 atomicrmw release - workgroup - local 1. ds_atomic 5318 atomicrmw release - agent - global 1. s_waitcnt lgkmcnt(0) & 5319 - system - generic vmcnt(0) 5320 5321 - If OpenCL, omit 5322 lgkmcnt(0). 5323 - Could be split into 5324 separate s_waitcnt 5325 vmcnt(0) and 5326 s_waitcnt 5327 lgkmcnt(0) to allow 5328 them to be 5329 independently moved 5330 according to the 5331 following rules. 5332 - s_waitcnt vmcnt(0) 5333 must happen after 5334 any preceding 5335 global/generic 5336 load/store/load 5337 atomic/store 5338 atomic/atomicrmw. 5339 - s_waitcnt lgkmcnt(0) 5340 must happen after 5341 any preceding 5342 local/generic 5343 load/store/load 5344 atomic/store 5345 atomic/atomicrmw. 5346 - Must happen before 5347 the following 5348 atomicrmw. 5349 - Ensures that all 5350 memory operations 5351 to global and local 5352 have completed 5353 before performing 5354 the atomicrmw that 5355 is being released. 5356 5357 2. buffer/global/flat_atomic 5358 fence release - singlethread *none* *none* 5359 - wavefront 5360 fence release - workgroup *none* 1. s_waitcnt lgkmcnt(0) 5361 5362 - If OpenCL and 5363 address space is 5364 not generic, omit. 5365 - However, since LLVM 5366 currently has no 5367 address space on 5368 the fence need to 5369 conservatively 5370 always generate. If 5371 fence had an 5372 address space then 5373 set to address 5374 space of OpenCL 5375 fence flag, or to 5376 generic if both 5377 local and global 5378 flags are 5379 specified. 5380 - Must happen after 5381 any preceding 5382 local/generic 5383 load/load 5384 atomic/store/store 5385 atomic/atomicrmw. 5386 - Must happen before 5387 any following store 5388 atomic/atomicrmw 5389 with an equal or 5390 wider sync scope 5391 and memory ordering 5392 stronger than 5393 unordered (this is 5394 termed the 5395 fence-paired-atomic). 5396 - Ensures that all 5397 memory operations 5398 to local have 5399 completed before 5400 performing the 5401 following 5402 fence-paired-atomic. 5403 5404 fence release - agent *none* 1. s_waitcnt lgkmcnt(0) & 5405 - system vmcnt(0) 5406 5407 - If OpenCL and 5408 address space is 5409 not generic, omit 5410 lgkmcnt(0). 5411 - If OpenCL and 5412 address space is 5413 local, omit 5414 vmcnt(0). 5415 - However, since LLVM 5416 currently has no 5417 address space on 5418 the fence need to 5419 conservatively 5420 always generate. If 5421 fence had an 5422 address space then 5423 set to address 5424 space of OpenCL 5425 fence flag, or to 5426 generic if both 5427 local and global 5428 flags are 5429 specified. 5430 - Could be split into 5431 separate s_waitcnt 5432 vmcnt(0) and 5433 s_waitcnt 5434 lgkmcnt(0) to allow 5435 them to be 5436 independently moved 5437 according to the 5438 following rules. 5439 - s_waitcnt vmcnt(0) 5440 must happen after 5441 any preceding 5442 global/generic 5443 load/store/load 5444 atomic/store 5445 atomic/atomicrmw. 5446 - s_waitcnt lgkmcnt(0) 5447 must happen after 5448 any preceding 5449 local/generic 5450 load/store/load 5451 atomic/store 5452 atomic/atomicrmw. 5453 - Must happen before 5454 any following store 5455 atomic/atomicrmw 5456 with an equal or 5457 wider sync scope 5458 and memory ordering 5459 stronger than 5460 unordered (this is 5461 termed the 5462 fence-paired-atomic). 5463 - Ensures that all 5464 memory operations 5465 have 5466 completed before 5467 performing the 5468 following 5469 fence-paired-atomic. 5470 5471 **Acquire-Release Atomic** 5472 ------------------------------------------------------------------------------------ 5473 atomicrmw acq_rel - singlethread - global 1. buffer/global/ds/flat_atomic 5474 - wavefront - local 5475 - generic 5476 atomicrmw acq_rel - workgroup - global 1. s_waitcnt lgkmcnt(0) 5477 5478 - If OpenCL, omit. 5479 - Must happen after 5480 any preceding 5481 local/generic 5482 load/store/load 5483 atomic/store 5484 atomic/atomicrmw. 5485 - Must happen before 5486 the following 5487 atomicrmw. 5488 - Ensures that all 5489 memory operations 5490 to local have 5491 completed before 5492 performing the 5493 atomicrmw that is 5494 being released. 5495 5496 2. buffer/global_atomic 5497 5498 atomicrmw acq_rel - workgroup - local 1. ds_atomic 5499 2. s_waitcnt lgkmcnt(0) 5500 5501 - If OpenCL, omit. 5502 - Must happen before 5503 any following 5504 global/generic 5505 load/load 5506 atomic/store/store 5507 atomic/atomicrmw. 5508 - Ensures any 5509 following global 5510 data read is no 5511 older than the local load 5512 atomic value being 5513 acquired. 5514 5515 atomicrmw acq_rel - workgroup - generic 1. s_waitcnt lgkmcnt(0) 5516 5517 - If OpenCL, omit. 5518 - Must happen after 5519 any preceding 5520 local/generic 5521 load/store/load 5522 atomic/store 5523 atomic/atomicrmw. 5524 - Must happen before 5525 the following 5526 atomicrmw. 5527 - Ensures that all 5528 memory operations 5529 to local have 5530 completed before 5531 performing the 5532 atomicrmw that is 5533 being released. 5534 5535 2. flat_atomic 5536 3. s_waitcnt lgkmcnt(0) 5537 5538 - If OpenCL, omit. 5539 - Must happen before 5540 any following 5541 global/generic 5542 load/load 5543 atomic/store/store 5544 atomic/atomicrmw. 5545 - Ensures any 5546 following global 5547 data read is no 5548 older than a local load 5549 atomic value being 5550 acquired. 5551 5552 atomicrmw acq_rel - agent - global 1. s_waitcnt lgkmcnt(0) & 5553 - system vmcnt(0) 5554 5555 - If OpenCL, omit 5556 lgkmcnt(0). 5557 - Could be split into 5558 separate s_waitcnt 5559 vmcnt(0) and 5560 s_waitcnt 5561 lgkmcnt(0) to allow 5562 them to be 5563 independently moved 5564 according to the 5565 following rules. 5566 - s_waitcnt vmcnt(0) 5567 must happen after 5568 any preceding 5569 global/generic 5570 load/store/load 5571 atomic/store 5572 atomic/atomicrmw. 5573 - s_waitcnt lgkmcnt(0) 5574 must happen after 5575 any preceding 5576 local/generic 5577 load/store/load 5578 atomic/store 5579 atomic/atomicrmw. 5580 - Must happen before 5581 the following 5582 atomicrmw. 5583 - Ensures that all 5584 memory operations 5585 to global have 5586 completed before 5587 performing the 5588 atomicrmw that is 5589 being released. 5590 5591 2. buffer/global_atomic 5592 3. s_waitcnt vmcnt(0) 5593 5594 - Must happen before 5595 following 5596 buffer_wbinvl1_vol. 5597 - Ensures the 5598 atomicrmw has 5599 completed before 5600 invalidating the 5601 cache. 5602 5603 4. buffer_wbinvl1_vol 5604 5605 - Must happen before 5606 any following 5607 global/generic 5608 load/load 5609 atomic/atomicrmw. 5610 - Ensures that 5611 following loads 5612 will not see stale 5613 global data. 5614 5615 atomicrmw acq_rel - agent - generic 1. s_waitcnt lgkmcnt(0) & 5616 - system vmcnt(0) 5617 5618 - If OpenCL, omit 5619 lgkmcnt(0). 5620 - Could be split into 5621 separate s_waitcnt 5622 vmcnt(0) and 5623 s_waitcnt 5624 lgkmcnt(0) to allow 5625 them to be 5626 independently moved 5627 according to the 5628 following rules. 5629 - s_waitcnt vmcnt(0) 5630 must happen after 5631 any preceding 5632 global/generic 5633 load/store/load 5634 atomic/store 5635 atomic/atomicrmw. 5636 - s_waitcnt lgkmcnt(0) 5637 must happen after 5638 any preceding 5639 local/generic 5640 load/store/load 5641 atomic/store 5642 atomic/atomicrmw. 5643 - Must happen before 5644 the following 5645 atomicrmw. 5646 - Ensures that all 5647 memory operations 5648 to global have 5649 completed before 5650 performing the 5651 atomicrmw that is 5652 being released. 5653 5654 2. flat_atomic 5655 3. s_waitcnt vmcnt(0) & 5656 lgkmcnt(0) 5657 5658 - If OpenCL, omit 5659 lgkmcnt(0). 5660 - Must happen before 5661 following 5662 buffer_wbinvl1_vol. 5663 - Ensures the 5664 atomicrmw has 5665 completed before 5666 invalidating the 5667 cache. 5668 5669 4. buffer_wbinvl1_vol 5670 5671 - Must happen before 5672 any following 5673 global/generic 5674 load/load 5675 atomic/atomicrmw. 5676 - Ensures that 5677 following loads 5678 will not see stale 5679 global data. 5680 5681 fence acq_rel - singlethread *none* *none* 5682 - wavefront 5683 fence acq_rel - workgroup *none* 1. s_waitcnt lgkmcnt(0) 5684 5685 - If OpenCL and 5686 address space is 5687 not generic, omit. 5688 - However, 5689 since LLVM 5690 currently has no 5691 address space on 5692 the fence need to 5693 conservatively 5694 always generate 5695 (see comment for 5696 previous fence). 5697 - Must happen after 5698 any preceding 5699 local/generic 5700 load/load 5701 atomic/store/store 5702 atomic/atomicrmw. 5703 - Must happen before 5704 any following 5705 global/generic 5706 load/load 5707 atomic/store/store 5708 atomic/atomicrmw. 5709 - Ensures that all 5710 memory operations 5711 to local have 5712 completed before 5713 performing any 5714 following global 5715 memory operations. 5716 - Ensures that the 5717 preceding 5718 local/generic load 5719 atomic/atomicrmw 5720 with an equal or 5721 wider sync scope 5722 and memory ordering 5723 stronger than 5724 unordered (this is 5725 termed the 5726 acquire-fence-paired-atomic) 5727 has completed 5728 before following 5729 global memory 5730 operations. This 5731 satisfies the 5732 requirements of 5733 acquire. 5734 - Ensures that all 5735 previous memory 5736 operations have 5737 completed before a 5738 following 5739 local/generic store 5740 atomic/atomicrmw 5741 with an equal or 5742 wider sync scope 5743 and memory ordering 5744 stronger than 5745 unordered (this is 5746 termed the 5747 release-fence-paired-atomic). 5748 This satisfies the 5749 requirements of 5750 release. 5751 5752 fence acq_rel - agent *none* 1. s_waitcnt lgkmcnt(0) & 5753 - system vmcnt(0) 5754 5755 - If OpenCL and 5756 address space is 5757 not generic, omit 5758 lgkmcnt(0). 5759 - However, since LLVM 5760 currently has no 5761 address space on 5762 the fence need to 5763 conservatively 5764 always generate 5765 (see comment for 5766 previous fence). 5767 - Could be split into 5768 separate s_waitcnt 5769 vmcnt(0) and 5770 s_waitcnt 5771 lgkmcnt(0) to allow 5772 them to be 5773 independently moved 5774 according to the 5775 following rules. 5776 - s_waitcnt vmcnt(0) 5777 must happen after 5778 any preceding 5779 global/generic 5780 load/store/load 5781 atomic/store 5782 atomic/atomicrmw. 5783 - s_waitcnt lgkmcnt(0) 5784 must happen after 5785 any preceding 5786 local/generic 5787 load/store/load 5788 atomic/store 5789 atomic/atomicrmw. 5790 - Must happen before 5791 the following 5792 buffer_wbinvl1_vol. 5793 - Ensures that the 5794 preceding 5795 global/local/generic 5796 load 5797 atomic/atomicrmw 5798 with an equal or 5799 wider sync scope 5800 and memory ordering 5801 stronger than 5802 unordered (this is 5803 termed the 5804 acquire-fence-paired-atomic) 5805 has completed 5806 before invalidating 5807 the cache. This 5808 satisfies the 5809 requirements of 5810 acquire. 5811 - Ensures that all 5812 previous memory 5813 operations have 5814 completed before a 5815 following 5816 global/local/generic 5817 store 5818 atomic/atomicrmw 5819 with an equal or 5820 wider sync scope 5821 and memory ordering 5822 stronger than 5823 unordered (this is 5824 termed the 5825 release-fence-paired-atomic). 5826 This satisfies the 5827 requirements of 5828 release. 5829 5830 2. buffer_wbinvl1_vol 5831 5832 - Must happen before 5833 any following 5834 global/generic 5835 load/load 5836 atomic/store/store 5837 atomic/atomicrmw. 5838 - Ensures that 5839 following loads 5840 will not see stale 5841 global data. This 5842 satisfies the 5843 requirements of 5844 acquire. 5845 5846 **Sequential Consistent Atomic** 5847 ------------------------------------------------------------------------------------ 5848 load atomic seq_cst - singlethread - global *Same as corresponding 5849 - wavefront - local load atomic acquire, 5850 - generic except must generated 5851 all instructions even 5852 for OpenCL.* 5853 load atomic seq_cst - workgroup - global 1. s_waitcnt lgkmcnt(0) 5854 - generic 5855 5856 - Must 5857 happen after 5858 preceding 5859 local/generic load 5860 atomic/store 5861 atomic/atomicrmw 5862 with memory 5863 ordering of seq_cst 5864 and with equal or 5865 wider sync scope. 5866 (Note that seq_cst 5867 fences have their 5868 own s_waitcnt 5869 lgkmcnt(0) and so do 5870 not need to be 5871 considered.) 5872 - Ensures any 5873 preceding 5874 sequential 5875 consistent local 5876 memory instructions 5877 have completed 5878 before executing 5879 this sequentially 5880 consistent 5881 instruction. This 5882 prevents reordering 5883 a seq_cst store 5884 followed by a 5885 seq_cst load. (Note 5886 that seq_cst is 5887 stronger than 5888 acquire/release as 5889 the reordering of 5890 load acquire 5891 followed by a store 5892 release is 5893 prevented by the 5894 s_waitcnt of 5895 the release, but 5896 there is nothing 5897 preventing a store 5898 release followed by 5899 load acquire from 5900 completing out of 5901 order. The s_waitcnt 5902 could be placed after 5903 seq_store or before 5904 the seq_load. We 5905 choose the load to 5906 make the s_waitcnt be 5907 as late as possible 5908 so that the store 5909 may have already 5910 completed.) 5911 5912 2. *Following 5913 instructions same as 5914 corresponding load 5915 atomic acquire, 5916 except must generated 5917 all instructions even 5918 for OpenCL.* 5919 load atomic seq_cst - workgroup - local *Same as corresponding 5920 load atomic acquire, 5921 except must generated 5922 all instructions even 5923 for OpenCL.* 5924 5925 load atomic seq_cst - agent - global 1. s_waitcnt lgkmcnt(0) & 5926 - system - generic vmcnt(0) 5927 5928 - Could be split into 5929 separate s_waitcnt 5930 vmcnt(0) 5931 and s_waitcnt 5932 lgkmcnt(0) to allow 5933 them to be 5934 independently moved 5935 according to the 5936 following rules. 5937 - s_waitcnt lgkmcnt(0) 5938 must happen after 5939 preceding 5940 global/generic load 5941 atomic/store 5942 atomic/atomicrmw 5943 with memory 5944 ordering of seq_cst 5945 and with equal or 5946 wider sync scope. 5947 (Note that seq_cst 5948 fences have their 5949 own s_waitcnt 5950 lgkmcnt(0) and so do 5951 not need to be 5952 considered.) 5953 - s_waitcnt vmcnt(0) 5954 must happen after 5955 preceding 5956 global/generic load 5957 atomic/store 5958 atomic/atomicrmw 5959 with memory 5960 ordering of seq_cst 5961 and with equal or 5962 wider sync scope. 5963 (Note that seq_cst 5964 fences have their 5965 own s_waitcnt 5966 vmcnt(0) and so do 5967 not need to be 5968 considered.) 5969 - Ensures any 5970 preceding 5971 sequential 5972 consistent global 5973 memory instructions 5974 have completed 5975 before executing 5976 this sequentially 5977 consistent 5978 instruction. This 5979 prevents reordering 5980 a seq_cst store 5981 followed by a 5982 seq_cst load. (Note 5983 that seq_cst is 5984 stronger than 5985 acquire/release as 5986 the reordering of 5987 load acquire 5988 followed by a store 5989 release is 5990 prevented by the 5991 s_waitcnt of 5992 the release, but 5993 there is nothing 5994 preventing a store 5995 release followed by 5996 load acquire from 5997 completing out of 5998 order. The s_waitcnt 5999 could be placed after 6000 seq_store or before 6001 the seq_load. We 6002 choose the load to 6003 make the s_waitcnt be 6004 as late as possible 6005 so that the store 6006 may have already 6007 completed.) 6008 6009 2. *Following 6010 instructions same as 6011 corresponding load 6012 atomic acquire, 6013 except must generated 6014 all instructions even 6015 for OpenCL.* 6016 store atomic seq_cst - singlethread - global *Same as corresponding 6017 - wavefront - local store atomic release, 6018 - workgroup - generic except must generated 6019 - agent all instructions even 6020 - system for OpenCL.* 6021 atomicrmw seq_cst - singlethread - global *Same as corresponding 6022 - wavefront - local atomicrmw acq_rel, 6023 - workgroup - generic except must generated 6024 - agent all instructions even 6025 - system for OpenCL.* 6026 fence seq_cst - singlethread *none* *Same as corresponding 6027 - wavefront fence acq_rel, 6028 - workgroup except must generated 6029 - agent all instructions even 6030 - system for OpenCL.* 6031 ============ ============ ============== ========== ================================ 6032 6033.. _amdgpu-amdhsa-memory-model-gfx90a: 6034 6035Memory Model GFX90A 6036+++++++++++++++++++ 6037 6038For GFX90A: 6039 6040* Each agent has multiple shader arrays (SA). 6041* Each SA has multiple compute units (CU). 6042* Each CU has multiple SIMDs that execute wavefronts. 6043* The wavefronts for a single work-group are executed in the same CU but may be 6044 executed by different SIMDs. The exception is when in tgsplit execution mode 6045 when the wavefronts may be executed by different SIMDs in different CUs. 6046* Each CU has a single LDS memory shared by the wavefronts of the work-groups 6047 executing on it. The exception is when in tgsplit execution mode when no LDS 6048 is allocated as wavefronts of the same work-group can be in different CUs. 6049* All LDS operations of a CU are performed as wavefront wide operations in a 6050 global order and involve no caching. Completion is reported to a wavefront in 6051 execution order. 6052* The LDS memory has multiple request queues shared by the SIMDs of a 6053 CU. Therefore, the LDS operations performed by different wavefronts of a 6054 work-group can be reordered relative to each other, which can result in 6055 reordering the visibility of vector memory operations with respect to LDS 6056 operations of other wavefronts in the same work-group. A ``s_waitcnt 6057 lgkmcnt(0)`` is required to ensure synchronization between LDS operations and 6058 vector memory operations between wavefronts of a work-group, but not between 6059 operations performed by the same wavefront. 6060* The vector memory operations are performed as wavefront wide operations and 6061 completion is reported to a wavefront in execution order. The exception is 6062 that ``flat_load/store/atomic`` instructions can report out of vector memory 6063 order if they access LDS memory, and out of LDS operation order if they access 6064 global memory. 6065* The vector memory operations access a single vector L1 cache shared by all 6066 SIMDs a CU. Therefore: 6067 6068 * No special action is required for coherence between the lanes of a single 6069 wavefront. 6070 6071 * No special action is required for coherence between wavefronts in the same 6072 work-group since they execute on the same CU. The exception is when in 6073 tgsplit execution mode as wavefronts of the same work-group can be in 6074 different CUs and so a ``buffer_wbinvl1_vol`` is required as described in 6075 the following item. 6076 6077 * A ``buffer_wbinvl1_vol`` is required for coherence between wavefronts 6078 executing in different work-groups as they may be executing on different 6079 CUs. 6080 6081* The scalar memory operations access a scalar L1 cache shared by all wavefronts 6082 on a group of CUs. The scalar and vector L1 caches are not coherent. However, 6083 scalar operations are used in a restricted way so do not impact the memory 6084 model. See :ref:`amdgpu-amdhsa-memory-spaces`. 6085* The vector and scalar memory operations use an L2 cache shared by all CUs on 6086 the same agent. 6087 6088 * The L2 cache has independent channels to service disjoint ranges of virtual 6089 addresses. 6090 * Each CU has a separate request queue per channel. Therefore, the vector and 6091 scalar memory operations performed by wavefronts executing in different 6092 work-groups (which may be executing on different CUs), or the same 6093 work-group if executing in tgsplit mode, of an agent can be reordered 6094 relative to each other. A ``s_waitcnt vmcnt(0)`` is required to ensure 6095 synchronization between vector memory operations of different CUs. It 6096 ensures a previous vector memory operation has completed before executing a 6097 subsequent vector memory or LDS operation and so can be used to meet the 6098 requirements of acquire and release. 6099 * The L2 cache of one agent can be kept coherent with other agents by: 6100 using the MTYPE RW (read-write) or MTYPE CC (cache-coherent) with the PTE 6101 C-bit for memory local to the L2; and using the MTYPE NC (non-coherent) with 6102 the PTE C-bit set or MTYPE UC (uncached) for memory not local to the L2. 6103 6104 * Any local memory cache lines will be automatically invalidated by writes 6105 from CUs associated with other L2 caches, or writes from the CPU, due to 6106 the cache probe caused by coherent requests. Coherent requests are caused 6107 by GPU accesses to pages with the PTE C-bit set, by CPU accesses over 6108 XGMI, and by PCIe requests that are configured to be coherent requests. 6109 * XGMI accesses from the CPU to local memory may be cached on the CPU. 6110 Subsequent access from the GPU will automatically invalidate or writeback 6111 the CPU cache due to the L2 probe filter and and the PTE C-bit being set. 6112 * Since all work-groups on the same agent share the same L2, no L2 6113 invalidation or writeback is required for coherence. 6114 * To ensure coherence of local and remote memory writes of work-groups in 6115 different agents a ``buffer_wbl2`` is required. It will writeback dirty L2 6116 cache lines of MTYPE RW (used for local coarse grain memory) and MTYPE NC 6117 ()used for remote coarse grain memory). Note that MTYPE CC (used for local 6118 fine grain memory) causes write through to DRAM, and MTYPE UC (used for 6119 remote fine grain memory) bypasses the L2, so both will never result in 6120 dirty L2 cache lines. 6121 * To ensure coherence of local and remote memory reads of work-groups in 6122 different agents a ``buffer_invl2`` is required. It will invalidate L2 6123 cache lines with MTYPE NC (used for remote coarse grain memory). Note that 6124 MTYPE CC (used for local fine grain memory) and MTYPE RW (used for local 6125 coarse memory) cause local reads to be invalidated by remote writes with 6126 with the PTE C-bit so these cache lines are not invalidated. Note that 6127 MTYPE UC (used for remote fine grain memory) bypasses the L2, so will 6128 never result in L2 cache lines that need to be invalidated. 6129 6130 * PCIe access from the GPU to the CPU memory is kept coherent by using the 6131 MTYPE UC (uncached) which bypasses the L2. 6132 6133Scalar memory operations are only used to access memory that is proven to not 6134change during the execution of the kernel dispatch. This includes constant 6135address space and global address space for program scope ``const`` variables. 6136Therefore, the kernel machine code does not have to maintain the scalar cache to 6137ensure it is coherent with the vector caches. The scalar and vector caches are 6138invalidated between kernel dispatches by CP since constant address space data 6139may change between kernel dispatch executions. See 6140:ref:`amdgpu-amdhsa-memory-spaces`. 6141 6142The one exception is if scalar writes are used to spill SGPR registers. In this 6143case the AMDGPU backend ensures the memory location used to spill is never 6144accessed by vector memory operations at the same time. If scalar writes are used 6145then a ``s_dcache_wb`` is inserted before the ``s_endpgm`` and before a function 6146return since the locations may be used for vector memory instructions by a 6147future wavefront that uses the same scratch area, or a function call that 6148creates a frame at the same address, respectively. There is no need for a 6149``s_dcache_inv`` as all scalar writes are write-before-read in the same thread. 6150 6151For kernarg backing memory: 6152 6153* CP invalidates the L1 cache at the start of each kernel dispatch. 6154* On dGPU over XGMI or PCIe the kernarg backing memory is allocated in host 6155 memory accessed as MTYPE UC (uncached) to avoid needing to invalidate the L2 6156 cache. This also causes it to be treated as non-volatile and so is not 6157 invalidated by ``*_vol``. 6158* On APU the kernarg backing memory is accessed as MTYPE CC (cache coherent) and 6159 so the L2 cache will be coherent with the CPU and other agents. 6160 6161Scratch backing memory (which is used for the private address space) is accessed 6162with MTYPE NC_NV (non-coherent non-volatile). Since the private address space is 6163only accessed by a single thread, and is always write-before-read, there is 6164never a need to invalidate these entries from the L1 cache. Hence all cache 6165invalidates are done as ``*_vol`` to only invalidate the volatile cache lines. 6166 6167The code sequences used to implement the memory model for GFX90A are defined 6168in table :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx90a-table`. 6169 6170 .. table:: AMDHSA Memory Model Code Sequences GFX90A 6171 :name: amdgpu-amdhsa-memory-model-code-sequences-gfx90a-table 6172 6173 ============ ============ ============== ========== ================================ 6174 LLVM Instr LLVM Memory LLVM Memory AMDGPU AMDGPU Machine Code 6175 Ordering Sync Scope Address GFX90A 6176 Space 6177 ============ ============ ============== ========== ================================ 6178 **Non-Atomic** 6179 ------------------------------------------------------------------------------------ 6180 load *none* *none* - global - !volatile & !nontemporal 6181 - generic 6182 - private 1. buffer/global/flat_load 6183 - constant 6184 - !volatile & nontemporal 6185 6186 1. buffer/global/flat_load 6187 glc=1 slc=1 6188 6189 - volatile 6190 6191 1. buffer/global/flat_load 6192 glc=1 6193 2. s_waitcnt vmcnt(0) 6194 6195 - Must happen before 6196 any following volatile 6197 global/generic 6198 load/store. 6199 - Ensures that 6200 volatile 6201 operations to 6202 different 6203 addresses will not 6204 be reordered by 6205 hardware. 6206 6207 load *none* *none* - local 1. ds_load 6208 store *none* *none* - global - !volatile & !nontemporal 6209 - generic 6210 - private 1. buffer/global/flat_store 6211 - constant 6212 - !volatile & nontemporal 6213 6214 1. buffer/global/flat_store 6215 glc=1 slc=1 6216 6217 - volatile 6218 6219 1. buffer/global/flat_store 6220 2. s_waitcnt vmcnt(0) 6221 6222 - Must happen before 6223 any following volatile 6224 global/generic 6225 load/store. 6226 - Ensures that 6227 volatile 6228 operations to 6229 different 6230 addresses will not 6231 be reordered by 6232 hardware. 6233 6234 store *none* *none* - local 1. ds_store 6235 **Unordered Atomic** 6236 ------------------------------------------------------------------------------------ 6237 load atomic unordered *any* *any* *Same as non-atomic*. 6238 store atomic unordered *any* *any* *Same as non-atomic*. 6239 atomicrmw unordered *any* *any* *Same as monotonic atomic*. 6240 **Monotonic Atomic** 6241 ------------------------------------------------------------------------------------ 6242 load atomic monotonic - singlethread - global 1. buffer/global/flat_load 6243 - wavefront - generic 6244 load atomic monotonic - workgroup - global 1. buffer/global/flat_load 6245 - generic glc=1 6246 6247 - If not TgSplit execution 6248 mode, omit glc=1. 6249 6250 load atomic monotonic - singlethread - local *If TgSplit execution mode, 6251 - wavefront local address space cannot 6252 - workgroup be used.* 6253 6254 1. ds_load 6255 load atomic monotonic - agent - global 1. buffer/global/flat_load 6256 - generic glc=1 6257 load atomic monotonic - system - global 1. buffer/global/flat_load 6258 - generic glc=1 6259 store atomic monotonic - singlethread - global 1. buffer/global/flat_store 6260 - wavefront - generic 6261 - workgroup 6262 - agent 6263 store atomic monotonic - system - global 1. buffer/global/flat_store 6264 - generic 6265 store atomic monotonic - singlethread - local *If TgSplit execution mode, 6266 - wavefront local address space cannot 6267 - workgroup be used.* 6268 6269 1. ds_store 6270 atomicrmw monotonic - singlethread - global 1. buffer/global/flat_atomic 6271 - wavefront - generic 6272 - workgroup 6273 - agent 6274 atomicrmw monotonic - system - global 1. buffer/global/flat_atomic 6275 - generic 6276 atomicrmw monotonic - singlethread - local *If TgSplit execution mode, 6277 - wavefront local address space cannot 6278 - workgroup be used.* 6279 6280 1. ds_atomic 6281 **Acquire Atomic** 6282 ------------------------------------------------------------------------------------ 6283 load atomic acquire - singlethread - global 1. buffer/global/ds/flat_load 6284 - wavefront - local 6285 - generic 6286 load atomic acquire - workgroup - global 1. buffer/global_load glc=1 6287 6288 - If not TgSplit execution 6289 mode, omit glc=1. 6290 6291 2. s_waitcnt vmcnt(0) 6292 6293 - If not TgSplit execution 6294 mode, omit. 6295 - Must happen before the 6296 following buffer_wbinvl1_vol. 6297 6298 3. buffer_wbinvl1_vol 6299 6300 - If not TgSplit execution 6301 mode, omit. 6302 - Must happen before 6303 any following 6304 global/generic 6305 load/load 6306 atomic/store/store 6307 atomic/atomicrmw. 6308 - Ensures that 6309 following 6310 loads will not see 6311 stale data. 6312 6313 load atomic acquire - workgroup - local *If TgSplit execution mode, 6314 local address space cannot 6315 be used.* 6316 6317 1. ds_load 6318 2. s_waitcnt lgkmcnt(0) 6319 6320 - If OpenCL, omit. 6321 - Must happen before 6322 any following 6323 global/generic 6324 load/load 6325 atomic/store/store 6326 atomic/atomicrmw. 6327 - Ensures any 6328 following global 6329 data read is no 6330 older than the local load 6331 atomic value being 6332 acquired. 6333 6334 load atomic acquire - workgroup - generic 1. flat_load glc=1 6335 6336 - If not TgSplit execution 6337 mode, omit glc=1. 6338 6339 2. s_waitcnt lgkm/vmcnt(0) 6340 6341 - Use lgkmcnt(0) if not 6342 TgSplit execution mode 6343 and vmcnt(0) if TgSplit 6344 execution mode. 6345 - If OpenCL, omit lgkmcnt(0). 6346 - Must happen before 6347 the following 6348 buffer_wbinvl1_vol and any 6349 following global/generic 6350 load/load 6351 atomic/store/store 6352 atomic/atomicrmw. 6353 - Ensures any 6354 following global 6355 data read is no 6356 older than a local load 6357 atomic value being 6358 acquired. 6359 6360 3. buffer_wbinvl1_vol 6361 6362 - If not TgSplit execution 6363 mode, omit. 6364 - Ensures that 6365 following 6366 loads will not see 6367 stale data. 6368 6369 load atomic acquire - agent - global 1. buffer/global_load 6370 glc=1 6371 2. s_waitcnt vmcnt(0) 6372 6373 - Must happen before 6374 following 6375 buffer_wbinvl1_vol. 6376 - Ensures the load 6377 has completed 6378 before invalidating 6379 the cache. 6380 6381 3. buffer_wbinvl1_vol 6382 6383 - Must happen before 6384 any following 6385 global/generic 6386 load/load 6387 atomic/atomicrmw. 6388 - Ensures that 6389 following 6390 loads will not see 6391 stale global data. 6392 6393 load atomic acquire - system - global 1. buffer/global/flat_load 6394 glc=1 6395 2. s_waitcnt vmcnt(0) 6396 6397 - Must happen before 6398 following buffer_invl2 and 6399 buffer_wbinvl1_vol. 6400 - Ensures the load 6401 has completed 6402 before invalidating 6403 the cache. 6404 6405 3. buffer_invl2; 6406 buffer_wbinvl1_vol 6407 6408 - Must happen before 6409 any following 6410 global/generic 6411 load/load 6412 atomic/atomicrmw. 6413 - Ensures that 6414 following 6415 loads will not see 6416 stale L1 global data, 6417 nor see stale L2 MTYPE 6418 NC global data. 6419 MTYPE RW and CC memory will 6420 never be stale in L2 due to 6421 the memory probes. 6422 6423 load atomic acquire - agent - generic 1. flat_load glc=1 6424 2. s_waitcnt vmcnt(0) & 6425 lgkmcnt(0) 6426 6427 - If TgSplit execution mode, 6428 omit lgkmcnt(0). 6429 - If OpenCL omit 6430 lgkmcnt(0). 6431 - Must happen before 6432 following 6433 buffer_wbinvl1_vol. 6434 - Ensures the flat_load 6435 has completed 6436 before invalidating 6437 the cache. 6438 6439 3. buffer_wbinvl1_vol 6440 6441 - Must happen before 6442 any following 6443 global/generic 6444 load/load 6445 atomic/atomicrmw. 6446 - Ensures that 6447 following loads 6448 will not see stale 6449 global data. 6450 6451 load atomic acquire - system - generic 1. flat_load glc=1 6452 2. s_waitcnt vmcnt(0) & 6453 lgkmcnt(0) 6454 6455 - If TgSplit execution mode, 6456 omit lgkmcnt(0). 6457 - If OpenCL omit 6458 lgkmcnt(0). 6459 - Must happen before 6460 following 6461 buffer_invl2 and 6462 buffer_wbinvl1_vol. 6463 - Ensures the flat_load 6464 has completed 6465 before invalidating 6466 the caches. 6467 6468 3. buffer_invl2; 6469 buffer_wbinvl1_vol 6470 6471 - Must happen before 6472 any following 6473 global/generic 6474 load/load 6475 atomic/atomicrmw. 6476 - Ensures that 6477 following 6478 loads will not see 6479 stale L1 global data, 6480 nor see stale L2 MTYPE 6481 NC global data. 6482 MTYPE RW and CC memory will 6483 never be stale in L2 due to 6484 the memory probes. 6485 6486 atomicrmw acquire - singlethread - global 1. buffer/global/flat_atomic 6487 - wavefront - generic 6488 atomicrmw acquire - singlethread - local *If TgSplit execution mode, 6489 - wavefront local address space cannot 6490 be used.* 6491 6492 1. ds_atomic 6493 atomicrmw acquire - workgroup - global 1. buffer/global_atomic 6494 2. s_waitcnt vmcnt(0) 6495 6496 - If not TgSplit execution 6497 mode, omit. 6498 - Must happen before the 6499 following buffer_wbinvl1_vol. 6500 - Ensures the atomicrmw 6501 has completed 6502 before invalidating 6503 the cache. 6504 6505 3. buffer_wbinvl1_vol 6506 6507 - If not TgSplit execution 6508 mode, omit. 6509 - Must happen before 6510 any following 6511 global/generic 6512 load/load 6513 atomic/atomicrmw. 6514 - Ensures that 6515 following loads 6516 will not see stale 6517 global data. 6518 6519 atomicrmw acquire - workgroup - local *If TgSplit execution mode, 6520 local address space cannot 6521 be used.* 6522 6523 1. ds_atomic 6524 2. s_waitcnt lgkmcnt(0) 6525 6526 - If OpenCL, omit. 6527 - Must happen before 6528 any following 6529 global/generic 6530 load/load 6531 atomic/store/store 6532 atomic/atomicrmw. 6533 - Ensures any 6534 following global 6535 data read is no 6536 older than the local 6537 atomicrmw value 6538 being acquired. 6539 6540 atomicrmw acquire - workgroup - generic 1. flat_atomic 6541 2. s_waitcnt lgkm/vmcnt(0) 6542 6543 - Use lgkmcnt(0) if not 6544 TgSplit execution mode 6545 and vmcnt(0) if TgSplit 6546 execution mode. 6547 - If OpenCL, omit lgkmcnt(0). 6548 - Must happen before 6549 the following 6550 buffer_wbinvl1_vol and 6551 any following 6552 global/generic 6553 load/load 6554 atomic/store/store 6555 atomic/atomicrmw. 6556 - Ensures any 6557 following global 6558 data read is no 6559 older than a local 6560 atomicrmw value 6561 being acquired. 6562 6563 3. buffer_wbinvl1_vol 6564 6565 - If not TgSplit execution 6566 mode, omit. 6567 - Ensures that 6568 following 6569 loads will not see 6570 stale data. 6571 6572 atomicrmw acquire - agent - global 1. buffer/global_atomic 6573 2. s_waitcnt vmcnt(0) 6574 6575 - Must happen before 6576 following 6577 buffer_wbinvl1_vol. 6578 - Ensures the 6579 atomicrmw has 6580 completed before 6581 invalidating the 6582 cache. 6583 6584 3. buffer_wbinvl1_vol 6585 6586 - Must happen before 6587 any following 6588 global/generic 6589 load/load 6590 atomic/atomicrmw. 6591 - Ensures that 6592 following loads 6593 will not see stale 6594 global data. 6595 6596 atomicrmw acquire - system - global 1. buffer/global_atomic 6597 2. s_waitcnt vmcnt(0) 6598 6599 - Must happen before 6600 following buffer_invl2 and 6601 buffer_wbinvl1_vol. 6602 - Ensures the 6603 atomicrmw has 6604 completed before 6605 invalidating the 6606 caches. 6607 6608 3. buffer_invl2; 6609 buffer_wbinvl1_vol 6610 6611 - Must happen before 6612 any following 6613 global/generic 6614 load/load 6615 atomic/atomicrmw. 6616 - Ensures that 6617 following 6618 loads will not see 6619 stale L1 global data, 6620 nor see stale L2 MTYPE 6621 NC global data. 6622 MTYPE RW and CC memory will 6623 never be stale in L2 due to 6624 the memory probes. 6625 6626 atomicrmw acquire - agent - generic 1. flat_atomic 6627 2. s_waitcnt vmcnt(0) & 6628 lgkmcnt(0) 6629 6630 - If TgSplit execution mode, 6631 omit lgkmcnt(0). 6632 - If OpenCL, omit 6633 lgkmcnt(0). 6634 - Must happen before 6635 following 6636 buffer_wbinvl1_vol. 6637 - Ensures the 6638 atomicrmw has 6639 completed before 6640 invalidating the 6641 cache. 6642 6643 3. buffer_wbinvl1_vol 6644 6645 - Must happen before 6646 any following 6647 global/generic 6648 load/load 6649 atomic/atomicrmw. 6650 - Ensures that 6651 following loads 6652 will not see stale 6653 global data. 6654 6655 atomicrmw acquire - system - generic 1. flat_atomic 6656 2. s_waitcnt vmcnt(0) & 6657 lgkmcnt(0) 6658 6659 - If TgSplit execution mode, 6660 omit lgkmcnt(0). 6661 - If OpenCL, omit 6662 lgkmcnt(0). 6663 - Must happen before 6664 following 6665 buffer_invl2 and 6666 buffer_wbinvl1_vol. 6667 - Ensures the 6668 atomicrmw has 6669 completed before 6670 invalidating the 6671 caches. 6672 6673 3. buffer_invl2; 6674 buffer_wbinvl1_vol 6675 6676 - Must happen before 6677 any following 6678 global/generic 6679 load/load 6680 atomic/atomicrmw. 6681 - Ensures that 6682 following 6683 loads will not see 6684 stale L1 global data, 6685 nor see stale L2 MTYPE 6686 NC global data. 6687 MTYPE RW and CC memory will 6688 never be stale in L2 due to 6689 the memory probes. 6690 6691 fence acquire - singlethread *none* *none* 6692 - wavefront 6693 fence acquire - workgroup *none* 1. s_waitcnt lgkm/vmcnt(0) 6694 6695 - Use lgkmcnt(0) if not 6696 TgSplit execution mode 6697 and vmcnt(0) if TgSplit 6698 execution mode. 6699 - If OpenCL and 6700 address space is 6701 not generic, omit 6702 lgkmcnt(0). 6703 - If OpenCL and 6704 address space is 6705 local, omit 6706 vmcnt(0). 6707 - However, since LLVM 6708 currently has no 6709 address space on 6710 the fence need to 6711 conservatively 6712 always generate. If 6713 fence had an 6714 address space then 6715 set to address 6716 space of OpenCL 6717 fence flag, or to 6718 generic if both 6719 local and global 6720 flags are 6721 specified. 6722 - s_waitcnt vmcnt(0) 6723 must happen after 6724 any preceding 6725 global/generic load 6726 atomic/ 6727 atomicrmw 6728 with an equal or 6729 wider sync scope 6730 and memory ordering 6731 stronger than 6732 unordered (this is 6733 termed the 6734 fence-paired-atomic). 6735 - s_waitcnt lgkmcnt(0) 6736 must happen after 6737 any preceding 6738 local/generic load 6739 atomic/atomicrmw 6740 with an equal or 6741 wider sync scope 6742 and memory ordering 6743 stronger than 6744 unordered (this is 6745 termed the 6746 fence-paired-atomic). 6747 - Must happen before 6748 the following 6749 buffer_wbinvl1_vol and 6750 any following 6751 global/generic 6752 load/load 6753 atomic/store/store 6754 atomic/atomicrmw. 6755 - Ensures any 6756 following global 6757 data read is no 6758 older than the 6759 value read by the 6760 fence-paired-atomic. 6761 6762 2. buffer_wbinvl1_vol 6763 6764 - If not TgSplit execution 6765 mode, omit. 6766 - Ensures that 6767 following 6768 loads will not see 6769 stale data. 6770 6771 fence acquire - agent *none* 1. s_waitcnt lgkmcnt(0) & 6772 vmcnt(0) 6773 6774 - If TgSplit execution mode, 6775 omit lgkmcnt(0). 6776 - If OpenCL and 6777 address space is 6778 not generic, omit 6779 lgkmcnt(0). 6780 - However, since LLVM 6781 currently has no 6782 address space on 6783 the fence need to 6784 conservatively 6785 always generate 6786 (see comment for 6787 previous fence). 6788 - Could be split into 6789 separate s_waitcnt 6790 vmcnt(0) and 6791 s_waitcnt 6792 lgkmcnt(0) to allow 6793 them to be 6794 independently moved 6795 according to the 6796 following rules. 6797 - s_waitcnt vmcnt(0) 6798 must happen after 6799 any preceding 6800 global/generic load 6801 atomic/atomicrmw 6802 with an equal or 6803 wider sync scope 6804 and memory ordering 6805 stronger than 6806 unordered (this is 6807 termed the 6808 fence-paired-atomic). 6809 - s_waitcnt lgkmcnt(0) 6810 must happen after 6811 any preceding 6812 local/generic load 6813 atomic/atomicrmw 6814 with an equal or 6815 wider sync scope 6816 and memory ordering 6817 stronger than 6818 unordered (this is 6819 termed the 6820 fence-paired-atomic). 6821 - Must happen before 6822 the following 6823 buffer_wbinvl1_vol. 6824 - Ensures that the 6825 fence-paired atomic 6826 has completed 6827 before invalidating 6828 the 6829 cache. Therefore 6830 any following 6831 locations read must 6832 be no older than 6833 the value read by 6834 the 6835 fence-paired-atomic. 6836 6837 2. buffer_wbinvl1_vol 6838 6839 - Must happen before any 6840 following global/generic 6841 load/load 6842 atomic/store/store 6843 atomic/atomicrmw. 6844 - Ensures that 6845 following loads 6846 will not see stale 6847 global data. 6848 6849 fence acquire - system *none* 1. s_waitcnt lgkmcnt(0) & 6850 vmcnt(0) 6851 6852 - If TgSplit execution mode, 6853 omit lgkmcnt(0). 6854 - If OpenCL and 6855 address space is 6856 not generic, omit 6857 lgkmcnt(0). 6858 - However, since LLVM 6859 currently has no 6860 address space on 6861 the fence need to 6862 conservatively 6863 always generate 6864 (see comment for 6865 previous fence). 6866 - Could be split into 6867 separate s_waitcnt 6868 vmcnt(0) and 6869 s_waitcnt 6870 lgkmcnt(0) to allow 6871 them to be 6872 independently moved 6873 according to the 6874 following rules. 6875 - s_waitcnt vmcnt(0) 6876 must happen after 6877 any preceding 6878 global/generic load 6879 atomic/atomicrmw 6880 with an equal or 6881 wider sync scope 6882 and memory ordering 6883 stronger than 6884 unordered (this is 6885 termed the 6886 fence-paired-atomic). 6887 - s_waitcnt lgkmcnt(0) 6888 must happen after 6889 any preceding 6890 local/generic load 6891 atomic/atomicrmw 6892 with an equal or 6893 wider sync scope 6894 and memory ordering 6895 stronger than 6896 unordered (this is 6897 termed the 6898 fence-paired-atomic). 6899 - Must happen before 6900 the following buffer_invl2 and 6901 buffer_wbinvl1_vol. 6902 - Ensures that the 6903 fence-paired atomic 6904 has completed 6905 before invalidating 6906 the 6907 cache. Therefore 6908 any following 6909 locations read must 6910 be no older than 6911 the value read by 6912 the 6913 fence-paired-atomic. 6914 6915 2. buffer_invl2; 6916 buffer_wbinvl1_vol 6917 6918 - Must happen before any 6919 following global/generic 6920 load/load 6921 atomic/store/store 6922 atomic/atomicrmw. 6923 - Ensures that 6924 following 6925 loads will not see 6926 stale L1 global data, 6927 nor see stale L2 MTYPE 6928 NC global data. 6929 MTYPE RW and CC memory will 6930 never be stale in L2 due to 6931 the memory probes. 6932 **Release Atomic** 6933 ------------------------------------------------------------------------------------ 6934 store atomic release - singlethread - global 1. buffer/global/flat_store 6935 - wavefront - generic 6936 store atomic release - singlethread - local *If TgSplit execution mode, 6937 - wavefront local address space cannot 6938 be used.* 6939 6940 1. ds_store 6941 store atomic release - workgroup - global 1. s_waitcnt lgkm/vmcnt(0) 6942 - generic 6943 - Use lgkmcnt(0) if not 6944 TgSplit execution mode 6945 and vmcnt(0) if TgSplit 6946 execution mode. 6947 - If OpenCL, omit lgkmcnt(0). 6948 - s_waitcnt vmcnt(0) 6949 must happen after 6950 any preceding 6951 global/generic load/store/ 6952 load atomic/store atomic/ 6953 atomicrmw. 6954 - s_waitcnt lgkmcnt(0) 6955 must happen after 6956 any preceding 6957 local/generic 6958 load/store/load 6959 atomic/store 6960 atomic/atomicrmw. 6961 - Must happen before 6962 the following 6963 store. 6964 - Ensures that all 6965 memory operations 6966 have 6967 completed before 6968 performing the 6969 store that is being 6970 released. 6971 6972 2. buffer/global/flat_store 6973 store atomic release - workgroup - local *If TgSplit execution mode, 6974 local address space cannot 6975 be used.* 6976 6977 1. ds_store 6978 store atomic release - agent - global 1. s_waitcnt lgkmcnt(0) & 6979 - generic vmcnt(0) 6980 6981 - If TgSplit execution mode, 6982 omit lgkmcnt(0). 6983 - If OpenCL and 6984 address space is 6985 not generic, omit 6986 lgkmcnt(0). 6987 - Could be split into 6988 separate s_waitcnt 6989 vmcnt(0) and 6990 s_waitcnt 6991 lgkmcnt(0) to allow 6992 them to be 6993 independently moved 6994 according to the 6995 following rules. 6996 - s_waitcnt vmcnt(0) 6997 must happen after 6998 any preceding 6999 global/generic 7000 load/store/load 7001 atomic/store 7002 atomic/atomicrmw. 7003 - s_waitcnt lgkmcnt(0) 7004 must happen after 7005 any preceding 7006 local/generic 7007 load/store/load 7008 atomic/store 7009 atomic/atomicrmw. 7010 - Must happen before 7011 the following 7012 store. 7013 - Ensures that all 7014 memory operations 7015 to memory have 7016 completed before 7017 performing the 7018 store that is being 7019 released. 7020 7021 2. buffer/global/flat_store 7022 store atomic release - system - global 1. buffer_wbl2 7023 - generic 7024 - Must happen before 7025 following s_waitcnt. 7026 - Performs L2 writeback to 7027 ensure previous 7028 global/generic 7029 store/atomicrmw are 7030 visible at system scope. 7031 7032 2. s_waitcnt lgkmcnt(0) & 7033 vmcnt(0) 7034 7035 - If TgSplit execution mode, 7036 omit lgkmcnt(0). 7037 - If OpenCL and 7038 address space is 7039 not generic, omit 7040 lgkmcnt(0). 7041 - Could be split into 7042 separate s_waitcnt 7043 vmcnt(0) and 7044 s_waitcnt 7045 lgkmcnt(0) to allow 7046 them to be 7047 independently moved 7048 according to the 7049 following rules. 7050 - s_waitcnt vmcnt(0) 7051 must happen after any 7052 preceding 7053 global/generic 7054 load/store/load 7055 atomic/store 7056 atomic/atomicrmw. 7057 - s_waitcnt lgkmcnt(0) 7058 must happen after any 7059 preceding 7060 local/generic 7061 load/store/load 7062 atomic/store 7063 atomic/atomicrmw. 7064 - Must happen before 7065 the following 7066 store. 7067 - Ensures that all 7068 memory operations 7069 to memory and the L2 7070 writeback have 7071 completed before 7072 performing the 7073 store that is being 7074 released. 7075 7076 3. buffer/global/flat_store 7077 atomicrmw release - singlethread - global 1. buffer/global/flat_atomic 7078 - wavefront - generic 7079 atomicrmw release - singlethread - local *If TgSplit execution mode, 7080 - wavefront local address space cannot 7081 be used.* 7082 7083 1. ds_atomic 7084 atomicrmw release - workgroup - global 1. s_waitcnt lgkm/vmcnt(0) 7085 - generic 7086 - Use lgkmcnt(0) if not 7087 TgSplit execution mode 7088 and vmcnt(0) if TgSplit 7089 execution mode. 7090 - If OpenCL, omit 7091 lgkmcnt(0). 7092 - s_waitcnt vmcnt(0) 7093 must happen after 7094 any preceding 7095 global/generic load/store/ 7096 load atomic/store atomic/ 7097 atomicrmw. 7098 - s_waitcnt lgkmcnt(0) 7099 must happen after 7100 any preceding 7101 local/generic 7102 load/store/load 7103 atomic/store 7104 atomic/atomicrmw. 7105 - Must happen before 7106 the following 7107 atomicrmw. 7108 - Ensures that all 7109 memory operations 7110 have 7111 completed before 7112 performing the 7113 atomicrmw that is 7114 being released. 7115 7116 2. buffer/global/flat_atomic 7117 atomicrmw release - workgroup - local *If TgSplit execution mode, 7118 local address space cannot 7119 be used.* 7120 7121 1. ds_atomic 7122 atomicrmw release - agent - global 1. s_waitcnt lgkmcnt(0) & 7123 - generic vmcnt(0) 7124 7125 - If TgSplit execution mode, 7126 omit lgkmcnt(0). 7127 - If OpenCL, omit 7128 lgkmcnt(0). 7129 - Could be split into 7130 separate s_waitcnt 7131 vmcnt(0) and 7132 s_waitcnt 7133 lgkmcnt(0) to allow 7134 them to be 7135 independently moved 7136 according to the 7137 following rules. 7138 - s_waitcnt vmcnt(0) 7139 must happen after 7140 any preceding 7141 global/generic 7142 load/store/load 7143 atomic/store 7144 atomic/atomicrmw. 7145 - s_waitcnt lgkmcnt(0) 7146 must happen after 7147 any preceding 7148 local/generic 7149 load/store/load 7150 atomic/store 7151 atomic/atomicrmw. 7152 - Must happen before 7153 the following 7154 atomicrmw. 7155 - Ensures that all 7156 memory operations 7157 to global and local 7158 have completed 7159 before performing 7160 the atomicrmw that 7161 is being released. 7162 7163 2. buffer/global/flat_atomic 7164 atomicrmw release - system - global 1. buffer_wbl2 7165 - generic 7166 - Must happen before 7167 following s_waitcnt. 7168 - Performs L2 writeback to 7169 ensure previous 7170 global/generic 7171 store/atomicrmw are 7172 visible at system scope. 7173 7174 2. s_waitcnt lgkmcnt(0) & 7175 vmcnt(0) 7176 7177 - If TgSplit execution mode, 7178 omit lgkmcnt(0). 7179 - If OpenCL, omit 7180 lgkmcnt(0). 7181 - Could be split into 7182 separate s_waitcnt 7183 vmcnt(0) and 7184 s_waitcnt 7185 lgkmcnt(0) to allow 7186 them to be 7187 independently moved 7188 according to the 7189 following rules. 7190 - s_waitcnt vmcnt(0) 7191 must happen after 7192 any preceding 7193 global/generic 7194 load/store/load 7195 atomic/store 7196 atomic/atomicrmw. 7197 - s_waitcnt lgkmcnt(0) 7198 must happen after 7199 any preceding 7200 local/generic 7201 load/store/load 7202 atomic/store 7203 atomic/atomicrmw. 7204 - Must happen before 7205 the following 7206 atomicrmw. 7207 - Ensures that all 7208 memory operations 7209 to memory and the L2 7210 writeback have 7211 completed before 7212 performing the 7213 store that is being 7214 released. 7215 7216 3. buffer/global/flat_atomic 7217 fence release - singlethread *none* *none* 7218 - wavefront 7219 fence release - workgroup *none* 1. s_waitcnt lgkm/vmcnt(0) 7220 7221 - Use lgkmcnt(0) if not 7222 TgSplit execution mode 7223 and vmcnt(0) if TgSplit 7224 execution mode. 7225 - If OpenCL and 7226 address space is 7227 not generic, omit 7228 lgkmcnt(0). 7229 - If OpenCL and 7230 address space is 7231 local, omit 7232 vmcnt(0). 7233 - However, since LLVM 7234 currently has no 7235 address space on 7236 the fence need to 7237 conservatively 7238 always generate. If 7239 fence had an 7240 address space then 7241 set to address 7242 space of OpenCL 7243 fence flag, or to 7244 generic if both 7245 local and global 7246 flags are 7247 specified. 7248 - s_waitcnt vmcnt(0) 7249 must happen after 7250 any preceding 7251 global/generic 7252 load/store/ 7253 load atomic/store atomic/ 7254 atomicrmw. 7255 - s_waitcnt lgkmcnt(0) 7256 must happen after 7257 any preceding 7258 local/generic 7259 load/load 7260 atomic/store/store 7261 atomic/atomicrmw. 7262 - Must happen before 7263 any following store 7264 atomic/atomicrmw 7265 with an equal or 7266 wider sync scope 7267 and memory ordering 7268 stronger than 7269 unordered (this is 7270 termed the 7271 fence-paired-atomic). 7272 - Ensures that all 7273 memory operations 7274 have 7275 completed before 7276 performing the 7277 following 7278 fence-paired-atomic. 7279 7280 fence release - agent *none* 1. s_waitcnt lgkmcnt(0) & 7281 vmcnt(0) 7282 7283 - If TgSplit execution mode, 7284 omit lgkmcnt(0). 7285 - If OpenCL and 7286 address space is 7287 not generic, omit 7288 lgkmcnt(0). 7289 - If OpenCL and 7290 address space is 7291 local, omit 7292 vmcnt(0). 7293 - However, since LLVM 7294 currently has no 7295 address space on 7296 the fence need to 7297 conservatively 7298 always generate. If 7299 fence had an 7300 address space then 7301 set to address 7302 space of OpenCL 7303 fence flag, or to 7304 generic if both 7305 local and global 7306 flags are 7307 specified. 7308 - Could be split into 7309 separate s_waitcnt 7310 vmcnt(0) and 7311 s_waitcnt 7312 lgkmcnt(0) to allow 7313 them to be 7314 independently moved 7315 according to the 7316 following rules. 7317 - s_waitcnt vmcnt(0) 7318 must happen after 7319 any preceding 7320 global/generic 7321 load/store/load 7322 atomic/store 7323 atomic/atomicrmw. 7324 - s_waitcnt lgkmcnt(0) 7325 must happen after 7326 any preceding 7327 local/generic 7328 load/store/load 7329 atomic/store 7330 atomic/atomicrmw. 7331 - Must happen before 7332 any following store 7333 atomic/atomicrmw 7334 with an equal or 7335 wider sync scope 7336 and memory ordering 7337 stronger than 7338 unordered (this is 7339 termed the 7340 fence-paired-atomic). 7341 - Ensures that all 7342 memory operations 7343 have 7344 completed before 7345 performing the 7346 following 7347 fence-paired-atomic. 7348 7349 fence release - system *none* 1. buffer_wbl2 7350 7351 - If OpenCL and 7352 address space is 7353 local, omit. 7354 - Must happen before 7355 following s_waitcnt. 7356 - Performs L2 writeback to 7357 ensure previous 7358 global/generic 7359 store/atomicrmw are 7360 visible at system scope. 7361 7362 2. s_waitcnt lgkmcnt(0) & 7363 vmcnt(0) 7364 7365 - If TgSplit execution mode, 7366 omit lgkmcnt(0). 7367 - If OpenCL and 7368 address space is 7369 not generic, omit 7370 lgkmcnt(0). 7371 - If OpenCL and 7372 address space is 7373 local, omit 7374 vmcnt(0). 7375 - However, since LLVM 7376 currently has no 7377 address space on 7378 the fence need to 7379 conservatively 7380 always generate. If 7381 fence had an 7382 address space then 7383 set to address 7384 space of OpenCL 7385 fence flag, or to 7386 generic if both 7387 local and global 7388 flags are 7389 specified. 7390 - Could be split into 7391 separate s_waitcnt 7392 vmcnt(0) and 7393 s_waitcnt 7394 lgkmcnt(0) to allow 7395 them to be 7396 independently moved 7397 according to the 7398 following rules. 7399 - s_waitcnt vmcnt(0) 7400 must happen after 7401 any preceding 7402 global/generic 7403 load/store/load 7404 atomic/store 7405 atomic/atomicrmw. 7406 - s_waitcnt lgkmcnt(0) 7407 must happen after 7408 any preceding 7409 local/generic 7410 load/store/load 7411 atomic/store 7412 atomic/atomicrmw. 7413 - Must happen before 7414 any following store 7415 atomic/atomicrmw 7416 with an equal or 7417 wider sync scope 7418 and memory ordering 7419 stronger than 7420 unordered (this is 7421 termed the 7422 fence-paired-atomic). 7423 - Ensures that all 7424 memory operations 7425 have 7426 completed before 7427 performing the 7428 following 7429 fence-paired-atomic. 7430 7431 **Acquire-Release Atomic** 7432 ------------------------------------------------------------------------------------ 7433 atomicrmw acq_rel - singlethread - global 1. buffer/global/flat_atomic 7434 - wavefront - generic 7435 atomicrmw acq_rel - singlethread - local *If TgSplit execution mode, 7436 - wavefront local address space cannot 7437 be used.* 7438 7439 1. ds_atomic 7440 atomicrmw acq_rel - workgroup - global 1. s_waitcnt lgkm/vmcnt(0) 7441 7442 - Use lgkmcnt(0) if not 7443 TgSplit execution mode 7444 and vmcnt(0) if TgSplit 7445 execution mode. 7446 - If OpenCL, omit 7447 lgkmcnt(0). 7448 - Must happen after 7449 any preceding 7450 local/generic 7451 load/store/load 7452 atomic/store 7453 atomic/atomicrmw. 7454 - s_waitcnt vmcnt(0) 7455 must happen after 7456 any preceding 7457 global/generic load/store/ 7458 load atomic/store atomic/ 7459 atomicrmw. 7460 - s_waitcnt lgkmcnt(0) 7461 must happen after 7462 any preceding 7463 local/generic 7464 load/store/load 7465 atomic/store 7466 atomic/atomicrmw. 7467 - Must happen before 7468 the following 7469 atomicrmw. 7470 - Ensures that all 7471 memory operations 7472 have 7473 completed before 7474 performing the 7475 atomicrmw that is 7476 being released. 7477 7478 2. buffer/global_atomic 7479 3. s_waitcnt vmcnt(0) 7480 7481 - If not TgSplit execution 7482 mode, omit. 7483 - Must happen before 7484 the following 7485 buffer_wbinvl1_vol. 7486 - Ensures any 7487 following global 7488 data read is no 7489 older than the 7490 atomicrmw value 7491 being acquired. 7492 7493 4. buffer_wbinvl1_vol 7494 7495 - If not TgSplit execution 7496 mode, omit. 7497 - Ensures that 7498 following 7499 loads will not see 7500 stale data. 7501 7502 atomicrmw acq_rel - workgroup - local *If TgSplit execution mode, 7503 local address space cannot 7504 be used.* 7505 7506 1. ds_atomic 7507 2. s_waitcnt lgkmcnt(0) 7508 7509 - If OpenCL, omit. 7510 - Must happen before 7511 any following 7512 global/generic 7513 load/load 7514 atomic/store/store 7515 atomic/atomicrmw. 7516 - Ensures any 7517 following global 7518 data read is no 7519 older than the local load 7520 atomic value being 7521 acquired. 7522 7523 atomicrmw acq_rel - workgroup - generic 1. s_waitcnt lgkm/vmcnt(0) 7524 7525 - Use lgkmcnt(0) if not 7526 TgSplit execution mode 7527 and vmcnt(0) if TgSplit 7528 execution mode. 7529 - If OpenCL, omit 7530 lgkmcnt(0). 7531 - s_waitcnt vmcnt(0) 7532 must happen after 7533 any preceding 7534 global/generic load/store/ 7535 load atomic/store atomic/ 7536 atomicrmw. 7537 - s_waitcnt lgkmcnt(0) 7538 must happen after 7539 any preceding 7540 local/generic 7541 load/store/load 7542 atomic/store 7543 atomic/atomicrmw. 7544 - Must happen before 7545 the following 7546 atomicrmw. 7547 - Ensures that all 7548 memory operations 7549 have 7550 completed before 7551 performing the 7552 atomicrmw that is 7553 being released. 7554 7555 2. flat_atomic 7556 3. s_waitcnt lgkmcnt(0) & 7557 vmcnt(0) 7558 7559 - If not TgSplit execution 7560 mode, omit vmcnt(0). 7561 - If OpenCL, omit 7562 lgkmcnt(0). 7563 - Must happen before 7564 the following 7565 buffer_wbinvl1_vol and 7566 any following 7567 global/generic 7568 load/load 7569 atomic/store/store 7570 atomic/atomicrmw. 7571 - Ensures any 7572 following global 7573 data read is no 7574 older than a local load 7575 atomic value being 7576 acquired. 7577 7578 3. buffer_wbinvl1_vol 7579 7580 - If not TgSplit execution 7581 mode, omit. 7582 - Ensures that 7583 following 7584 loads will not see 7585 stale data. 7586 7587 atomicrmw acq_rel - agent - global 1. s_waitcnt lgkmcnt(0) & 7588 vmcnt(0) 7589 7590 - If TgSplit execution mode, 7591 omit lgkmcnt(0). 7592 - If OpenCL, omit 7593 lgkmcnt(0). 7594 - Could be split into 7595 separate s_waitcnt 7596 vmcnt(0) and 7597 s_waitcnt 7598 lgkmcnt(0) to allow 7599 them to be 7600 independently moved 7601 according to the 7602 following rules. 7603 - s_waitcnt vmcnt(0) 7604 must happen after 7605 any preceding 7606 global/generic 7607 load/store/load 7608 atomic/store 7609 atomic/atomicrmw. 7610 - s_waitcnt lgkmcnt(0) 7611 must happen after 7612 any preceding 7613 local/generic 7614 load/store/load 7615 atomic/store 7616 atomic/atomicrmw. 7617 - Must happen before 7618 the following 7619 atomicrmw. 7620 - Ensures that all 7621 memory operations 7622 to global have 7623 completed before 7624 performing the 7625 atomicrmw that is 7626 being released. 7627 7628 2. buffer/global_atomic 7629 3. s_waitcnt vmcnt(0) 7630 7631 - Must happen before 7632 following 7633 buffer_wbinvl1_vol. 7634 - Ensures the 7635 atomicrmw has 7636 completed before 7637 invalidating the 7638 cache. 7639 7640 4. buffer_wbinvl1_vol 7641 7642 - Must happen before 7643 any following 7644 global/generic 7645 load/load 7646 atomic/atomicrmw. 7647 - Ensures that 7648 following loads 7649 will not see stale 7650 global data. 7651 7652 atomicrmw acq_rel - system - global 1. buffer_wbl2 7653 7654 - Must happen before 7655 following s_waitcnt. 7656 - Performs L2 writeback to 7657 ensure previous 7658 global/generic 7659 store/atomicrmw are 7660 visible at system scope. 7661 7662 2. s_waitcnt lgkmcnt(0) & 7663 vmcnt(0) 7664 7665 - If TgSplit execution mode, 7666 omit lgkmcnt(0). 7667 - If OpenCL, omit 7668 lgkmcnt(0). 7669 - Could be split into 7670 separate s_waitcnt 7671 vmcnt(0) and 7672 s_waitcnt 7673 lgkmcnt(0) to allow 7674 them to be 7675 independently moved 7676 according to the 7677 following rules. 7678 - s_waitcnt vmcnt(0) 7679 must happen after 7680 any preceding 7681 global/generic 7682 load/store/load 7683 atomic/store 7684 atomic/atomicrmw. 7685 - s_waitcnt lgkmcnt(0) 7686 must happen after 7687 any preceding 7688 local/generic 7689 load/store/load 7690 atomic/store 7691 atomic/atomicrmw. 7692 - Must happen before 7693 the following 7694 atomicrmw. 7695 - Ensures that all 7696 memory operations 7697 to global and L2 writeback 7698 have completed before 7699 performing the 7700 atomicrmw that is 7701 being released. 7702 7703 3. buffer/global_atomic 7704 4. s_waitcnt vmcnt(0) 7705 7706 - Must happen before 7707 following buffer_invl2 and 7708 buffer_wbinvl1_vol. 7709 - Ensures the 7710 atomicrmw has 7711 completed before 7712 invalidating the 7713 caches. 7714 7715 5. buffer_invl2; 7716 buffer_wbinvl1_vol 7717 7718 - Must happen before 7719 any following 7720 global/generic 7721 load/load 7722 atomic/atomicrmw. 7723 - Ensures that 7724 following 7725 loads will not see 7726 stale L1 global data, 7727 nor see stale L2 MTYPE 7728 NC global data. 7729 MTYPE RW and CC memory will 7730 never be stale in L2 due to 7731 the memory probes. 7732 7733 atomicrmw acq_rel - agent - generic 1. s_waitcnt lgkmcnt(0) & 7734 vmcnt(0) 7735 7736 - If TgSplit execution mode, 7737 omit lgkmcnt(0). 7738 - If OpenCL, omit 7739 lgkmcnt(0). 7740 - Could be split into 7741 separate s_waitcnt 7742 vmcnt(0) and 7743 s_waitcnt 7744 lgkmcnt(0) to allow 7745 them to be 7746 independently moved 7747 according to the 7748 following rules. 7749 - s_waitcnt vmcnt(0) 7750 must happen after 7751 any preceding 7752 global/generic 7753 load/store/load 7754 atomic/store 7755 atomic/atomicrmw. 7756 - s_waitcnt lgkmcnt(0) 7757 must happen after 7758 any preceding 7759 local/generic 7760 load/store/load 7761 atomic/store 7762 atomic/atomicrmw. 7763 - Must happen before 7764 the following 7765 atomicrmw. 7766 - Ensures that all 7767 memory operations 7768 to global have 7769 completed before 7770 performing the 7771 atomicrmw that is 7772 being released. 7773 7774 2. flat_atomic 7775 3. s_waitcnt vmcnt(0) & 7776 lgkmcnt(0) 7777 7778 - If TgSplit execution mode, 7779 omit lgkmcnt(0). 7780 - If OpenCL, omit 7781 lgkmcnt(0). 7782 - Must happen before 7783 following 7784 buffer_wbinvl1_vol. 7785 - Ensures the 7786 atomicrmw has 7787 completed before 7788 invalidating the 7789 cache. 7790 7791 4. buffer_wbinvl1_vol 7792 7793 - Must happen before 7794 any following 7795 global/generic 7796 load/load 7797 atomic/atomicrmw. 7798 - Ensures that 7799 following loads 7800 will not see stale 7801 global data. 7802 7803 atomicrmw acq_rel - system - generic 1. buffer_wbl2 7804 7805 - Must happen before 7806 following s_waitcnt. 7807 - Performs L2 writeback to 7808 ensure previous 7809 global/generic 7810 store/atomicrmw are 7811 visible at system scope. 7812 7813 2. s_waitcnt lgkmcnt(0) & 7814 vmcnt(0) 7815 7816 - If TgSplit execution mode, 7817 omit lgkmcnt(0). 7818 - If OpenCL, omit 7819 lgkmcnt(0). 7820 - Could be split into 7821 separate s_waitcnt 7822 vmcnt(0) and 7823 s_waitcnt 7824 lgkmcnt(0) to allow 7825 them to be 7826 independently moved 7827 according to the 7828 following rules. 7829 - s_waitcnt vmcnt(0) 7830 must happen after 7831 any preceding 7832 global/generic 7833 load/store/load 7834 atomic/store 7835 atomic/atomicrmw. 7836 - s_waitcnt lgkmcnt(0) 7837 must happen after 7838 any preceding 7839 local/generic 7840 load/store/load 7841 atomic/store 7842 atomic/atomicrmw. 7843 - Must happen before 7844 the following 7845 atomicrmw. 7846 - Ensures that all 7847 memory operations 7848 to global and L2 writeback 7849 have completed before 7850 performing the 7851 atomicrmw that is 7852 being released. 7853 7854 3. flat_atomic 7855 4. s_waitcnt vmcnt(0) & 7856 lgkmcnt(0) 7857 7858 - If TgSplit execution mode, 7859 omit lgkmcnt(0). 7860 - If OpenCL, omit 7861 lgkmcnt(0). 7862 - Must happen before 7863 following buffer_invl2 and 7864 buffer_wbinvl1_vol. 7865 - Ensures the 7866 atomicrmw has 7867 completed before 7868 invalidating the 7869 caches. 7870 7871 5. buffer_invl2; 7872 buffer_wbinvl1_vol 7873 7874 - Must happen before 7875 any following 7876 global/generic 7877 load/load 7878 atomic/atomicrmw. 7879 - Ensures that 7880 following 7881 loads will not see 7882 stale L1 global data, 7883 nor see stale L2 MTYPE 7884 NC global data. 7885 MTYPE RW and CC memory will 7886 never be stale in L2 due to 7887 the memory probes. 7888 7889 fence acq_rel - singlethread *none* *none* 7890 - wavefront 7891 fence acq_rel - workgroup *none* 1. s_waitcnt lgkm/vmcnt(0) 7892 7893 - Use lgkmcnt(0) if not 7894 TgSplit execution mode 7895 and vmcnt(0) if TgSplit 7896 execution mode. 7897 - If OpenCL and 7898 address space is 7899 not generic, omit 7900 lgkmcnt(0). 7901 - If OpenCL and 7902 address space is 7903 local, omit 7904 vmcnt(0). 7905 - However, 7906 since LLVM 7907 currently has no 7908 address space on 7909 the fence need to 7910 conservatively 7911 always generate 7912 (see comment for 7913 previous fence). 7914 - s_waitcnt vmcnt(0) 7915 must happen after 7916 any preceding 7917 global/generic 7918 load/store/ 7919 load atomic/store atomic/ 7920 atomicrmw. 7921 - s_waitcnt lgkmcnt(0) 7922 must happen after 7923 any preceding 7924 local/generic 7925 load/load 7926 atomic/store/store 7927 atomic/atomicrmw. 7928 - Must happen before 7929 any following 7930 global/generic 7931 load/load 7932 atomic/store/store 7933 atomic/atomicrmw. 7934 - Ensures that all 7935 memory operations 7936 have 7937 completed before 7938 performing any 7939 following global 7940 memory operations. 7941 - Ensures that the 7942 preceding 7943 local/generic load 7944 atomic/atomicrmw 7945 with an equal or 7946 wider sync scope 7947 and memory ordering 7948 stronger than 7949 unordered (this is 7950 termed the 7951 acquire-fence-paired-atomic) 7952 has completed 7953 before following 7954 global memory 7955 operations. This 7956 satisfies the 7957 requirements of 7958 acquire. 7959 - Ensures that all 7960 previous memory 7961 operations have 7962 completed before a 7963 following 7964 local/generic store 7965 atomic/atomicrmw 7966 with an equal or 7967 wider sync scope 7968 and memory ordering 7969 stronger than 7970 unordered (this is 7971 termed the 7972 release-fence-paired-atomic). 7973 This satisfies the 7974 requirements of 7975 release. 7976 - Must happen before 7977 the following 7978 buffer_wbinvl1_vol. 7979 - Ensures that the 7980 acquire-fence-paired 7981 atomic has completed 7982 before invalidating 7983 the 7984 cache. Therefore 7985 any following 7986 locations read must 7987 be no older than 7988 the value read by 7989 the 7990 acquire-fence-paired-atomic. 7991 7992 2. buffer_wbinvl1_vol 7993 7994 - If not TgSplit execution 7995 mode, omit. 7996 - Ensures that 7997 following 7998 loads will not see 7999 stale data. 8000 8001 fence acq_rel - agent *none* 1. s_waitcnt lgkmcnt(0) & 8002 vmcnt(0) 8003 8004 - If TgSplit execution mode, 8005 omit lgkmcnt(0). 8006 - If OpenCL and 8007 address space is 8008 not generic, omit 8009 lgkmcnt(0). 8010 - However, since LLVM 8011 currently has no 8012 address space on 8013 the fence need to 8014 conservatively 8015 always generate 8016 (see comment for 8017 previous fence). 8018 - Could be split into 8019 separate s_waitcnt 8020 vmcnt(0) and 8021 s_waitcnt 8022 lgkmcnt(0) to allow 8023 them to be 8024 independently moved 8025 according to the 8026 following rules. 8027 - s_waitcnt vmcnt(0) 8028 must happen after 8029 any preceding 8030 global/generic 8031 load/store/load 8032 atomic/store 8033 atomic/atomicrmw. 8034 - s_waitcnt lgkmcnt(0) 8035 must happen after 8036 any preceding 8037 local/generic 8038 load/store/load 8039 atomic/store 8040 atomic/atomicrmw. 8041 - Must happen before 8042 the following 8043 buffer_wbinvl1_vol. 8044 - Ensures that the 8045 preceding 8046 global/local/generic 8047 load 8048 atomic/atomicrmw 8049 with an equal or 8050 wider sync scope 8051 and memory ordering 8052 stronger than 8053 unordered (this is 8054 termed the 8055 acquire-fence-paired-atomic) 8056 has completed 8057 before invalidating 8058 the cache. This 8059 satisfies the 8060 requirements of 8061 acquire. 8062 - Ensures that all 8063 previous memory 8064 operations have 8065 completed before a 8066 following 8067 global/local/generic 8068 store 8069 atomic/atomicrmw 8070 with an equal or 8071 wider sync scope 8072 and memory ordering 8073 stronger than 8074 unordered (this is 8075 termed the 8076 release-fence-paired-atomic). 8077 This satisfies the 8078 requirements of 8079 release. 8080 8081 2. buffer_wbinvl1_vol 8082 8083 - Must happen before 8084 any following 8085 global/generic 8086 load/load 8087 atomic/store/store 8088 atomic/atomicrmw. 8089 - Ensures that 8090 following loads 8091 will not see stale 8092 global data. This 8093 satisfies the 8094 requirements of 8095 acquire. 8096 8097 fence acq_rel - system *none* 1. buffer_wbl2 8098 8099 - If OpenCL and 8100 address space is 8101 local, omit. 8102 - Must happen before 8103 following s_waitcnt. 8104 - Performs L2 writeback to 8105 ensure previous 8106 global/generic 8107 store/atomicrmw are 8108 visible at system scope. 8109 8110 2. s_waitcnt lgkmcnt(0) & 8111 vmcnt(0) 8112 8113 - If TgSplit execution mode, 8114 omit lgkmcnt(0). 8115 - If OpenCL and 8116 address space is 8117 not generic, omit 8118 lgkmcnt(0). 8119 - However, since LLVM 8120 currently has no 8121 address space on 8122 the fence need to 8123 conservatively 8124 always generate 8125 (see comment for 8126 previous fence). 8127 - Could be split into 8128 separate s_waitcnt 8129 vmcnt(0) and 8130 s_waitcnt 8131 lgkmcnt(0) to allow 8132 them to be 8133 independently moved 8134 according to the 8135 following rules. 8136 - s_waitcnt vmcnt(0) 8137 must happen after 8138 any preceding 8139 global/generic 8140 load/store/load 8141 atomic/store 8142 atomic/atomicrmw. 8143 - s_waitcnt lgkmcnt(0) 8144 must happen after 8145 any preceding 8146 local/generic 8147 load/store/load 8148 atomic/store 8149 atomic/atomicrmw. 8150 - Must happen before 8151 the following buffer_invl2 and 8152 buffer_wbinvl1_vol. 8153 - Ensures that the 8154 preceding 8155 global/local/generic 8156 load 8157 atomic/atomicrmw 8158 with an equal or 8159 wider sync scope 8160 and memory ordering 8161 stronger than 8162 unordered (this is 8163 termed the 8164 acquire-fence-paired-atomic) 8165 has completed 8166 before invalidating 8167 the cache. This 8168 satisfies the 8169 requirements of 8170 acquire. 8171 - Ensures that all 8172 previous memory 8173 operations have 8174 completed before a 8175 following 8176 global/local/generic 8177 store 8178 atomic/atomicrmw 8179 with an equal or 8180 wider sync scope 8181 and memory ordering 8182 stronger than 8183 unordered (this is 8184 termed the 8185 release-fence-paired-atomic). 8186 This satisfies the 8187 requirements of 8188 release. 8189 8190 3. buffer_invl2; 8191 buffer_wbinvl1_vol 8192 8193 - Must happen before 8194 any following 8195 global/generic 8196 load/load 8197 atomic/store/store 8198 atomic/atomicrmw. 8199 - Ensures that 8200 following 8201 loads will not see 8202 stale L1 global data, 8203 nor see stale L2 MTYPE 8204 NC global data. 8205 MTYPE RW and CC memory will 8206 never be stale in L2 due to 8207 the memory probes. 8208 8209 **Sequential Consistent Atomic** 8210 ------------------------------------------------------------------------------------ 8211 load atomic seq_cst - singlethread - global *Same as corresponding 8212 - wavefront - local load atomic acquire, 8213 - generic except must generated 8214 all instructions even 8215 for OpenCL.* 8216 load atomic seq_cst - workgroup - global 1. s_waitcnt lgkm/vmcnt(0) 8217 - generic 8218 - Use lgkmcnt(0) if not 8219 TgSplit execution mode 8220 and vmcnt(0) if TgSplit 8221 execution mode. 8222 - s_waitcnt lgkmcnt(0) must 8223 happen after 8224 preceding 8225 local/generic load 8226 atomic/store 8227 atomic/atomicrmw 8228 with memory 8229 ordering of seq_cst 8230 and with equal or 8231 wider sync scope. 8232 (Note that seq_cst 8233 fences have their 8234 own s_waitcnt 8235 lgkmcnt(0) and so do 8236 not need to be 8237 considered.) 8238 - s_waitcnt vmcnt(0) 8239 must happen after 8240 preceding 8241 global/generic load 8242 atomic/store 8243 atomic/atomicrmw 8244 with memory 8245 ordering of seq_cst 8246 and with equal or 8247 wider sync scope. 8248 (Note that seq_cst 8249 fences have their 8250 own s_waitcnt 8251 vmcnt(0) and so do 8252 not need to be 8253 considered.) 8254 - Ensures any 8255 preceding 8256 sequential 8257 consistent global/local 8258 memory instructions 8259 have completed 8260 before executing 8261 this sequentially 8262 consistent 8263 instruction. This 8264 prevents reordering 8265 a seq_cst store 8266 followed by a 8267 seq_cst load. (Note 8268 that seq_cst is 8269 stronger than 8270 acquire/release as 8271 the reordering of 8272 load acquire 8273 followed by a store 8274 release is 8275 prevented by the 8276 s_waitcnt of 8277 the release, but 8278 there is nothing 8279 preventing a store 8280 release followed by 8281 load acquire from 8282 completing out of 8283 order. The s_waitcnt 8284 could be placed after 8285 seq_store or before 8286 the seq_load. We 8287 choose the load to 8288 make the s_waitcnt be 8289 as late as possible 8290 so that the store 8291 may have already 8292 completed.) 8293 8294 2. *Following 8295 instructions same as 8296 corresponding load 8297 atomic acquire, 8298 except must generated 8299 all instructions even 8300 for OpenCL.* 8301 load atomic seq_cst - workgroup - local *If TgSplit execution mode, 8302 local address space cannot 8303 be used.* 8304 8305 *Same as corresponding 8306 load atomic acquire, 8307 except must generated 8308 all instructions even 8309 for OpenCL.* 8310 8311 load atomic seq_cst - agent - global 1. s_waitcnt lgkmcnt(0) & 8312 - system - generic vmcnt(0) 8313 8314 - If TgSplit execution mode, 8315 omit lgkmcnt(0). 8316 - Could be split into 8317 separate s_waitcnt 8318 vmcnt(0) 8319 and s_waitcnt 8320 lgkmcnt(0) to allow 8321 them to be 8322 independently moved 8323 according to the 8324 following rules. 8325 - s_waitcnt lgkmcnt(0) 8326 must happen after 8327 preceding 8328 global/generic load 8329 atomic/store 8330 atomic/atomicrmw 8331 with memory 8332 ordering of seq_cst 8333 and with equal or 8334 wider sync scope. 8335 (Note that seq_cst 8336 fences have their 8337 own s_waitcnt 8338 lgkmcnt(0) and so do 8339 not need to be 8340 considered.) 8341 - s_waitcnt vmcnt(0) 8342 must happen after 8343 preceding 8344 global/generic load 8345 atomic/store 8346 atomic/atomicrmw 8347 with memory 8348 ordering of seq_cst 8349 and with equal or 8350 wider sync scope. 8351 (Note that seq_cst 8352 fences have their 8353 own s_waitcnt 8354 vmcnt(0) and so do 8355 not need to be 8356 considered.) 8357 - Ensures any 8358 preceding 8359 sequential 8360 consistent global 8361 memory instructions 8362 have completed 8363 before executing 8364 this sequentially 8365 consistent 8366 instruction. This 8367 prevents reordering 8368 a seq_cst store 8369 followed by a 8370 seq_cst load. (Note 8371 that seq_cst is 8372 stronger than 8373 acquire/release as 8374 the reordering of 8375 load acquire 8376 followed by a store 8377 release is 8378 prevented by the 8379 s_waitcnt of 8380 the release, but 8381 there is nothing 8382 preventing a store 8383 release followed by 8384 load acquire from 8385 completing out of 8386 order. The s_waitcnt 8387 could be placed after 8388 seq_store or before 8389 the seq_load. We 8390 choose the load to 8391 make the s_waitcnt be 8392 as late as possible 8393 so that the store 8394 may have already 8395 completed.) 8396 8397 2. *Following 8398 instructions same as 8399 corresponding load 8400 atomic acquire, 8401 except must generated 8402 all instructions even 8403 for OpenCL.* 8404 store atomic seq_cst - singlethread - global *Same as corresponding 8405 - wavefront - local store atomic release, 8406 - workgroup - generic except must generated 8407 - agent all instructions even 8408 - system for OpenCL.* 8409 atomicrmw seq_cst - singlethread - global *Same as corresponding 8410 - wavefront - local atomicrmw acq_rel, 8411 - workgroup - generic except must generated 8412 - agent all instructions even 8413 - system for OpenCL.* 8414 fence seq_cst - singlethread *none* *Same as corresponding 8415 - wavefront fence acq_rel, 8416 - workgroup except must generated 8417 - agent all instructions even 8418 - system for OpenCL.* 8419 ============ ============ ============== ========== ================================ 8420 8421.. _amdgpu-amdhsa-memory-model-gfx10: 8422 8423Memory Model GFX10 8424++++++++++++++++++ 8425 8426For GFX10: 8427 8428* Each agent has multiple shader arrays (SA). 8429* Each SA has multiple work-group processors (WGP). 8430* Each WGP has multiple compute units (CU). 8431* Each CU has multiple SIMDs that execute wavefronts. 8432* The wavefronts for a single work-group are executed in the same 8433 WGP. In CU wavefront execution mode the wavefronts may be executed by 8434 different SIMDs in the same CU. In WGP wavefront execution mode the 8435 wavefronts may be executed by different SIMDs in different CUs in the same 8436 WGP. 8437* Each WGP has a single LDS memory shared by the wavefronts of the work-groups 8438 executing on it. 8439* All LDS operations of a WGP are performed as wavefront wide operations in a 8440 global order and involve no caching. Completion is reported to a wavefront in 8441 execution order. 8442* The LDS memory has multiple request queues shared by the SIMDs of a 8443 WGP. Therefore, the LDS operations performed by different wavefronts of a 8444 work-group can be reordered relative to each other, which can result in 8445 reordering the visibility of vector memory operations with respect to LDS 8446 operations of other wavefronts in the same work-group. A ``s_waitcnt 8447 lgkmcnt(0)`` is required to ensure synchronization between LDS operations and 8448 vector memory operations between wavefronts of a work-group, but not between 8449 operations performed by the same wavefront. 8450* The vector memory operations are performed as wavefront wide operations. 8451 Completion of load/store/sample operations are reported to a wavefront in 8452 execution order of other load/store/sample operations performed by that 8453 wavefront. 8454* The vector memory operations access a vector L0 cache. There is a single L0 8455 cache per CU. Each SIMD of a CU accesses the same L0 cache. Therefore, no 8456 special action is required for coherence between the lanes of a single 8457 wavefront. However, a ``buffer_gl0_inv`` is required for coherence between 8458 wavefronts executing in the same work-group as they may be executing on SIMDs 8459 of different CUs that access different L0s. A ``buffer_gl0_inv`` is also 8460 required for coherence between wavefronts executing in different work-groups 8461 as they may be executing on different WGPs. 8462* The scalar memory operations access a scalar L0 cache shared by all wavefronts 8463 on a WGP. The scalar and vector L0 caches are not coherent. However, scalar 8464 operations are used in a restricted way so do not impact the memory model. See 8465 :ref:`amdgpu-amdhsa-memory-spaces`. 8466* The vector and scalar memory L0 caches use an L1 cache shared by all WGPs on 8467 the same SA. Therefore, no special action is required for coherence between 8468 the wavefronts of a single work-group. However, a ``buffer_gl1_inv`` is 8469 required for coherence between wavefronts executing in different work-groups 8470 as they may be executing on different SAs that access different L1s. 8471* The L1 caches have independent quadrants to service disjoint ranges of virtual 8472 addresses. 8473* Each L0 cache has a separate request queue per L1 quadrant. Therefore, the 8474 vector and scalar memory operations performed by different wavefronts, whether 8475 executing in the same or different work-groups (which may be executing on 8476 different CUs accessing different L0s), can be reordered relative to each 8477 other. A ``s_waitcnt vmcnt(0) & vscnt(0)`` is required to ensure 8478 synchronization between vector memory operations of different wavefronts. It 8479 ensures a previous vector memory operation has completed before executing a 8480 subsequent vector memory or LDS operation and so can be used to meet the 8481 requirements of acquire, release and sequential consistency. 8482* The L1 caches use an L2 cache shared by all SAs on the same agent. 8483* The L2 cache has independent channels to service disjoint ranges of virtual 8484 addresses. 8485* Each L1 quadrant of a single SA accesses a different L2 channel. Each L1 8486 quadrant has a separate request queue per L2 channel. Therefore, the vector 8487 and scalar memory operations performed by wavefronts executing in different 8488 work-groups (which may be executing on different SAs) of an agent can be 8489 reordered relative to each other. A ``s_waitcnt vmcnt(0) & vscnt(0)`` is 8490 required to ensure synchronization between vector memory operations of 8491 different SAs. It ensures a previous vector memory operation has completed 8492 before executing a subsequent vector memory and so can be used to meet the 8493 requirements of acquire, release and sequential consistency. 8494* The L2 cache can be kept coherent with other agents on some targets, or ranges 8495 of virtual addresses can be set up to bypass it to ensure system coherence. 8496 8497Scalar memory operations are only used to access memory that is proven to not 8498change during the execution of the kernel dispatch. This includes constant 8499address space and global address space for program scope ``const`` variables. 8500Therefore, the kernel machine code does not have to maintain the scalar cache to 8501ensure it is coherent with the vector caches. The scalar and vector caches are 8502invalidated between kernel dispatches by CP since constant address space data 8503may change between kernel dispatch executions. See 8504:ref:`amdgpu-amdhsa-memory-spaces`. 8505 8506The one exception is if scalar writes are used to spill SGPR registers. In this 8507case the AMDGPU backend ensures the memory location used to spill is never 8508accessed by vector memory operations at the same time. If scalar writes are used 8509then a ``s_dcache_wb`` is inserted before the ``s_endpgm`` and before a function 8510return since the locations may be used for vector memory instructions by a 8511future wavefront that uses the same scratch area, or a function call that 8512creates a frame at the same address, respectively. There is no need for a 8513``s_dcache_inv`` as all scalar writes are write-before-read in the same thread. 8514 8515For kernarg backing memory: 8516 8517* CP invalidates the L0 and L1 caches at the start of each kernel dispatch. 8518* On dGPU the kernarg backing memory is accessed as MTYPE UC (uncached) to avoid 8519 needing to invalidate the L2 cache. 8520* On APU the kernarg backing memory is accessed as MTYPE CC (cache coherent) and 8521 so the L2 cache will be coherent with the CPU and other agents. 8522 8523Scratch backing memory (which is used for the private address space) is accessed 8524with MTYPE NC (non-coherent). Since the private address space is only accessed 8525by a single thread, and is always write-before-read, there is never a need to 8526invalidate these entries from the L0 or L1 caches. 8527 8528Wavefronts are executed in native mode with in-order reporting of loads and 8529sample instructions. In this mode vmcnt reports completion of load, atomic with 8530return and sample instructions in order, and the vscnt reports the completion of 8531store and atomic without return in order. See ``MEM_ORDERED`` field in 8532:ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx10-table`. 8533 8534Wavefronts can be executed in WGP or CU wavefront execution mode: 8535 8536* In WGP wavefront execution mode the wavefronts of a work-group are executed 8537 on the SIMDs of both CUs of the WGP. Therefore, explicit management of the per 8538 CU L0 caches is required for work-group synchronization. Also accesses to L1 8539 at work-group scope need to be explicitly ordered as the accesses from 8540 different CUs are not ordered. 8541* In CU wavefront execution mode the wavefronts of a work-group are executed on 8542 the SIMDs of a single CU of the WGP. Therefore, all global memory access by 8543 the work-group access the same L0 which in turn ensures L1 accesses are 8544 ordered and so do not require explicit management of the caches for 8545 work-group synchronization. 8546 8547See ``WGP_MODE`` field in 8548:ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx10-table` and 8549:ref:`amdgpu-target-features`. 8550 8551The code sequences used to implement the memory model for GFX10 are defined in 8552table :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx10-table`. 8553 8554 .. table:: AMDHSA Memory Model Code Sequences GFX10 8555 :name: amdgpu-amdhsa-memory-model-code-sequences-gfx10-table 8556 8557 ============ ============ ============== ========== ================================ 8558 LLVM Instr LLVM Memory LLVM Memory AMDGPU AMDGPU Machine Code 8559 Ordering Sync Scope Address GFX10 8560 Space 8561 ============ ============ ============== ========== ================================ 8562 **Non-Atomic** 8563 ------------------------------------------------------------------------------------ 8564 load *none* *none* - global - !volatile & !nontemporal 8565 - generic 8566 - private 1. buffer/global/flat_load 8567 - constant 8568 - !volatile & nontemporal 8569 8570 1. buffer/global/flat_load 8571 slc=1 8572 8573 - volatile 8574 8575 1. buffer/global/flat_load 8576 glc=1 dlc=1 8577 2. s_waitcnt vmcnt(0) 8578 8579 - Must happen before 8580 any following volatile 8581 global/generic 8582 load/store. 8583 - Ensures that 8584 volatile 8585 operations to 8586 different 8587 addresses will not 8588 be reordered by 8589 hardware. 8590 8591 load *none* *none* - local 1. ds_load 8592 store *none* *none* - global - !volatile & !nontemporal 8593 - generic 8594 - private 1. buffer/global/flat_store 8595 - constant 8596 - !volatile & nontemporal 8597 8598 1. buffer/global/flat_store 8599 slc=1 8600 8601 - volatile 8602 8603 1. buffer/global/flat_store 8604 2. s_waitcnt vscnt(0) 8605 8606 - Must happen before 8607 any following volatile 8608 global/generic 8609 load/store. 8610 - Ensures that 8611 volatile 8612 operations to 8613 different 8614 addresses will not 8615 be reordered by 8616 hardware. 8617 8618 store *none* *none* - local 1. ds_store 8619 **Unordered Atomic** 8620 ------------------------------------------------------------------------------------ 8621 load atomic unordered *any* *any* *Same as non-atomic*. 8622 store atomic unordered *any* *any* *Same as non-atomic*. 8623 atomicrmw unordered *any* *any* *Same as monotonic atomic*. 8624 **Monotonic Atomic** 8625 ------------------------------------------------------------------------------------ 8626 load atomic monotonic - singlethread - global 1. buffer/global/flat_load 8627 - wavefront - generic 8628 load atomic monotonic - workgroup - global 1. buffer/global/flat_load 8629 - generic glc=1 8630 8631 - If CU wavefront execution 8632 mode, omit glc=1. 8633 8634 load atomic monotonic - singlethread - local 1. ds_load 8635 - wavefront 8636 - workgroup 8637 load atomic monotonic - agent - global 1. buffer/global/flat_load 8638 - system - generic glc=1 dlc=1 8639 store atomic monotonic - singlethread - global 1. buffer/global/flat_store 8640 - wavefront - generic 8641 - workgroup 8642 - agent 8643 - system 8644 store atomic monotonic - singlethread - local 1. ds_store 8645 - wavefront 8646 - workgroup 8647 atomicrmw monotonic - singlethread - global 1. buffer/global/flat_atomic 8648 - wavefront - generic 8649 - workgroup 8650 - agent 8651 - system 8652 atomicrmw monotonic - singlethread - local 1. ds_atomic 8653 - wavefront 8654 - workgroup 8655 **Acquire Atomic** 8656 ------------------------------------------------------------------------------------ 8657 load atomic acquire - singlethread - global 1. buffer/global/ds/flat_load 8658 - wavefront - local 8659 - generic 8660 load atomic acquire - workgroup - global 1. buffer/global_load glc=1 8661 8662 - If CU wavefront execution 8663 mode, omit glc=1. 8664 8665 2. s_waitcnt vmcnt(0) 8666 8667 - If CU wavefront execution 8668 mode, omit. 8669 - Must happen before 8670 the following buffer_gl0_inv 8671 and before any following 8672 global/generic 8673 load/load 8674 atomic/store/store 8675 atomic/atomicrmw. 8676 8677 3. buffer_gl0_inv 8678 8679 - If CU wavefront execution 8680 mode, omit. 8681 - Ensures that 8682 following 8683 loads will not see 8684 stale data. 8685 8686 load atomic acquire - workgroup - local 1. ds_load 8687 2. s_waitcnt lgkmcnt(0) 8688 8689 - If OpenCL, omit. 8690 - Must happen before 8691 the following buffer_gl0_inv 8692 and before any following 8693 global/generic load/load 8694 atomic/store/store 8695 atomic/atomicrmw. 8696 - Ensures any 8697 following global 8698 data read is no 8699 older than the local load 8700 atomic value being 8701 acquired. 8702 8703 3. buffer_gl0_inv 8704 8705 - If CU wavefront execution 8706 mode, omit. 8707 - If OpenCL, omit. 8708 - Ensures that 8709 following 8710 loads will not see 8711 stale data. 8712 8713 load atomic acquire - workgroup - generic 1. flat_load glc=1 8714 8715 - If CU wavefront execution 8716 mode, omit glc=1. 8717 8718 2. s_waitcnt lgkmcnt(0) & 8719 vmcnt(0) 8720 8721 - If CU wavefront execution 8722 mode, omit vmcnt(0). 8723 - If OpenCL, omit 8724 lgkmcnt(0). 8725 - Must happen before 8726 the following 8727 buffer_gl0_inv and any 8728 following global/generic 8729 load/load 8730 atomic/store/store 8731 atomic/atomicrmw. 8732 - Ensures any 8733 following global 8734 data read is no 8735 older than a local load 8736 atomic value being 8737 acquired. 8738 8739 3. buffer_gl0_inv 8740 8741 - If CU wavefront execution 8742 mode, omit. 8743 - Ensures that 8744 following 8745 loads will not see 8746 stale data. 8747 8748 load atomic acquire - agent - global 1. buffer/global_load 8749 - system glc=1 dlc=1 8750 2. s_waitcnt vmcnt(0) 8751 8752 - Must happen before 8753 following 8754 buffer_gl*_inv. 8755 - Ensures the load 8756 has completed 8757 before invalidating 8758 the caches. 8759 8760 3. buffer_gl0_inv; 8761 buffer_gl1_inv 8762 8763 - Must happen before 8764 any following 8765 global/generic 8766 load/load 8767 atomic/atomicrmw. 8768 - Ensures that 8769 following 8770 loads will not see 8771 stale global data. 8772 8773 load atomic acquire - agent - generic 1. flat_load glc=1 dlc=1 8774 - system 2. s_waitcnt vmcnt(0) & 8775 lgkmcnt(0) 8776 8777 - If OpenCL omit 8778 lgkmcnt(0). 8779 - Must happen before 8780 following 8781 buffer_gl*_invl. 8782 - Ensures the flat_load 8783 has completed 8784 before invalidating 8785 the caches. 8786 8787 3. buffer_gl0_inv; 8788 buffer_gl1_inv 8789 8790 - Must happen before 8791 any following 8792 global/generic 8793 load/load 8794 atomic/atomicrmw. 8795 - Ensures that 8796 following loads 8797 will not see stale 8798 global data. 8799 8800 atomicrmw acquire - singlethread - global 1. buffer/global/ds/flat_atomic 8801 - wavefront - local 8802 - generic 8803 atomicrmw acquire - workgroup - global 1. buffer/global_atomic 8804 2. s_waitcnt vm/vscnt(0) 8805 8806 - If CU wavefront execution 8807 mode, omit. 8808 - Use vmcnt(0) if atomic with 8809 return and vscnt(0) if 8810 atomic with no-return. 8811 - Must happen before 8812 the following buffer_gl0_inv 8813 and before any following 8814 global/generic 8815 load/load 8816 atomic/store/store 8817 atomic/atomicrmw. 8818 8819 3. buffer_gl0_inv 8820 8821 - If CU wavefront execution 8822 mode, omit. 8823 - Ensures that 8824 following 8825 loads will not see 8826 stale data. 8827 8828 atomicrmw acquire - workgroup - local 1. ds_atomic 8829 2. s_waitcnt lgkmcnt(0) 8830 8831 - If OpenCL, omit. 8832 - Must happen before 8833 the following 8834 buffer_gl0_inv. 8835 - Ensures any 8836 following global 8837 data read is no 8838 older than the local 8839 atomicrmw value 8840 being acquired. 8841 8842 3. buffer_gl0_inv 8843 8844 - If OpenCL omit. 8845 - Ensures that 8846 following 8847 loads will not see 8848 stale data. 8849 8850 atomicrmw acquire - workgroup - generic 1. flat_atomic 8851 2. s_waitcnt lgkmcnt(0) & 8852 vm/vscnt(0) 8853 8854 - If CU wavefront execution 8855 mode, omit vm/vscnt(0). 8856 - If OpenCL, omit lgkmcnt(0). 8857 - Use vmcnt(0) if atomic with 8858 return and vscnt(0) if 8859 atomic with no-return. 8860 - Must happen before 8861 the following 8862 buffer_gl0_inv. 8863 - Ensures any 8864 following global 8865 data read is no 8866 older than a local 8867 atomicrmw value 8868 being acquired. 8869 8870 3. buffer_gl0_inv 8871 8872 - If CU wavefront execution 8873 mode, omit. 8874 - Ensures that 8875 following 8876 loads will not see 8877 stale data. 8878 8879 atomicrmw acquire - agent - global 1. buffer/global_atomic 8880 - system 2. s_waitcnt vm/vscnt(0) 8881 8882 - Use vmcnt(0) if atomic with 8883 return and vscnt(0) if 8884 atomic with no-return. 8885 - Must happen before 8886 following 8887 buffer_gl*_inv. 8888 - Ensures the 8889 atomicrmw has 8890 completed before 8891 invalidating the 8892 caches. 8893 8894 3. buffer_gl0_inv; 8895 buffer_gl1_inv 8896 8897 - Must happen before 8898 any following 8899 global/generic 8900 load/load 8901 atomic/atomicrmw. 8902 - Ensures that 8903 following loads 8904 will not see stale 8905 global data. 8906 8907 atomicrmw acquire - agent - generic 1. flat_atomic 8908 - system 2. s_waitcnt vm/vscnt(0) & 8909 lgkmcnt(0) 8910 8911 - If OpenCL, omit 8912 lgkmcnt(0). 8913 - Use vmcnt(0) if atomic with 8914 return and vscnt(0) if 8915 atomic with no-return. 8916 - Must happen before 8917 following 8918 buffer_gl*_inv. 8919 - Ensures the 8920 atomicrmw has 8921 completed before 8922 invalidating the 8923 caches. 8924 8925 3. buffer_gl0_inv; 8926 buffer_gl1_inv 8927 8928 - Must happen before 8929 any following 8930 global/generic 8931 load/load 8932 atomic/atomicrmw. 8933 - Ensures that 8934 following loads 8935 will not see stale 8936 global data. 8937 8938 fence acquire - singlethread *none* *none* 8939 - wavefront 8940 fence acquire - workgroup *none* 1. s_waitcnt lgkmcnt(0) & 8941 vmcnt(0) & vscnt(0) 8942 8943 - If CU wavefront execution 8944 mode, omit vmcnt(0) and 8945 vscnt(0). 8946 - If OpenCL and 8947 address space is 8948 not generic, omit 8949 lgkmcnt(0). 8950 - If OpenCL and 8951 address space is 8952 local, omit 8953 vmcnt(0) and vscnt(0). 8954 - However, since LLVM 8955 currently has no 8956 address space on 8957 the fence need to 8958 conservatively 8959 always generate. If 8960 fence had an 8961 address space then 8962 set to address 8963 space of OpenCL 8964 fence flag, or to 8965 generic if both 8966 local and global 8967 flags are 8968 specified. 8969 - Could be split into 8970 separate s_waitcnt 8971 vmcnt(0), s_waitcnt 8972 vscnt(0) and s_waitcnt 8973 lgkmcnt(0) to allow 8974 them to be 8975 independently moved 8976 according to the 8977 following rules. 8978 - s_waitcnt vmcnt(0) 8979 must happen after 8980 any preceding 8981 global/generic load 8982 atomic/ 8983 atomicrmw-with-return-value 8984 with an equal or 8985 wider sync scope 8986 and memory ordering 8987 stronger than 8988 unordered (this is 8989 termed the 8990 fence-paired-atomic). 8991 - s_waitcnt vscnt(0) 8992 must happen after 8993 any preceding 8994 global/generic 8995 atomicrmw-no-return-value 8996 with an equal or 8997 wider sync scope 8998 and memory ordering 8999 stronger than 9000 unordered (this is 9001 termed the 9002 fence-paired-atomic). 9003 - s_waitcnt lgkmcnt(0) 9004 must happen after 9005 any preceding 9006 local/generic load 9007 atomic/atomicrmw 9008 with an equal or 9009 wider sync scope 9010 and memory ordering 9011 stronger than 9012 unordered (this is 9013 termed the 9014 fence-paired-atomic). 9015 - Must happen before 9016 the following 9017 buffer_gl0_inv. 9018 - Ensures that the 9019 fence-paired atomic 9020 has completed 9021 before invalidating 9022 the 9023 cache. Therefore 9024 any following 9025 locations read must 9026 be no older than 9027 the value read by 9028 the 9029 fence-paired-atomic. 9030 9031 3. buffer_gl0_inv 9032 9033 - If CU wavefront execution 9034 mode, omit. 9035 - Ensures that 9036 following 9037 loads will not see 9038 stale data. 9039 9040 fence acquire - agent *none* 1. s_waitcnt lgkmcnt(0) & 9041 - system vmcnt(0) & vscnt(0) 9042 9043 - If OpenCL and 9044 address space is 9045 not generic, omit 9046 lgkmcnt(0). 9047 - If OpenCL and 9048 address space is 9049 local, omit 9050 vmcnt(0) and vscnt(0). 9051 - However, since LLVM 9052 currently has no 9053 address space on 9054 the fence need to 9055 conservatively 9056 always generate 9057 (see comment for 9058 previous fence). 9059 - Could be split into 9060 separate s_waitcnt 9061 vmcnt(0), s_waitcnt 9062 vscnt(0) and s_waitcnt 9063 lgkmcnt(0) to allow 9064 them to be 9065 independently moved 9066 according to the 9067 following rules. 9068 - s_waitcnt vmcnt(0) 9069 must happen after 9070 any preceding 9071 global/generic load 9072 atomic/ 9073 atomicrmw-with-return-value 9074 with an equal or 9075 wider sync scope 9076 and memory ordering 9077 stronger than 9078 unordered (this is 9079 termed the 9080 fence-paired-atomic). 9081 - s_waitcnt vscnt(0) 9082 must happen after 9083 any preceding 9084 global/generic 9085 atomicrmw-no-return-value 9086 with an equal or 9087 wider sync scope 9088 and memory ordering 9089 stronger than 9090 unordered (this is 9091 termed the 9092 fence-paired-atomic). 9093 - s_waitcnt lgkmcnt(0) 9094 must happen after 9095 any preceding 9096 local/generic load 9097 atomic/atomicrmw 9098 with an equal or 9099 wider sync scope 9100 and memory ordering 9101 stronger than 9102 unordered (this is 9103 termed the 9104 fence-paired-atomic). 9105 - Must happen before 9106 the following 9107 buffer_gl*_inv. 9108 - Ensures that the 9109 fence-paired atomic 9110 has completed 9111 before invalidating 9112 the 9113 caches. Therefore 9114 any following 9115 locations read must 9116 be no older than 9117 the value read by 9118 the 9119 fence-paired-atomic. 9120 9121 2. buffer_gl0_inv; 9122 buffer_gl1_inv 9123 9124 - Must happen before any 9125 following global/generic 9126 load/load 9127 atomic/store/store 9128 atomic/atomicrmw. 9129 - Ensures that 9130 following loads 9131 will not see stale 9132 global data. 9133 9134 **Release Atomic** 9135 ------------------------------------------------------------------------------------ 9136 store atomic release - singlethread - global 1. buffer/global/ds/flat_store 9137 - wavefront - local 9138 - generic 9139 store atomic release - workgroup - global 1. s_waitcnt lgkmcnt(0) & 9140 - generic vmcnt(0) & vscnt(0) 9141 9142 - If CU wavefront execution 9143 mode, omit vmcnt(0) and 9144 vscnt(0). 9145 - If OpenCL, omit 9146 lgkmcnt(0). 9147 - Could be split into 9148 separate s_waitcnt 9149 vmcnt(0), s_waitcnt 9150 vscnt(0) and s_waitcnt 9151 lgkmcnt(0) to allow 9152 them to be 9153 independently moved 9154 according to the 9155 following rules. 9156 - s_waitcnt vmcnt(0) 9157 must happen after 9158 any preceding 9159 global/generic load/load 9160 atomic/ 9161 atomicrmw-with-return-value. 9162 - s_waitcnt vscnt(0) 9163 must happen after 9164 any preceding 9165 global/generic 9166 store/store 9167 atomic/ 9168 atomicrmw-no-return-value. 9169 - s_waitcnt lgkmcnt(0) 9170 must happen after 9171 any preceding 9172 local/generic 9173 load/store/load 9174 atomic/store 9175 atomic/atomicrmw. 9176 - Must happen before 9177 the following 9178 store. 9179 - Ensures that all 9180 memory operations 9181 have 9182 completed before 9183 performing the 9184 store that is being 9185 released. 9186 9187 2. buffer/global/flat_store 9188 store atomic release - workgroup - local 1. s_waitcnt vmcnt(0) & vscnt(0) 9189 9190 - If CU wavefront execution 9191 mode, omit. 9192 - If OpenCL, omit. 9193 - Could be split into 9194 separate s_waitcnt 9195 vmcnt(0) and s_waitcnt 9196 vscnt(0) to allow 9197 them to be 9198 independently moved 9199 according to the 9200 following rules. 9201 - s_waitcnt vmcnt(0) 9202 must happen after 9203 any preceding 9204 global/generic load/load 9205 atomic/ 9206 atomicrmw-with-return-value. 9207 - s_waitcnt vscnt(0) 9208 must happen after 9209 any preceding 9210 global/generic 9211 store/store atomic/ 9212 atomicrmw-no-return-value. 9213 - Must happen before 9214 the following 9215 store. 9216 - Ensures that all 9217 global memory 9218 operations have 9219 completed before 9220 performing the 9221 store that is being 9222 released. 9223 9224 2. ds_store 9225 store atomic release - agent - global 1. s_waitcnt lgkmcnt(0) & 9226 - system - generic vmcnt(0) & vscnt(0) 9227 9228 - If OpenCL and 9229 address space is 9230 not generic, omit 9231 lgkmcnt(0). 9232 - Could be split into 9233 separate s_waitcnt 9234 vmcnt(0), s_waitcnt vscnt(0) 9235 and s_waitcnt 9236 lgkmcnt(0) to allow 9237 them to be 9238 independently moved 9239 according to the 9240 following rules. 9241 - s_waitcnt vmcnt(0) 9242 must happen after 9243 any preceding 9244 global/generic 9245 load/load 9246 atomic/ 9247 atomicrmw-with-return-value. 9248 - s_waitcnt vscnt(0) 9249 must happen after 9250 any preceding 9251 global/generic 9252 store/store atomic/ 9253 atomicrmw-no-return-value. 9254 - s_waitcnt lgkmcnt(0) 9255 must happen after 9256 any preceding 9257 local/generic 9258 load/store/load 9259 atomic/store 9260 atomic/atomicrmw. 9261 - Must happen before 9262 the following 9263 store. 9264 - Ensures that all 9265 memory operations 9266 have 9267 completed before 9268 performing the 9269 store that is being 9270 released. 9271 9272 2. buffer/global/flat_store 9273 atomicrmw release - singlethread - global 1. buffer/global/ds/flat_atomic 9274 - wavefront - local 9275 - generic 9276 atomicrmw release - workgroup - global 1. s_waitcnt lgkmcnt(0) & 9277 - generic vmcnt(0) & vscnt(0) 9278 9279 - If CU wavefront execution 9280 mode, omit vmcnt(0) and 9281 vscnt(0). 9282 - If OpenCL, omit lgkmcnt(0). 9283 - Could be split into 9284 separate s_waitcnt 9285 vmcnt(0), s_waitcnt 9286 vscnt(0) and s_waitcnt 9287 lgkmcnt(0) to allow 9288 them to be 9289 independently moved 9290 according to the 9291 following rules. 9292 - s_waitcnt vmcnt(0) 9293 must happen after 9294 any preceding 9295 global/generic load/load 9296 atomic/ 9297 atomicrmw-with-return-value. 9298 - s_waitcnt vscnt(0) 9299 must happen after 9300 any preceding 9301 global/generic 9302 store/store 9303 atomic/ 9304 atomicrmw-no-return-value. 9305 - s_waitcnt lgkmcnt(0) 9306 must happen after 9307 any preceding 9308 local/generic 9309 load/store/load 9310 atomic/store 9311 atomic/atomicrmw. 9312 - Must happen before 9313 the following 9314 atomicrmw. 9315 - Ensures that all 9316 memory operations 9317 have 9318 completed before 9319 performing the 9320 atomicrmw that is 9321 being released. 9322 9323 2. buffer/global/flat_atomic 9324 atomicrmw release - workgroup - local 1. s_waitcnt vmcnt(0) & vscnt(0) 9325 9326 - If CU wavefront execution 9327 mode, omit. 9328 - If OpenCL, omit. 9329 - Could be split into 9330 separate s_waitcnt 9331 vmcnt(0) and s_waitcnt 9332 vscnt(0) to allow 9333 them to be 9334 independently moved 9335 according to the 9336 following rules. 9337 - s_waitcnt vmcnt(0) 9338 must happen after 9339 any preceding 9340 global/generic load/load 9341 atomic/ 9342 atomicrmw-with-return-value. 9343 - s_waitcnt vscnt(0) 9344 must happen after 9345 any preceding 9346 global/generic 9347 store/store atomic/ 9348 atomicrmw-no-return-value. 9349 - Must happen before 9350 the following 9351 store. 9352 - Ensures that all 9353 global memory 9354 operations have 9355 completed before 9356 performing the 9357 store that is being 9358 released. 9359 9360 2. ds_atomic 9361 atomicrmw release - agent - global 1. s_waitcnt lgkmcnt(0) & 9362 - system - generic vmcnt(0) & vscnt(0) 9363 9364 - If OpenCL, omit 9365 lgkmcnt(0). 9366 - Could be split into 9367 separate s_waitcnt 9368 vmcnt(0), s_waitcnt 9369 vscnt(0) and s_waitcnt 9370 lgkmcnt(0) to allow 9371 them to be 9372 independently moved 9373 according to the 9374 following rules. 9375 - s_waitcnt vmcnt(0) 9376 must happen after 9377 any preceding 9378 global/generic 9379 load/load atomic/ 9380 atomicrmw-with-return-value. 9381 - s_waitcnt vscnt(0) 9382 must happen after 9383 any preceding 9384 global/generic 9385 store/store atomic/ 9386 atomicrmw-no-return-value. 9387 - s_waitcnt lgkmcnt(0) 9388 must happen after 9389 any preceding 9390 local/generic 9391 load/store/load 9392 atomic/store 9393 atomic/atomicrmw. 9394 - Must happen before 9395 the following 9396 atomicrmw. 9397 - Ensures that all 9398 memory operations 9399 to global and local 9400 have completed 9401 before performing 9402 the atomicrmw that 9403 is being released. 9404 9405 2. buffer/global/flat_atomic 9406 fence release - singlethread *none* *none* 9407 - wavefront 9408 fence release - workgroup *none* 1. s_waitcnt lgkmcnt(0) & 9409 vmcnt(0) & vscnt(0) 9410 9411 - If CU wavefront execution 9412 mode, omit vmcnt(0) and 9413 vscnt(0). 9414 - If OpenCL and 9415 address space is 9416 not generic, omit 9417 lgkmcnt(0). 9418 - If OpenCL and 9419 address space is 9420 local, omit 9421 vmcnt(0) and vscnt(0). 9422 - However, since LLVM 9423 currently has no 9424 address space on 9425 the fence need to 9426 conservatively 9427 always generate. If 9428 fence had an 9429 address space then 9430 set to address 9431 space of OpenCL 9432 fence flag, or to 9433 generic if both 9434 local and global 9435 flags are 9436 specified. 9437 - Could be split into 9438 separate s_waitcnt 9439 vmcnt(0), s_waitcnt 9440 vscnt(0) and s_waitcnt 9441 lgkmcnt(0) to allow 9442 them to be 9443 independently moved 9444 according to the 9445 following rules. 9446 - s_waitcnt vmcnt(0) 9447 must happen after 9448 any preceding 9449 global/generic 9450 load/load 9451 atomic/ 9452 atomicrmw-with-return-value. 9453 - s_waitcnt vscnt(0) 9454 must happen after 9455 any preceding 9456 global/generic 9457 store/store atomic/ 9458 atomicrmw-no-return-value. 9459 - s_waitcnt lgkmcnt(0) 9460 must happen after 9461 any preceding 9462 local/generic 9463 load/store/load 9464 atomic/store atomic/ 9465 atomicrmw. 9466 - Must happen before 9467 any following store 9468 atomic/atomicrmw 9469 with an equal or 9470 wider sync scope 9471 and memory ordering 9472 stronger than 9473 unordered (this is 9474 termed the 9475 fence-paired-atomic). 9476 - Ensures that all 9477 memory operations 9478 have 9479 completed before 9480 performing the 9481 following 9482 fence-paired-atomic. 9483 9484 fence release - agent *none* 1. s_waitcnt lgkmcnt(0) & 9485 - system vmcnt(0) & vscnt(0) 9486 9487 - If OpenCL and 9488 address space is 9489 not generic, omit 9490 lgkmcnt(0). 9491 - If OpenCL and 9492 address space is 9493 local, omit 9494 vmcnt(0) and vscnt(0). 9495 - However, since LLVM 9496 currently has no 9497 address space on 9498 the fence need to 9499 conservatively 9500 always generate. If 9501 fence had an 9502 address space then 9503 set to address 9504 space of OpenCL 9505 fence flag, or to 9506 generic if both 9507 local and global 9508 flags are 9509 specified. 9510 - Could be split into 9511 separate s_waitcnt 9512 vmcnt(0), s_waitcnt 9513 vscnt(0) and s_waitcnt 9514 lgkmcnt(0) to allow 9515 them to be 9516 independently moved 9517 according to the 9518 following rules. 9519 - s_waitcnt vmcnt(0) 9520 must happen after 9521 any preceding 9522 global/generic 9523 load/load atomic/ 9524 atomicrmw-with-return-value. 9525 - s_waitcnt vscnt(0) 9526 must happen after 9527 any preceding 9528 global/generic 9529 store/store atomic/ 9530 atomicrmw-no-return-value. 9531 - s_waitcnt lgkmcnt(0) 9532 must happen after 9533 any preceding 9534 local/generic 9535 load/store/load 9536 atomic/store 9537 atomic/atomicrmw. 9538 - Must happen before 9539 any following store 9540 atomic/atomicrmw 9541 with an equal or 9542 wider sync scope 9543 and memory ordering 9544 stronger than 9545 unordered (this is 9546 termed the 9547 fence-paired-atomic). 9548 - Ensures that all 9549 memory operations 9550 have 9551 completed before 9552 performing the 9553 following 9554 fence-paired-atomic. 9555 9556 **Acquire-Release Atomic** 9557 ------------------------------------------------------------------------------------ 9558 atomicrmw acq_rel - singlethread - global 1. buffer/global/ds/flat_atomic 9559 - wavefront - local 9560 - generic 9561 atomicrmw acq_rel - workgroup - global 1. s_waitcnt lgkmcnt(0) & 9562 vmcnt(0) & vscnt(0) 9563 9564 - If CU wavefront execution 9565 mode, omit vmcnt(0) and 9566 vscnt(0). 9567 - If OpenCL, omit 9568 lgkmcnt(0). 9569 - Must happen after 9570 any preceding 9571 local/generic 9572 load/store/load 9573 atomic/store 9574 atomic/atomicrmw. 9575 - Could be split into 9576 separate s_waitcnt 9577 vmcnt(0), s_waitcnt 9578 vscnt(0), and s_waitcnt 9579 lgkmcnt(0) to allow 9580 them to be 9581 independently moved 9582 according to the 9583 following rules. 9584 - s_waitcnt vmcnt(0) 9585 must happen after 9586 any preceding 9587 global/generic load/load 9588 atomic/ 9589 atomicrmw-with-return-value. 9590 - s_waitcnt vscnt(0) 9591 must happen after 9592 any preceding 9593 global/generic 9594 store/store 9595 atomic/ 9596 atomicrmw-no-return-value. 9597 - s_waitcnt lgkmcnt(0) 9598 must happen after 9599 any preceding 9600 local/generic 9601 load/store/load 9602 atomic/store 9603 atomic/atomicrmw. 9604 - Must happen before 9605 the following 9606 atomicrmw. 9607 - Ensures that all 9608 memory operations 9609 have 9610 completed before 9611 performing the 9612 atomicrmw that is 9613 being released. 9614 9615 2. buffer/global_atomic 9616 3. s_waitcnt vm/vscnt(0) 9617 9618 - If CU wavefront execution 9619 mode, omit. 9620 - Use vmcnt(0) if atomic with 9621 return and vscnt(0) if 9622 atomic with no-return. 9623 - Must happen before 9624 the following 9625 buffer_gl0_inv. 9626 - Ensures any 9627 following global 9628 data read is no 9629 older than the 9630 atomicrmw value 9631 being acquired. 9632 9633 4. buffer_gl0_inv 9634 9635 - If CU wavefront execution 9636 mode, omit. 9637 - Ensures that 9638 following 9639 loads will not see 9640 stale data. 9641 9642 atomicrmw acq_rel - workgroup - local 1. s_waitcnt vmcnt(0) & vscnt(0) 9643 9644 - If CU wavefront execution 9645 mode, omit. 9646 - If OpenCL, omit. 9647 - Could be split into 9648 separate s_waitcnt 9649 vmcnt(0) and s_waitcnt 9650 vscnt(0) to allow 9651 them to be 9652 independently moved 9653 according to the 9654 following rules. 9655 - s_waitcnt vmcnt(0) 9656 must happen after 9657 any preceding 9658 global/generic load/load 9659 atomic/ 9660 atomicrmw-with-return-value. 9661 - s_waitcnt vscnt(0) 9662 must happen after 9663 any preceding 9664 global/generic 9665 store/store atomic/ 9666 atomicrmw-no-return-value. 9667 - Must happen before 9668 the following 9669 store. 9670 - Ensures that all 9671 global memory 9672 operations have 9673 completed before 9674 performing the 9675 store that is being 9676 released. 9677 9678 2. ds_atomic 9679 3. s_waitcnt lgkmcnt(0) 9680 9681 - If OpenCL, omit. 9682 - Must happen before 9683 the following 9684 buffer_gl0_inv. 9685 - Ensures any 9686 following global 9687 data read is no 9688 older than the local load 9689 atomic value being 9690 acquired. 9691 9692 4. buffer_gl0_inv 9693 9694 - If CU wavefront execution 9695 mode, omit. 9696 - If OpenCL omit. 9697 - Ensures that 9698 following 9699 loads will not see 9700 stale data. 9701 9702 atomicrmw acq_rel - workgroup - generic 1. s_waitcnt lgkmcnt(0) & 9703 vmcnt(0) & vscnt(0) 9704 9705 - If CU wavefront execution 9706 mode, omit vmcnt(0) and 9707 vscnt(0). 9708 - If OpenCL, omit lgkmcnt(0). 9709 - Could be split into 9710 separate s_waitcnt 9711 vmcnt(0), s_waitcnt 9712 vscnt(0) and s_waitcnt 9713 lgkmcnt(0) to allow 9714 them to be 9715 independently moved 9716 according to the 9717 following rules. 9718 - s_waitcnt vmcnt(0) 9719 must happen after 9720 any preceding 9721 global/generic load/load 9722 atomic/ 9723 atomicrmw-with-return-value. 9724 - s_waitcnt vscnt(0) 9725 must happen after 9726 any preceding 9727 global/generic 9728 store/store 9729 atomic/ 9730 atomicrmw-no-return-value. 9731 - s_waitcnt lgkmcnt(0) 9732 must happen after 9733 any preceding 9734 local/generic 9735 load/store/load 9736 atomic/store 9737 atomic/atomicrmw. 9738 - Must happen before 9739 the following 9740 atomicrmw. 9741 - Ensures that all 9742 memory operations 9743 have 9744 completed before 9745 performing the 9746 atomicrmw that is 9747 being released. 9748 9749 2. flat_atomic 9750 3. s_waitcnt lgkmcnt(0) & 9751 vmcnt(0) & vscnt(0) 9752 9753 - If CU wavefront execution 9754 mode, omit vmcnt(0) and 9755 vscnt(0). 9756 - If OpenCL, omit lgkmcnt(0). 9757 - Must happen before 9758 the following 9759 buffer_gl0_inv. 9760 - Ensures any 9761 following global 9762 data read is no 9763 older than the load 9764 atomic value being 9765 acquired. 9766 9767 3. buffer_gl0_inv 9768 9769 - If CU wavefront execution 9770 mode, omit. 9771 - Ensures that 9772 following 9773 loads will not see 9774 stale data. 9775 9776 atomicrmw acq_rel - agent - global 1. s_waitcnt lgkmcnt(0) & 9777 - system vmcnt(0) & vscnt(0) 9778 9779 - If OpenCL, omit 9780 lgkmcnt(0). 9781 - Could be split into 9782 separate s_waitcnt 9783 vmcnt(0), s_waitcnt 9784 vscnt(0) and s_waitcnt 9785 lgkmcnt(0) to allow 9786 them to be 9787 independently moved 9788 according to the 9789 following rules. 9790 - s_waitcnt vmcnt(0) 9791 must happen after 9792 any preceding 9793 global/generic 9794 load/load atomic/ 9795 atomicrmw-with-return-value. 9796 - s_waitcnt vscnt(0) 9797 must happen after 9798 any preceding 9799 global/generic 9800 store/store atomic/ 9801 atomicrmw-no-return-value. 9802 - s_waitcnt lgkmcnt(0) 9803 must happen after 9804 any preceding 9805 local/generic 9806 load/store/load 9807 atomic/store 9808 atomic/atomicrmw. 9809 - Must happen before 9810 the following 9811 atomicrmw. 9812 - Ensures that all 9813 memory operations 9814 to global have 9815 completed before 9816 performing the 9817 atomicrmw that is 9818 being released. 9819 9820 2. buffer/global_atomic 9821 3. s_waitcnt vm/vscnt(0) 9822 9823 - Use vmcnt(0) if atomic with 9824 return and vscnt(0) if 9825 atomic with no-return. 9826 - Must happen before 9827 following 9828 buffer_gl*_inv. 9829 - Ensures the 9830 atomicrmw has 9831 completed before 9832 invalidating the 9833 caches. 9834 9835 4. buffer_gl0_inv; 9836 buffer_gl1_inv 9837 9838 - Must happen before 9839 any following 9840 global/generic 9841 load/load 9842 atomic/atomicrmw. 9843 - Ensures that 9844 following loads 9845 will not see stale 9846 global data. 9847 9848 atomicrmw acq_rel - agent - generic 1. s_waitcnt lgkmcnt(0) & 9849 - system vmcnt(0) & vscnt(0) 9850 9851 - If OpenCL, omit 9852 lgkmcnt(0). 9853 - Could be split into 9854 separate s_waitcnt 9855 vmcnt(0), s_waitcnt 9856 vscnt(0), and s_waitcnt 9857 lgkmcnt(0) to allow 9858 them to be 9859 independently moved 9860 according to the 9861 following rules. 9862 - s_waitcnt vmcnt(0) 9863 must happen after 9864 any preceding 9865 global/generic 9866 load/load atomic 9867 atomicrmw-with-return-value. 9868 - s_waitcnt vscnt(0) 9869 must happen after 9870 any preceding 9871 global/generic 9872 store/store atomic/ 9873 atomicrmw-no-return-value. 9874 - s_waitcnt lgkmcnt(0) 9875 must happen after 9876 any preceding 9877 local/generic 9878 load/store/load 9879 atomic/store 9880 atomic/atomicrmw. 9881 - Must happen before 9882 the following 9883 atomicrmw. 9884 - Ensures that all 9885 memory operations 9886 have 9887 completed before 9888 performing the 9889 atomicrmw that is 9890 being released. 9891 9892 2. flat_atomic 9893 3. s_waitcnt vm/vscnt(0) & 9894 lgkmcnt(0) 9895 9896 - If OpenCL, omit 9897 lgkmcnt(0). 9898 - Use vmcnt(0) if atomic with 9899 return and vscnt(0) if 9900 atomic with no-return. 9901 - Must happen before 9902 following 9903 buffer_gl*_inv. 9904 - Ensures the 9905 atomicrmw has 9906 completed before 9907 invalidating the 9908 caches. 9909 9910 4. buffer_gl0_inv; 9911 buffer_gl1_inv 9912 9913 - Must happen before 9914 any following 9915 global/generic 9916 load/load 9917 atomic/atomicrmw. 9918 - Ensures that 9919 following loads 9920 will not see stale 9921 global data. 9922 9923 fence acq_rel - singlethread *none* *none* 9924 - wavefront 9925 fence acq_rel - workgroup *none* 1. s_waitcnt lgkmcnt(0) & 9926 vmcnt(0) & vscnt(0) 9927 9928 - If CU wavefront execution 9929 mode, omit vmcnt(0) and 9930 vscnt(0). 9931 - If OpenCL and 9932 address space is 9933 not generic, omit 9934 lgkmcnt(0). 9935 - If OpenCL and 9936 address space is 9937 local, omit 9938 vmcnt(0) and vscnt(0). 9939 - However, 9940 since LLVM 9941 currently has no 9942 address space on 9943 the fence need to 9944 conservatively 9945 always generate 9946 (see comment for 9947 previous fence). 9948 - Could be split into 9949 separate s_waitcnt 9950 vmcnt(0), s_waitcnt 9951 vscnt(0) and s_waitcnt 9952 lgkmcnt(0) to allow 9953 them to be 9954 independently moved 9955 according to the 9956 following rules. 9957 - s_waitcnt vmcnt(0) 9958 must happen after 9959 any preceding 9960 global/generic 9961 load/load 9962 atomic/ 9963 atomicrmw-with-return-value. 9964 - s_waitcnt vscnt(0) 9965 must happen after 9966 any preceding 9967 global/generic 9968 store/store atomic/ 9969 atomicrmw-no-return-value. 9970 - s_waitcnt lgkmcnt(0) 9971 must happen after 9972 any preceding 9973 local/generic 9974 load/store/load 9975 atomic/store atomic/ 9976 atomicrmw. 9977 - Must happen before 9978 any following 9979 global/generic 9980 load/load 9981 atomic/store/store 9982 atomic/atomicrmw. 9983 - Ensures that all 9984 memory operations 9985 have 9986 completed before 9987 performing any 9988 following global 9989 memory operations. 9990 - Ensures that the 9991 preceding 9992 local/generic load 9993 atomic/atomicrmw 9994 with an equal or 9995 wider sync scope 9996 and memory ordering 9997 stronger than 9998 unordered (this is 9999 termed the 10000 acquire-fence-paired-atomic) 10001 has completed 10002 before following 10003 global memory 10004 operations. This 10005 satisfies the 10006 requirements of 10007 acquire. 10008 - Ensures that all 10009 previous memory 10010 operations have 10011 completed before a 10012 following 10013 local/generic store 10014 atomic/atomicrmw 10015 with an equal or 10016 wider sync scope 10017 and memory ordering 10018 stronger than 10019 unordered (this is 10020 termed the 10021 release-fence-paired-atomic). 10022 This satisfies the 10023 requirements of 10024 release. 10025 - Must happen before 10026 the following 10027 buffer_gl0_inv. 10028 - Ensures that the 10029 acquire-fence-paired 10030 atomic has completed 10031 before invalidating 10032 the 10033 cache. Therefore 10034 any following 10035 locations read must 10036 be no older than 10037 the value read by 10038 the 10039 acquire-fence-paired-atomic. 10040 10041 3. buffer_gl0_inv 10042 10043 - If CU wavefront execution 10044 mode, omit. 10045 - Ensures that 10046 following 10047 loads will not see 10048 stale data. 10049 10050 fence acq_rel - agent *none* 1. s_waitcnt lgkmcnt(0) & 10051 - system vmcnt(0) & vscnt(0) 10052 10053 - If OpenCL and 10054 address space is 10055 not generic, omit 10056 lgkmcnt(0). 10057 - If OpenCL and 10058 address space is 10059 local, omit 10060 vmcnt(0) and vscnt(0). 10061 - However, since LLVM 10062 currently has no 10063 address space on 10064 the fence need to 10065 conservatively 10066 always generate 10067 (see comment for 10068 previous fence). 10069 - Could be split into 10070 separate s_waitcnt 10071 vmcnt(0), s_waitcnt 10072 vscnt(0) and s_waitcnt 10073 lgkmcnt(0) to allow 10074 them to be 10075 independently moved 10076 according to the 10077 following rules. 10078 - s_waitcnt vmcnt(0) 10079 must happen after 10080 any preceding 10081 global/generic 10082 load/load 10083 atomic/ 10084 atomicrmw-with-return-value. 10085 - s_waitcnt vscnt(0) 10086 must happen after 10087 any preceding 10088 global/generic 10089 store/store atomic/ 10090 atomicrmw-no-return-value. 10091 - s_waitcnt lgkmcnt(0) 10092 must happen after 10093 any preceding 10094 local/generic 10095 load/store/load 10096 atomic/store 10097 atomic/atomicrmw. 10098 - Must happen before 10099 the following 10100 buffer_gl*_inv. 10101 - Ensures that the 10102 preceding 10103 global/local/generic 10104 load 10105 atomic/atomicrmw 10106 with an equal or 10107 wider sync scope 10108 and memory ordering 10109 stronger than 10110 unordered (this is 10111 termed the 10112 acquire-fence-paired-atomic) 10113 has completed 10114 before invalidating 10115 the caches. This 10116 satisfies the 10117 requirements of 10118 acquire. 10119 - Ensures that all 10120 previous memory 10121 operations have 10122 completed before a 10123 following 10124 global/local/generic 10125 store 10126 atomic/atomicrmw 10127 with an equal or 10128 wider sync scope 10129 and memory ordering 10130 stronger than 10131 unordered (this is 10132 termed the 10133 release-fence-paired-atomic). 10134 This satisfies the 10135 requirements of 10136 release. 10137 10138 2. buffer_gl0_inv; 10139 buffer_gl1_inv 10140 10141 - Must happen before 10142 any following 10143 global/generic 10144 load/load 10145 atomic/store/store 10146 atomic/atomicrmw. 10147 - Ensures that 10148 following loads 10149 will not see stale 10150 global data. This 10151 satisfies the 10152 requirements of 10153 acquire. 10154 10155 **Sequential Consistent Atomic** 10156 ------------------------------------------------------------------------------------ 10157 load atomic seq_cst - singlethread - global *Same as corresponding 10158 - wavefront - local load atomic acquire, 10159 - generic except must generated 10160 all instructions even 10161 for OpenCL.* 10162 load atomic seq_cst - workgroup - global 1. s_waitcnt lgkmcnt(0) & 10163 - generic vmcnt(0) & vscnt(0) 10164 10165 - If CU wavefront execution 10166 mode, omit vmcnt(0) and 10167 vscnt(0). 10168 - Could be split into 10169 separate s_waitcnt 10170 vmcnt(0), s_waitcnt 10171 vscnt(0), and s_waitcnt 10172 lgkmcnt(0) to allow 10173 them to be 10174 independently moved 10175 according to the 10176 following rules. 10177 - s_waitcnt lgkmcnt(0) must 10178 happen after 10179 preceding 10180 local/generic load 10181 atomic/store 10182 atomic/atomicrmw 10183 with memory 10184 ordering of seq_cst 10185 and with equal or 10186 wider sync scope. 10187 (Note that seq_cst 10188 fences have their 10189 own s_waitcnt 10190 lgkmcnt(0) and so do 10191 not need to be 10192 considered.) 10193 - s_waitcnt vmcnt(0) 10194 must happen after 10195 preceding 10196 global/generic load 10197 atomic/ 10198 atomicrmw-with-return-value 10199 with memory 10200 ordering of seq_cst 10201 and with equal or 10202 wider sync scope. 10203 (Note that seq_cst 10204 fences have their 10205 own s_waitcnt 10206 vmcnt(0) and so do 10207 not need to be 10208 considered.) 10209 - s_waitcnt vscnt(0) 10210 Must happen after 10211 preceding 10212 global/generic store 10213 atomic/ 10214 atomicrmw-no-return-value 10215 with memory 10216 ordering of seq_cst 10217 and with equal or 10218 wider sync scope. 10219 (Note that seq_cst 10220 fences have their 10221 own s_waitcnt 10222 vscnt(0) and so do 10223 not need to be 10224 considered.) 10225 - Ensures any 10226 preceding 10227 sequential 10228 consistent global/local 10229 memory instructions 10230 have completed 10231 before executing 10232 this sequentially 10233 consistent 10234 instruction. This 10235 prevents reordering 10236 a seq_cst store 10237 followed by a 10238 seq_cst load. (Note 10239 that seq_cst is 10240 stronger than 10241 acquire/release as 10242 the reordering of 10243 load acquire 10244 followed by a store 10245 release is 10246 prevented by the 10247 s_waitcnt of 10248 the release, but 10249 there is nothing 10250 preventing a store 10251 release followed by 10252 load acquire from 10253 completing out of 10254 order. The s_waitcnt 10255 could be placed after 10256 seq_store or before 10257 the seq_load. We 10258 choose the load to 10259 make the s_waitcnt be 10260 as late as possible 10261 so that the store 10262 may have already 10263 completed.) 10264 10265 2. *Following 10266 instructions same as 10267 corresponding load 10268 atomic acquire, 10269 except must generated 10270 all instructions even 10271 for OpenCL.* 10272 load atomic seq_cst - workgroup - local 10273 10274 1. s_waitcnt vmcnt(0) & vscnt(0) 10275 10276 - If CU wavefront execution 10277 mode, omit. 10278 - Could be split into 10279 separate s_waitcnt 10280 vmcnt(0) and s_waitcnt 10281 vscnt(0) to allow 10282 them to be 10283 independently moved 10284 according to the 10285 following rules. 10286 - s_waitcnt vmcnt(0) 10287 Must happen after 10288 preceding 10289 global/generic load 10290 atomic/ 10291 atomicrmw-with-return-value 10292 with memory 10293 ordering of seq_cst 10294 and with equal or 10295 wider sync scope. 10296 (Note that seq_cst 10297 fences have their 10298 own s_waitcnt 10299 vmcnt(0) and so do 10300 not need to be 10301 considered.) 10302 - s_waitcnt vscnt(0) 10303 Must happen after 10304 preceding 10305 global/generic store 10306 atomic/ 10307 atomicrmw-no-return-value 10308 with memory 10309 ordering of seq_cst 10310 and with equal or 10311 wider sync scope. 10312 (Note that seq_cst 10313 fences have their 10314 own s_waitcnt 10315 vscnt(0) and so do 10316 not need to be 10317 considered.) 10318 - Ensures any 10319 preceding 10320 sequential 10321 consistent global 10322 memory instructions 10323 have completed 10324 before executing 10325 this sequentially 10326 consistent 10327 instruction. This 10328 prevents reordering 10329 a seq_cst store 10330 followed by a 10331 seq_cst load. (Note 10332 that seq_cst is 10333 stronger than 10334 acquire/release as 10335 the reordering of 10336 load acquire 10337 followed by a store 10338 release is 10339 prevented by the 10340 s_waitcnt of 10341 the release, but 10342 there is nothing 10343 preventing a store 10344 release followed by 10345 load acquire from 10346 completing out of 10347 order. The s_waitcnt 10348 could be placed after 10349 seq_store or before 10350 the seq_load. We 10351 choose the load to 10352 make the s_waitcnt be 10353 as late as possible 10354 so that the store 10355 may have already 10356 completed.) 10357 10358 2. *Following 10359 instructions same as 10360 corresponding load 10361 atomic acquire, 10362 except must generated 10363 all instructions even 10364 for OpenCL.* 10365 10366 load atomic seq_cst - agent - global 1. s_waitcnt lgkmcnt(0) & 10367 - system - generic vmcnt(0) & vscnt(0) 10368 10369 - Could be split into 10370 separate s_waitcnt 10371 vmcnt(0), s_waitcnt 10372 vscnt(0) and s_waitcnt 10373 lgkmcnt(0) to allow 10374 them to be 10375 independently moved 10376 according to the 10377 following rules. 10378 - s_waitcnt lgkmcnt(0) 10379 must happen after 10380 preceding 10381 local load 10382 atomic/store 10383 atomic/atomicrmw 10384 with memory 10385 ordering of seq_cst 10386 and with equal or 10387 wider sync scope. 10388 (Note that seq_cst 10389 fences have their 10390 own s_waitcnt 10391 lgkmcnt(0) and so do 10392 not need to be 10393 considered.) 10394 - s_waitcnt vmcnt(0) 10395 must happen after 10396 preceding 10397 global/generic load 10398 atomic/ 10399 atomicrmw-with-return-value 10400 with memory 10401 ordering of seq_cst 10402 and with equal or 10403 wider sync scope. 10404 (Note that seq_cst 10405 fences have their 10406 own s_waitcnt 10407 vmcnt(0) and so do 10408 not need to be 10409 considered.) 10410 - s_waitcnt vscnt(0) 10411 Must happen after 10412 preceding 10413 global/generic store 10414 atomic/ 10415 atomicrmw-no-return-value 10416 with memory 10417 ordering of seq_cst 10418 and with equal or 10419 wider sync scope. 10420 (Note that seq_cst 10421 fences have their 10422 own s_waitcnt 10423 vscnt(0) and so do 10424 not need to be 10425 considered.) 10426 - Ensures any 10427 preceding 10428 sequential 10429 consistent global 10430 memory instructions 10431 have completed 10432 before executing 10433 this sequentially 10434 consistent 10435 instruction. This 10436 prevents reordering 10437 a seq_cst store 10438 followed by a 10439 seq_cst load. (Note 10440 that seq_cst is 10441 stronger than 10442 acquire/release as 10443 the reordering of 10444 load acquire 10445 followed by a store 10446 release is 10447 prevented by the 10448 s_waitcnt of 10449 the release, but 10450 there is nothing 10451 preventing a store 10452 release followed by 10453 load acquire from 10454 completing out of 10455 order. The s_waitcnt 10456 could be placed after 10457 seq_store or before 10458 the seq_load. We 10459 choose the load to 10460 make the s_waitcnt be 10461 as late as possible 10462 so that the store 10463 may have already 10464 completed.) 10465 10466 2. *Following 10467 instructions same as 10468 corresponding load 10469 atomic acquire, 10470 except must generated 10471 all instructions even 10472 for OpenCL.* 10473 store atomic seq_cst - singlethread - global *Same as corresponding 10474 - wavefront - local store atomic release, 10475 - workgroup - generic except must generated 10476 - agent all instructions even 10477 - system for OpenCL.* 10478 atomicrmw seq_cst - singlethread - global *Same as corresponding 10479 - wavefront - local atomicrmw acq_rel, 10480 - workgroup - generic except must generated 10481 - agent all instructions even 10482 - system for OpenCL.* 10483 fence seq_cst - singlethread *none* *Same as corresponding 10484 - wavefront fence acq_rel, 10485 - workgroup except must generated 10486 - agent all instructions even 10487 - system for OpenCL.* 10488 ============ ============ ============== ========== ================================ 10489 10490Trap Handler ABI 10491~~~~~~~~~~~~~~~~ 10492 10493For code objects generated by the AMDGPU backend for HSA [HSA]_ compatible 10494runtimes (see :ref:`amdgpu-os`), the runtime installs a trap handler that 10495supports the ``s_trap`` instruction. For usage see: 10496 10497- :ref:`amdgpu-trap-handler-for-amdhsa-os-v2-table` 10498- :ref:`amdgpu-trap-handler-for-amdhsa-os-v3-table` 10499- :ref:`amdgpu-trap-handler-for-amdhsa-os-v4-table` 10500 10501 .. table:: AMDGPU Trap Handler for AMDHSA OS Code Object V2 10502 :name: amdgpu-trap-handler-for-amdhsa-os-v2-table 10503 10504 =================== =============== =============== ======================================= 10505 Usage Code Sequence Trap Handler Description 10506 Inputs 10507 =================== =============== =============== ======================================= 10508 reserved ``s_trap 0x00`` Reserved by hardware. 10509 ``debugtrap(arg)`` ``s_trap 0x01`` ``SGPR0-1``: Reserved for Finalizer HSA ``debugtrap`` 10510 ``queue_ptr`` intrinsic (not implemented). 10511 ``VGPR0``: 10512 ``arg`` 10513 ``llvm.trap`` ``s_trap 0x02`` ``SGPR0-1``: Causes wave to be halted with the PC at 10514 ``queue_ptr`` the trap instruction. The associated 10515 queue is signalled to put it into the 10516 error state. When the queue is put in 10517 the error state, the waves executing 10518 dispatches on the queue will be 10519 terminated. 10520 ``llvm.debugtrap`` ``s_trap 0x03`` *none* - If debugger not enabled then behaves 10521 as a no-operation. The trap handler 10522 is entered and immediately returns to 10523 continue execution of the wavefront. 10524 - If the debugger is enabled, causes 10525 the debug trap to be reported by the 10526 debugger and the wavefront is put in 10527 the halt state with the PC at the 10528 instruction. The debugger must 10529 increment the PC and resume the wave. 10530 reserved ``s_trap 0x04`` Reserved. 10531 reserved ``s_trap 0x05`` Reserved. 10532 reserved ``s_trap 0x06`` Reserved. 10533 reserved ``s_trap 0x07`` Reserved. 10534 reserved ``s_trap 0x08`` Reserved. 10535 reserved ``s_trap 0xfe`` Reserved. 10536 reserved ``s_trap 0xff`` Reserved. 10537 =================== =============== =============== ======================================= 10538 10539.. 10540 10541 .. table:: AMDGPU Trap Handler for AMDHSA OS Code Object V3 10542 :name: amdgpu-trap-handler-for-amdhsa-os-v3-table 10543 10544 =================== =============== =============== ======================================= 10545 Usage Code Sequence Trap Handler Description 10546 Inputs 10547 =================== =============== =============== ======================================= 10548 reserved ``s_trap 0x00`` Reserved by hardware. 10549 debugger breakpoint ``s_trap 0x01`` *none* Reserved for debugger to use for 10550 breakpoints. Causes wave to be halted 10551 with the PC at the trap instruction. 10552 The debugger is responsible to resume 10553 the wave, including the instruction 10554 that the breakpoint overwrote. 10555 ``llvm.trap`` ``s_trap 0x02`` ``SGPR0-1``: Causes wave to be halted with the PC at 10556 ``queue_ptr`` the trap instruction. The associated 10557 queue is signalled to put it into the 10558 error state. When the queue is put in 10559 the error state, the waves executing 10560 dispatches on the queue will be 10561 terminated. 10562 ``llvm.debugtrap`` ``s_trap 0x03`` *none* - If debugger not enabled then behaves 10563 as a no-operation. The trap handler 10564 is entered and immediately returns to 10565 continue execution of the wavefront. 10566 - If the debugger is enabled, causes 10567 the debug trap to be reported by the 10568 debugger and the wavefront is put in 10569 the halt state with the PC at the 10570 instruction. The debugger must 10571 increment the PC and resume the wave. 10572 reserved ``s_trap 0x04`` Reserved. 10573 reserved ``s_trap 0x05`` Reserved. 10574 reserved ``s_trap 0x06`` Reserved. 10575 reserved ``s_trap 0x07`` Reserved. 10576 reserved ``s_trap 0x08`` Reserved. 10577 reserved ``s_trap 0xfe`` Reserved. 10578 reserved ``s_trap 0xff`` Reserved. 10579 =================== =============== =============== ======================================= 10580 10581.. 10582 10583 .. table:: AMDGPU Trap Handler for AMDHSA OS Code Object V4 10584 :name: amdgpu-trap-handler-for-amdhsa-os-v4-table 10585 10586 =================== =============== ================ ================= ======================================= 10587 Usage Code Sequence GFX6-GFX8 Inputs GFX9-GFX10 Inputs Description 10588 =================== =============== ================ ================= ======================================= 10589 reserved ``s_trap 0x00`` Reserved by hardware. 10590 debugger breakpoint ``s_trap 0x01`` *none* *none* Reserved for debugger to use for 10591 breakpoints. Causes wave to be halted 10592 with the PC at the trap instruction. 10593 The debugger is responsible to resume 10594 the wave, including the instruction 10595 that the breakpoint overwrote. 10596 ``llvm.trap`` ``s_trap 0x02`` ``SGPR0-1``: *none* Causes wave to be halted with the PC at 10597 ``queue_ptr`` the trap instruction. The associated 10598 queue is signalled to put it into the 10599 error state. When the queue is put in 10600 the error state, the waves executing 10601 dispatches on the queue will be 10602 terminated. 10603 ``llvm.debugtrap`` ``s_trap 0x03`` *none* *none* - If debugger not enabled then behaves 10604 as a no-operation. The trap handler 10605 is entered and immediately returns to 10606 continue execution of the wavefront. 10607 - If the debugger is enabled, causes 10608 the debug trap to be reported by the 10609 debugger and the wavefront is put in 10610 the halt state with the PC at the 10611 instruction. The debugger must 10612 increment the PC and resume the wave. 10613 reserved ``s_trap 0x04`` Reserved. 10614 reserved ``s_trap 0x05`` Reserved. 10615 reserved ``s_trap 0x06`` Reserved. 10616 reserved ``s_trap 0x07`` Reserved. 10617 reserved ``s_trap 0x08`` Reserved. 10618 reserved ``s_trap 0xfe`` Reserved. 10619 reserved ``s_trap 0xff`` Reserved. 10620 =================== =============== ================ ================= ======================================= 10621 10622.. _amdgpu-amdhsa-function-call-convention: 10623 10624Call Convention 10625~~~~~~~~~~~~~~~ 10626 10627.. note:: 10628 10629 This section is currently incomplete and has inaccuracies. It is WIP that will 10630 be updated as information is determined. 10631 10632See :ref:`amdgpu-dwarf-address-space-identifier` for information on swizzled 10633addresses. Unswizzled addresses are normal linear addresses. 10634 10635.. _amdgpu-amdhsa-function-call-convention-kernel-functions: 10636 10637Kernel Functions 10638++++++++++++++++ 10639 10640This section describes the call convention ABI for the outer kernel function. 10641 10642See :ref:`amdgpu-amdhsa-initial-kernel-execution-state` for the kernel call 10643convention. 10644 10645The following is not part of the AMDGPU kernel calling convention but describes 10646how the AMDGPU implements function calls: 10647 106481. Clang decides the kernarg layout to match the *HSA Programmer's Language 10649 Reference* [HSA]_. 10650 10651 - All structs are passed directly. 10652 - Lambda values are passed *TBA*. 10653 10654 .. TODO:: 10655 10656 - Does this really follow HSA rules? Or are structs >16 bytes passed 10657 by-value struct? 10658 - What is ABI for lambda values? 10659 106604. The kernel performs certain setup in its prolog, as described in 10661 :ref:`amdgpu-amdhsa-kernel-prolog`. 10662 10663.. _amdgpu-amdhsa-function-call-convention-non-kernel-functions: 10664 10665Non-Kernel Functions 10666++++++++++++++++++++ 10667 10668This section describes the call convention ABI for functions other than the 10669outer kernel function. 10670 10671If a kernel has function calls then scratch is always allocated and used for 10672the call stack which grows from low address to high address using the swizzled 10673scratch address space. 10674 10675On entry to a function: 10676 106771. SGPR0-3 contain a V# with the following properties (see 10678 :ref:`amdgpu-amdhsa-kernel-prolog-private-segment-buffer`): 10679 10680 * Base address pointing to the beginning of the wavefront scratch backing 10681 memory. 10682 * Swizzled with dword element size and stride of wavefront size elements. 10683 106842. The FLAT_SCRATCH register pair is setup. See 10685 :ref:`amdgpu-amdhsa-kernel-prolog-flat-scratch`. 106863. GFX6-GFX8: M0 register set to the size of LDS in bytes. See 10687 :ref:`amdgpu-amdhsa-kernel-prolog-m0`. 106884. The EXEC register is set to the lanes active on entry to the function. 106895. MODE register: *TBD* 106906. VGPR0-31 and SGPR4-29 are used to pass function input arguments as described 10691 below. 106927. SGPR30-31 return address (RA). The code address that the function must 10693 return to when it completes. The value is undefined if the function is *no 10694 return*. 106958. SGPR32 is used for the stack pointer (SP). It is an unswizzled scratch 10696 offset relative to the beginning of the wavefront scratch backing memory. 10697 10698 The unswizzled SP can be used with buffer instructions as an unswizzled SGPR 10699 offset with the scratch V# in SGPR0-3 to access the stack in a swizzled 10700 manner. 10701 10702 The unswizzled SP value can be converted into the swizzled SP value by: 10703 10704 | swizzled SP = unswizzled SP / wavefront size 10705 10706 This may be used to obtain the private address space address of stack 10707 objects and to convert this address to a flat address by adding the flat 10708 scratch aperture base address. 10709 10710 The swizzled SP value is always 4 bytes aligned for the ``r600`` 10711 architecture and 16 byte aligned for the ``amdgcn`` architecture. 10712 10713 .. note:: 10714 10715 The ``amdgcn`` value is selected to avoid dynamic stack alignment for the 10716 OpenCL language which has the largest base type defined as 16 bytes. 10717 10718 On entry, the swizzled SP value is the address of the first function 10719 argument passed on the stack. Other stack passed arguments are positive 10720 offsets from the entry swizzled SP value. 10721 10722 The function may use positive offsets beyond the last stack passed argument 10723 for stack allocated local variables and register spill slots. If necessary, 10724 the function may align these to greater alignment than 16 bytes. After these 10725 the function may dynamically allocate space for such things as runtime sized 10726 ``alloca`` local allocations. 10727 10728 If the function calls another function, it will place any stack allocated 10729 arguments after the last local allocation and adjust SGPR32 to the address 10730 after the last local allocation. 10731 107329. All other registers are unspecified. 1073310. Any necessary ``s_waitcnt`` has been performed to ensure memory is available 10734 to the function. 10735 10736On exit from a function: 10737 107381. VGPR0-31 and SGPR4-29 are used to pass function result arguments as 10739 described below. Any registers used are considered clobbered registers. 107402. The following registers are preserved and have the same value as on entry: 10741 10742 * FLAT_SCRATCH 10743 * EXEC 10744 * GFX6-GFX8: M0 10745 * All SGPR registers except the clobbered registers of SGPR4-31. 10746 * VGPR40-47 10747 * VGPR56-63 10748 * VGPR72-79 10749 * VGPR88-95 10750 * VGPR104-111 10751 * VGPR120-127 10752 * VGPR136-143 10753 * VGPR152-159 10754 * VGPR168-175 10755 * VGPR184-191 10756 * VGPR200-207 10757 * VGPR216-223 10758 * VGPR232-239 10759 * VGPR248-255 10760 10761 .. note:: 10762 10763 Except the argument registers, the VGPRs clobbered and the preserved 10764 registers are intermixed at regular intervals in order to keep a 10765 similar ratio independent of the number of allocated VGPRs. 10766 10767 * Lanes of all VGPRs that are inactive at the call site. 10768 10769 For the AMDGPU backend, an inter-procedural register allocation (IPRA) 10770 optimization may mark some of clobbered SGPR and VGPR registers as 10771 preserved if it can be determined that the called function does not change 10772 their value. 10773 107742. The PC is set to the RA provided on entry. 107753. MODE register: *TBD*. 107764. All other registers are clobbered. 107775. Any necessary ``s_waitcnt`` has been performed to ensure memory accessed by 10778 function is available to the caller. 10779 10780.. TODO:: 10781 10782 - On gfx908 are all ACC registers clobbered? 10783 10784 - How are function results returned? The address of structured types is passed 10785 by reference, but what about other types? 10786 10787The function input arguments are made up of the formal arguments explicitly 10788declared by the source language function plus the implicit input arguments used 10789by the implementation. 10790 10791The source language input arguments are: 10792 107931. Any source language implicit ``this`` or ``self`` argument comes first as a 10794 pointer type. 107952. Followed by the function formal arguments in left to right source order. 10796 10797The source language result arguments are: 10798 107991. The function result argument. 10800 10801The source language input or result struct type arguments that are less than or 10802equal to 16 bytes, are decomposed recursively into their base type fields, and 10803each field is passed as if a separate argument. For input arguments, if the 10804called function requires the struct to be in memory, for example because its 10805address is taken, then the function body is responsible for allocating a stack 10806location and copying the field arguments into it. Clang terms this *direct 10807struct*. 10808 10809The source language input struct type arguments that are greater than 16 bytes, 10810are passed by reference. The caller is responsible for allocating a stack 10811location to make a copy of the struct value and pass the address as the input 10812argument. The called function is responsible to perform the dereference when 10813accessing the input argument. Clang terms this *by-value struct*. 10814 10815A source language result struct type argument that is greater than 16 bytes, is 10816returned by reference. The caller is responsible for allocating a stack location 10817to hold the result value and passes the address as the last input argument 10818(before the implicit input arguments). In this case there are no result 10819arguments. The called function is responsible to perform the dereference when 10820storing the result value. Clang terms this *structured return (sret)*. 10821 10822*TODO: correct the ``sret`` definition.* 10823 10824.. TODO:: 10825 10826 Is this definition correct? Or is ``sret`` only used if passing in registers, and 10827 pass as non-decomposed struct as stack argument? Or something else? Is the 10828 memory location in the caller stack frame, or a stack memory argument and so 10829 no address is passed as the caller can directly write to the argument stack 10830 location? But then the stack location is still live after return. If an 10831 argument stack location is it the first stack argument or the last one? 10832 10833Lambda argument types are treated as struct types with an implementation defined 10834set of fields. 10835 10836.. TODO:: 10837 10838 Need to specify the ABI for lambda types for AMDGPU. 10839 10840For AMDGPU backend all source language arguments (including the decomposed 10841struct type arguments) are passed in VGPRs unless marked ``inreg`` in which case 10842they are passed in SGPRs. 10843 10844The AMDGPU backend walks the function call graph from the leaves to determine 10845which implicit input arguments are used, propagating to each caller of the 10846function. The used implicit arguments are appended to the function arguments 10847after the source language arguments in the following order: 10848 10849.. TODO:: 10850 10851 Is recursion or external functions supported? 10852 108531. Work-Item ID (1 VGPR) 10854 10855 The X, Y and Z work-item ID are packed into a single VGRP with the following 10856 layout. Only fields actually used by the function are set. The other bits 10857 are undefined. 10858 10859 The values come from the initial kernel execution state. See 10860 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`. 10861 10862 .. table:: Work-item implicit argument layout 10863 :name: amdgpu-amdhsa-workitem-implicit-argument-layout-table 10864 10865 ======= ======= ============== 10866 Bits Size Field Name 10867 ======= ======= ============== 10868 9:0 10 bits X Work-Item ID 10869 19:10 10 bits Y Work-Item ID 10870 29:20 10 bits Z Work-Item ID 10871 31:30 2 bits Unused 10872 ======= ======= ============== 10873 108742. Dispatch Ptr (2 SGPRs) 10875 10876 The value comes from the initial kernel execution state. See 10877 :ref:`amdgpu-amdhsa-sgpr-register-set-up-order-table`. 10878 108793. Queue Ptr (2 SGPRs) 10880 10881 The value comes from the initial kernel execution state. See 10882 :ref:`amdgpu-amdhsa-sgpr-register-set-up-order-table`. 10883 108844. Kernarg Segment Ptr (2 SGPRs) 10885 10886 The value comes from the initial kernel execution state. See 10887 :ref:`amdgpu-amdhsa-sgpr-register-set-up-order-table`. 10888 108895. Dispatch id (2 SGPRs) 10890 10891 The value comes from the initial kernel execution state. See 10892 :ref:`amdgpu-amdhsa-sgpr-register-set-up-order-table`. 10893 108946. Work-Group ID X (1 SGPR) 10895 10896 The value comes from the initial kernel execution state. See 10897 :ref:`amdgpu-amdhsa-sgpr-register-set-up-order-table`. 10898 108997. Work-Group ID Y (1 SGPR) 10900 10901 The value comes from the initial kernel execution state. See 10902 :ref:`amdgpu-amdhsa-sgpr-register-set-up-order-table`. 10903 109048. Work-Group ID Z (1 SGPR) 10905 10906 The value comes from the initial kernel execution state. See 10907 :ref:`amdgpu-amdhsa-sgpr-register-set-up-order-table`. 10908 109099. Implicit Argument Ptr (2 SGPRs) 10910 10911 The value is computed by adding an offset to Kernarg Segment Ptr to get the 10912 global address space pointer to the first kernarg implicit argument. 10913 10914The input and result arguments are assigned in order in the following manner: 10915 10916.. note:: 10917 10918 There are likely some errors and omissions in the following description that 10919 need correction. 10920 10921 .. TODO:: 10922 10923 Check the Clang source code to decipher how function arguments and return 10924 results are handled. Also see the AMDGPU specific values used. 10925 10926* VGPR arguments are assigned to consecutive VGPRs starting at VGPR0 up to 10927 VGPR31. 10928 10929 If there are more arguments than will fit in these registers, the remaining 10930 arguments are allocated on the stack in order on naturally aligned 10931 addresses. 10932 10933 .. TODO:: 10934 10935 How are overly aligned structures allocated on the stack? 10936 10937* SGPR arguments are assigned to consecutive SGPRs starting at SGPR0 up to 10938 SGPR29. 10939 10940 If there are more arguments than will fit in these registers, the remaining 10941 arguments are allocated on the stack in order on naturally aligned 10942 addresses. 10943 10944Note that decomposed struct type arguments may have some fields passed in 10945registers and some in memory. 10946 10947.. TODO:: 10948 10949 So, a struct which can pass some fields as decomposed register arguments, will 10950 pass the rest as decomposed stack elements? But an argument that will not start 10951 in registers will not be decomposed and will be passed as a non-decomposed 10952 stack value? 10953 10954The following is not part of the AMDGPU function calling convention but 10955describes how the AMDGPU implements function calls: 10956 109571. SGPR33 is used as a frame pointer (FP) if necessary. Like the SP it is an 10958 unswizzled scratch address. It is only needed if runtime sized ``alloca`` 10959 are used, or for the reasons defined in ``SIFrameLowering``. 109602. Runtime stack alignment is supported. SGPR34 is used as a base pointer (BP) 10961 to access the incoming stack arguments in the function. The BP is needed 10962 only when the function requires the runtime stack alignment. 10963 109643. Allocating SGPR arguments on the stack are not supported. 10965 109664. No CFI is currently generated. See 10967 :ref:`amdgpu-dwarf-call-frame-information`. 10968 10969 .. note:: 10970 10971 CFI will be generated that defines the CFA as the unswizzled address 10972 relative to the wave scratch base in the unswizzled private address space 10973 of the lowest address stack allocated local variable. 10974 10975 ``DW_AT_frame_base`` will be defined as the swizzled address in the 10976 swizzled private address space by dividing the CFA by the wavefront size 10977 (since CFA is always at least dword aligned which matches the scratch 10978 swizzle element size). 10979 10980 If no dynamic stack alignment was performed, the stack allocated arguments 10981 are accessed as negative offsets relative to ``DW_AT_frame_base``, and the 10982 local variables and register spill slots are accessed as positive offsets 10983 relative to ``DW_AT_frame_base``. 10984 109855. Function argument passing is implemented by copying the input physical 10986 registers to virtual registers on entry. The register allocator can spill if 10987 necessary. These are copied back to physical registers at call sites. The 10988 net effect is that each function call can have these values in entirely 10989 distinct locations. The IPRA can help avoid shuffling argument registers. 109906. Call sites are implemented by setting up the arguments at positive offsets 10991 from SP. Then SP is incremented to account for the known frame size before 10992 the call and decremented after the call. 10993 10994 .. note:: 10995 10996 The CFI will reflect the changed calculation needed to compute the CFA 10997 from SP. 10998 109997. 4 byte spill slots are used in the stack frame. One slot is allocated for an 11000 emergency spill slot. Buffer instructions are used for stack accesses and 11001 not the ``flat_scratch`` instruction. 11002 11003 .. TODO:: 11004 11005 Explain when the emergency spill slot is used. 11006 11007.. TODO:: 11008 11009 Possible broken issues: 11010 11011 - Stack arguments must be aligned to required alignment. 11012 - Stack is aligned to max(16, max formal argument alignment) 11013 - Direct argument < 64 bits should check register budget. 11014 - Register budget calculation should respect ``inreg`` for SGPR. 11015 - SGPR overflow is not handled. 11016 - struct with 1 member unpeeling is not checking size of member. 11017 - ``sret`` is after ``this`` pointer. 11018 - Caller is not implementing stack realignment: need an extra pointer. 11019 - Should say AMDGPU passes FP rather than SP. 11020 - Should CFI define CFA as address of locals or arguments. Difference is 11021 apparent when have implemented dynamic alignment. 11022 - If ``SCRATCH`` instruction could allow negative offsets, then can make FP be 11023 highest address of stack frame and use negative offset for locals. Would 11024 allow SP to be the same as FP and could support signal-handler-like as now 11025 have a real SP for the top of the stack. 11026 - How is ``sret`` passed on the stack? In argument stack area? Can it overlay 11027 arguments? 11028 11029AMDPAL 11030------ 11031 11032This section provides code conventions used when the target triple OS is 11033``amdpal`` (see :ref:`amdgpu-target-triples`). 11034 11035.. _amdgpu-amdpal-code-object-metadata-section: 11036 11037Code Object Metadata 11038~~~~~~~~~~~~~~~~~~~~ 11039 11040.. note:: 11041 11042 The metadata is currently in development and is subject to major 11043 changes. Only the current version is supported. *When this document 11044 was generated the version was 2.6.* 11045 11046Code object metadata is specified by the ``NT_AMDGPU_METADATA`` note 11047record (see :ref:`amdgpu-note-records-v3-v4`). 11048 11049The metadata is represented as Message Pack formatted binary data (see 11050[MsgPack]_). The top level is a Message Pack map that includes the keys 11051defined in table :ref:`amdgpu-amdpal-code-object-metadata-map-table` 11052and referenced tables. 11053 11054Additional information can be added to the maps. To avoid conflicts, any 11055key names should be prefixed by "*vendor-name*." where ``vendor-name`` 11056can be the name of the vendor and specific vendor tool that generates the 11057information. The prefix is abbreviated to simply "." when it appears 11058within a map that has been added by the same *vendor-name*. 11059 11060 .. table:: AMDPAL Code Object Metadata Map 11061 :name: amdgpu-amdpal-code-object-metadata-map-table 11062 11063 =================== ============== ========= ====================================================================== 11064 String Key Value Type Required? Description 11065 =================== ============== ========= ====================================================================== 11066 "amdpal.version" sequence of Required PAL code object metadata (major, minor) version. The current values 11067 2 integers are defined by *Util::Abi::PipelineMetadata(Major|Minor)Version*. 11068 "amdpal.pipelines" sequence of Required Per-pipeline metadata. See 11069 map :ref:`amdgpu-amdpal-code-object-pipeline-metadata-map-table` for the 11070 definition of the keys included in that map. 11071 =================== ============== ========= ====================================================================== 11072 11073.. 11074 11075 .. table:: AMDPAL Code Object Pipeline Metadata Map 11076 :name: amdgpu-amdpal-code-object-pipeline-metadata-map-table 11077 11078 ====================================== ============== ========= =================================================== 11079 String Key Value Type Required? Description 11080 ====================================== ============== ========= =================================================== 11081 ".name" string Source name of the pipeline. 11082 ".type" string Pipeline type, e.g. VsPs. Values include: 11083 11084 - "VsPs" 11085 - "Gs" 11086 - "Cs" 11087 - "Ngg" 11088 - "Tess" 11089 - "GsTess" 11090 - "NggTess" 11091 11092 ".internal_pipeline_hash" sequence of Required Internal compiler hash for this pipeline. Lower 11093 2 integers 64 bits is the "stable" portion of the hash, used 11094 for e.g. shader replacement lookup. Upper 64 bits 11095 is the "unique" portion of the hash, used for 11096 e.g. pipeline cache lookup. The value is 11097 implementation defined, and can not be relied on 11098 between different builds of the compiler. 11099 ".shaders" map Per-API shader metadata. See 11100 :ref:`amdgpu-amdpal-code-object-shader-map-table` 11101 for the definition of the keys included in that 11102 map. 11103 ".hardware_stages" map Per-hardware stage metadata. See 11104 :ref:`amdgpu-amdpal-code-object-hardware-stage-map-table` 11105 for the definition of the keys included in that 11106 map. 11107 ".shader_functions" map Per-shader function metadata. See 11108 :ref:`amdgpu-amdpal-code-object-shader-function-map-table` 11109 for the definition of the keys included in that 11110 map. 11111 ".registers" map Required Hardware register configuration. See 11112 :ref:`amdgpu-amdpal-code-object-register-map-table` 11113 for the definition of the keys included in that 11114 map. 11115 ".user_data_limit" integer Number of user data entries accessed by this 11116 pipeline. 11117 ".spill_threshold" integer The user data spill threshold. 0xFFFF for 11118 NoUserDataSpilling. 11119 ".uses_viewport_array_index" boolean Indicates whether or not the pipeline uses the 11120 viewport array index feature. Pipelines which use 11121 this feature can render into all 16 viewports, 11122 whereas pipelines which do not use it are 11123 restricted to viewport #0. 11124 ".es_gs_lds_size" integer Size in bytes of LDS space used internally for 11125 handling data-passing between the ES and GS 11126 shader stages. This can be zero if the data is 11127 passed using off-chip buffers. This value should 11128 be used to program all user-SGPRs which have been 11129 marked with "UserDataMapping::EsGsLdsSize" 11130 (typically only the GS and VS HW stages will ever 11131 have a user-SGPR so marked). 11132 ".nggSubgroupSize" integer Explicit maximum subgroup size for NGG shaders 11133 (maximum number of threads in a subgroup). 11134 ".num_interpolants" integer Graphics only. Number of PS interpolants. 11135 ".mesh_scratch_memory_size" integer Max mesh shader scratch memory used. 11136 ".api" string Name of the client graphics API. 11137 ".api_create_info" binary Graphics API shader create info binary blob. Can 11138 be defined by the driver using the compiler if 11139 they want to be able to correlate API-specific 11140 information used during creation at a later time. 11141 ====================================== ============== ========= =================================================== 11142 11143.. 11144 11145 .. table:: AMDPAL Code Object Shader Map 11146 :name: amdgpu-amdpal-code-object-shader-map-table 11147 11148 11149 +-------------+--------------+-------------------------------------------------------------------+ 11150 |String Key |Value Type |Description | 11151 +=============+==============+===================================================================+ 11152 |- ".compute" |map |See :ref:`amdgpu-amdpal-code-object-api-shader-metadata-map-table` | 11153 |- ".vertex" | |for the definition of the keys included in that map. | 11154 |- ".hull" | | | 11155 |- ".domain" | | | 11156 |- ".geometry"| | | 11157 |- ".pixel" | | | 11158 +-------------+--------------+-------------------------------------------------------------------+ 11159 11160.. 11161 11162 .. table:: AMDPAL Code Object API Shader Metadata Map 11163 :name: amdgpu-amdpal-code-object-api-shader-metadata-map-table 11164 11165 ==================== ============== ========= ===================================================================== 11166 String Key Value Type Required? Description 11167 ==================== ============== ========= ===================================================================== 11168 ".api_shader_hash" sequence of Required Input shader hash, typically passed in from the client. The value 11169 2 integers is implementation defined, and can not be relied on between 11170 different builds of the compiler. 11171 ".hardware_mapping" sequence of Required Flags indicating the HW stages this API shader maps to. Values 11172 string include: 11173 11174 - ".ls" 11175 - ".hs" 11176 - ".es" 11177 - ".gs" 11178 - ".vs" 11179 - ".ps" 11180 - ".cs" 11181 11182 ==================== ============== ========= ===================================================================== 11183 11184.. 11185 11186 .. table:: AMDPAL Code Object Hardware Stage Map 11187 :name: amdgpu-amdpal-code-object-hardware-stage-map-table 11188 11189 +-------------+--------------+-----------------------------------------------------------------------+ 11190 |String Key |Value Type |Description | 11191 +=============+==============+=======================================================================+ 11192 |- ".ls" |map |See :ref:`amdgpu-amdpal-code-object-hardware-stage-metadata-map-table` | 11193 |- ".hs" | |for the definition of the keys included in that map. | 11194 |- ".es" | | | 11195 |- ".gs" | | | 11196 |- ".vs" | | | 11197 |- ".ps" | | | 11198 |- ".cs" | | | 11199 +-------------+--------------+-----------------------------------------------------------------------+ 11200 11201.. 11202 11203 .. table:: AMDPAL Code Object Hardware Stage Metadata Map 11204 :name: amdgpu-amdpal-code-object-hardware-stage-metadata-map-table 11205 11206 ========================== ============== ========= =============================================================== 11207 String Key Value Type Required? Description 11208 ========================== ============== ========= =============================================================== 11209 ".entry_point" string The ELF symbol pointing to this pipeline's stage entry point. 11210 ".scratch_memory_size" integer Scratch memory size in bytes. 11211 ".lds_size" integer Local Data Share size in bytes. 11212 ".perf_data_buffer_size" integer Performance data buffer size in bytes. 11213 ".vgpr_count" integer Number of VGPRs used. 11214 ".sgpr_count" integer Number of SGPRs used. 11215 ".vgpr_limit" integer If non-zero, indicates the shader was compiled with a 11216 directive to instruct the compiler to limit the VGPR usage to 11217 be less than or equal to the specified value (only set if 11218 different from HW default). 11219 ".sgpr_limit" integer SGPR count upper limit (only set if different from HW 11220 default). 11221 ".threadgroup_dimensions" sequence of Thread-group X/Y/Z dimensions (Compute only). 11222 3 integers 11223 ".wavefront_size" integer Wavefront size (only set if different from HW default). 11224 ".uses_uavs" boolean The shader reads or writes UAVs. 11225 ".uses_rovs" boolean The shader reads or writes ROVs. 11226 ".writes_uavs" boolean The shader writes to one or more UAVs. 11227 ".writes_depth" boolean The shader writes out a depth value. 11228 ".uses_append_consume" boolean The shader uses append and/or consume operations, either 11229 memory or GDS. 11230 ".uses_prim_id" boolean The shader uses PrimID. 11231 ========================== ============== ========= =============================================================== 11232 11233.. 11234 11235 .. table:: AMDPAL Code Object Shader Function Map 11236 :name: amdgpu-amdpal-code-object-shader-function-map-table 11237 11238 =============== ============== ==================================================================== 11239 String Key Value Type Description 11240 =============== ============== ==================================================================== 11241 *symbol name* map *symbol name* is the ELF symbol name of the shader function code 11242 entry address. The value is the function's metadata. See 11243 :ref:`amdgpu-amdpal-code-object-shader-function-metadata-map-table`. 11244 =============== ============== ==================================================================== 11245 11246.. 11247 11248 .. table:: AMDPAL Code Object Shader Function Metadata Map 11249 :name: amdgpu-amdpal-code-object-shader-function-metadata-map-table 11250 11251 ============================= ============== ================================================================= 11252 String Key Value Type Description 11253 ============================= ============== ================================================================= 11254 ".api_shader_hash" sequence of Input shader hash, typically passed in from the client. The value 11255 2 integers is implementation defined, and can not be relied on between 11256 different builds of the compiler. 11257 ".scratch_memory_size" integer Size in bytes of scratch memory used by the shader. 11258 ".lds_size" integer Size in bytes of LDS memory. 11259 ".vgpr_count" integer Number of VGPRs used by the shader. 11260 ".sgpr_count" integer Number of SGPRs used by the shader. 11261 ".stack_frame_size_in_bytes" integer Amount of stack size used by the shader. 11262 ".shader_subtype" string Shader subtype/kind. Values include: 11263 11264 - "Unknown" 11265 11266 ============================= ============== ================================================================= 11267 11268.. 11269 11270 .. table:: AMDPAL Code Object Register Map 11271 :name: amdgpu-amdpal-code-object-register-map-table 11272 11273 ========================== ============== ==================================================================== 11274 32-bit Integer Key Value Type Description 11275 ========================== ============== ==================================================================== 11276 ``reg offset`` 32-bit integer ``reg offset`` is the dword offset into the GFXIP register space of 11277 a GRBM register (i.e., driver accessible GPU register number, not 11278 shader GPR register number). The driver is required to program each 11279 specified register to the corresponding specified value when 11280 executing this pipeline. Typically, the ``reg offsets`` are the 11281 ``uint16_t`` offsets to each register as defined by the hardware 11282 chip headers. The register is set to the provided value. However, a 11283 ``reg offset`` that specifies a user data register (e.g., 11284 COMPUTE_USER_DATA_0) needs special treatment. See 11285 :ref:`amdgpu-amdpal-code-object-user-data-section` section for more 11286 information. 11287 ========================== ============== ==================================================================== 11288 11289.. _amdgpu-amdpal-code-object-user-data-section: 11290 11291User Data 11292+++++++++ 11293 11294Each hardware stage has a set of 32-bit physical SPI *user data registers* 11295(either 16 or 32 based on graphics IP and the stage) which can be 11296written from a command buffer and then loaded into SGPRs when waves are 11297launched via a subsequent dispatch or draw operation. This is the way 11298most arguments are passed from the application/runtime to a hardware 11299shader. 11300 11301PAL abstracts this functionality by exposing a set of 128 *user data 11302entries* per pipeline a client can use to pass arguments from a command 11303buffer to one or more shaders in that pipeline. The ELF code object must 11304specify a mapping from virtualized *user data entries* to physical *user 11305data registers*, and PAL is responsible for implementing that mapping, 11306including spilling overflow *user data entries* to memory if needed. 11307 11308Since the *user data registers* are GRBM-accessible SPI registers, this 11309mapping is actually embedded in the ``.registers`` metadata entry. For 11310most registers, the value in that map is a literal 32-bit value that 11311should be written to the register by the driver. However, when the 11312register is a *user data register* (any USER_DATA register e.g., 11313SPI_SHADER_USER_DATA_PS_5), the value is instead an encoding that tells 11314the driver to write either a *user data entry* value or one of several 11315driver-internal values to the register. This encoding is described in 11316the following table: 11317 11318.. note:: 11319 11320 Currently, *user data registers* 0 and 1 (e.g., SPI_SHADER_USER_DATA_PS_0, 11321 and SPI_SHADER_USER_DATA_PS_1) are reserved. *User data register* 0 must 11322 always be programmed to the address of the GlobalTable, and *user data 11323 register* 1 must always be programmed to the address of the PerShaderTable. 11324 11325.. 11326 11327 .. table:: AMDPAL User Data Mapping 11328 :name: amdgpu-amdpal-code-object-metadata-user-data-mapping-table 11329 11330 ========== ================= =============================================================================== 11331 Value Name Description 11332 ========== ================= =============================================================================== 11333 0..127 *User Data Entry* 32-bit value of user_data_entry[N] as specified via *CmdSetUserData()* 11334 0x10000000 GlobalTable 32-bit pointer to GPU memory containing the global internal table (should 11335 always point to *user data register* 0). 11336 0x10000001 PerShaderTable 32-bit pointer to GPU memory containing the per-shader internal table. See 11337 :ref:`amdgpu-amdpal-code-object-metadata-user-data-per-shader-table-section` 11338 for more detail (should always point to *user data register* 1). 11339 0x10000002 SpillTable 32-bit pointer to GPU memory containing the user data spill table. See 11340 :ref:`amdgpu-amdpal-code-object-metadata-user-data-spill-table-section` for 11341 more detail. 11342 0x10000003 BaseVertex Vertex offset (32-bit unsigned integer). Not needed if the pipeline doesn't 11343 reference the draw index in the vertex shader. Only supported by the first 11344 stage in a graphics pipeline. 11345 0x10000004 BaseInstance Instance offset (32-bit unsigned integer). Only supported by the first stage in 11346 a graphics pipeline. 11347 0x10000005 DrawIndex Draw index (32-bit unsigned integer). Only supported by the first stage in a 11348 graphics pipeline. 11349 0x10000006 Workgroup Thread group count (32-bit unsigned integer). Low half of a 64-bit address of 11350 a buffer containing the grid dimensions for a Compute dispatch operation. The 11351 high half of the address is stored in the next sequential user-SGPR. Only 11352 supported by compute pipelines. 11353 0x1000000A EsGsLdsSize Indicates that PAL will program this user-SGPR to contain the amount of LDS 11354 space used for the ES/GS pseudo-ring-buffer for passing data between shader 11355 stages. 11356 0x1000000B ViewId View id (32-bit unsigned integer) identifies a view of graphic 11357 pipeline instancing. 11358 0x1000000C StreamOutTable 32-bit pointer to GPU memory containing the stream out target SRD table. This 11359 can only appear for one shader stage per pipeline. 11360 0x1000000D PerShaderPerfData 32-bit pointer to GPU memory containing the per-shader performance data buffer. 11361 0x1000000F VertexBufferTable 32-bit pointer to GPU memory containing the vertex buffer SRD table. This can 11362 only appear for one shader stage per pipeline. 11363 0x10000010 UavExportTable 32-bit pointer to GPU memory containing the UAV export SRD table. This can 11364 only appear for one shader stage per pipeline (PS). These replace color targets 11365 and are completely separate from any UAVs used by the shader. This is optional, 11366 and only used by the PS when UAV exports are used to replace color-target 11367 exports to optimize specific shaders. 11368 0x10000011 NggCullingData 64-bit pointer to GPU memory containing the hardware register data needed by 11369 some NGG pipelines to perform culling. This value contains the address of the 11370 first of two consecutive registers which provide the full GPU address. 11371 0x10000015 FetchShaderPtr 64-bit pointer to GPU memory containing the fetch shader subroutine. 11372 ========== ================= =============================================================================== 11373 11374.. _amdgpu-amdpal-code-object-metadata-user-data-per-shader-table-section: 11375 11376Per-Shader Table 11377################ 11378 11379Low 32 bits of the GPU address for an optional buffer in the ``.data`` 11380section of the ELF. The high 32 bits of the address match the high 32 bits 11381of the shader's program counter. 11382 11383The buffer can be anything the shader compiler needs it for, and 11384allows each shader to have its own region of the ``.data`` section. 11385Typically, this could be a table of buffer SRD's and the data pointed to 11386by the buffer SRD's, but it could be a flat-address region of memory as 11387well. Its layout and usage are defined by the shader compiler. 11388 11389Each shader's table in the ``.data`` section is referenced by the symbol 11390``_amdgpu_``\ *xs*\ ``_shdr_intrl_data`` where *xs* corresponds with the 11391hardware shader stage the data is for. E.g., 11392``_amdgpu_cs_shdr_intrl_data`` for the compute shader hardware stage. 11393 11394.. _amdgpu-amdpal-code-object-metadata-user-data-spill-table-section: 11395 11396Spill Table 11397########### 11398 11399It is possible for a hardware shader to need access to more *user data 11400entries* than there are slots available in user data registers for one 11401or more hardware shader stages. In that case, the PAL runtime expects 11402the necessary *user data entries* to be spilled to GPU memory and use 11403one user data register to point to the spilled user data memory. The 11404value of the *user data entry* must then represent the location where 11405a shader expects to read the low 32-bits of the table's GPU virtual 11406address. The *spill table* itself represents a set of 32-bit values 11407managed by the PAL runtime in GPU-accessible memory that can be made 11408indirectly accessible to a hardware shader. 11409 11410Unspecified OS 11411-------------- 11412 11413This section provides code conventions used when the target triple OS is 11414empty (see :ref:`amdgpu-target-triples`). 11415 11416Trap Handler ABI 11417~~~~~~~~~~~~~~~~ 11418 11419For code objects generated by AMDGPU backend for non-amdhsa OS, the runtime does 11420not install a trap handler. The ``llvm.trap`` and ``llvm.debugtrap`` 11421instructions are handled as follows: 11422 11423 .. table:: AMDGPU Trap Handler for Non-AMDHSA OS 11424 :name: amdgpu-trap-handler-for-non-amdhsa-os-table 11425 11426 =============== =============== =========================================== 11427 Usage Code Sequence Description 11428 =============== =============== =========================================== 11429 llvm.trap s_endpgm Causes wavefront to be terminated. 11430 llvm.debugtrap *none* Compiler warning given that there is no 11431 trap handler installed. 11432 =============== =============== =========================================== 11433 11434Source Languages 11435================ 11436 11437.. _amdgpu-opencl: 11438 11439OpenCL 11440------ 11441 11442When the language is OpenCL the following differences occur: 11443 114441. The OpenCL memory model is used (see :ref:`amdgpu-amdhsa-memory-model`). 114452. The AMDGPU backend appends additional arguments to the kernel's explicit 11446 arguments for the AMDHSA OS (see 11447 :ref:`opencl-kernel-implicit-arguments-appended-for-amdhsa-os-table`). 114483. Additional metadata is generated 11449 (see :ref:`amdgpu-amdhsa-code-object-metadata`). 11450 11451 .. table:: OpenCL kernel implicit arguments appended for AMDHSA OS 11452 :name: opencl-kernel-implicit-arguments-appended-for-amdhsa-os-table 11453 11454 ======== ==== ========= =========================================== 11455 Position Byte Byte Description 11456 Size Alignment 11457 ======== ==== ========= =========================================== 11458 1 8 8 OpenCL Global Offset X 11459 2 8 8 OpenCL Global Offset Y 11460 3 8 8 OpenCL Global Offset Z 11461 4 8 8 OpenCL address of printf buffer 11462 5 8 8 OpenCL address of virtual queue used by 11463 enqueue_kernel. 11464 6 8 8 OpenCL address of AqlWrap struct used by 11465 enqueue_kernel. 11466 7 8 8 Pointer argument used for Multi-gird 11467 synchronization. 11468 ======== ==== ========= =========================================== 11469 11470.. _amdgpu-hcc: 11471 11472HCC 11473--- 11474 11475When the language is HCC the following differences occur: 11476 114771. The HSA memory model is used (see :ref:`amdgpu-amdhsa-memory-model`). 11478 11479.. _amdgpu-assembler: 11480 11481Assembler 11482--------- 11483 11484AMDGPU backend has LLVM-MC based assembler which is currently in development. 11485It supports AMDGCN GFX6-GFX10. 11486 11487This section describes general syntax for instructions and operands. 11488 11489Instructions 11490~~~~~~~~~~~~ 11491 11492An instruction has the following :doc:`syntax<AMDGPUInstructionSyntax>`: 11493 11494 | ``<``\ *opcode*\ ``> <``\ *operand0*\ ``>, <``\ *operand1*\ ``>,... 11495 <``\ *modifier0*\ ``> <``\ *modifier1*\ ``>...`` 11496 11497:doc:`Operands<AMDGPUOperandSyntax>` are comma-separated while 11498:doc:`modifiers<AMDGPUModifierSyntax>` are space-separated. 11499 11500The order of operands and modifiers is fixed. 11501Most modifiers are optional and may be omitted. 11502 11503Links to detailed instruction syntax description may be found in the following 11504table. Note that features under development are not included 11505in this description. 11506 11507 =================================== ======================================= 11508 Core ISA ISA Extensions 11509 =================================== ======================================= 11510 :doc:`GFX7<AMDGPU/AMDGPUAsmGFX7>` \- 11511 :doc:`GFX8<AMDGPU/AMDGPUAsmGFX8>` \- 11512 :doc:`GFX9<AMDGPU/AMDGPUAsmGFX9>` :doc:`gfx900<AMDGPU/AMDGPUAsmGFX900>` 11513 11514 :doc:`gfx902<AMDGPU/AMDGPUAsmGFX900>` 11515 11516 :doc:`gfx904<AMDGPU/AMDGPUAsmGFX904>` 11517 11518 :doc:`gfx906<AMDGPU/AMDGPUAsmGFX906>` 11519 11520 :doc:`gfx908<AMDGPU/AMDGPUAsmGFX908>` 11521 11522 :doc:`gfx909<AMDGPU/AMDGPUAsmGFX900>` 11523 11524 :doc:`gfx90a<AMDGPU/AMDGPUAsmGFX90a>` 11525 11526 :doc:`GFX10<AMDGPU/AMDGPUAsmGFX10>` :doc:`gfx1011<AMDGPU/AMDGPUAsmGFX1011>` 11527 11528 :doc:`gfx1012<AMDGPU/AMDGPUAsmGFX1011>` 11529 =================================== ======================================= 11530 11531For more information about instructions, their semantics and supported 11532combinations of operands, refer to one of instruction set architecture manuals 11533[AMD-GCN-GFX6]_, [AMD-GCN-GFX7]_, [AMD-GCN-GFX8]_, 11534[AMD-GCN-GFX900-GFX904-VEGA]_, [AMD-GCN-GFX906-VEGA7NM]_ 11535[AMD-GCN-GFX908-CDNA1]_, [AMD-GCN-GFX10-RDNA1]_ and [AMD-GCN-GFX10-RDNA2]_. 11536 11537Operands 11538~~~~~~~~ 11539 11540Detailed description of operands may be found :doc:`here<AMDGPUOperandSyntax>`. 11541 11542Modifiers 11543~~~~~~~~~ 11544 11545Detailed description of modifiers may be found 11546:doc:`here<AMDGPUModifierSyntax>`. 11547 11548Instruction Examples 11549~~~~~~~~~~~~~~~~~~~~ 11550 11551DS 11552++ 11553 11554.. code-block:: nasm 11555 11556 ds_add_u32 v2, v4 offset:16 11557 ds_write_src2_b64 v2 offset0:4 offset1:8 11558 ds_cmpst_f32 v2, v4, v6 11559 ds_min_rtn_f64 v[8:9], v2, v[4:5] 11560 11561For full list of supported instructions, refer to "LDS/GDS instructions" in ISA 11562Manual. 11563 11564FLAT 11565++++ 11566 11567.. code-block:: nasm 11568 11569 flat_load_dword v1, v[3:4] 11570 flat_store_dwordx3 v[3:4], v[5:7] 11571 flat_atomic_swap v1, v[3:4], v5 glc 11572 flat_atomic_cmpswap v1, v[3:4], v[5:6] glc slc 11573 flat_atomic_fmax_x2 v[1:2], v[3:4], v[5:6] glc 11574 11575For full list of supported instructions, refer to "FLAT instructions" in ISA 11576Manual. 11577 11578MUBUF 11579+++++ 11580 11581.. code-block:: nasm 11582 11583 buffer_load_dword v1, off, s[4:7], s1 11584 buffer_store_dwordx4 v[1:4], v2, ttmp[4:7], s1 offen offset:4 glc tfe 11585 buffer_store_format_xy v[1:2], off, s[4:7], s1 11586 buffer_wbinvl1 11587 buffer_atomic_inc v1, v2, s[8:11], s4 idxen offset:4 slc 11588 11589For full list of supported instructions, refer to "MUBUF Instructions" in ISA 11590Manual. 11591 11592SMRD/SMEM 11593+++++++++ 11594 11595.. code-block:: nasm 11596 11597 s_load_dword s1, s[2:3], 0xfc 11598 s_load_dwordx8 s[8:15], s[2:3], s4 11599 s_load_dwordx16 s[88:103], s[2:3], s4 11600 s_dcache_inv_vol 11601 s_memtime s[4:5] 11602 11603For full list of supported instructions, refer to "Scalar Memory Operations" in 11604ISA Manual. 11605 11606SOP1 11607++++ 11608 11609.. code-block:: nasm 11610 11611 s_mov_b32 s1, s2 11612 s_mov_b64 s[0:1], 0x80000000 11613 s_cmov_b32 s1, 200 11614 s_wqm_b64 s[2:3], s[4:5] 11615 s_bcnt0_i32_b64 s1, s[2:3] 11616 s_swappc_b64 s[2:3], s[4:5] 11617 s_cbranch_join s[4:5] 11618 11619For full list of supported instructions, refer to "SOP1 Instructions" in ISA 11620Manual. 11621 11622SOP2 11623++++ 11624 11625.. code-block:: nasm 11626 11627 s_add_u32 s1, s2, s3 11628 s_and_b64 s[2:3], s[4:5], s[6:7] 11629 s_cselect_b32 s1, s2, s3 11630 s_andn2_b32 s2, s4, s6 11631 s_lshr_b64 s[2:3], s[4:5], s6 11632 s_ashr_i32 s2, s4, s6 11633 s_bfm_b64 s[2:3], s4, s6 11634 s_bfe_i64 s[2:3], s[4:5], s6 11635 s_cbranch_g_fork s[4:5], s[6:7] 11636 11637For full list of supported instructions, refer to "SOP2 Instructions" in ISA 11638Manual. 11639 11640SOPC 11641++++ 11642 11643.. code-block:: nasm 11644 11645 s_cmp_eq_i32 s1, s2 11646 s_bitcmp1_b32 s1, s2 11647 s_bitcmp0_b64 s[2:3], s4 11648 s_setvskip s3, s5 11649 11650For full list of supported instructions, refer to "SOPC Instructions" in ISA 11651Manual. 11652 11653SOPP 11654++++ 11655 11656.. code-block:: nasm 11657 11658 s_barrier 11659 s_nop 2 11660 s_endpgm 11661 s_waitcnt 0 ; Wait for all counters to be 0 11662 s_waitcnt vmcnt(0) & expcnt(0) & lgkmcnt(0) ; Equivalent to above 11663 s_waitcnt vmcnt(1) ; Wait for vmcnt counter to be 1. 11664 s_sethalt 9 11665 s_sleep 10 11666 s_sendmsg 0x1 11667 s_sendmsg sendmsg(MSG_INTERRUPT) 11668 s_trap 1 11669 11670For full list of supported instructions, refer to "SOPP Instructions" in ISA 11671Manual. 11672 11673Unless otherwise mentioned, little verification is performed on the operands 11674of SOPP Instructions, so it is up to the programmer to be familiar with the 11675range or acceptable values. 11676 11677VALU 11678++++ 11679 11680For vector ALU instruction opcodes (VOP1, VOP2, VOP3, VOPC, VOP_DPP, VOP_SDWA), 11681the assembler will automatically use optimal encoding based on its operands. To 11682force specific encoding, one can add a suffix to the opcode of the instruction: 11683 11684* _e32 for 32-bit VOP1/VOP2/VOPC 11685* _e64 for 64-bit VOP3 11686* _dpp for VOP_DPP 11687* _sdwa for VOP_SDWA 11688 11689VOP1/VOP2/VOP3/VOPC examples: 11690 11691.. code-block:: nasm 11692 11693 v_mov_b32 v1, v2 11694 v_mov_b32_e32 v1, v2 11695 v_nop 11696 v_cvt_f64_i32_e32 v[1:2], v2 11697 v_floor_f32_e32 v1, v2 11698 v_bfrev_b32_e32 v1, v2 11699 v_add_f32_e32 v1, v2, v3 11700 v_mul_i32_i24_e64 v1, v2, 3 11701 v_mul_i32_i24_e32 v1, -3, v3 11702 v_mul_i32_i24_e32 v1, -100, v3 11703 v_addc_u32 v1, s[0:1], v2, v3, s[2:3] 11704 v_max_f16_e32 v1, v2, v3 11705 11706VOP_DPP examples: 11707 11708.. code-block:: nasm 11709 11710 v_mov_b32 v0, v0 quad_perm:[0,2,1,1] 11711 v_sin_f32 v0, v0 row_shl:1 row_mask:0xa bank_mask:0x1 bound_ctrl:0 11712 v_mov_b32 v0, v0 wave_shl:1 11713 v_mov_b32 v0, v0 row_mirror 11714 v_mov_b32 v0, v0 row_bcast:31 11715 v_mov_b32 v0, v0 quad_perm:[1,3,0,1] row_mask:0xa bank_mask:0x1 bound_ctrl:0 11716 v_add_f32 v0, v0, |v0| row_shl:1 row_mask:0xa bank_mask:0x1 bound_ctrl:0 11717 v_max_f16 v1, v2, v3 row_shl:1 row_mask:0xa bank_mask:0x1 bound_ctrl:0 11718 11719VOP_SDWA examples: 11720 11721.. code-block:: nasm 11722 11723 v_mov_b32 v1, v2 dst_sel:BYTE_0 dst_unused:UNUSED_PRESERVE src0_sel:DWORD 11724 v_min_u32 v200, v200, v1 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_1 src1_sel:DWORD 11725 v_sin_f32 v0, v0 dst_unused:UNUSED_PAD src0_sel:WORD_1 11726 v_fract_f32 v0, |v0| dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_1 11727 v_cmpx_le_u32 vcc, v1, v2 src0_sel:BYTE_2 src1_sel:WORD_0 11728 11729For full list of supported instructions, refer to "Vector ALU instructions". 11730 11731.. _amdgpu-amdhsa-assembler-predefined-symbols-v2: 11732 11733Code Object V2 Predefined Symbols 11734~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 11735 11736.. warning:: 11737 Code object V2 is not the default code object version emitted by 11738 this version of LLVM. 11739 11740The AMDGPU assembler defines and updates some symbols automatically. These 11741symbols do not affect code generation. 11742 11743.option.machine_version_major 11744+++++++++++++++++++++++++++++ 11745 11746Set to the GFX major generation number of the target being assembled for. For 11747example, when assembling for a "GFX9" target this will be set to the integer 11748value "9". The possible GFX major generation numbers are presented in 11749:ref:`amdgpu-processors`. 11750 11751.option.machine_version_minor 11752+++++++++++++++++++++++++++++ 11753 11754Set to the GFX minor generation number of the target being assembled for. For 11755example, when assembling for a "GFX810" target this will be set to the integer 11756value "1". The possible GFX minor generation numbers are presented in 11757:ref:`amdgpu-processors`. 11758 11759.option.machine_version_stepping 11760++++++++++++++++++++++++++++++++ 11761 11762Set to the GFX stepping generation number of the target being assembled for. 11763For example, when assembling for a "GFX704" target this will be set to the 11764integer value "4". The possible GFX stepping generation numbers are presented 11765in :ref:`amdgpu-processors`. 11766 11767.kernel.vgpr_count 11768++++++++++++++++++ 11769 11770Set to zero each time a 11771:ref:`amdgpu-amdhsa-assembler-directive-amdgpu_hsa_kernel` directive is 11772encountered. At each instruction, if the current value of this symbol is less 11773than or equal to the maximum VGPR number explicitly referenced within that 11774instruction then the symbol value is updated to equal that VGPR number plus 11775one. 11776 11777.kernel.sgpr_count 11778++++++++++++++++++ 11779 11780Set to zero each time a 11781:ref:`amdgpu-amdhsa-assembler-directive-amdgpu_hsa_kernel` directive is 11782encountered. At each instruction, if the current value of this symbol is less 11783than or equal to the maximum VGPR number explicitly referenced within that 11784instruction then the symbol value is updated to equal that SGPR number plus 11785one. 11786 11787.. _amdgpu-amdhsa-assembler-directives-v2: 11788 11789Code Object V2 Directives 11790~~~~~~~~~~~~~~~~~~~~~~~~~ 11791 11792.. warning:: 11793 Code object V2 is not the default code object version emitted by 11794 this version of LLVM. 11795 11796AMDGPU ABI defines auxiliary data in output code object. In assembly source, 11797one can specify them with assembler directives. 11798 11799.hsa_code_object_version major, minor 11800+++++++++++++++++++++++++++++++++++++ 11801 11802*major* and *minor* are integers that specify the version of the HSA code 11803object that will be generated by the assembler. 11804 11805.hsa_code_object_isa [major, minor, stepping, vendor, arch] 11806+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ 11807 11808 11809*major*, *minor*, and *stepping* are all integers that describe the instruction 11810set architecture (ISA) version of the assembly program. 11811 11812*vendor* and *arch* are quoted strings. *vendor* should always be equal to 11813"AMD" and *arch* should always be equal to "AMDGPU". 11814 11815By default, the assembler will derive the ISA version, *vendor*, and *arch* 11816from the value of the -mcpu option that is passed to the assembler. 11817 11818.. _amdgpu-amdhsa-assembler-directive-amdgpu_hsa_kernel: 11819 11820.amdgpu_hsa_kernel (name) 11821+++++++++++++++++++++++++ 11822 11823This directives specifies that the symbol with given name is a kernel entry 11824point (label) and the object should contain corresponding symbol of type 11825STT_AMDGPU_HSA_KERNEL. 11826 11827.amd_kernel_code_t 11828++++++++++++++++++ 11829 11830This directive marks the beginning of a list of key / value pairs that are used 11831to specify the amd_kernel_code_t object that will be emitted by the assembler. 11832The list must be terminated by the *.end_amd_kernel_code_t* directive. For any 11833amd_kernel_code_t values that are unspecified a default value will be used. The 11834default value for all keys is 0, with the following exceptions: 11835 11836- *amd_code_version_major* defaults to 1. 11837- *amd_kernel_code_version_minor* defaults to 2. 11838- *amd_machine_kind* defaults to 1. 11839- *amd_machine_version_major*, *machine_version_minor*, and 11840 *amd_machine_version_stepping* are derived from the value of the -mcpu option 11841 that is passed to the assembler. 11842- *kernel_code_entry_byte_offset* defaults to 256. 11843- *wavefront_size* defaults 6 for all targets before GFX10. For GFX10 onwards 11844 defaults to 6 if target feature ``wavefrontsize64`` is enabled, otherwise 5. 11845 Note that wavefront size is specified as a power of two, so a value of **n** 11846 means a size of 2^ **n**. 11847- *call_convention* defaults to -1. 11848- *kernarg_segment_alignment*, *group_segment_alignment*, and 11849 *private_segment_alignment* default to 4. Note that alignments are specified 11850 as a power of 2, so a value of **n** means an alignment of 2^ **n**. 11851- *enable_tg_split* defaults to 1 if target feature ``tgsplit`` is enabled for 11852 GFX90A onwards. 11853- *enable_wgp_mode* defaults to 1 if target feature ``cumode`` is disabled for 11854 GFX10 onwards. 11855- *enable_mem_ordered* defaults to 1 for GFX10 onwards. 11856 11857The *.amd_kernel_code_t* directive must be placed immediately after the 11858function label and before any instructions. 11859 11860For a full list of amd_kernel_code_t keys, refer to AMDGPU ABI document, 11861comments in lib/Target/AMDGPU/AmdKernelCodeT.h and test/CodeGen/AMDGPU/hsa.s. 11862 11863.. _amdgpu-amdhsa-assembler-example-v2: 11864 11865Code Object V2 Example Source Code 11866~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 11867 11868.. warning:: 11869 Code Object V2 is not the default code object version emitted by 11870 this version of LLVM. 11871 11872Here is an example of a minimal assembly source file, defining one HSA kernel: 11873 11874.. code:: 11875 :number-lines: 11876 11877 .hsa_code_object_version 1,0 11878 .hsa_code_object_isa 11879 11880 .hsatext 11881 .globl hello_world 11882 .p2align 8 11883 .amdgpu_hsa_kernel hello_world 11884 11885 hello_world: 11886 11887 .amd_kernel_code_t 11888 enable_sgpr_kernarg_segment_ptr = 1 11889 is_ptr64 = 1 11890 compute_pgm_rsrc1_vgprs = 0 11891 compute_pgm_rsrc1_sgprs = 0 11892 compute_pgm_rsrc2_user_sgpr = 2 11893 compute_pgm_rsrc1_wgp_mode = 0 11894 compute_pgm_rsrc1_mem_ordered = 0 11895 compute_pgm_rsrc1_fwd_progress = 1 11896 .end_amd_kernel_code_t 11897 11898 s_load_dwordx2 s[0:1], s[0:1] 0x0 11899 v_mov_b32 v0, 3.14159 11900 s_waitcnt lgkmcnt(0) 11901 v_mov_b32 v1, s0 11902 v_mov_b32 v2, s1 11903 flat_store_dword v[1:2], v0 11904 s_endpgm 11905 .Lfunc_end0: 11906 .size hello_world, .Lfunc_end0-hello_world 11907 11908.. _amdgpu-amdhsa-assembler-predefined-symbols-v3-v4: 11909 11910Code Object V3 to V4 Predefined Symbols 11911~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 11912 11913The AMDGPU assembler defines and updates some symbols automatically. These 11914symbols do not affect code generation. 11915 11916.amdgcn.gfx_generation_number 11917+++++++++++++++++++++++++++++ 11918 11919Set to the GFX major generation number of the target being assembled for. For 11920example, when assembling for a "GFX9" target this will be set to the integer 11921value "9". The possible GFX major generation numbers are presented in 11922:ref:`amdgpu-processors`. 11923 11924.amdgcn.gfx_generation_minor 11925++++++++++++++++++++++++++++ 11926 11927Set to the GFX minor generation number of the target being assembled for. For 11928example, when assembling for a "GFX810" target this will be set to the integer 11929value "1". The possible GFX minor generation numbers are presented in 11930:ref:`amdgpu-processors`. 11931 11932.amdgcn.gfx_generation_stepping 11933+++++++++++++++++++++++++++++++ 11934 11935Set to the GFX stepping generation number of the target being assembled for. 11936For example, when assembling for a "GFX704" target this will be set to the 11937integer value "4". The possible GFX stepping generation numbers are presented 11938in :ref:`amdgpu-processors`. 11939 11940.. _amdgpu-amdhsa-assembler-symbol-next_free_vgpr: 11941 11942.amdgcn.next_free_vgpr 11943++++++++++++++++++++++ 11944 11945Set to zero before assembly begins. At each instruction, if the current value 11946of this symbol is less than or equal to the maximum VGPR number explicitly 11947referenced within that instruction then the symbol value is updated to equal 11948that VGPR number plus one. 11949 11950May be used to set the `.amdhsa_next_free_vgpr` directive in 11951:ref:`amdhsa-kernel-directives-table`. 11952 11953May be set at any time, e.g. manually set to zero at the start of each kernel. 11954 11955.. _amdgpu-amdhsa-assembler-symbol-next_free_sgpr: 11956 11957.amdgcn.next_free_sgpr 11958++++++++++++++++++++++ 11959 11960Set to zero before assembly begins. At each instruction, if the current value 11961of this symbol is less than or equal the maximum SGPR number explicitly 11962referenced within that instruction then the symbol value is updated to equal 11963that SGPR number plus one. 11964 11965May be used to set the `.amdhsa_next_free_spgr` directive in 11966:ref:`amdhsa-kernel-directives-table`. 11967 11968May be set at any time, e.g. manually set to zero at the start of each kernel. 11969 11970.. _amdgpu-amdhsa-assembler-directives-v3-v4: 11971 11972Code Object V3 to V4 Directives 11973~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 11974 11975Directives which begin with ``.amdgcn`` are valid for all ``amdgcn`` 11976architecture processors, and are not OS-specific. Directives which begin with 11977``.amdhsa`` are specific to ``amdgcn`` architecture processors when the 11978``amdhsa`` OS is specified. See :ref:`amdgpu-target-triples` and 11979:ref:`amdgpu-processors`. 11980 11981.. _amdgpu-assembler-directive-amdgcn-target: 11982 11983.amdgcn_target <target-triple> "-" <target-id> 11984++++++++++++++++++++++++++++++++++++++++++++++ 11985 11986Optional directive which declares the ``<target-triple>-<target-id>`` supported 11987by the containing assembler source file. Used by the assembler to validate 11988command-line options such as ``-triple``, ``-mcpu``, and 11989``--offload-arch=<target-id>``. A non-canonical target ID is allowed. See 11990:ref:`amdgpu-target-triples` and :ref:`amdgpu-target-id`. 11991 11992.. note:: 11993 11994 The target ID syntax used for code object V2 to V3 for this directive differs 11995 from that used elsewhere. See :ref:`amdgpu-target-id-v2-v3`. 11996 11997.amdhsa_kernel <name> 11998+++++++++++++++++++++ 11999 12000Creates a correctly aligned AMDHSA kernel descriptor and a symbol, 12001``<name>.kd``, in the current location of the current section. Only valid when 12002the OS is ``amdhsa``. ``<name>`` must be a symbol that labels the first 12003instruction to execute, and does not need to be previously defined. 12004 12005Marks the beginning of a list of directives used to generate the bytes of a 12006kernel descriptor, as described in :ref:`amdgpu-amdhsa-kernel-descriptor`. 12007Directives which may appear in this list are described in 12008:ref:`amdhsa-kernel-directives-table`. Directives may appear in any order, must 12009be valid for the target being assembled for, and cannot be repeated. Directives 12010support the range of values specified by the field they reference in 12011:ref:`amdgpu-amdhsa-kernel-descriptor`. If a directive is not specified, it is 12012assumed to have its default value, unless it is marked as "Required", in which 12013case it is an error to omit the directive. This list of directives is 12014terminated by an ``.end_amdhsa_kernel`` directive. 12015 12016 .. table:: AMDHSA Kernel Assembler Directives 12017 :name: amdhsa-kernel-directives-table 12018 12019 ======================================================== =================== ============ =================== 12020 Directive Default Supported On Description 12021 ======================================================== =================== ============ =================== 12022 ``.amdhsa_group_segment_fixed_size`` 0 GFX6-GFX10 Controls GROUP_SEGMENT_FIXED_SIZE in 12023 :ref:`amdgpu-amdhsa-kernel-descriptor-v3-table`. 12024 ``.amdhsa_private_segment_fixed_size`` 0 GFX6-GFX10 Controls PRIVATE_SEGMENT_FIXED_SIZE in 12025 :ref:`amdgpu-amdhsa-kernel-descriptor-v3-table`. 12026 ``.amdhsa_kernarg_size`` 0 GFX6-GFX10 Controls KERNARG_SIZE in 12027 :ref:`amdgpu-amdhsa-kernel-descriptor-v3-table`. 12028 ``.amdhsa_user_sgpr_private_segment_buffer`` 0 GFX6-GFX10 Controls ENABLE_SGPR_PRIVATE_SEGMENT_BUFFER in 12029 :ref:`amdgpu-amdhsa-kernel-descriptor-v3-table`. 12030 ``.amdhsa_user_sgpr_dispatch_ptr`` 0 GFX6-GFX10 Controls ENABLE_SGPR_DISPATCH_PTR in 12031 :ref:`amdgpu-amdhsa-kernel-descriptor-v3-table`. 12032 ``.amdhsa_user_sgpr_queue_ptr`` 0 GFX6-GFX10 Controls ENABLE_SGPR_QUEUE_PTR in 12033 :ref:`amdgpu-amdhsa-kernel-descriptor-v3-table`. 12034 ``.amdhsa_user_sgpr_kernarg_segment_ptr`` 0 GFX6-GFX10 Controls ENABLE_SGPR_KERNARG_SEGMENT_PTR in 12035 :ref:`amdgpu-amdhsa-kernel-descriptor-v3-table`. 12036 ``.amdhsa_user_sgpr_dispatch_id`` 0 GFX6-GFX10 Controls ENABLE_SGPR_DISPATCH_ID in 12037 :ref:`amdgpu-amdhsa-kernel-descriptor-v3-table`. 12038 ``.amdhsa_user_sgpr_flat_scratch_init`` 0 GFX6-GFX10 Controls ENABLE_SGPR_FLAT_SCRATCH_INIT in 12039 :ref:`amdgpu-amdhsa-kernel-descriptor-v3-table`. 12040 ``.amdhsa_user_sgpr_private_segment_size`` 0 GFX6-GFX10 Controls ENABLE_SGPR_PRIVATE_SEGMENT_SIZE in 12041 :ref:`amdgpu-amdhsa-kernel-descriptor-v3-table`. 12042 ``.amdhsa_wavefront_size32`` Target GFX10 Controls ENABLE_WAVEFRONT_SIZE32 in 12043 Feature :ref:`amdgpu-amdhsa-kernel-descriptor-v3-table`. 12044 Specific 12045 (wavefrontsize64) 12046 ``.amdhsa_system_sgpr_private_segment_wavefront_offset`` 0 GFX6-GFX10 Controls ENABLE_PRIVATE_SEGMENT in 12047 :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx10-table`. 12048 ``.amdhsa_system_sgpr_workgroup_id_x`` 1 GFX6-GFX10 Controls ENABLE_SGPR_WORKGROUP_ID_X in 12049 :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx10-table`. 12050 ``.amdhsa_system_sgpr_workgroup_id_y`` 0 GFX6-GFX10 Controls ENABLE_SGPR_WORKGROUP_ID_Y in 12051 :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx10-table`. 12052 ``.amdhsa_system_sgpr_workgroup_id_z`` 0 GFX6-GFX10 Controls ENABLE_SGPR_WORKGROUP_ID_Z in 12053 :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx10-table`. 12054 ``.amdhsa_system_sgpr_workgroup_info`` 0 GFX6-GFX10 Controls ENABLE_SGPR_WORKGROUP_INFO in 12055 :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx10-table`. 12056 ``.amdhsa_system_vgpr_workitem_id`` 0 GFX6-GFX10 Controls ENABLE_VGPR_WORKITEM_ID in 12057 :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx10-table`. 12058 Possible values are defined in 12059 :ref:`amdgpu-amdhsa-system-vgpr-work-item-id-enumeration-values-table`. 12060 ``.amdhsa_next_free_vgpr`` Required GFX6-GFX10 Maximum VGPR number explicitly referenced, plus one. 12061 Used to calculate GRANULATED_WORKITEM_VGPR_COUNT in 12062 :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx10-table`. 12063 ``.amdhsa_next_free_sgpr`` Required GFX6-GFX10 Maximum SGPR number explicitly referenced, plus one. 12064 Used to calculate GRANULATED_WAVEFRONT_SGPR_COUNT in 12065 :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx10-table`. 12066 ``.amdhsa_accum_offset`` Required GFX90A Offset of a first AccVGPR in the unified register file. 12067 Used to calculate ACCUM_OFFSET in 12068 :ref:`amdgpu-amdhsa-compute_pgm_rsrc3-gfx90a-table`. 12069 ``.amdhsa_reserve_vcc`` 1 GFX6-GFX10 Whether the kernel may use the special VCC SGPR. 12070 Used to calculate GRANULATED_WAVEFRONT_SGPR_COUNT in 12071 :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx10-table`. 12072 ``.amdhsa_reserve_flat_scratch`` 1 GFX7-GFX10 Whether the kernel may use flat instructions to access 12073 scratch memory. Used to calculate 12074 GRANULATED_WAVEFRONT_SGPR_COUNT in 12075 :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx10-table`. 12076 ``.amdhsa_reserve_xnack_mask`` Target GFX8-GFX10 Whether the kernel may trigger XNACK replay. 12077 Feature Used to calculate GRANULATED_WAVEFRONT_SGPR_COUNT in 12078 Specific :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx10-table`. 12079 (xnack) 12080 ``.amdhsa_float_round_mode_32`` 0 GFX6-GFX10 Controls FLOAT_ROUND_MODE_32 in 12081 :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx10-table`. 12082 Possible values are defined in 12083 :ref:`amdgpu-amdhsa-floating-point-rounding-mode-enumeration-values-table`. 12084 ``.amdhsa_float_round_mode_16_64`` 0 GFX6-GFX10 Controls FLOAT_ROUND_MODE_16_64 in 12085 :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx10-table`. 12086 Possible values are defined in 12087 :ref:`amdgpu-amdhsa-floating-point-rounding-mode-enumeration-values-table`. 12088 ``.amdhsa_float_denorm_mode_32`` 0 GFX6-GFX10 Controls FLOAT_DENORM_MODE_32 in 12089 :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx10-table`. 12090 Possible values are defined in 12091 :ref:`amdgpu-amdhsa-floating-point-denorm-mode-enumeration-values-table`. 12092 ``.amdhsa_float_denorm_mode_16_64`` 3 GFX6-GFX10 Controls FLOAT_DENORM_MODE_16_64 in 12093 :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx10-table`. 12094 Possible values are defined in 12095 :ref:`amdgpu-amdhsa-floating-point-denorm-mode-enumeration-values-table`. 12096 ``.amdhsa_dx10_clamp`` 1 GFX6-GFX10 Controls ENABLE_DX10_CLAMP in 12097 :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx10-table`. 12098 ``.amdhsa_ieee_mode`` 1 GFX6-GFX10 Controls ENABLE_IEEE_MODE in 12099 :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx10-table`. 12100 ``.amdhsa_fp16_overflow`` 0 GFX9-GFX10 Controls FP16_OVFL in 12101 :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx10-table`. 12102 ``.amdhsa_tg_split`` Target GFX90A Controls TG_SPLIT in 12103 Feature :ref:`amdgpu-amdhsa-compute_pgm_rsrc3-gfx90a-table`. 12104 Specific 12105 (tgsplit) 12106 ``.amdhsa_workgroup_processor_mode`` Target GFX10 Controls ENABLE_WGP_MODE in 12107 Feature :ref:`amdgpu-amdhsa-kernel-descriptor-v3-table`. 12108 Specific 12109 (cumode) 12110 ``.amdhsa_memory_ordered`` 1 GFX10 Controls MEM_ORDERED in 12111 :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx10-table`. 12112 ``.amdhsa_forward_progress`` 0 GFX10 Controls FWD_PROGRESS in 12113 :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx10-table`. 12114 ``.amdhsa_exception_fp_ieee_invalid_op`` 0 GFX6-GFX10 Controls ENABLE_EXCEPTION_IEEE_754_FP_INVALID_OPERATION in 12115 :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx10-table`. 12116 ``.amdhsa_exception_fp_denorm_src`` 0 GFX6-GFX10 Controls ENABLE_EXCEPTION_FP_DENORMAL_SOURCE in 12117 :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx10-table`. 12118 ``.amdhsa_exception_fp_ieee_div_zero`` 0 GFX6-GFX10 Controls ENABLE_EXCEPTION_IEEE_754_FP_DIVISION_BY_ZERO in 12119 :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx10-table`. 12120 ``.amdhsa_exception_fp_ieee_overflow`` 0 GFX6-GFX10 Controls ENABLE_EXCEPTION_IEEE_754_FP_OVERFLOW in 12121 :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx10-table`. 12122 ``.amdhsa_exception_fp_ieee_underflow`` 0 GFX6-GFX10 Controls ENABLE_EXCEPTION_IEEE_754_FP_UNDERFLOW in 12123 :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx10-table`. 12124 ``.amdhsa_exception_fp_ieee_inexact`` 0 GFX6-GFX10 Controls ENABLE_EXCEPTION_IEEE_754_FP_INEXACT in 12125 :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx10-table`. 12126 ``.amdhsa_exception_int_div_zero`` 0 GFX6-GFX10 Controls ENABLE_EXCEPTION_INT_DIVIDE_BY_ZERO in 12127 :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx10-table`. 12128 ======================================================== =================== ============ =================== 12129 12130.amdgpu_metadata 12131++++++++++++++++ 12132 12133Optional directive which declares the contents of the ``NT_AMDGPU_METADATA`` 12134note record (see :ref:`amdgpu-elf-note-records-table-v3-v4`). 12135 12136The contents must be in the [YAML]_ markup format, with the same structure and 12137semantics described in :ref:`amdgpu-amdhsa-code-object-metadata-v3` or 12138:ref:`amdgpu-amdhsa-code-object-metadata-v4`. 12139 12140This directive is terminated by an ``.end_amdgpu_metadata`` directive. 12141 12142.. _amdgpu-amdhsa-assembler-example-v3-v4: 12143 12144Code Object V3 to V4 Example Source Code 12145~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 12146 12147Here is an example of a minimal assembly source file, defining one HSA kernel: 12148 12149.. code:: 12150 :number-lines: 12151 12152 .amdgcn_target "amdgcn-amd-amdhsa--gfx900+xnack" // optional 12153 12154 .text 12155 .globl hello_world 12156 .p2align 8 12157 .type hello_world,@function 12158 hello_world: 12159 s_load_dwordx2 s[0:1], s[0:1] 0x0 12160 v_mov_b32 v0, 3.14159 12161 s_waitcnt lgkmcnt(0) 12162 v_mov_b32 v1, s0 12163 v_mov_b32 v2, s1 12164 flat_store_dword v[1:2], v0 12165 s_endpgm 12166 .Lfunc_end0: 12167 .size hello_world, .Lfunc_end0-hello_world 12168 12169 .rodata 12170 .p2align 6 12171 .amdhsa_kernel hello_world 12172 .amdhsa_user_sgpr_kernarg_segment_ptr 1 12173 .amdhsa_next_free_vgpr .amdgcn.next_free_vgpr 12174 .amdhsa_next_free_sgpr .amdgcn.next_free_sgpr 12175 .end_amdhsa_kernel 12176 12177 .amdgpu_metadata 12178 --- 12179 amdhsa.version: 12180 - 1 12181 - 0 12182 amdhsa.kernels: 12183 - .name: hello_world 12184 .symbol: hello_world.kd 12185 .kernarg_segment_size: 48 12186 .group_segment_fixed_size: 0 12187 .private_segment_fixed_size: 0 12188 .kernarg_segment_align: 4 12189 .wavefront_size: 64 12190 .sgpr_count: 2 12191 .vgpr_count: 3 12192 .max_flat_workgroup_size: 256 12193 .args: 12194 - .size: 8 12195 .offset: 0 12196 .value_kind: global_buffer 12197 .address_space: global 12198 .actual_access: write_only 12199 //... 12200 .end_amdgpu_metadata 12201 12202This kernel is equivalent to the following HIP program: 12203 12204.. code:: 12205 :number-lines: 12206 12207 __global__ void hello_world(float *p) { 12208 *p = 3.14159f; 12209 } 12210 12211If an assembly source file contains multiple kernels and/or functions, the 12212:ref:`amdgpu-amdhsa-assembler-symbol-next_free_vgpr` and 12213:ref:`amdgpu-amdhsa-assembler-symbol-next_free_sgpr` symbols may be reset using 12214the ``.set <symbol>, <expression>`` directive. For example, in the case of two 12215kernels, where ``function1`` is only called from ``kernel1`` it is sufficient 12216to group the function with the kernel that calls it and reset the symbols 12217between the two connected components: 12218 12219.. code:: 12220 :number-lines: 12221 12222 .amdgcn_target "amdgcn-amd-amdhsa--gfx900+xnack" // optional 12223 12224 // gpr tracking symbols are implicitly set to zero 12225 12226 .text 12227 .globl kern0 12228 .p2align 8 12229 .type kern0,@function 12230 kern0: 12231 // ... 12232 s_endpgm 12233 .Lkern0_end: 12234 .size kern0, .Lkern0_end-kern0 12235 12236 .rodata 12237 .p2align 6 12238 .amdhsa_kernel kern0 12239 // ... 12240 .amdhsa_next_free_vgpr .amdgcn.next_free_vgpr 12241 .amdhsa_next_free_sgpr .amdgcn.next_free_sgpr 12242 .end_amdhsa_kernel 12243 12244 // reset symbols to begin tracking usage in func1 and kern1 12245 .set .amdgcn.next_free_vgpr, 0 12246 .set .amdgcn.next_free_sgpr, 0 12247 12248 .text 12249 .hidden func1 12250 .global func1 12251 .p2align 2 12252 .type func1,@function 12253 func1: 12254 // ... 12255 s_setpc_b64 s[30:31] 12256 .Lfunc1_end: 12257 .size func1, .Lfunc1_end-func1 12258 12259 .globl kern1 12260 .p2align 8 12261 .type kern1,@function 12262 kern1: 12263 // ... 12264 s_getpc_b64 s[4:5] 12265 s_add_u32 s4, s4, func1@rel32@lo+4 12266 s_addc_u32 s5, s5, func1@rel32@lo+4 12267 s_swappc_b64 s[30:31], s[4:5] 12268 // ... 12269 s_endpgm 12270 .Lkern1_end: 12271 .size kern1, .Lkern1_end-kern1 12272 12273 .rodata 12274 .p2align 6 12275 .amdhsa_kernel kern1 12276 // ... 12277 .amdhsa_next_free_vgpr .amdgcn.next_free_vgpr 12278 .amdhsa_next_free_sgpr .amdgcn.next_free_sgpr 12279 .end_amdhsa_kernel 12280 12281These symbols cannot identify connected components in order to automatically 12282track the usage for each kernel. However, in some cases careful organization of 12283the kernels and functions in the source file means there is minimal additional 12284effort required to accurately calculate GPR usage. 12285 12286Additional Documentation 12287======================== 12288 12289.. [AMD-GCN-GFX6] `AMD Southern Islands Series ISA <http://developer.amd.com/wordpress/media/2012/12/AMD_Southern_Islands_Instruction_Set_Architecture.pdf>`__ 12290.. [AMD-GCN-GFX7] `AMD Sea Islands Series ISA <http://developer.amd.com/wordpress/media/2013/07/AMD_Sea_Islands_Instruction_Set_Architecture.pdf>`_ 12291.. [AMD-GCN-GFX8] `AMD GCN3 Instruction Set Architecture <http://amd-dev.wpengine.netdna-cdn.com/wordpress/media/2013/12/AMD_GCN3_Instruction_Set_Architecture_rev1.1.pdf>`__ 12292.. [AMD-GCN-GFX900-GFX904-VEGA] `AMD Vega Instruction Set Architecture <http://developer.amd.com/wordpress/media/2013/12/Vega_Shader_ISA_28July2017.pdf>`__ 12293.. [AMD-GCN-GFX906-VEGA7NM] `AMD Vega 7nm Instruction Set Architecture <https://gpuopen.com/wp-content/uploads/2019/11/Vega_7nm_Shader_ISA_26November2019.pdf>`__ 12294.. [AMD-GCN-GFX908-CDNA1] `AMD Instinct MI100 Instruction Set Architecture <https://developer.amd.com/wp-content/resources/CDNA1_Shader_ISA_14December2020.pdf>`__ 12295.. [AMD-GCN-GFX10-RDNA1] `AMD RDNA 1.0 Instruction Set Architecture <https://gpuopen.com/wp-content/uploads/2019/08/RDNA_Shader_ISA_5August2019.pdf>`__ 12296.. [AMD-GCN-GFX10-RDNA2] `AMD RDNA 2 Instruction Set Architecture <https://developer.amd.com/wp-content/resources/RDNA2_Shader_ISA_November2020.pdf>`__ 12297.. [AMD-RADEON-HD-2000-3000] `AMD R6xx shader ISA <http://developer.amd.com/wordpress/media/2012/10/R600_Instruction_Set_Architecture.pdf>`__ 12298.. [AMD-RADEON-HD-4000] `AMD R7xx shader ISA <http://developer.amd.com/wordpress/media/2012/10/R700-Family_Instruction_Set_Architecture.pdf>`__ 12299.. [AMD-RADEON-HD-5000] `AMD Evergreen shader ISA <http://developer.amd.com/wordpress/media/2012/10/AMD_Evergreen-Family_Instruction_Set_Architecture.pdf>`__ 12300.. [AMD-RADEON-HD-6000] `AMD Cayman/Trinity shader ISA <http://developer.amd.com/wordpress/media/2012/10/AMD_HD_6900_Series_Instruction_Set_Architecture.pdf>`__ 12301.. [AMD-ROCm] `AMD ROCm™ Platform <https://rocmdocs.amd.com/>`__ 12302.. [AMD-ROCm-github] `AMD ROCm™ github <http://github.com/RadeonOpenCompute>`__ 12303.. [AMD-ROCm-Release-Notes] `AMD ROCm Release Notes <https://github.com/RadeonOpenCompute/ROCm>`__ 12304.. [CLANG-ATTR] `Attributes in Clang <https://clang.llvm.org/docs/AttributeReference.html>`__ 12305.. [DWARF] `DWARF Debugging Information Format <http://dwarfstd.org/>`__ 12306.. [ELF] `Executable and Linkable Format (ELF) <http://www.sco.com/developers/gabi/>`__ 12307.. [HRF] `Heterogeneous-race-free Memory Models <http://benedictgaster.org/wp-content/uploads/2014/01/asplos269-FINAL.pdf>`__ 12308.. [HSA] `Heterogeneous System Architecture (HSA) Foundation <http://www.hsafoundation.com/>`__ 12309.. [MsgPack] `Message Pack <http://www.msgpack.org/>`__ 12310.. [OpenCL] `The OpenCL Specification Version 2.0 <http://www.khronos.org/registry/cl/specs/opencl-2.0.pdf>`__ 12311.. [SEMVER] `Semantic Versioning <https://semver.org/>`__ 12312.. [YAML] `YAML Ain't Markup Language (YAML™) Version 1.2 <http://www.yaml.org/spec/1.2/spec.html>`__ 12313