1============================= 2User Guide for AMDGPU Backend 3============================= 4 5.. contents:: 6 :local: 7 8.. toctree:: 9 :hidden: 10 11 AMDGPU/AMDGPUAsmGFX7 12 AMDGPU/AMDGPUAsmGFX8 13 AMDGPU/AMDGPUAsmGFX9 14 AMDGPU/AMDGPUAsmGFX900 15 AMDGPU/AMDGPUAsmGFX904 16 AMDGPU/AMDGPUAsmGFX906 17 AMDGPU/AMDGPUAsmGFX908 18 AMDGPU/AMDGPUAsmGFX10 19 AMDGPU/AMDGPUAsmGFX1011 20 AMDGPUModifierSyntax 21 AMDGPUOperandSyntax 22 AMDGPUInstructionSyntax 23 AMDGPUInstructionNotation 24 AMDGPUDwarfProposalForHeterogeneousDebugging 25 26Introduction 27============ 28 29The AMDGPU backend provides ISA code generation for AMD GPUs, starting with the 30R600 family up until the current GCN families. It lives in the 31``llvm/lib/Target/AMDGPU`` directory. 32 33LLVM 34==== 35 36.. _amdgpu-target-triples: 37 38Target Triples 39-------------- 40 41Use the ``clang -target <Architecture>-<Vendor>-<OS>-<Environment>`` option to 42specify the target triple: 43 44 .. table:: AMDGPU Architectures 45 :name: amdgpu-architecture-table 46 47 ============ ============================================================== 48 Architecture Description 49 ============ ============================================================== 50 ``r600`` AMD GPUs HD2XXX-HD6XXX for graphics and compute shaders. 51 ``amdgcn`` AMD GPUs GCN GFX6 onwards for graphics and compute shaders. 52 ============ ============================================================== 53 54 .. table:: AMDGPU Vendors 55 :name: amdgpu-vendor-table 56 57 ============ ============================================================== 58 Vendor Description 59 ============ ============================================================== 60 ``amd`` Can be used for all AMD GPU usage. 61 ``mesa3d`` Can be used if the OS is ``mesa3d``. 62 ============ ============================================================== 63 64 .. table:: AMDGPU Operating Systems 65 :name: amdgpu-os-table 66 67 ============== ============================================================ 68 OS Description 69 ============== ============================================================ 70 *<empty>* Defaults to the *unknown* OS. 71 ``amdhsa`` Compute kernels executed on HSA [HSA]_ compatible runtimes 72 such as AMD's ROCm [AMD-ROCm]_. 73 ``amdpal`` Graphic shaders and compute kernels executed on AMD PAL 74 runtime. 75 ``mesa3d`` Graphic shaders and compute kernels executed on Mesa 3D 76 runtime. 77 ============== ============================================================ 78 79 .. table:: AMDGPU Environments 80 :name: amdgpu-environment-table 81 82 ============ ============================================================== 83 Environment Description 84 ============ ============================================================== 85 *<empty>* Default. 86 ============ ============================================================== 87 88.. _amdgpu-processors: 89 90Processors 91---------- 92 93Use the ``clang -mcpu <Processor>`` option to specify the AMDGPU processor. The 94names from both the *Processor* and *Alternative Processor* can be used. 95 96 .. table:: AMDGPU Processors 97 :name: amdgpu-processor-table 98 99 =========== =============== ============ ===== ================= ======= ====================== 100 Processor Alternative Target dGPU/ Target ROCm Example 101 Processor Triple APU Features Support Products 102 Architecture Supported 103 [Default] 104 =========== =============== ============ ===== ================= ======= ====================== 105 **Radeon HD 2000/3000 Series (R600)** [AMD-RADEON-HD-2000-3000]_ 106 ----------------------------------------------------------------------------------------------- 107 ``r600`` ``r600`` dGPU 108 ``r630`` ``r600`` dGPU 109 ``rs880`` ``r600`` dGPU 110 ``rv670`` ``r600`` dGPU 111 **Radeon HD 4000 Series (R700)** [AMD-RADEON-HD-4000]_ 112 ----------------------------------------------------------------------------------------------- 113 ``rv710`` ``r600`` dGPU 114 ``rv730`` ``r600`` dGPU 115 ``rv770`` ``r600`` dGPU 116 **Radeon HD 5000 Series (Evergreen)** [AMD-RADEON-HD-5000]_ 117 ----------------------------------------------------------------------------------------------- 118 ``cedar`` ``r600`` dGPU 119 ``cypress`` ``r600`` dGPU 120 ``juniper`` ``r600`` dGPU 121 ``redwood`` ``r600`` dGPU 122 ``sumo`` ``r600`` dGPU 123 **Radeon HD 6000 Series (Northern Islands)** [AMD-RADEON-HD-6000]_ 124 ----------------------------------------------------------------------------------------------- 125 ``barts`` ``r600`` dGPU 126 ``caicos`` ``r600`` dGPU 127 ``cayman`` ``r600`` dGPU 128 ``turks`` ``r600`` dGPU 129 **GCN GFX6 (Southern Islands (SI))** [AMD-GCN-GFX6]_ 130 ----------------------------------------------------------------------------------------------- 131 ``gfx600`` - ``tahiti`` ``amdgcn`` dGPU 132 ``gfx601`` - ``hainan`` ``amdgcn`` dGPU 133 - ``oland`` 134 - ``pitcairn`` 135 - ``verde`` 136 **GCN GFX7 (Sea Islands (CI))** [AMD-GCN-GFX7]_ 137 ----------------------------------------------------------------------------------------------- 138 ``gfx700`` - ``kaveri`` ``amdgcn`` APU - A6-7000 139 - A6 Pro-7050B 140 - A8-7100 141 - A8 Pro-7150B 142 - A10-7300 143 - A10 Pro-7350B 144 - FX-7500 145 - A8-7200P 146 - A10-7400P 147 - FX-7600P 148 ``gfx701`` - ``hawaii`` ``amdgcn`` dGPU ROCm - FirePro W8100 149 - FirePro W9100 150 - FirePro S9150 151 - FirePro S9170 152 ``gfx702`` ``amdgcn`` dGPU ROCm - Radeon R9 290 153 - Radeon R9 290x 154 - Radeon R390 155 - Radeon R390x 156 ``gfx703`` - ``kabini`` ``amdgcn`` APU - E1-2100 157 - ``mullins`` - E1-2200 158 - E1-2500 159 - E2-3000 160 - E2-3800 161 - A4-5000 162 - A4-5100 163 - A6-5200 164 - A4 Pro-3340B 165 ``gfx704`` - ``bonaire`` ``amdgcn`` dGPU - Radeon HD 7790 166 - Radeon HD 8770 167 - R7 260 168 - R7 260X 169 **GCN GFX8 (Volcanic Islands (VI))** [AMD-GCN-GFX8]_ 170 ----------------------------------------------------------------------------------------------- 171 ``gfx801`` - ``carrizo`` ``amdgcn`` APU - xnack - A6-8500P 172 [on] - Pro A6-8500B 173 - A8-8600P 174 - Pro A8-8600B 175 - FX-8800P 176 - Pro A12-8800B 177 \ ``amdgcn`` APU - xnack ROCm - A10-8700P 178 [on] - Pro A10-8700B 179 - A10-8780P 180 \ ``amdgcn`` APU - xnack - A10-9600P 181 [on] - A10-9630P 182 - A12-9700P 183 - A12-9730P 184 - FX-9800P 185 - FX-9830P 186 \ ``amdgcn`` APU - xnack - E2-9010 187 [on] - A6-9210 188 - A9-9410 189 ``gfx802`` - ``iceland`` ``amdgcn`` dGPU - xnack ROCm - FirePro S7150 190 - ``tonga`` [off] - FirePro S7100 191 - FirePro W7100 192 - Radeon R285 193 - Radeon R9 380 194 - Radeon R9 385 195 - Mobile FirePro 196 M7170 197 ``gfx803`` - ``fiji`` ``amdgcn`` dGPU - xnack ROCm - Radeon R9 Nano 198 [off] - Radeon R9 Fury 199 - Radeon R9 FuryX 200 - Radeon Pro Duo 201 - FirePro S9300x2 202 - Radeon Instinct MI8 203 \ - ``polaris10`` ``amdgcn`` dGPU - xnack ROCm - Radeon RX 470 204 [off] - Radeon RX 480 205 - Radeon Instinct MI6 206 \ - ``polaris11`` ``amdgcn`` dGPU - xnack ROCm - Radeon RX 460 207 [off] 208 ``gfx810`` - ``stoney`` ``amdgcn`` APU - xnack 209 [on] 210 **GCN GFX9** [AMD-GCN-GFX9]_ 211 ----------------------------------------------------------------------------------------------- 212 ``gfx900`` ``amdgcn`` dGPU - xnack ROCm - Radeon Vega 213 [off] Frontier Edition 214 - Radeon RX Vega 56 215 - Radeon RX Vega 64 216 - Radeon RX Vega 64 217 Liquid 218 - Radeon Instinct MI25 219 ``gfx902`` ``amdgcn`` APU - xnack - Ryzen 3 2200G 220 [on] - Ryzen 5 2400G 221 ``gfx904`` ``amdgcn`` dGPU - xnack *TBA* 222 [off] 223 .. TODO:: 224 Add product 225 names. 226 ``gfx906`` ``amdgcn`` dGPU - xnack - Radeon Instinct MI50 227 [off] - Radeon Instinct MI60 228 - Radeon VII 229 - Radeon Pro VII 230 ``gfx908`` ``amdgcn`` dGPU - xnack *TBA* 231 [off] 232 sram-ecc 233 [on] 234 .. TODO:: 235 Add product 236 names. 237 ``gfx909`` ``amdgcn`` APU - xnack *TBA* 238 [on] 239 .. TODO:: 240 Add product 241 names. 242 **GCN GFX10** [AMD-GCN-GFX10]_ 243 ----------------------------------------------------------------------------------------------- 244 ``gfx1010`` ``amdgcn`` dGPU - xnack - Radeon RX 5700 245 [off] - Radeon RX 5700 XT 246 - wavefrontsize64 - Radeon Pro 5600 XT 247 [off] 248 - cumode 249 [off] 250 ``gfx1011`` ``amdgcn`` dGPU - xnack - Radeon Pro 5600M 251 [off] 252 - wavefrontsize64 253 [off] 254 - cumode 255 [off] 256 ``gfx1012`` ``amdgcn`` dGPU - xnack - Radeon RX 5500 257 [off] - Radeon RX 5500 XT 258 - wavefrontsize64 259 [off] 260 - cumode 261 [off] 262 ``gfx1030`` ``amdgcn`` dGPU - wavefrontsize64 *TBA* 263 [off] 264 - cumode 265 [off] 266 .. TODO 267 Add product 268 names. 269 =========== =============== ============ ===== ================= ======= ====================== 270 271.. _amdgpu-target-features: 272 273Target Features 274--------------- 275 276Target features control how code is generated to support certain 277processor specific features. Not all target features are supported by 278all processors. The runtime must ensure that the features supported by 279the device used to execute the code match the features enabled when 280generating the code. A mismatch of features may result in incorrect 281execution, or a reduction in performance. 282 283The target features supported by each processor, and the default value 284used if not specified explicitly, is listed in 285:ref:`amdgpu-processor-table`. 286 287Use the ``clang -m[no-]<TargetFeature>`` option to specify the AMDGPU 288target features. 289 290For example: 291 292``-mxnack`` 293 Enable the ``xnack`` feature. 294``-mno-xnack`` 295 Disable the ``xnack`` feature. 296 297 .. table:: AMDGPU Target Features 298 :name: amdgpu-target-feature-table 299 300 ====================== ================================================== 301 Target Feature Description 302 ====================== ================================================== 303 -m[no-]xnack Enable/disable generating code that has 304 memory clauses that are compatible with 305 having XNACK replay enabled. 306 307 This is used for demand paging and page 308 migration. If XNACK replay is enabled in 309 the device, then if a page fault occurs 310 the code may execute incorrectly if the 311 ``xnack`` feature is not enabled. Executing 312 code that has the feature enabled on a 313 device that does not have XNACK replay 314 enabled will execute correctly but may 315 be less performant than code with the 316 feature disabled. 317 318 -m[no-]sram-ecc Enable/disable generating code that assumes SRAM 319 ECC is enabled/disabled. 320 321 -m[no-]wavefrontsize64 Control the default wavefront size used when 322 generating code for kernels. When disabled 323 native wavefront size 32 is used, when enabled 324 wavefront size 64 is used. 325 326 -m[no-]cumode Control the default wavefront execution mode used 327 when generating code for kernels. When disabled 328 native WGP wavefront execution mode is used, 329 when enabled CU wavefront execution mode is used 330 (see :ref:`amdgpu-amdhsa-memory-model`). 331 ====================== ================================================== 332 333.. _amdgpu-address-spaces: 334 335Address Spaces 336-------------- 337 338The AMDGPU architecture supports a number of memory address spaces. The address 339space names use the OpenCL standard names, with some additions. 340 341The AMDGPU address spaces correspond to target architecture specific LLVM 342address space numbers used in LLVM IR. 343 344The AMDGPU address spaces are described in 345:ref:`amdgpu-address-spaces-table`. Only 64-bit process address spaces are 346supported for the ``amdgcn`` target. 347 348 .. table:: AMDGPU Address Spaces 349 :name: amdgpu-address-spaces-table 350 351 ================================= =============== =========== ================ ======= ============================ 352 .. 64-Bit Process Address Space 353 --------------------------------- --------------- ----------- ---------------- ------------------------------------ 354 Address Space Name LLVM IR Address HSA Segment Hardware Address NULL Value 355 Space Number Name Name Size 356 ================================= =============== =========== ================ ======= ============================ 357 Generic 0 flat flat 64 0x0000000000000000 358 Global 1 global global 64 0x0000000000000000 359 Region 2 N/A GDS 32 *not implemented for AMDHSA* 360 Local 3 group LDS 32 0xFFFFFFFF 361 Constant 4 constant *same as global* 64 0x0000000000000000 362 Private 5 private scratch 32 0xFFFFFFFF 363 Constant 32-bit 6 *TODO* 0x00000000 364 Buffer Fat Pointer (experimental) 7 *TODO* 365 ================================= =============== =========== ================ ======= ============================ 366 367**Generic** 368 The generic address space uses the hardware flat address support available in 369 GFX7-GFX10. This uses two fixed ranges of virtual addresses (the private and 370 local apertures), that are outside the range of addressable global memory, to 371 map from a flat address to a private or local address. 372 373 FLAT instructions can take a flat address and access global, private 374 (scratch), and group (LDS) memory depending on if the address is within one 375 of the aperture ranges. Flat access to scratch requires hardware aperture 376 setup and setup in the kernel prologue (see 377 :ref:`amdgpu-amdhsa-kernel-prolog-flat-scratch`). Flat access to LDS requires 378 hardware aperture setup and M0 (GFX7-GFX8) register setup (see 379 :ref:`amdgpu-amdhsa-kernel-prolog-m0`). 380 381 To convert between a private or group address space address (termed a segment 382 address) and a flat address the base address of the corresponding aperture 383 can be used. For GFX7-GFX8 these are available in the 384 :ref:`amdgpu-amdhsa-hsa-aql-queue` the address of which can be obtained with 385 Queue Ptr SGPR (see :ref:`amdgpu-amdhsa-initial-kernel-execution-state`). For 386 GFX9-GFX10 the aperture base addresses are directly available as inline 387 constant registers ``SRC_SHARED_BASE/LIMIT`` and ``SRC_PRIVATE_BASE/LIMIT``. 388 In 64-bit address mode the aperture sizes are 2^32 bytes and the base is 389 aligned to 2^32 which makes it easier to convert from flat to segment or 390 segment to flat. 391 392 A global address space address has the same value when used as a flat address 393 so no conversion is needed. 394 395**Global and Constant** 396 The global and constant address spaces both use global virtual addresses, 397 which are the same virtual address space used by the CPU. However, some 398 virtual addresses may only be accessible to the CPU, some only accessible 399 by the GPU, and some by both. 400 401 Using the constant address space indicates that the data will not change 402 during the execution of the kernel. This allows scalar read instructions to 403 be used. The vector and scalar L1 caches are invalidated of volatile data 404 before each kernel dispatch execution to allow constant memory to change 405 values between kernel dispatches. 406 407**Region** 408 The region address space uses the hardware Global Data Store (GDS). All 409 wavefronts executing on the same device will access the same memory for any 410 given region address. However, the same region address accessed by wavefronts 411 executing on different devices will access different memory. It is higher 412 performance than global memory. It is allocated by the runtime. The data 413 store (DS) instructions can be used to access it. 414 415**Local** 416 The local address space uses the hardware Local Data Store (LDS) which is 417 automatically allocated when the hardware creates the wavefronts of a 418 work-group, and freed when all the wavefronts of a work-group have 419 terminated. All wavefronts belonging to the same work-group will access the 420 same memory for any given local address. However, the same local address 421 accessed by wavefronts belonging to different work-groups will access 422 different memory. It is higher performance than global memory. The data store 423 (DS) instructions can be used to access it. 424 425**Private** 426 The private address space uses the hardware scratch memory support which 427 automatically allocates memory when it creates a wavefront and frees it when 428 a wavefronts terminates. The memory accessed by a lane of a wavefront for any 429 given private address will be different to the memory accessed by another lane 430 of the same or different wavefront for the same private address. 431 432 If a kernel dispatch uses scratch, then the hardware allocates memory from a 433 pool of backing memory allocated by the runtime for each wavefront. The lanes 434 of the wavefront access this using dword (4 byte) interleaving. The mapping 435 used from private address to backing memory address is: 436 437 ``wavefront-scratch-base + 438 ((private-address / 4) * wavefront-size * 4) + 439 (wavefront-lane-id * 4) + (private-address % 4)`` 440 441 If each lane of a wavefront accesses the same private address, the 442 interleaving results in adjacent dwords being accessed and hence requires 443 fewer cache lines to be fetched. 444 445 There are different ways that the wavefront scratch base address is 446 determined by a wavefront (see 447 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`). 448 449 Scratch memory can be accessed in an interleaved manner using buffer 450 instructions with the scratch buffer descriptor and per wavefront scratch 451 offset, by the scratch instructions, or by flat instructions. Multi-dword 452 access is not supported except by flat and scratch instructions in 453 GFX9-GFX10. 454 455**Constant 32-bit** 456 *TODO* 457 458**Buffer Fat Pointer** 459 The buffer fat pointer is an experimental address space that is currently 460 unsupported in the backend. It exposes a non-integral pointer that is in 461 the future intended to support the modelling of 128-bit buffer descriptors 462 plus a 32-bit offset into the buffer (in total encapsulating a 160-bit 463 *pointer*), allowing normal LLVM load/store/atomic operations to be used to 464 model the buffer descriptors used heavily in graphics workloads targeting 465 the backend. 466 467.. _amdgpu-memory-scopes: 468 469Memory Scopes 470------------- 471 472This section provides LLVM memory synchronization scopes supported by the AMDGPU 473backend memory model when the target triple OS is ``amdhsa`` (see 474:ref:`amdgpu-amdhsa-memory-model` and :ref:`amdgpu-target-triples`). 475 476The memory model supported is based on the HSA memory model [HSA]_ which is 477based in turn on HRF-indirect with scope inclusion [HRF]_. The happens-before 478relation is transitive over the synchronizes-with relation independent of scope 479and synchronizes-with allows the memory scope instances to be inclusive (see 480table :ref:`amdgpu-amdhsa-llvm-sync-scopes-table`). 481 482This is different to the OpenCL [OpenCL]_ memory model which does not have scope 483inclusion and requires the memory scopes to exactly match. However, this 484is conservatively correct for OpenCL. 485 486 .. table:: AMDHSA LLVM Sync Scopes 487 :name: amdgpu-amdhsa-llvm-sync-scopes-table 488 489 ======================= =================================================== 490 LLVM Sync Scope Description 491 ======================= =================================================== 492 *none* The default: ``system``. 493 494 Synchronizes with, and participates in modification 495 and seq_cst total orderings with, other operations 496 (except image operations) for all address spaces 497 (except private, or generic that accesses private) 498 provided the other operation's sync scope is: 499 500 - ``system``. 501 - ``agent`` and executed by a thread on the same 502 agent. 503 - ``workgroup`` and executed by a thread in the 504 same work-group. 505 - ``wavefront`` and executed by a thread in the 506 same wavefront. 507 508 ``agent`` Synchronizes with, and participates in modification 509 and seq_cst total orderings with, other operations 510 (except image operations) for all address spaces 511 (except private, or generic that accesses private) 512 provided the other operation's sync scope is: 513 514 - ``system`` or ``agent`` and executed by a thread 515 on the same agent. 516 - ``workgroup`` and executed by a thread in the 517 same work-group. 518 - ``wavefront`` and executed by a thread in the 519 same wavefront. 520 521 ``workgroup`` Synchronizes with, and participates in modification 522 and seq_cst total orderings with, other operations 523 (except image operations) for all address spaces 524 (except private, or generic that accesses private) 525 provided the other operation's sync scope is: 526 527 - ``system``, ``agent`` or ``workgroup`` and 528 executed by a thread in the same work-group. 529 - ``wavefront`` and executed by a thread in the 530 same wavefront. 531 532 ``wavefront`` Synchronizes with, and participates in modification 533 and seq_cst total orderings with, other operations 534 (except image operations) for all address spaces 535 (except private, or generic that accesses private) 536 provided the other operation's sync scope is: 537 538 - ``system``, ``agent``, ``workgroup`` or 539 ``wavefront`` and executed by a thread in the 540 same wavefront. 541 542 ``singlethread`` Only synchronizes with and participates in 543 modification and seq_cst total orderings with, 544 other operations (except image operations) running 545 in the same thread for all address spaces (for 546 example, in signal handlers). 547 548 ``one-as`` Same as ``system`` but only synchronizes with other 549 operations within the same address space. 550 551 ``agent-one-as`` Same as ``agent`` but only synchronizes with other 552 operations within the same address space. 553 554 ``workgroup-one-as`` Same as ``workgroup`` but only synchronizes with 555 other operations within the same address space. 556 557 ``wavefront-one-as`` Same as ``wavefront`` but only synchronizes with 558 other operations within the same address space. 559 560 ``singlethread-one-as`` Same as ``singlethread`` but only synchronizes with 561 other operations within the same address space. 562 ======================= =================================================== 563 564LLVM IR Intrinsics 565------------------ 566 567The AMDGPU backend implements the following LLVM IR intrinsics. 568 569*This section is WIP.* 570 571.. TODO:: 572 573 List AMDGPU intrinsics. 574 575LLVM IR Attributes 576------------------ 577 578The AMDGPU backend supports the following LLVM IR attributes. 579 580 .. table:: AMDGPU LLVM IR Attributes 581 :name: amdgpu-llvm-ir-attributes-table 582 583 ======================================= ========================================================== 584 LLVM Attribute Description 585 ======================================= ========================================================== 586 "amdgpu-flat-work-group-size"="min,max" Specify the minimum and maximum flat work group sizes that 587 will be specified when the kernel is dispatched. Generated 588 by the ``amdgpu_flat_work_group_size`` CLANG attribute [CLANG-ATTR]_. 589 "amdgpu-implicitarg-num-bytes"="n" Number of kernel argument bytes to add to the kernel 590 argument block size for the implicit arguments. This 591 varies by OS and language (for OpenCL see 592 :ref:`opencl-kernel-implicit-arguments-appended-for-amdhsa-os-table`). 593 "amdgpu-num-sgpr"="n" Specifies the number of SGPRs to use. Generated by 594 the ``amdgpu_num_sgpr`` CLANG attribute [CLANG-ATTR]_. 595 "amdgpu-num-vgpr"="n" Specifies the number of VGPRs to use. Generated by the 596 ``amdgpu_num_vgpr`` CLANG attribute [CLANG-ATTR]_. 597 "amdgpu-waves-per-eu"="m,n" Specify the minimum and maximum number of waves per 598 execution unit. Generated by the ``amdgpu_waves_per_eu`` 599 CLANG attribute [CLANG-ATTR]_. 600 "amdgpu-ieee" true/false. Specify whether the function expects the IEEE field of the 601 mode register to be set on entry. Overrides the default for 602 the calling convention. 603 "amdgpu-dx10-clamp" true/false. Specify whether the function expects the DX10_CLAMP field of 604 the mode register to be set on entry. Overrides the default 605 for the calling convention. 606 ======================================= ========================================================== 607 608.. _amdgpu-elf-code-object: 609 610ELF Code Object 611=============== 612 613The AMDGPU backend generates a standard ELF [ELF]_ relocatable code object that 614can be linked by ``lld`` to produce a standard ELF shared code object which can 615be loaded and executed on an AMDGPU target. 616 617.. _amdgpu-elf-header: 618 619Header 620------ 621 622The AMDGPU backend uses the following ELF header: 623 624 .. table:: AMDGPU ELF Header 625 :name: amdgpu-elf-header-table 626 627 ========================== =============================== 628 Field Value 629 ========================== =============================== 630 ``e_ident[EI_CLASS]`` ``ELFCLASS64`` 631 ``e_ident[EI_DATA]`` ``ELFDATA2LSB`` 632 ``e_ident[EI_OSABI]`` - ``ELFOSABI_NONE`` 633 - ``ELFOSABI_AMDGPU_HSA`` 634 - ``ELFOSABI_AMDGPU_PAL`` 635 - ``ELFOSABI_AMDGPU_MESA3D`` 636 ``e_ident[EI_ABIVERSION]`` - ``ELFABIVERSION_AMDGPU_HSA`` 637 - ``ELFABIVERSION_AMDGPU_PAL`` 638 - ``ELFABIVERSION_AMDGPU_MESA3D`` 639 ``e_type`` - ``ET_REL`` 640 - ``ET_DYN`` 641 ``e_machine`` ``EM_AMDGPU`` 642 ``e_entry`` 0 643 ``e_flags`` See :ref:`amdgpu-elf-header-e_flags-table` 644 ========================== =============================== 645 646.. 647 648 .. table:: AMDGPU ELF Header Enumeration Values 649 :name: amdgpu-elf-header-enumeration-values-table 650 651 =============================== ===== 652 Name Value 653 =============================== ===== 654 ``EM_AMDGPU`` 224 655 ``ELFOSABI_NONE`` 0 656 ``ELFOSABI_AMDGPU_HSA`` 64 657 ``ELFOSABI_AMDGPU_PAL`` 65 658 ``ELFOSABI_AMDGPU_MESA3D`` 66 659 ``ELFABIVERSION_AMDGPU_HSA`` 1 660 ``ELFABIVERSION_AMDGPU_PAL`` 0 661 ``ELFABIVERSION_AMDGPU_MESA3D`` 0 662 =============================== ===== 663 664``e_ident[EI_CLASS]`` 665 The ELF class is: 666 667 * ``ELFCLASS32`` for ``r600`` architecture. 668 669 * ``ELFCLASS64`` for ``amdgcn`` architecture which only supports 64-bit 670 process address space applications. 671 672``e_ident[EI_DATA]`` 673 All AMDGPU targets use ``ELFDATA2LSB`` for little-endian byte ordering. 674 675``e_ident[EI_OSABI]`` 676 One of the following AMDGPU target architecture specific OS ABIs 677 (see :ref:`amdgpu-os-table`): 678 679 * ``ELFOSABI_NONE`` for *unknown* OS. 680 681 * ``ELFOSABI_AMDGPU_HSA`` for ``amdhsa`` OS. 682 683 * ``ELFOSABI_AMDGPU_PAL`` for ``amdpal`` OS. 684 685 * ``ELFOSABI_AMDGPU_MESA3D`` for ``mesa3D`` OS. 686 687``e_ident[EI_ABIVERSION]`` 688 The ABI version of the AMDGPU target architecture specific OS ABI to which the code 689 object conforms: 690 691 * ``ELFABIVERSION_AMDGPU_HSA`` is used to specify the version of AMD HSA 692 runtime ABI. 693 694 * ``ELFABIVERSION_AMDGPU_PAL`` is used to specify the version of AMD PAL 695 runtime ABI. 696 697 * ``ELFABIVERSION_AMDGPU_MESA3D`` is used to specify the version of AMD MESA 698 3D runtime ABI. 699 700``e_type`` 701 Can be one of the following values: 702 703 704 ``ET_REL`` 705 The type produced by the AMDGPU backend compiler as it is relocatable code 706 object. 707 708 ``ET_DYN`` 709 The type produced by the linker as it is a shared code object. 710 711 The AMD HSA runtime loader requires a ``ET_DYN`` code object. 712 713``e_machine`` 714 The value ``EM_AMDGPU`` is used for the machine for all processors supported 715 by the ``r600`` and ``amdgcn`` architectures (see 716 :ref:`amdgpu-processor-table`). The specific processor is specified in the 717 ``EF_AMDGPU_MACH`` bit field of the ``e_flags`` (see 718 :ref:`amdgpu-elf-header-e_flags-table`). 719 720``e_entry`` 721 The entry point is 0 as the entry points for individual kernels must be 722 selected in order to invoke them through AQL packets. 723 724``e_flags`` 725 The AMDGPU backend uses the following ELF header flags: 726 727 .. table:: AMDGPU ELF Header ``e_flags`` 728 :name: amdgpu-elf-header-e_flags-table 729 730 ================================= ========== ============================= 731 Name Value Description 732 ================================= ========== ============================= 733 **AMDGPU Processor Flag** See :ref:`amdgpu-processor-table`. 734 -------------------------------------------- ----------------------------- 735 ``EF_AMDGPU_MACH`` 0x000000ff AMDGPU processor selection 736 mask for 737 ``EF_AMDGPU_MACH_xxx`` values 738 defined in 739 :ref:`amdgpu-ef-amdgpu-mach-table`. 740 ``EF_AMDGPU_XNACK`` 0x00000100 Indicates if the ``xnack`` 741 target feature is 742 enabled for all code 743 contained in the code object. 744 If the processor 745 does not support the 746 ``xnack`` target 747 feature then must 748 be 0. 749 See 750 :ref:`amdgpu-target-features`. 751 ``EF_AMDGPU_SRAM_ECC`` 0x00000200 Indicates if the ``sram-ecc`` 752 target feature is 753 enabled for all code 754 contained in the code object. 755 If the processor 756 does not support the 757 ``sram-ecc`` target 758 feature then must 759 be 0. 760 See 761 :ref:`amdgpu-target-features`. 762 ================================= ========== ============================= 763 764 .. table:: AMDGPU ``EF_AMDGPU_MACH`` Values 765 :name: amdgpu-ef-amdgpu-mach-table 766 767 ================================= ========== ============================= 768 Name Value Description (see 769 :ref:`amdgpu-processor-table`) 770 ================================= ========== ============================= 771 ``EF_AMDGPU_MACH_NONE`` 0x000 *not specified* 772 ``EF_AMDGPU_MACH_R600_R600`` 0x001 ``r600`` 773 ``EF_AMDGPU_MACH_R600_R630`` 0x002 ``r630`` 774 ``EF_AMDGPU_MACH_R600_RS880`` 0x003 ``rs880`` 775 ``EF_AMDGPU_MACH_R600_RV670`` 0x004 ``rv670`` 776 ``EF_AMDGPU_MACH_R600_RV710`` 0x005 ``rv710`` 777 ``EF_AMDGPU_MACH_R600_RV730`` 0x006 ``rv730`` 778 ``EF_AMDGPU_MACH_R600_RV770`` 0x007 ``rv770`` 779 ``EF_AMDGPU_MACH_R600_CEDAR`` 0x008 ``cedar`` 780 ``EF_AMDGPU_MACH_R600_CYPRESS`` 0x009 ``cypress`` 781 ``EF_AMDGPU_MACH_R600_JUNIPER`` 0x00a ``juniper`` 782 ``EF_AMDGPU_MACH_R600_REDWOOD`` 0x00b ``redwood`` 783 ``EF_AMDGPU_MACH_R600_SUMO`` 0x00c ``sumo`` 784 ``EF_AMDGPU_MACH_R600_BARTS`` 0x00d ``barts`` 785 ``EF_AMDGPU_MACH_R600_CAICOS`` 0x00e ``caicos`` 786 ``EF_AMDGPU_MACH_R600_CAYMAN`` 0x00f ``cayman`` 787 ``EF_AMDGPU_MACH_R600_TURKS`` 0x010 ``turks`` 788 *reserved* 0x011 - Reserved for ``r600`` 789 0x01f architecture processors. 790 ``EF_AMDGPU_MACH_AMDGCN_GFX600`` 0x020 ``gfx600`` 791 ``EF_AMDGPU_MACH_AMDGCN_GFX601`` 0x021 ``gfx601`` 792 ``EF_AMDGPU_MACH_AMDGCN_GFX700`` 0x022 ``gfx700`` 793 ``EF_AMDGPU_MACH_AMDGCN_GFX701`` 0x023 ``gfx701`` 794 ``EF_AMDGPU_MACH_AMDGCN_GFX702`` 0x024 ``gfx702`` 795 ``EF_AMDGPU_MACH_AMDGCN_GFX703`` 0x025 ``gfx703`` 796 ``EF_AMDGPU_MACH_AMDGCN_GFX704`` 0x026 ``gfx704`` 797 *reserved* 0x027 Reserved. 798 ``EF_AMDGPU_MACH_AMDGCN_GFX801`` 0x028 ``gfx801`` 799 ``EF_AMDGPU_MACH_AMDGCN_GFX802`` 0x029 ``gfx802`` 800 ``EF_AMDGPU_MACH_AMDGCN_GFX803`` 0x02a ``gfx803`` 801 ``EF_AMDGPU_MACH_AMDGCN_GFX810`` 0x02b ``gfx810`` 802 ``EF_AMDGPU_MACH_AMDGCN_GFX900`` 0x02c ``gfx900`` 803 ``EF_AMDGPU_MACH_AMDGCN_GFX902`` 0x02d ``gfx902`` 804 ``EF_AMDGPU_MACH_AMDGCN_GFX904`` 0x02e ``gfx904`` 805 ``EF_AMDGPU_MACH_AMDGCN_GFX906`` 0x02f ``gfx906`` 806 ``EF_AMDGPU_MACH_AMDGCN_GFX908`` 0x030 ``gfx908`` 807 ``EF_AMDGPU_MACH_AMDGCN_GFX909`` 0x031 ``gfx909`` 808 *reserved* 0x032 Reserved. 809 ``EF_AMDGPU_MACH_AMDGCN_GFX1010`` 0x033 ``gfx1010`` 810 ``EF_AMDGPU_MACH_AMDGCN_GFX1011`` 0x034 ``gfx1011`` 811 ``EF_AMDGPU_MACH_AMDGCN_GFX1012`` 0x035 ``gfx1012`` 812 ``EF_AMDGPU_MACH_AMDGCN_GFX1030`` 0x036 ``gfx1030`` 813 ================================= ========== ============================= 814 815Sections 816-------- 817 818An AMDGPU target ELF code object has the standard ELF sections which include: 819 820 .. table:: AMDGPU ELF Sections 821 :name: amdgpu-elf-sections-table 822 823 ================== ================ ================================= 824 Name Type Attributes 825 ================== ================ ================================= 826 ``.bss`` ``SHT_NOBITS`` ``SHF_ALLOC`` + ``SHF_WRITE`` 827 ``.data`` ``SHT_PROGBITS`` ``SHF_ALLOC`` + ``SHF_WRITE`` 828 ``.debug_``\ *\** ``SHT_PROGBITS`` *none* 829 ``.dynamic`` ``SHT_DYNAMIC`` ``SHF_ALLOC`` 830 ``.dynstr`` ``SHT_PROGBITS`` ``SHF_ALLOC`` 831 ``.dynsym`` ``SHT_PROGBITS`` ``SHF_ALLOC`` 832 ``.got`` ``SHT_PROGBITS`` ``SHF_ALLOC`` + ``SHF_WRITE`` 833 ``.hash`` ``SHT_HASH`` ``SHF_ALLOC`` 834 ``.note`` ``SHT_NOTE`` *none* 835 ``.rela``\ *name* ``SHT_RELA`` *none* 836 ``.rela.dyn`` ``SHT_RELA`` *none* 837 ``.rodata`` ``SHT_PROGBITS`` ``SHF_ALLOC`` 838 ``.shstrtab`` ``SHT_STRTAB`` *none* 839 ``.strtab`` ``SHT_STRTAB`` *none* 840 ``.symtab`` ``SHT_SYMTAB`` *none* 841 ``.text`` ``SHT_PROGBITS`` ``SHF_ALLOC`` + ``SHF_EXECINSTR`` 842 ================== ================ ================================= 843 844These sections have their standard meanings (see [ELF]_) and are only generated 845if needed. 846 847``.debug``\ *\** 848 The standard DWARF sections. See :ref:`amdgpu-dwarf-debug-information` for 849 information on the DWARF produced by the AMDGPU backend. 850 851``.dynamic``, ``.dynstr``, ``.dynsym``, ``.hash`` 852 The standard sections used by a dynamic loader. 853 854``.note`` 855 See :ref:`amdgpu-note-records` for the note records supported by the AMDGPU 856 backend. 857 858``.rela``\ *name*, ``.rela.dyn`` 859 For relocatable code objects, *name* is the name of the section that the 860 relocation records apply. For example, ``.rela.text`` is the section name for 861 relocation records associated with the ``.text`` section. 862 863 For linked shared code objects, ``.rela.dyn`` contains all the relocation 864 records from each of the relocatable code object's ``.rela``\ *name* sections. 865 866 See :ref:`amdgpu-relocation-records` for the relocation records supported by 867 the AMDGPU backend. 868 869``.text`` 870 The executable machine code for the kernels and functions they call. Generated 871 as position independent code. See :ref:`amdgpu-code-conventions` for 872 information on conventions used in the isa generation. 873 874.. _amdgpu-note-records: 875 876Note Records 877------------ 878 879The AMDGPU backend code object contains ELF note records in the ``.note`` 880section. The set of generated notes and their semantics depend on the code 881object version; see :ref:`amdgpu-note-records-v2` and 882:ref:`amdgpu-note-records-v3`. 883 884As required by ``ELFCLASS32`` and ``ELFCLASS64``, minimal zero-byte padding 885must be generated after the ``name`` field to ensure the ``desc`` field is 4 886byte aligned. In addition, minimal zero-byte padding must be generated to 887ensure the ``desc`` field size is a multiple of 4 bytes. The ``sh_addralign`` 888field of the ``.note`` section must be at least 4 to indicate at least 8 byte 889alignment. 890 891.. _amdgpu-note-records-v2: 892 893Code Object V2 Note Records (-mattr=-code-object-v3) 894~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 895 896.. warning:: Code Object V2 is not the default code object version emitted by 897 this version of LLVM. For a description of the notes generated with the 898 default configuration (Code Object V3) see :ref:`amdgpu-note-records-v3`. 899 900The AMDGPU backend code object uses the following ELF note record in the 901``.note`` section when compiling for Code Object V2 (-mattr=-code-object-v3). 902 903Additional note records may be present, but any which are not documented here 904are deprecated and should not be used. 905 906 .. table:: AMDGPU Code Object V2 ELF Note Records 907 :name: amdgpu-elf-note-records-table-v2 908 909 ===== ============================== ====================================== 910 Name Type Description 911 ===== ============================== ====================================== 912 "AMD" ``NT_AMD_AMDGPU_HSA_METADATA`` <metadata null terminated string> 913 ===== ============================== ====================================== 914 915.. 916 917 .. table:: AMDGPU Code Object V2 ELF Note Record Enumeration Values 918 :name: amdgpu-elf-note-record-enumeration-values-table-v2 919 920 ============================== ===== 921 Name Value 922 ============================== ===== 923 *reserved* 0-9 924 ``NT_AMD_AMDGPU_HSA_METADATA`` 10 925 *reserved* 11 926 ============================== ===== 927 928``NT_AMD_AMDGPU_HSA_METADATA`` 929 Specifies extensible metadata associated with the code objects executed on HSA 930 [HSA]_ compatible runtimes such as AMD's ROCm [AMD-ROCm]_. It is required when 931 the target triple OS is ``amdhsa`` (see :ref:`amdgpu-target-triples`). See 932 :ref:`amdgpu-amdhsa-code-object-metadata-v2` for the syntax of the code 933 object metadata string. 934 935.. _amdgpu-note-records-v3: 936 937Code Object V3 Note Records (-mattr=+code-object-v3) 938~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 939 940The AMDGPU backend code object uses the following ELF note record in the 941``.note`` section when compiling for Code Object V3 (-mattr=+code-object-v3). 942 943Additional note records may be present, but any which are not documented here 944are deprecated and should not be used. 945 946 .. table:: AMDGPU Code Object V3 ELF Note Records 947 :name: amdgpu-elf-note-records-table-v3 948 949 ======== ============================== ====================================== 950 Name Type Description 951 ======== ============================== ====================================== 952 "AMDGPU" ``NT_AMDGPU_METADATA`` Metadata in Message Pack [MsgPack]_ 953 binary format. 954 ======== ============================== ====================================== 955 956.. 957 958 .. table:: AMDGPU Code Object V3 ELF Note Record Enumeration Values 959 :name: amdgpu-elf-note-record-enumeration-values-table-v3 960 961 ============================== ===== 962 Name Value 963 ============================== ===== 964 *reserved* 0-31 965 ``NT_AMDGPU_METADATA`` 32 966 ============================== ===== 967 968``NT_AMDGPU_METADATA`` 969 Specifies extensible metadata associated with an AMDGPU code 970 object. It is encoded as a map in the Message Pack [MsgPack]_ binary 971 data format. See :ref:`amdgpu-amdhsa-code-object-metadata-v3` for the 972 map keys defined for the ``amdhsa`` OS. 973 974.. _amdgpu-symbols: 975 976Symbols 977------- 978 979Symbols include the following: 980 981 .. table:: AMDGPU ELF Symbols 982 :name: amdgpu-elf-symbols-table 983 984 ===================== ================== ================ ================== 985 Name Type Section Description 986 ===================== ================== ================ ================== 987 *link-name* ``STT_OBJECT`` - ``.data`` Global variable 988 - ``.rodata`` 989 - ``.bss`` 990 *link-name*\ ``.kd`` ``STT_OBJECT`` - ``.rodata`` Kernel descriptor 991 *link-name* ``STT_FUNC`` - ``.text`` Kernel entry point 992 *link-name* ``STT_OBJECT`` - SHN_AMDGPU_LDS Global variable in LDS 993 ===================== ================== ================ ================== 994 995Global variable 996 Global variables both used and defined by the compilation unit. 997 998 If the symbol is defined in the compilation unit then it is allocated in the 999 appropriate section according to if it has initialized data or is readonly. 1000 1001 If the symbol is external then its section is ``STN_UNDEF`` and the loader 1002 will resolve relocations using the definition provided by another code object 1003 or explicitly defined by the runtime. 1004 1005 If the symbol resides in local/group memory (LDS) then its section is the 1006 special processor specific section name ``SHN_AMDGPU_LDS``, and the 1007 ``st_value`` field describes alignment requirements as it does for common 1008 symbols. 1009 1010 .. TODO:: 1011 1012 Add description of linked shared object symbols. Seems undefined symbols 1013 are marked as STT_NOTYPE. 1014 1015Kernel descriptor 1016 Every HSA kernel has an associated kernel descriptor. It is the address of the 1017 kernel descriptor that is used in the AQL dispatch packet used to invoke the 1018 kernel, not the kernel entry point. The layout of the HSA kernel descriptor is 1019 defined in :ref:`amdgpu-amdhsa-kernel-descriptor`. 1020 1021Kernel entry point 1022 Every HSA kernel also has a symbol for its machine code entry point. 1023 1024.. _amdgpu-relocation-records: 1025 1026Relocation Records 1027------------------ 1028 1029AMDGPU backend generates ``Elf64_Rela`` relocation records. Supported 1030relocatable fields are: 1031 1032``word32`` 1033 This specifies a 32-bit field occupying 4 bytes with arbitrary byte 1034 alignment. These values use the same byte order as other word values in the 1035 AMDGPU architecture. 1036 1037``word64`` 1038 This specifies a 64-bit field occupying 8 bytes with arbitrary byte 1039 alignment. These values use the same byte order as other word values in the 1040 AMDGPU architecture. 1041 1042Following notations are used for specifying relocation calculations: 1043 1044**A** 1045 Represents the addend used to compute the value of the relocatable field. 1046 1047**G** 1048 Represents the offset into the global offset table at which the relocation 1049 entry's symbol will reside during execution. 1050 1051**GOT** 1052 Represents the address of the global offset table. 1053 1054**P** 1055 Represents the place (section offset for ``et_rel`` or address for ``et_dyn``) 1056 of the storage unit being relocated (computed using ``r_offset``). 1057 1058**S** 1059 Represents the value of the symbol whose index resides in the relocation 1060 entry. Relocations not using this must specify a symbol index of 1061 ``STN_UNDEF``. 1062 1063**B** 1064 Represents the base address of a loaded executable or shared object which is 1065 the difference between the ELF address and the actual load address. 1066 Relocations using this are only valid in executable or shared objects. 1067 1068The following relocation types are supported: 1069 1070 .. table:: AMDGPU ELF Relocation Records 1071 :name: amdgpu-elf-relocation-records-table 1072 1073 ========================== ======= ===== ========== ============================== 1074 Relocation Type Kind Value Field Calculation 1075 ========================== ======= ===== ========== ============================== 1076 ``R_AMDGPU_NONE`` 0 *none* *none* 1077 ``R_AMDGPU_ABS32_LO`` Static, 1 ``word32`` (S + A) & 0xFFFFFFFF 1078 Dynamic 1079 ``R_AMDGPU_ABS32_HI`` Static, 2 ``word32`` (S + A) >> 32 1080 Dynamic 1081 ``R_AMDGPU_ABS64`` Static, 3 ``word64`` S + A 1082 Dynamic 1083 ``R_AMDGPU_REL32`` Static 4 ``word32`` S + A - P 1084 ``R_AMDGPU_REL64`` Static 5 ``word64`` S + A - P 1085 ``R_AMDGPU_ABS32`` Static, 6 ``word32`` S + A 1086 Dynamic 1087 ``R_AMDGPU_GOTPCREL`` Static 7 ``word32`` G + GOT + A - P 1088 ``R_AMDGPU_GOTPCREL32_LO`` Static 8 ``word32`` (G + GOT + A - P) & 0xFFFFFFFF 1089 ``R_AMDGPU_GOTPCREL32_HI`` Static 9 ``word32`` (G + GOT + A - P) >> 32 1090 ``R_AMDGPU_REL32_LO`` Static 10 ``word32`` (S + A - P) & 0xFFFFFFFF 1091 ``R_AMDGPU_REL32_HI`` Static 11 ``word32`` (S + A - P) >> 32 1092 *reserved* 12 1093 ``R_AMDGPU_RELATIVE64`` Dynamic 13 ``word64`` B + A 1094 ========================== ======= ===== ========== ============================== 1095 1096``R_AMDGPU_ABS32_LO`` and ``R_AMDGPU_ABS32_HI`` are only supported by 1097the ``mesa3d`` OS, which does not support ``R_AMDGPU_ABS64``. 1098 1099There is no current OS loader support for 32-bit programs and so 1100``R_AMDGPU_ABS32`` is not used. 1101 1102.. _amdgpu-loaded-code-object-path-uniform-resource-identifier: 1103 1104Loaded Code Object Path Uniform Resource Identifier (URI) 1105--------------------------------------------------------- 1106 1107The AMD GPU code object loader represents the path of the ELF shared object from 1108which the code object was loaded as a textual Unifom Resource Identifier (URI). 1109Note that the code object is the in memory loaded relocated form of the ELF 1110shared object. Multiple code objects may be loaded at different memory 1111addresses in the same process from the same ELF shared object. 1112 1113The loaded code object path URI syntax is defined by the following BNF syntax: 1114 1115.. code:: 1116 1117 code_object_uri ::== file_uri | memory_uri 1118 file_uri ::== "file://" file_path [ range_specifier ] 1119 memory_uri ::== "memory://" process_id range_specifier 1120 range_specifier ::== [ "#" | "?" ] "offset=" number "&" "size=" number 1121 file_path ::== URI_ENCODED_OS_FILE_PATH 1122 process_id ::== DECIMAL_NUMBER 1123 number ::== HEX_NUMBER | DECIMAL_NUMBER | OCTAL_NUMBER 1124 1125**number** 1126 Is a C integral literal where hexadecimal values are prefixed by "0x" or "0X", 1127 and octal values by "0". 1128 1129**file_path** 1130 Is the file's path specified as a URI encoded UTF-8 string. In URI encoding, 1131 every character that is not in the regular expression ``[a-zA-Z0-9/_.~-]`` is 1132 encoded as two uppercase hexidecimal digits proceeded by "%". Directories in 1133 the path are separated by "/". 1134 1135**offset** 1136 Is a 0-based byte offset to the start of the code object. For a file URI, it 1137 is from the start of the file specified by the ``file_path``, and if omitted 1138 defaults to 0. For a memory URI, it is the memory address and is required. 1139 1140**size** 1141 Is the number of bytes in the code object. For a file URI, if omitted it 1142 defaults to the size of the file. It is required for a memory URI. 1143 1144**process_id** 1145 Is the identity of the process owning the memory. For Linux it is the C 1146 unsigned integral decimal literal for the process ID (PID). 1147 1148For example: 1149 1150.. code:: 1151 1152 file:///dir1/dir2/file1 1153 file:///dir3/dir4/file2#offset=0x2000&size=3000 1154 memory://1234#offset=0x20000&size=3000 1155 1156.. _amdgpu-dwarf-debug-information: 1157 1158DWARF Debug Information 1159======================= 1160 1161.. warning:: 1162 1163 This section describes a **provisional proposal** for AMDGPU DWARF [DWARF]_ 1164 that is not currently fully implemented and is subject to change. 1165 1166AMDGPU generates DWARF [DWARF]_ debugging information ELF sections (see 1167:ref:`amdgpu-elf-code-object`) which contain information that maps the code 1168object executable code and data to the source language constructs. It can be 1169used by tools such as debuggers and profilers. It uses features defined in 1170:doc:`AMDGPUDwarfProposalForHeterogeneousDebugging` that are made available in 1171DWARF Version 4 and DWARF Version 5 as an LLVM vendor extension. 1172 1173This section defines the AMDGPU target architecture specific DWARF mappings. 1174 1175.. _amdgpu-dwarf-register-identifier: 1176 1177Register Identifier 1178------------------- 1179 1180This section defines the AMDGPU target architecture register numbers used in 1181DWARF operation expressions (see DWARF Version 5 section 2.5 and 1182:ref:`amdgpu-dwarf-operation-expressions`) and Call Frame Information 1183instructions (see DWARF Version 5 section 6.4 and 1184:ref:`amdgpu-dwarf-call-frame-information`). 1185 1186A single code object can contain code for kernels that have different wavefront 1187sizes. The vector registers and some scalar registers are based on the wavefront 1188size. AMDGPU defines distinct DWARF registers for each wavefront size. This 1189simplifies the consumer of the DWARF so that each register has a fixed size, 1190rather than being dynamic according to the wavefront size mode. Similarly, 1191distinct DWARF registers are defined for those registers that vary in size 1192according to the process address size. This allows a consumer to treat a 1193specific AMDGPU processor as a single architecture regardless of how it is 1194configured at run time. The compiler explicitly specifies the DWARF registers 1195that match the mode in which the code it is generating will be executed. 1196 1197DWARF registers are encoded as numbers, which are mapped to architecture 1198registers. The mapping for AMDGPU is defined in 1199:ref:`amdgpu-dwarf-register-mapping-table`. All AMDGPU targets use the same 1200mapping. 1201 1202.. table:: AMDGPU DWARF Register Mapping 1203 :name: amdgpu-dwarf-register-mapping-table 1204 1205 ============== ================= ======== ================================== 1206 DWARF Register AMDGPU Register Bit Size Description 1207 ============== ================= ======== ================================== 1208 0 PC_32 32 Program Counter (PC) when 1209 executing in a 32-bit process 1210 address space. Used in the CFI to 1211 describe the PC of the calling 1212 frame. 1213 1 EXEC_MASK_32 32 Execution Mask Register when 1214 executing in wavefront 32 mode. 1215 2-15 *Reserved* *Reserved for highly accessed 1216 registers using DWARF shortcut.* 1217 16 PC_64 64 Program Counter (PC) when 1218 executing in a 64-bit process 1219 address space. Used in the CFI to 1220 describe the PC of the calling 1221 frame. 1222 17 EXEC_MASK_64 64 Execution Mask Register when 1223 executing in wavefront 64 mode. 1224 18-31 *Reserved* *Reserved for highly accessed 1225 registers using DWARF shortcut.* 1226 32-95 SGPR0-SGPR63 32 Scalar General Purpose 1227 Registers. 1228 96-127 *Reserved* *Reserved for frequently accessed 1229 registers using DWARF 1-byte ULEB.* 1230 128 SCC 32 Scalar Condition Code Register. 1231 129-511 *Reserved* *Reserved for future Scalar 1232 Architectural Registers.* 1233 512 VCC_32 32 Vector Condition Code Register 1234 when executing in wavefront 32 1235 mode. 1236 513-1023 *Reserved* *Reserved for future Vector 1237 Architectural Registers when 1238 executing in wavefront 32 mode.* 1239 768 VCC_64 32 Vector Condition Code Register 1240 when executing in wavefront 64 1241 mode. 1242 769-1023 *Reserved* *Reserved for future Vector 1243 Architectural Registers when 1244 executing in wavefront 64 mode.* 1245 1024-1087 *Reserved* *Reserved for padding.* 1246 1088-1129 SGPR64-SGPR105 32 Scalar General Purpose Registers. 1247 1130-1535 *Reserved* *Reserved for future Scalar 1248 General Purpose Registers.* 1249 1536-1791 VGPR0-VGPR255 32*32 Vector General Purpose Registers 1250 when executing in wavefront 32 1251 mode. 1252 1792-2047 *Reserved* *Reserved for future Vector 1253 General Purpose Registers when 1254 executing in wavefront 32 mode.* 1255 2048-2303 AGPR0-AGPR255 32*32 Vector Accumulation Registers 1256 when executing in wavefront 32 1257 mode. 1258 2304-2559 *Reserved* *Reserved for future Vector 1259 Accumulation Registers when 1260 executing in wavefront 32 mode.* 1261 2560-2815 VGPR0-VGPR255 64*32 Vector General Purpose Registers 1262 when executing in wavefront 64 1263 mode. 1264 2816-3071 *Reserved* *Reserved for future Vector 1265 General Purpose Registers when 1266 executing in wavefront 64 mode.* 1267 3072-3327 AGPR0-AGPR255 64*32 Vector Accumulation Registers 1268 when executing in wavefront 64 1269 mode. 1270 3328-3583 *Reserved* *Reserved for future Vector 1271 Accumulation Registers when 1272 executing in wavefront 64 mode.* 1273 ============== ================= ======== ================================== 1274 1275The vector registers are represented as the full size for the wavefront. They 1276are organized as consecutive dwords (32-bits), one per lane, with the dword at 1277the least significant bit position corresponding to lane 0 and so forth. DWARF 1278location expressions involving the ``DW_OP_LLVM_offset`` and 1279``DW_OP_LLVM_push_lane`` operations are used to select the part of the vector 1280register corresponding to the lane that is executing the current thread of 1281execution in languages that are implemented using a SIMD or SIMT execution 1282model. 1283 1284If the wavefront size is 32 lanes then the wavefront 32 mode register 1285definitions are used. If the wavefront size is 64 lanes then the wavefront 64 1286mode register definitions are used. Some AMDGPU targets support executing in 1287both wavefront 32 and wavefront 64 mode. The register definitions corresponding 1288to the wavefront mode of the generated code will be used. 1289 1290If code is generated to execute in a 32-bit process address space, then the 129132-bit process address space register definitions are used. If code is generated 1292to execute in a 64-bit process address space, then the 64-bit process address 1293space register definitions are used. The ``amdgcn`` target only supports the 129464-bit process address space. 1295 1296.. _amdgpu-dwarf-address-class-identifier: 1297 1298Address Class Identifier 1299------------------------ 1300 1301The DWARF address class represents the source language memory space. See DWARF 1302Version 5 section 2.12 which is updated by the propoal in 1303:ref:`amdgpu-dwarf-segment_addresses`. 1304 1305The DWARF address class mapping used for AMDGPU is defined in 1306:ref:`amdgpu-dwarf-address-class-mapping-table`. 1307 1308.. table:: AMDGPU DWARF Address Class Mapping 1309 :name: amdgpu-dwarf-address-class-mapping-table 1310 1311 ========================= ====== ================= 1312 DWARF AMDGPU 1313 -------------------------------- ----------------- 1314 Address Class Name Value Address Space 1315 ========================= ====== ================= 1316 ``DW_ADDR_none`` 0x0000 Generic (Flat) 1317 ``DW_ADDR_LLVM_global`` 0x0001 Global 1318 ``DW_ADDR_LLVM_constant`` 0x0002 Global 1319 ``DW_ADDR_LLVM_group`` 0x0003 Local (group/LDS) 1320 ``DW_ADDR_LLVM_private`` 0x0004 Private (Scratch) 1321 ``DW_ADDR_AMDGPU_region`` 0x8000 Region (GDS) 1322 ========================= ====== ================= 1323 1324The DWARF address class values defined in the proposal at 1325:ref:`amdgpu-dwarf-segment_addresses` are used. 1326 1327In addition, ``DW_ADDR_AMDGPU_region`` is encoded as a vendor extension. This is 1328available for use for the AMD extension for access to the hardware GDS memory 1329which is scratchpad memory allocated per device. 1330 1331For AMDGPU if no ``DW_AT_address_class`` attribute is present, then the default 1332address class of ``DW_ADDR_none`` is used. 1333 1334See :ref:`amdgpu-dwarf-address-space-identifier` for information on the AMDGPU 1335mapping of DWARF address classes to DWARF address spaces, including address size 1336and NULL value. 1337 1338.. _amdgpu-dwarf-address-space-identifier: 1339 1340Address Space Identifier 1341------------------------ 1342 1343DWARF address spaces correspond to target architecture specific linear 1344addressable memory areas. See DWARF Version 5 section 2.12 and 1345:ref:`amdgpu-dwarf-segment_addresses`. 1346 1347The DWARF address space mapping used for AMDGPU is defined in 1348:ref:`amdgpu-dwarf-address-space-mapping-table`. 1349 1350.. table:: AMDGPU DWARF Address Space Mapping 1351 :name: amdgpu-dwarf-address-space-mapping-table 1352 1353 ======================================= ===== ======= ======== ================= ======================= 1354 DWARF AMDGPU Notes 1355 --------------------------------------- ----- ---------------- ----------------- ----------------------- 1356 Address Space Name Value Address Bit Size Address Space 1357 --------------------------------------- ----- ------- -------- ----------------- ----------------------- 1358 .. 64-bit 32-bit 1359 process process 1360 address address 1361 space space 1362 ======================================= ===== ======= ======== ================= ======================= 1363 ``DW_ASPACE_none`` 0x00 8 4 Global *default address space* 1364 ``DW_ASPACE_AMDGPU_generic`` 0x01 8 4 Generic (Flat) 1365 ``DW_ASPACE_AMDGPU_region`` 0x02 4 4 Region (GDS) 1366 ``DW_ASPACE_AMDGPU_local`` 0x03 4 4 Local (group/LDS) 1367 *Reserved* 0x04 1368 ``DW_ASPACE_AMDGPU_private_lane`` 0x05 4 4 Private (Scratch) *focused lane* 1369 ``DW_ASPACE_AMDGPU_private_wave`` 0x06 4 4 Private (Scratch) *unswizzled wavefront* 1370 *Reserved* 0x07- 1371 0x1F 1372 ``DW_ASPACE_AMDGPU_private_lane<0-63>`` 0x20- 4 4 Private (Scratch) *specific lane* 1373 0x5F 1374 ======================================= ===== ======= ======== ================= ======================= 1375 1376See :ref:`amdgpu-address-spaces` for information on the AMDGPU address spaces 1377including address size and NULL value. 1378 1379The ``DW_ASPACE_none`` address space is the default target architecture address 1380space used in DWARF operations that do not specify an address space. It 1381therefore has to map to the global address space so that the ``DW_OP_addr*`` and 1382related operations can refer to addresses in the program code. 1383 1384The ``DW_ASPACE_AMDGPU_generic`` address space allows location expressions to 1385specify the flat address space. If the address corresponds to an address in the 1386local address space, then it corresponds to the wavefront that is executing the 1387focused thread of execution. If the address corresponds to an address in the 1388private address space, then it corresponds to the lane that is executing the 1389focused thread of execution for languages that are implemented using a SIMD or 1390SIMT execution model. 1391 1392.. note:: 1393 1394 CUDA-like languages such as HIP that do not have address spaces in the 1395 language type system, but do allow variables to be allocated in different 1396 address spaces, need to explicitly specify the ``DW_ASPACE_AMDGPU_generic`` 1397 address space in the DWARF expression operations as the default address space 1398 is the global address space. 1399 1400The ``DW_ASPACE_AMDGPU_local`` address space allows location expressions to 1401specify the local address space corresponding to the wavefront that is executing 1402the focused thread of execution. 1403 1404The ``DW_ASPACE_AMDGPU_private_lane`` address space allows location expressions 1405to specify the private address space corresponding to the lane that is executing 1406the focused thread of execution for languages that are implemented using a SIMD 1407or SIMT execution model. 1408 1409The ``DW_ASPACE_AMDGPU_private_wave`` address space allows location expressions 1410to specify the unswizzled private address space corresponding to the wavefront 1411that is executing the focused thread of execution. The wavefront view of private 1412memory is the per wavefront unswizzled backing memory layout defined in 1413:ref:`amdgpu-address-spaces`, such that address 0 corresponds to the first 1414location for the backing memory of the wavefront (namely the address is not 1415offset by ``wavefront-scratch-base``). The following formula can be used to 1416convert from a ``DW_ASPACE_AMDGPU_private_lane`` address to a 1417``DW_ASPACE_AMDGPU_private_wave`` address: 1418 1419:: 1420 1421 private-address-wavefront = 1422 ((private-address-lane / 4) * wavefront-size * 4) + 1423 (wavefront-lane-id * 4) + (private-address-lane % 4) 1424 1425If the ``DW_ASPACE_AMDGPU_private_lane`` address is dword aligned, and the start 1426of the dwords for each lane starting with lane 0 is required, then this 1427simplifies to: 1428 1429:: 1430 1431 private-address-wavefront = 1432 private-address-lane * wavefront-size 1433 1434A compiler can use the ``DW_ASPACE_AMDGPU_private_wave`` address space to read a 1435complete spilled vector register back into a complete vector register in the 1436CFI. The frame pointer can be a private lane address which is dword aligned, 1437which can be shifted to multiply by the wavefront size, and then used to form a 1438private wavefront address that gives a location for a contiguous set of dwords, 1439one per lane, where the vector register dwords are spilled. The compiler knows 1440the wavefront size since it generates the code. Note that the type of the 1441address may have to be converted as the size of a 1442``DW_ASPACE_AMDGPU_private_lane`` address may be smaller than the size of a 1443``DW_ASPACE_AMDGPU_private_wave`` address. 1444 1445The ``DW_ASPACE_AMDGPU_private_lane<N>`` address space allows location 1446expressions to specify the private address space corresponding to a specific 1447lane N. For example, this can be used when the compiler spills scalar registers 1448to scratch memory, with each scalar register being saved to a different lane's 1449scratch memory. 1450 1451.. _amdgpu-dwarf-lane-identifier: 1452 1453Lane identifier 1454--------------- 1455 1456DWARF lane identifies specify a target architecture lane position for hardware 1457that executes in a SIMD or SIMT manner, and on which a source language maps its 1458threads of execution onto those lanes. The DWARF lane identifier is pushed by 1459the ``DW_OP_LLVM_push_lane`` DWARF expression operation. See DWARF Version 5 1460section 2.5 which is updated by the proposal in 1461:ref:`amdgpu-dwarf-operation-expressions`. 1462 1463For AMDGPU, the lane identifier corresponds to the hardware lane ID of a 1464wavefront. It is numbered from 0 to the wavefront size minus 1. 1465 1466Operation Expressions 1467--------------------- 1468 1469DWARF expressions are used to compute program values and the locations of 1470program objects. See DWARF Version 5 section 2.5 and 1471:ref:`amdgpu-dwarf-operation-expressions`. 1472 1473DWARF location descriptions describe how to access storage which includes memory 1474and registers. When accessing storage on AMDGPU, bytes are ordered with least 1475significant bytes first, and bits are ordered within bytes with least 1476significant bits first. 1477 1478For AMDGPU CFI expressions, ``DW_OP_LLVM_select_bit_piece`` is used to describe 1479unwinding vector registers that are spilled under the execution mask to memory: 1480the zero-single location description is the vector register, and the one-single 1481location description is the spilled memory location description. The 1482``DW_OP_LLVM_form_aspace_address`` is used to specify the address space of the 1483memory location description. 1484 1485In AMDGPU expressions, ``DW_OP_LLVM_select_bit_piece`` is used by the 1486``DW_AT_LLVM_lane_pc`` attribute expression where divergent control flow is 1487controlled by the execution mask. An undefined location description together 1488with ``DW_OP_LLVM_extend`` is used to indicate the lane was not active on entry 1489to the subprogram. See :ref:`amdgpu-dwarf-dw-at-llvm-lane-pc` for an example. 1490 1491Debugger Information Entry Attributes 1492------------------------------------- 1493 1494This section describes how certain debugger information entry attributes are 1495used by AMDGPU. See the sections in DWARF Version 5 section 2 which are updated 1496by the proposal in :ref:`amdgpu-dwarf-debugging-information-entry-attributes`. 1497 1498.. _amdgpu-dwarf-dw-at-llvm-lane-pc: 1499 1500``DW_AT_LLVM_lane_pc`` 1501~~~~~~~~~~~~~~~~~~~~~~ 1502 1503For AMDGPU, the ``DW_AT_LLVM_lane_pc`` attribute is used to specify the program 1504location of the separate lanes of a SIMT thread. 1505 1506If the lane is an active lane then this will be the same as the current program 1507location. 1508 1509If the lane is inactive, but was active on entry to the subprogram, then this is 1510the program location in the subprogram at which execution of the lane is 1511conceptual positioned. 1512 1513If the lane was not active on entry to the subprogram, then this will be the 1514undefined location. A client debugger can check if the lane is part of a valid 1515work-group by checking that the lane is in the range of the associated 1516work-group within the grid, accounting for partial work-groups. If it is not, 1517then the debugger can omit any information for the lane. Otherwise, the debugger 1518may repeatedly unwind the stack and inspect the ``DW_AT_LLVM_lane_pc`` of the 1519calling subprogram until it finds a non-undefined location. Conceptually the 1520lane only has the call frames that it has a non-undefined 1521``DW_AT_LLVM_lane_pc``. 1522 1523The following example illustrates how the AMDGPU backend can generate a DWARF 1524location list expression for the nested ``IF/THEN/ELSE`` structures of the 1525following subprogram pseudo code for a target with 64 lanes per wavefront. 1526 1527.. code:: 1528 :number-lines: 1529 1530 SUBPROGRAM X 1531 BEGIN 1532 a; 1533 IF (c1) THEN 1534 b; 1535 IF (c2) THEN 1536 c; 1537 ELSE 1538 d; 1539 ENDIF 1540 e; 1541 ELSE 1542 f; 1543 ENDIF 1544 g; 1545 END 1546 1547The AMDGPU backend may generate the following pseudo LLVM MIR to manipulate the 1548execution mask (``EXEC``) to linearize the control flow. The condition is 1549evaluated to make a mask of the lanes for which the condition evaluates to true. 1550First the ``THEN`` region is executed by setting the ``EXEC`` mask to the 1551logical ``AND`` of the current ``EXEC`` mask with the condition mask. Then the 1552``ELSE`` region is executed by negating the ``EXEC`` mask and logical ``AND`` of 1553the saved ``EXEC`` mask at the start of the region. After the ``IF/THEN/ELSE`` 1554region the ``EXEC`` mask is restored to the value it had at the beginning of the 1555region. This is shown below. Other approaches are possible, but the basic 1556concept is the same. 1557 1558.. code:: 1559 :number-lines: 1560 1561 $lex_start: 1562 a; 1563 %1 = EXEC 1564 %2 = c1 1565 $lex_1_start: 1566 EXEC = %1 & %2 1567 $if_1_then: 1568 b; 1569 %3 = EXEC 1570 %4 = c2 1571 $lex_1_1_start: 1572 EXEC = %3 & %4 1573 $lex_1_1_then: 1574 c; 1575 EXEC = ~EXEC & %3 1576 $lex_1_1_else: 1577 d; 1578 EXEC = %3 1579 $lex_1_1_end: 1580 e; 1581 EXEC = ~EXEC & %1 1582 $lex_1_else: 1583 f; 1584 EXEC = %1 1585 $lex_1_end: 1586 g; 1587 $lex_end: 1588 1589To create the DWARF location list expression that defines the location 1590description of a vector of lane program locations, the LLVM MIR ``DBG_VALUE`` 1591pseudo instruction can be used to annotate the linearized control flow. This can 1592be done by defining an artificial variable for the lane PC. The DWARF location 1593list expression created for it is used as the value of the 1594``DW_AT_LLVM_lane_pc`` attribute on the subprogram's debugger information entry. 1595 1596A DWARF procedure is defined for each well nested structured control flow region 1597which provides the conceptual lane program location for a lane if it is not 1598active (namely it is divergent). The DWARF operation expression for each region 1599conceptually inherits the value of the immediately enclosing region and modifies 1600it according to the semantics of the region. 1601 1602For an ``IF/THEN/ELSE`` region the divergent program location is at the start of 1603the region for the ``THEN`` region since it is executed first. For the ``ELSE`` 1604region the divergent program location is at the end of the ``IF/THEN/ELSE`` 1605region since the ``THEN`` region has completed. 1606 1607The lane PC artificial variable is assigned at each region transition. It uses 1608the immediately enclosing region's DWARF procedure to compute the program 1609location for each lane assuming they are divergent, and then modifies the result 1610by inserting the current program location for each lane that the ``EXEC`` mask 1611indicates is active. 1612 1613By having separate DWARF procedures for each region, they can be reused to 1614define the value for any nested region. This reduces the total size of the DWARF 1615operation expressions. 1616 1617The following provides an example using pseudo LLVM MIR. 1618 1619.. code:: 1620 :number-lines: 1621 1622 $lex_start: 1623 DEFINE_DWARF %__uint_64 = DW_TAG_base_type[ 1624 DW_AT_name = "__uint64"; 1625 DW_AT_byte_size = 8; 1626 DW_AT_encoding = DW_ATE_unsigned; 1627 ]; 1628 DEFINE_DWARF %__active_lane_pc = DW_TAG_dwarf_procedure[ 1629 DW_AT_name = "__active_lane_pc"; 1630 DW_AT_location = [ 1631 DW_OP_regx PC; 1632 DW_OP_LLVM_extend 64, 64; 1633 DW_OP_regval_type EXEC, %uint_64; 1634 DW_OP_LLVM_select_bit_piece 64, 64; 1635 ]; 1636 ]; 1637 DEFINE_DWARF %__divergent_lane_pc = DW_TAG_dwarf_procedure[ 1638 DW_AT_name = "__divergent_lane_pc"; 1639 DW_AT_location = [ 1640 DW_OP_LLVM_undefined; 1641 DW_OP_LLVM_extend 64, 64; 1642 ]; 1643 ]; 1644 DBG_VALUE $noreg, $noreg, %DW_AT_LLVM_lane_pc, DIExpression[ 1645 DW_OP_call_ref %__divergent_lane_pc; 1646 DW_OP_call_ref %__active_lane_pc; 1647 ]; 1648 a; 1649 %1 = EXEC; 1650 DBG_VALUE %1, $noreg, %__lex_1_save_exec; 1651 %2 = c1; 1652 $lex_1_start: 1653 EXEC = %1 & %2; 1654 $lex_1_then: 1655 DEFINE_DWARF %__divergent_lane_pc_1_then = DW_TAG_dwarf_procedure[ 1656 DW_AT_name = "__divergent_lane_pc_1_then"; 1657 DW_AT_location = DIExpression[ 1658 DW_OP_call_ref %__divergent_lane_pc; 1659 DW_OP_addrx &lex_1_start; 1660 DW_OP_stack_value; 1661 DW_OP_LLVM_extend 64, 64; 1662 DW_OP_call_ref %__lex_1_save_exec; 1663 DW_OP_deref_type 64, %__uint_64; 1664 DW_OP_LLVM_select_bit_piece 64, 64; 1665 ]; 1666 ]; 1667 DBG_VALUE $noreg, $noreg, %DW_AT_LLVM_lane_pc, DIExpression[ 1668 DW_OP_call_ref %__divergent_lane_pc_1_then; 1669 DW_OP_call_ref %__active_lane_pc; 1670 ]; 1671 b; 1672 %3 = EXEC; 1673 DBG_VALUE %3, %__lex_1_1_save_exec; 1674 %4 = c2; 1675 $lex_1_1_start: 1676 EXEC = %3 & %4; 1677 $lex_1_1_then: 1678 DEFINE_DWARF %__divergent_lane_pc_1_1_then = DW_TAG_dwarf_procedure[ 1679 DW_AT_name = "__divergent_lane_pc_1_1_then"; 1680 DW_AT_location = DIExpression[ 1681 DW_OP_call_ref %__divergent_lane_pc_1_then; 1682 DW_OP_addrx &lex_1_1_start; 1683 DW_OP_stack_value; 1684 DW_OP_LLVM_extend 64, 64; 1685 DW_OP_call_ref %__lex_1_1_save_exec; 1686 DW_OP_deref_type 64, %__uint_64; 1687 DW_OP_LLVM_select_bit_piece 64, 64; 1688 ]; 1689 ]; 1690 DBG_VALUE $noreg, $noreg, %DW_AT_LLVM_lane_pc, DIExpression[ 1691 DW_OP_call_ref %__divergent_lane_pc_1_1_then; 1692 DW_OP_call_ref %__active_lane_pc; 1693 ]; 1694 c; 1695 EXEC = ~EXEC & %3; 1696 $lex_1_1_else: 1697 DEFINE_DWARF %__divergent_lane_pc_1_1_else = DW_TAG_dwarf_procedure[ 1698 DW_AT_name = "__divergent_lane_pc_1_1_else"; 1699 DW_AT_location = DIExpression[ 1700 DW_OP_call_ref %__divergent_lane_pc_1_then; 1701 DW_OP_addrx &lex_1_1_end; 1702 DW_OP_stack_value; 1703 DW_OP_LLVM_extend 64, 64; 1704 DW_OP_call_ref %__lex_1_1_save_exec; 1705 DW_OP_deref_type 64, %__uint_64; 1706 DW_OP_LLVM_select_bit_piece 64, 64; 1707 ]; 1708 ]; 1709 DBG_VALUE $noreg, $noreg, %DW_AT_LLVM_lane_pc, DIExpression[ 1710 DW_OP_call_ref %__divergent_lane_pc_1_1_else; 1711 DW_OP_call_ref %__active_lane_pc; 1712 ]; 1713 d; 1714 EXEC = %3; 1715 $lex_1_1_end: 1716 DBG_VALUE $noreg, $noreg, %DW_AT_LLVM_lane_pc, DIExpression[ 1717 DW_OP_call_ref %__divergent_lane_pc; 1718 DW_OP_call_ref %__active_lane_pc; 1719 ]; 1720 e; 1721 EXEC = ~EXEC & %1; 1722 $lex_1_else: 1723 DEFINE_DWARF %__divergent_lane_pc_1_else = DW_TAG_dwarf_procedure[ 1724 DW_AT_name = "__divergent_lane_pc_1_else"; 1725 DW_AT_location = DIExpression[ 1726 DW_OP_call_ref %__divergent_lane_pc; 1727 DW_OP_addrx &lex_1_end; 1728 DW_OP_stack_value; 1729 DW_OP_LLVM_extend 64, 64; 1730 DW_OP_call_ref %__lex_1_save_exec; 1731 DW_OP_deref_type 64, %__uint_64; 1732 DW_OP_LLVM_select_bit_piece 64, 64; 1733 ]; 1734 ]; 1735 DBG_VALUE $noreg, $noreg, %DW_AT_LLVM_lane_pc, DIExpression[ 1736 DW_OP_call_ref %__divergent_lane_pc_1_else; 1737 DW_OP_call_ref %__active_lane_pc; 1738 ]; 1739 f; 1740 EXEC = %1; 1741 $lex_1_end: 1742 DBG_VALUE $noreg, $noreg, %DW_AT_LLVM_lane_pc DIExpression[ 1743 DW_OP_call_ref %__divergent_lane_pc; 1744 DW_OP_call_ref %__active_lane_pc; 1745 ]; 1746 g; 1747 $lex_end: 1748 1749The DWARF procedure ``%__active_lane_pc`` is used to update the lane pc elements 1750that are active, with the current program location. 1751 1752Artificial variables %__lex_1_save_exec and %__lex_1_1_save_exec are created for 1753the execution masks saved on entry to a region. Using the ``DBG_VALUE`` pseudo 1754instruction, location list entries will be created that describe where the 1755artificial variables are allocated at any given program location. The compiler 1756may allocate them to registers or spill them to memory. 1757 1758The DWARF procedures for each region use the values of the saved execution mask 1759artificial variables to only update the lanes that are active on entry to the 1760region. All other lanes retain the value of the enclosing region where they were 1761last active. If they were not active on entry to the subprogram, then will have 1762the undefined location description. 1763 1764Other structured control flow regions can be handled similarly. For example, 1765loops would set the divergent program location for the region at the end of the 1766loop. Any lanes active will be in the loop, and any lanes not active must have 1767exited the loop. 1768 1769An ``IF/THEN/ELSEIF/ELSEIF/...`` region can be treated as a nest of 1770``IF/THEN/ELSE`` regions. 1771 1772The DWARF procedures can use the active lane artificial variable described in 1773:ref:`amdgpu-dwarf-amdgpu-dw-at-llvm-active-lane` rather than the actual 1774``EXEC`` mask in order to support whole or quad wavefront mode. 1775 1776.. _amdgpu-dwarf-amdgpu-dw-at-llvm-active-lane: 1777 1778``DW_AT_LLVM_active_lane`` 1779~~~~~~~~~~~~~~~~~~~~~~~~~~ 1780 1781The ``DW_AT_LLVM_active_lane`` attribute on a subprogram debugger information 1782entry is used to specify the lanes that are conceptually active for a SIMT 1783thread. 1784 1785The execution mask may be modified to implement whole or quad wavefront mode 1786operations. For example, all lanes may need to temporarily be made active to 1787execute a whole wavefront operation. Such regions would save the ``EXEC`` mask, 1788update it to enable the necessary lanes, perform the operations, and then 1789restore the ``EXEC`` mask from the saved value. While executing the whole 1790wavefront region, the conceptual execution mask is the saved value, not the 1791``EXEC`` value. 1792 1793This is handled by defining an artificial variable for the active lane mask. The 1794active lane mask artificial variable would be the actual ``EXEC`` mask for 1795normal regions, and the saved execution mask for regions where the mask is 1796temporarily updated. The location list expression created for this artificial 1797variable is used to define the value of the ``DW_AT_LLVM_active_lane`` 1798attribute. 1799 1800``DW_AT_LLVM_augmentation`` 1801~~~~~~~~~~~~~~~~~~~~~~~~~~~ 1802 1803For AMDGPU, the ``DW_AT_LLVM_augmentation`` attribute of a compilation unit 1804debugger information entry has the following value for the augmentation string: 1805 1806:: 1807 1808 [amdgpu:v0.0] 1809 1810The "vX.Y" specifies the major X and minor Y version number of the AMDGPU 1811extensions used in the DWARF of the compilation unit. The version number 1812conforms to [SEMVER]_. 1813 1814Call Frame Information 1815---------------------- 1816 1817DWARF Call Frame Information (CFI) describes how a consumer can virtually 1818*unwind* call frames in a running process or core dump. See DWARF Version 5 1819section 6.4 and :ref:`amdgpu-dwarf-call-frame-information`. 1820 1821For AMDGPU, the Common Information Entry (CIE) fields have the following values: 1822 18231. ``augmentation`` string contains the following null-terminated UTF-8 string: 1824 1825 :: 1826 1827 [amd:v0.0] 1828 1829 The ``vX.Y`` specifies the major X and minor Y version number of the AMDGPU 1830 extensions used in this CIE or to the FDEs that use it. The version number 1831 conforms to [SEMVER]_. 1832 18332. ``address_size`` for the ``Global`` address space is defined in 1834 :ref:`amdgpu-dwarf-address-space-identifier`. 1835 18363. ``segment_selector_size`` is 0 as AMDGPU does not use a segment selector. 1837 18384. ``code_alignment_factor`` is 4 bytes. 1839 1840 .. TODO:: 1841 1842 Add to :ref:`amdgpu-processor-table` table. 1843 18445. ``data_alignment_factor`` is 4 bytes. 1845 1846 .. TODO:: 1847 1848 Add to :ref:`amdgpu-processor-table` table. 1849 18506. ``return_address_register`` is ``PC_32`` for 32-bit processes and ``PC_64`` 1851 for 64-bit processes defined in :ref:`amdgpu-dwarf-register-identifier`. 1852 18537. ``initial_instructions`` Since a subprogram X with fewer registers can be 1854 called from subprogram Y that has more allocated, X will not change any of 1855 the extra registers as it cannot access them. Therefore, the default rule 1856 for all columns is ``same value``. 1857 1858For AMDGPU the register number follows the numbering defined in 1859:ref:`amdgpu-dwarf-register-identifier`. 1860 1861For AMDGPU the instructions are variable size. A consumer can subtract 1 from 1862the return address to get the address of a byte within the call site 1863instructions. See DWARF Version 5 section 6.4.4. 1864 1865Accelerated Access 1866------------------ 1867 1868See DWARF Version 5 section 6.1. 1869 1870Lookup By Name Section Header 1871~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 1872 1873See DWARF Version 5 section 6.1.1.4.1 and :ref:`amdgpu-dwarf-lookup-by-name`. 1874 1875For AMDGPU the lookup by name section header table: 1876 1877``augmentation_string_size`` (uword) 1878 1879 Set to the length of the ``augmentation_string`` value which is always a 1880 multiple of 4. 1881 1882``augmentation_string`` (sequence of UTF-8 characters) 1883 1884 Contains the following UTF-8 string null padded to a multiple of 4 bytes: 1885 1886 :: 1887 1888 [amdgpu:v0.0] 1889 1890 The "vX.Y" specifies the major X and minor Y version number of the AMDGPU 1891 extensions used in the DWARF of this index. The version number conforms to 1892 [SEMVER]_. 1893 1894 .. note:: 1895 1896 This is different to the DWARF Version 5 definition that requires the first 1897 4 characters to be the vendor ID. But this is consistent with the other 1898 augmentation strings and does allow multiple vendor contributions. However, 1899 backwards compatibility may be more desirable. 1900 1901Lookup By Address Section Header 1902~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 1903 1904See DWARF Version 5 section 6.1.2. 1905 1906For AMDGPU the lookup by address section header table: 1907 1908``address_size`` (ubyte) 1909 1910 Match the address size for the ``Global`` address space defined in 1911 :ref:`amdgpu-dwarf-address-space-identifier`. 1912 1913``segment_selector_size`` (ubyte) 1914 1915 AMDGPU does not use a segment selector so this is 0. The entries in the 1916 ``.debug_aranges`` do not have a segment selector. 1917 1918Line Number Information 1919----------------------- 1920 1921See DWARF Version 5 section 6.2 and :ref:`amdgpu-dwarf-line-number-information`. 1922 1923AMDGPU does not use the ``isa`` state machine registers and always sets it to 0. 1924The instruction set must be obtained from the ELF file header ``e_flags`` field 1925in the ``EF_AMDGPU_MACH`` bit position (see :ref:`ELF Header 1926<amdgpu-elf-header>`). See DWARF Version 5 section 6.2.2. 1927 1928.. TODO:: 1929 1930 Should the ``isa`` state machine register be used to indicate if the code is 1931 in wavefront32 or wavefront64 mode? Or used to specify the architecture ISA? 1932 1933For AMDGPU the line number program header fields have the following values (see 1934DWARF Version 5 section 6.2.4): 1935 1936``address_size`` (ubyte) 1937 Matches the address size for the ``Global`` address space defined in 1938 :ref:`amdgpu-dwarf-address-space-identifier`. 1939 1940``segment_selector_size`` (ubyte) 1941 AMDGPU does not use a segment selector so this is 0. 1942 1943``minimum_instruction_length`` (ubyte) 1944 For GFX9-GFX10 this is 4. 1945 1946``maximum_operations_per_instruction`` (ubyte) 1947 For GFX9-GFX10 this is 1. 1948 1949Source text for online-compiled programs (for example, those compiled by the 1950OpenCL language runtime) may be embedded into the DWARF Version 5 line table. 1951See DWARF Version 5 section 6.2.4.1 which is updated by the proposal in 1952:ref:`DW_LNCT_LLVM_source 1953<amdgpu-dwarf-line-number-information-dw-lnct-llvm-source>`. 1954 1955The Clang option used to control source embedding in AMDGPU is defined in 1956:ref:`amdgpu-clang-debug-options-table`. 1957 1958 .. table:: AMDGPU Clang Debug Options 1959 :name: amdgpu-clang-debug-options-table 1960 1961 ==================== ================================================== 1962 Debug Flag Description 1963 ==================== ================================================== 1964 -g[no-]embed-source Enable/disable embedding source text in DWARF 1965 debug sections. Useful for environments where 1966 source cannot be written to disk, such as 1967 when performing online compilation. 1968 ==================== ================================================== 1969 1970For example: 1971 1972``-gembed-source`` 1973 Enable the embedded source. 1974 1975``-gno-embed-source`` 1976 Disable the embedded source. 1977 197832-Bit and 64-Bit DWARF Formats 1979------------------------------- 1980 1981See DWARF Version 5 section 7.4 and 1982:ref:`amdgpu-dwarf-32-bit-and-64-bit-dwarf-formats`. 1983 1984For AMDGPU: 1985 1986* For the ``amdgcn`` target architecture only the 64-bit process address space 1987 is supported. 1988 1989* The producer can generate either 32-bit or 64-bit DWARF format. LLVM generates 1990 the 32-bit DWARF format. 1991 1992Unit Headers 1993------------ 1994 1995For AMDGPU the following values apply for each of the unit headers described in 1996DWARF Version 5 sections 7.5.1.1, 7.5.1.2, and 7.5.1.3: 1997 1998``address_size`` (ubyte) 1999 Matches the address size for the ``Global`` address space defined in 2000 :ref:`amdgpu-dwarf-address-space-identifier`. 2001 2002.. _amdgpu-code-conventions: 2003 2004Code Conventions 2005================ 2006 2007This section provides code conventions used for each supported target triple OS 2008(see :ref:`amdgpu-target-triples`). 2009 2010AMDHSA 2011------ 2012 2013This section provides code conventions used when the target triple OS is 2014``amdhsa`` (see :ref:`amdgpu-target-triples`). 2015 2016.. _amdgpu-amdhsa-code-object-target-identification: 2017 2018Code Object Target Identification 2019~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 2020 2021The AMDHSA OS uses the following syntax to specify the code object 2022target as a single string: 2023 2024 ``<Architecture>-<Vendor>-<OS>-<Environment>-<Processor><Target Features>`` 2025 2026Where: 2027 2028 - ``<Architecture>``, ``<Vendor>``, ``<OS>`` and ``<Environment>`` 2029 are the same as the *Target Triple* (see 2030 :ref:`amdgpu-target-triples`). 2031 2032 - ``<Processor>`` is the same as the *Processor* (see 2033 :ref:`amdgpu-processors`). 2034 2035 - ``<Target Features>`` is a list of the enabled *Target Features* 2036 (see :ref:`amdgpu-target-features`), each prefixed by a plus, that 2037 apply to *Processor*. The list must be in the same order as listed 2038 in the table :ref:`amdgpu-target-feature-table`. Note that *Target 2039 Features* must be included in the list if they are enabled even if 2040 that is the default for *Processor*. 2041 2042For example: 2043 2044 ``"amdgcn-amd-amdhsa--gfx902+xnack"`` 2045 2046.. _amdgpu-amdhsa-code-object-metadata: 2047 2048Code Object Metadata 2049~~~~~~~~~~~~~~~~~~~~ 2050 2051The code object metadata specifies extensible metadata associated with the code 2052objects executed on HSA [HSA]_ compatible runtimes such as AMD's ROCm 2053[AMD-ROCm]_. The encoding and semantics of this metadata depends on the code 2054object version; see :ref:`amdgpu-amdhsa-code-object-metadata-v2` and 2055:ref:`amdgpu-amdhsa-code-object-metadata-v3`. 2056 2057Code object metadata is specified in a note record (see 2058:ref:`amdgpu-note-records`) and is required when the target triple OS is 2059``amdhsa`` (see :ref:`amdgpu-target-triples`). It must contain the minimum 2060information necessary to support the ROCM kernel queries. For example, the 2061segment sizes needed in a dispatch packet. In addition, a high-level language 2062runtime may require other information to be included. For example, the AMD 2063OpenCL runtime records kernel argument information. 2064 2065.. _amdgpu-amdhsa-code-object-metadata-v2: 2066 2067Code Object V2 Metadata (-mattr=-code-object-v3) 2068++++++++++++++++++++++++++++++++++++++++++++++++ 2069 2070.. warning:: Code Object V2 is not the default code object version emitted by 2071 this version of LLVM. For a description of the metadata generated with the 2072 default configuration (Code Object V3) see 2073 :ref:`amdgpu-amdhsa-code-object-metadata-v3`. 2074 2075Code object V2 metadata is specified by the ``NT_AMD_AMDGPU_METADATA`` note 2076record (see :ref:`amdgpu-note-records-v2`). 2077 2078The metadata is specified as a YAML formatted string (see [YAML]_ and 2079:doc:`YamlIO`). 2080 2081.. TODO:: 2082 2083 Is the string null terminated? It probably should not if YAML allows it to 2084 contain null characters, otherwise it should be. 2085 2086The metadata is represented as a single YAML document comprised of the mapping 2087defined in table :ref:`amdgpu-amdhsa-code-object-metadata-map-table-v2` and 2088referenced tables. 2089 2090For boolean values, the string values of ``false`` and ``true`` are used for 2091false and true respectively. 2092 2093Additional information can be added to the mappings. To avoid conflicts, any 2094non-AMD key names should be prefixed by "*vendor-name*.". 2095 2096 .. table:: AMDHSA Code Object V2 Metadata Map 2097 :name: amdgpu-amdhsa-code-object-metadata-map-table-v2 2098 2099 ========== ============== ========= ======================================= 2100 String Key Value Type Required? Description 2101 ========== ============== ========= ======================================= 2102 "Version" sequence of Required - The first integer is the major 2103 2 integers version. Currently 1. 2104 - The second integer is the minor 2105 version. Currently 0. 2106 "Printf" sequence of Each string is encoded information 2107 strings about a printf function call. The 2108 encoded information is organized as 2109 fields separated by colon (':'): 2110 2111 ``ID:N:S[0]:S[1]:...:S[N-1]:FormatString`` 2112 2113 where: 2114 2115 ``ID`` 2116 A 32-bit integer as a unique id for 2117 each printf function call 2118 2119 ``N`` 2120 A 32-bit integer equal to the number 2121 of arguments of printf function call 2122 minus 1 2123 2124 ``S[i]`` (where i = 0, 1, ... , N-1) 2125 32-bit integers for the size in bytes 2126 of the i-th FormatString argument of 2127 the printf function call 2128 2129 FormatString 2130 The format string passed to the 2131 printf function call. 2132 "Kernels" sequence of Required Sequence of the mappings for each 2133 mapping kernel in the code object. See 2134 :ref:`amdgpu-amdhsa-code-object-kernel-metadata-map-table-v2` 2135 for the definition of the mapping. 2136 ========== ============== ========= ======================================= 2137 2138.. 2139 2140 .. table:: AMDHSA Code Object V2 Kernel Metadata Map 2141 :name: amdgpu-amdhsa-code-object-kernel-metadata-map-table-v2 2142 2143 ================= ============== ========= ================================ 2144 String Key Value Type Required? Description 2145 ================= ============== ========= ================================ 2146 "Name" string Required Source name of the kernel. 2147 "SymbolName" string Required Name of the kernel 2148 descriptor ELF symbol. 2149 "Language" string Source language of the kernel. 2150 Values include: 2151 2152 - "OpenCL C" 2153 - "OpenCL C++" 2154 - "HCC" 2155 - "OpenMP" 2156 2157 "LanguageVersion" sequence of - The first integer is the major 2158 2 integers version. 2159 - The second integer is the 2160 minor version. 2161 "Attrs" mapping Mapping of kernel attributes. 2162 See 2163 :ref:`amdgpu-amdhsa-code-object-kernel-attribute-metadata-map-table-v2` 2164 for the mapping definition. 2165 "Args" sequence of Sequence of mappings of the 2166 mapping kernel arguments. See 2167 :ref:`amdgpu-amdhsa-code-object-kernel-argument-metadata-map-table-v2` 2168 for the definition of the mapping. 2169 "CodeProps" mapping Mapping of properties related to 2170 the kernel code. See 2171 :ref:`amdgpu-amdhsa-code-object-kernel-code-properties-metadata-map-table-v2` 2172 for the mapping definition. 2173 ================= ============== ========= ================================ 2174 2175.. 2176 2177 .. table:: AMDHSA Code Object V2 Kernel Attribute Metadata Map 2178 :name: amdgpu-amdhsa-code-object-kernel-attribute-metadata-map-table-v2 2179 2180 =================== ============== ========= ============================== 2181 String Key Value Type Required? Description 2182 =================== ============== ========= ============================== 2183 "ReqdWorkGroupSize" sequence of If not 0, 0, 0 then all values 2184 3 integers must be >=1 and the dispatch 2185 work-group size X, Y, Z must 2186 correspond to the specified 2187 values. Defaults to 0, 0, 0. 2188 2189 Corresponds to the OpenCL 2190 ``reqd_work_group_size`` 2191 attribute. 2192 "WorkGroupSizeHint" sequence of The dispatch work-group size 2193 3 integers X, Y, Z is likely to be the 2194 specified values. 2195 2196 Corresponds to the OpenCL 2197 ``work_group_size_hint`` 2198 attribute. 2199 "VecTypeHint" string The name of a scalar or vector 2200 type. 2201 2202 Corresponds to the OpenCL 2203 ``vec_type_hint`` attribute. 2204 2205 "RuntimeHandle" string The external symbol name 2206 associated with a kernel. 2207 OpenCL runtime allocates a 2208 global buffer for the symbol 2209 and saves the kernel's address 2210 to it, which is used for 2211 device side enqueueing. Only 2212 available for device side 2213 enqueued kernels. 2214 =================== ============== ========= ============================== 2215 2216.. 2217 2218 .. table:: AMDHSA Code Object V2 Kernel Argument Metadata Map 2219 :name: amdgpu-amdhsa-code-object-kernel-argument-metadata-map-table-v2 2220 2221 ================= ============== ========= ================================ 2222 String Key Value Type Required? Description 2223 ================= ============== ========= ================================ 2224 "Name" string Kernel argument name. 2225 "TypeName" string Kernel argument type name. 2226 "Size" integer Required Kernel argument size in bytes. 2227 "Align" integer Required Kernel argument alignment in 2228 bytes. Must be a power of two. 2229 "ValueKind" string Required Kernel argument kind that 2230 specifies how to set up the 2231 corresponding argument. 2232 Values include: 2233 2234 "ByValue" 2235 The argument is copied 2236 directly into the kernarg. 2237 2238 "GlobalBuffer" 2239 A global address space pointer 2240 to the buffer data is passed 2241 in the kernarg. 2242 2243 "DynamicSharedPointer" 2244 A group address space pointer 2245 to dynamically allocated LDS 2246 is passed in the kernarg. 2247 2248 "Sampler" 2249 A global address space 2250 pointer to a S# is passed in 2251 the kernarg. 2252 2253 "Image" 2254 A global address space 2255 pointer to a T# is passed in 2256 the kernarg. 2257 2258 "Pipe" 2259 A global address space pointer 2260 to an OpenCL pipe is passed in 2261 the kernarg. 2262 2263 "Queue" 2264 A global address space pointer 2265 to an OpenCL device enqueue 2266 queue is passed in the 2267 kernarg. 2268 2269 "HiddenGlobalOffsetX" 2270 The OpenCL grid dispatch 2271 global offset for the X 2272 dimension is passed in the 2273 kernarg. 2274 2275 "HiddenGlobalOffsetY" 2276 The OpenCL grid dispatch 2277 global offset for the Y 2278 dimension is passed in the 2279 kernarg. 2280 2281 "HiddenGlobalOffsetZ" 2282 The OpenCL grid dispatch 2283 global offset for the Z 2284 dimension is passed in the 2285 kernarg. 2286 2287 "HiddenNone" 2288 An argument that is not used 2289 by the kernel. Space needs to 2290 be left for it, but it does 2291 not need to be set up. 2292 2293 "HiddenPrintfBuffer" 2294 A global address space pointer 2295 to the runtime printf buffer 2296 is passed in kernarg. 2297 2298 "HiddenHostcallBuffer" 2299 A global address space pointer 2300 to the runtime hostcall buffer 2301 is passed in kernarg. 2302 2303 "HiddenDefaultQueue" 2304 A global address space pointer 2305 to the OpenCL device enqueue 2306 queue that should be used by 2307 the kernel by default is 2308 passed in the kernarg. 2309 2310 "HiddenCompletionAction" 2311 A global address space pointer 2312 to help link enqueued kernels into 2313 the ancestor tree for determining 2314 when the parent kernel has finished. 2315 2316 "HiddenMultiGridSyncArg" 2317 A global address space pointer for 2318 multi-grid synchronization is 2319 passed in the kernarg. 2320 2321 "ValueType" string Unused and deprecated. This should no longer 2322 be emitted, but is accepted for compatibility. 2323 2324 2325 "PointeeAlign" integer Alignment in bytes of pointee 2326 type for pointer type kernel 2327 argument. Must be a power 2328 of 2. Only present if 2329 "ValueKind" is 2330 "DynamicSharedPointer". 2331 "AddrSpaceQual" string Kernel argument address space 2332 qualifier. Only present if 2333 "ValueKind" is "GlobalBuffer" or 2334 "DynamicSharedPointer". Values 2335 are: 2336 2337 - "Private" 2338 - "Global" 2339 - "Constant" 2340 - "Local" 2341 - "Generic" 2342 - "Region" 2343 2344 .. TODO:: 2345 Is GlobalBuffer only Global 2346 or Constant? Is 2347 DynamicSharedPointer always 2348 Local? Can HCC allow Generic? 2349 How can Private or Region 2350 ever happen? 2351 "AccQual" string Kernel argument access 2352 qualifier. Only present if 2353 "ValueKind" is "Image" or 2354 "Pipe". Values 2355 are: 2356 2357 - "ReadOnly" 2358 - "WriteOnly" 2359 - "ReadWrite" 2360 2361 .. TODO:: 2362 Does this apply to 2363 GlobalBuffer? 2364 "ActualAccQual" string The actual memory accesses 2365 performed by the kernel on the 2366 kernel argument. Only present if 2367 "ValueKind" is "GlobalBuffer", 2368 "Image", or "Pipe". This may be 2369 more restrictive than indicated 2370 by "AccQual" to reflect what the 2371 kernel actual does. If not 2372 present then the runtime must 2373 assume what is implied by 2374 "AccQual" and "IsConst". Values 2375 are: 2376 2377 - "ReadOnly" 2378 - "WriteOnly" 2379 - "ReadWrite" 2380 2381 "IsConst" boolean Indicates if the kernel argument 2382 is const qualified. Only present 2383 if "ValueKind" is 2384 "GlobalBuffer". 2385 2386 "IsRestrict" boolean Indicates if the kernel argument 2387 is restrict qualified. Only 2388 present if "ValueKind" is 2389 "GlobalBuffer". 2390 2391 "IsVolatile" boolean Indicates if the kernel argument 2392 is volatile qualified. Only 2393 present if "ValueKind" is 2394 "GlobalBuffer". 2395 2396 "IsPipe" boolean Indicates if the kernel argument 2397 is pipe qualified. Only present 2398 if "ValueKind" is "Pipe". 2399 2400 .. TODO:: 2401 Can GlobalBuffer be pipe 2402 qualified? 2403 ================= ============== ========= ================================ 2404 2405.. 2406 2407 .. table:: AMDHSA Code Object V2 Kernel Code Properties Metadata Map 2408 :name: amdgpu-amdhsa-code-object-kernel-code-properties-metadata-map-table-v2 2409 2410 ============================ ============== ========= ===================== 2411 String Key Value Type Required? Description 2412 ============================ ============== ========= ===================== 2413 "KernargSegmentSize" integer Required The size in bytes of 2414 the kernarg segment 2415 that holds the values 2416 of the arguments to 2417 the kernel. 2418 "GroupSegmentFixedSize" integer Required The amount of group 2419 segment memory 2420 required by a 2421 work-group in 2422 bytes. This does not 2423 include any 2424 dynamically allocated 2425 group segment memory 2426 that may be added 2427 when the kernel is 2428 dispatched. 2429 "PrivateSegmentFixedSize" integer Required The amount of fixed 2430 private address space 2431 memory required for a 2432 work-item in 2433 bytes. If the kernel 2434 uses a dynamic call 2435 stack then additional 2436 space must be added 2437 to this value for the 2438 call stack. 2439 "KernargSegmentAlign" integer Required The maximum byte 2440 alignment of 2441 arguments in the 2442 kernarg segment. Must 2443 be a power of 2. 2444 "WavefrontSize" integer Required Wavefront size. Must 2445 be a power of 2. 2446 "NumSGPRs" integer Required Number of scalar 2447 registers used by a 2448 wavefront for 2449 GFX6-GFX10. This 2450 includes the special 2451 SGPRs for VCC, Flat 2452 Scratch (GFX7-GFX10) 2453 and XNACK (for 2454 GFX8-GFX10). It does 2455 not include the 16 2456 SGPR added if a trap 2457 handler is 2458 enabled. It is not 2459 rounded up to the 2460 allocation 2461 granularity. 2462 "NumVGPRs" integer Required Number of vector 2463 registers used by 2464 each work-item for 2465 GFX6-GFX10 2466 "MaxFlatWorkGroupSize" integer Required Maximum flat 2467 work-group size 2468 supported by the 2469 kernel in work-items. 2470 Must be >=1 and 2471 consistent with 2472 ReqdWorkGroupSize if 2473 not 0, 0, 0. 2474 "NumSpilledSGPRs" integer Number of stores from 2475 a scalar register to 2476 a register allocator 2477 created spill 2478 location. 2479 "NumSpilledVGPRs" integer Number of stores from 2480 a vector register to 2481 a register allocator 2482 created spill 2483 location. 2484 ============================ ============== ========= ===================== 2485 2486.. _amdgpu-amdhsa-code-object-metadata-v3: 2487 2488Code Object V3 Metadata (-mattr=+code-object-v3) 2489++++++++++++++++++++++++++++++++++++++++++++++++ 2490 2491Code object V3 metadata is specified by the ``NT_AMDGPU_METADATA`` note record 2492(see :ref:`amdgpu-note-records-v3`). 2493 2494The metadata is represented as Message Pack formatted binary data (see 2495[MsgPack]_). The top level is a Message Pack map that includes the 2496keys defined in table 2497:ref:`amdgpu-amdhsa-code-object-metadata-map-table-v3` and referenced 2498tables. 2499 2500Additional information can be added to the maps. To avoid conflicts, 2501any key names should be prefixed by "*vendor-name*." where 2502``vendor-name`` can be the name of the vendor and specific vendor 2503tool that generates the information. The prefix is abbreviated to 2504simply "." when it appears within a map that has been added by the 2505same *vendor-name*. 2506 2507 .. table:: AMDHSA Code Object V3 Metadata Map 2508 :name: amdgpu-amdhsa-code-object-metadata-map-table-v3 2509 2510 ================= ============== ========= ======================================= 2511 String Key Value Type Required? Description 2512 ================= ============== ========= ======================================= 2513 "amdhsa.version" sequence of Required - The first integer is the major 2514 2 integers version. Currently 1. 2515 - The second integer is the minor 2516 version. Currently 0. 2517 "amdhsa.printf" sequence of Each string is encoded information 2518 strings about a printf function call. The 2519 encoded information is organized as 2520 fields separated by colon (':'): 2521 2522 ``ID:N:S[0]:S[1]:...:S[N-1]:FormatString`` 2523 2524 where: 2525 2526 ``ID`` 2527 A 32-bit integer as a unique id for 2528 each printf function call 2529 2530 ``N`` 2531 A 32-bit integer equal to the number 2532 of arguments of printf function call 2533 minus 1 2534 2535 ``S[i]`` (where i = 0, 1, ... , N-1) 2536 32-bit integers for the size in bytes 2537 of the i-th FormatString argument of 2538 the printf function call 2539 2540 FormatString 2541 The format string passed to the 2542 printf function call. 2543 "amdhsa.kernels" sequence of Required Sequence of the maps for each 2544 map kernel in the code object. See 2545 :ref:`amdgpu-amdhsa-code-object-kernel-metadata-map-table-v3` 2546 for the definition of the keys included 2547 in that map. 2548 ================= ============== ========= ======================================= 2549 2550.. 2551 2552 .. table:: AMDHSA Code Object V3 Kernel Metadata Map 2553 :name: amdgpu-amdhsa-code-object-kernel-metadata-map-table-v3 2554 2555 =================================== ============== ========= ================================ 2556 String Key Value Type Required? Description 2557 =================================== ============== ========= ================================ 2558 ".name" string Required Source name of the kernel. 2559 ".symbol" string Required Name of the kernel 2560 descriptor ELF symbol. 2561 ".language" string Source language of the kernel. 2562 Values include: 2563 2564 - "OpenCL C" 2565 - "OpenCL C++" 2566 - "HCC" 2567 - "HIP" 2568 - "OpenMP" 2569 - "Assembler" 2570 2571 ".language_version" sequence of - The first integer is the major 2572 2 integers version. 2573 - The second integer is the 2574 minor version. 2575 ".args" sequence of Sequence of maps of the 2576 map kernel arguments. See 2577 :ref:`amdgpu-amdhsa-code-object-kernel-argument-metadata-map-table-v3` 2578 for the definition of the keys 2579 included in that map. 2580 ".reqd_workgroup_size" sequence of If not 0, 0, 0 then all values 2581 3 integers must be >=1 and the dispatch 2582 work-group size X, Y, Z must 2583 correspond to the specified 2584 values. Defaults to 0, 0, 0. 2585 2586 Corresponds to the OpenCL 2587 ``reqd_work_group_size`` 2588 attribute. 2589 ".workgroup_size_hint" sequence of The dispatch work-group size 2590 3 integers X, Y, Z is likely to be the 2591 specified values. 2592 2593 Corresponds to the OpenCL 2594 ``work_group_size_hint`` 2595 attribute. 2596 ".vec_type_hint" string The name of a scalar or vector 2597 type. 2598 2599 Corresponds to the OpenCL 2600 ``vec_type_hint`` attribute. 2601 2602 ".device_enqueue_symbol" string The external symbol name 2603 associated with a kernel. 2604 OpenCL runtime allocates a 2605 global buffer for the symbol 2606 and saves the kernel's address 2607 to it, which is used for 2608 device side enqueueing. Only 2609 available for device side 2610 enqueued kernels. 2611 ".kernarg_segment_size" integer Required The size in bytes of 2612 the kernarg segment 2613 that holds the values 2614 of the arguments to 2615 the kernel. 2616 ".group_segment_fixed_size" integer Required The amount of group 2617 segment memory 2618 required by a 2619 work-group in 2620 bytes. This does not 2621 include any 2622 dynamically allocated 2623 group segment memory 2624 that may be added 2625 when the kernel is 2626 dispatched. 2627 ".private_segment_fixed_size" integer Required The amount of fixed 2628 private address space 2629 memory required for a 2630 work-item in 2631 bytes. If the kernel 2632 uses a dynamic call 2633 stack then additional 2634 space must be added 2635 to this value for the 2636 call stack. 2637 ".kernarg_segment_align" integer Required The maximum byte 2638 alignment of 2639 arguments in the 2640 kernarg segment. Must 2641 be a power of 2. 2642 ".wavefront_size" integer Required Wavefront size. Must 2643 be a power of 2. 2644 ".sgpr_count" integer Required Number of scalar 2645 registers required by a 2646 wavefront for 2647 GFX6-GFX9. A register 2648 is required if it is 2649 used explicitly, or 2650 if a higher numbered 2651 register is used 2652 explicitly. This 2653 includes the special 2654 SGPRs for VCC, Flat 2655 Scratch (GFX7-GFX9) 2656 and XNACK (for 2657 GFX8-GFX9). It does 2658 not include the 16 2659 SGPR added if a trap 2660 handler is 2661 enabled. It is not 2662 rounded up to the 2663 allocation 2664 granularity. 2665 ".vgpr_count" integer Required Number of vector 2666 registers required by 2667 each work-item for 2668 GFX6-GFX9. A register 2669 is required if it is 2670 used explicitly, or 2671 if a higher numbered 2672 register is used 2673 explicitly. 2674 ".max_flat_workgroup_size" integer Required Maximum flat 2675 work-group size 2676 supported by the 2677 kernel in work-items. 2678 Must be >=1 and 2679 consistent with 2680 ReqdWorkGroupSize if 2681 not 0, 0, 0. 2682 ".sgpr_spill_count" integer Number of stores from 2683 a scalar register to 2684 a register allocator 2685 created spill 2686 location. 2687 ".vgpr_spill_count" integer Number of stores from 2688 a vector register to 2689 a register allocator 2690 created spill 2691 location. 2692 =================================== ============== ========= ================================ 2693 2694.. 2695 2696 .. table:: AMDHSA Code Object V3 Kernel Argument Metadata Map 2697 :name: amdgpu-amdhsa-code-object-kernel-argument-metadata-map-table-v3 2698 2699 ====================== ============== ========= ================================ 2700 String Key Value Type Required? Description 2701 ====================== ============== ========= ================================ 2702 ".name" string Kernel argument name. 2703 ".type_name" string Kernel argument type name. 2704 ".size" integer Required Kernel argument size in bytes. 2705 ".offset" integer Required Kernel argument offset in 2706 bytes. The offset must be a 2707 multiple of the alignment 2708 required by the argument. 2709 ".value_kind" string Required Kernel argument kind that 2710 specifies how to set up the 2711 corresponding argument. 2712 Values include: 2713 2714 "by_value" 2715 The argument is copied 2716 directly into the kernarg. 2717 2718 "global_buffer" 2719 A global address space pointer 2720 to the buffer data is passed 2721 in the kernarg. 2722 2723 "dynamic_shared_pointer" 2724 A group address space pointer 2725 to dynamically allocated LDS 2726 is passed in the kernarg. 2727 2728 "sampler" 2729 A global address space 2730 pointer to a S# is passed in 2731 the kernarg. 2732 2733 "image" 2734 A global address space 2735 pointer to a T# is passed in 2736 the kernarg. 2737 2738 "pipe" 2739 A global address space pointer 2740 to an OpenCL pipe is passed in 2741 the kernarg. 2742 2743 "queue" 2744 A global address space pointer 2745 to an OpenCL device enqueue 2746 queue is passed in the 2747 kernarg. 2748 2749 "hidden_global_offset_x" 2750 The OpenCL grid dispatch 2751 global offset for the X 2752 dimension is passed in the 2753 kernarg. 2754 2755 "hidden_global_offset_y" 2756 The OpenCL grid dispatch 2757 global offset for the Y 2758 dimension is passed in the 2759 kernarg. 2760 2761 "hidden_global_offset_z" 2762 The OpenCL grid dispatch 2763 global offset for the Z 2764 dimension is passed in the 2765 kernarg. 2766 2767 "hidden_none" 2768 An argument that is not used 2769 by the kernel. Space needs to 2770 be left for it, but it does 2771 not need to be set up. 2772 2773 "hidden_printf_buffer" 2774 A global address space pointer 2775 to the runtime printf buffer 2776 is passed in kernarg. 2777 2778 "hidden_hostcall_buffer" 2779 A global address space pointer 2780 to the runtime hostcall buffer 2781 is passed in kernarg. 2782 2783 "hidden_default_queue" 2784 A global address space pointer 2785 to the OpenCL device enqueue 2786 queue that should be used by 2787 the kernel by default is 2788 passed in the kernarg. 2789 2790 "hidden_completion_action" 2791 A global address space pointer 2792 to help link enqueued kernels into 2793 the ancestor tree for determining 2794 when the parent kernel has finished. 2795 2796 "hidden_multigrid_sync_arg" 2797 A global address space pointer for 2798 multi-grid synchronization is 2799 passed in the kernarg. 2800 2801 ".value_type" string Unused and deprecated. This should no longer 2802 be emitted, but is accepted for compatibility. 2803 2804 ".pointee_align" integer Alignment in bytes of pointee 2805 type for pointer type kernel 2806 argument. Must be a power 2807 of 2. Only present if 2808 ".value_kind" is 2809 "dynamic_shared_pointer". 2810 ".address_space" string Kernel argument address space 2811 qualifier. Only present if 2812 ".value_kind" is "global_buffer" or 2813 "dynamic_shared_pointer". Values 2814 are: 2815 2816 - "private" 2817 - "global" 2818 - "constant" 2819 - "local" 2820 - "generic" 2821 - "region" 2822 2823 .. TODO:: 2824 Is "global_buffer" only "global" 2825 or "constant"? Is 2826 "dynamic_shared_pointer" always 2827 "local"? Can HCC allow "generic"? 2828 How can "private" or "region" 2829 ever happen? 2830 ".access" string Kernel argument access 2831 qualifier. Only present if 2832 ".value_kind" is "image" or 2833 "pipe". Values 2834 are: 2835 2836 - "read_only" 2837 - "write_only" 2838 - "read_write" 2839 2840 .. TODO:: 2841 Does this apply to 2842 "global_buffer"? 2843 ".actual_access" string The actual memory accesses 2844 performed by the kernel on the 2845 kernel argument. Only present if 2846 ".value_kind" is "global_buffer", 2847 "image", or "pipe". This may be 2848 more restrictive than indicated 2849 by ".access" to reflect what the 2850 kernel actual does. If not 2851 present then the runtime must 2852 assume what is implied by 2853 ".access" and ".is_const" . Values 2854 are: 2855 2856 - "read_only" 2857 - "write_only" 2858 - "read_write" 2859 2860 ".is_const" boolean Indicates if the kernel argument 2861 is const qualified. Only present 2862 if ".value_kind" is 2863 "global_buffer". 2864 2865 ".is_restrict" boolean Indicates if the kernel argument 2866 is restrict qualified. Only 2867 present if ".value_kind" is 2868 "global_buffer". 2869 2870 ".is_volatile" boolean Indicates if the kernel argument 2871 is volatile qualified. Only 2872 present if ".value_kind" is 2873 "global_buffer". 2874 2875 ".is_pipe" boolean Indicates if the kernel argument 2876 is pipe qualified. Only present 2877 if ".value_kind" is "pipe". 2878 2879 .. TODO:: 2880 Can "global_buffer" be pipe 2881 qualified? 2882 ====================== ============== ========= ================================ 2883 2884.. 2885 2886Kernel Dispatch 2887~~~~~~~~~~~~~~~ 2888 2889The HSA architected queuing language (AQL) defines a user space memory 2890interface that can be used to control the dispatch of kernels, in an agent 2891independent way. An agent can have zero or more AQL queues created for it using 2892the ROCm runtime, in which AQL packets (all of which are 64 bytes) can be 2893placed. See the *HSA Platform System Architecture Specification* [HSA]_ for the 2894AQL queue mechanics and packet layouts. 2895 2896The packet processor of a kernel agent is responsible for detecting and 2897dispatching HSA kernels from the AQL queues associated with it. For AMD GPUs the 2898packet processor is implemented by the hardware command processor (CP), 2899asynchronous dispatch controller (ADC) and shader processor input controller 2900(SPI). 2901 2902The ROCm runtime can be used to allocate an AQL queue object. It uses the kernel 2903mode driver to initialize and register the AQL queue with CP. 2904 2905To dispatch a kernel the following actions are performed. This can occur in the 2906CPU host program, or from an HSA kernel executing on a GPU. 2907 29081. A pointer to an AQL queue for the kernel agent on which the kernel is to be 2909 executed is obtained. 29102. A pointer to the kernel descriptor (see 2911 :ref:`amdgpu-amdhsa-kernel-descriptor`) of the kernel to execute is obtained. 2912 It must be for a kernel that is contained in a code object that that was 2913 loaded by the ROCm runtime on the kernel agent with which the AQL queue is 2914 associated. 29153. Space is allocated for the kernel arguments using the ROCm runtime allocator 2916 for a memory region with the kernarg property for the kernel agent that will 2917 execute the kernel. It must be at least 16-byte aligned. 29184. Kernel argument values are assigned to the kernel argument memory 2919 allocation. The layout is defined in the *HSA Programmer's Language 2920 Reference* [HSA]_. For AMDGPU the kernel execution directly accesses the 2921 kernel argument memory in the same way constant memory is accessed. (Note 2922 that the HSA specification allows an implementation to copy the kernel 2923 argument contents to another location that is accessed by the kernel.) 29245. An AQL kernel dispatch packet is created on the AQL queue. The ROCm runtime 2925 api uses 64-bit atomic operations to reserve space in the AQL queue for the 2926 packet. The packet must be set up, and the final write must use an atomic 2927 store release to set the packet kind to ensure the packet contents are 2928 visible to the kernel agent. AQL defines a doorbell signal mechanism to 2929 notify the kernel agent that the AQL queue has been updated. These rules, and 2930 the layout of the AQL queue and kernel dispatch packet is defined in the *HSA 2931 System Architecture Specification* [HSA]_. 29326. A kernel dispatch packet includes information about the actual dispatch, 2933 such as grid and work-group size, together with information from the code 2934 object about the kernel, such as segment sizes. The ROCm runtime queries on 2935 the kernel symbol can be used to obtain the code object values which are 2936 recorded in the :ref:`amdgpu-amdhsa-code-object-metadata`. 29377. CP executes micro-code and is responsible for detecting and setting up the 2938 GPU to execute the wavefronts of a kernel dispatch. 29398. CP ensures that when the a wavefront starts executing the kernel machine 2940 code, the scalar general purpose registers (SGPR) and vector general purpose 2941 registers (VGPR) are set up as required by the machine code. The required 2942 setup is defined in the :ref:`amdgpu-amdhsa-kernel-descriptor`. The initial 2943 register state is defined in 2944 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`. 29459. The prolog of the kernel machine code (see 2946 :ref:`amdgpu-amdhsa-kernel-prolog`) sets up the machine state as necessary 2947 before continuing executing the machine code that corresponds to the kernel. 294810. When the kernel dispatch has completed execution, CP signals the completion 2949 signal specified in the kernel dispatch packet if not 0. 2950 2951Image and Samplers 2952~~~~~~~~~~~~~~~~~~ 2953 2954Image and sample handles created by the ROCm runtime are 64-bit addresses of a 2955hardware 32-byte V# and 48 byte S# object respectively. In order to support the 2956HSA ``query_sampler`` operations two extra dwords are used to store the HSA BRIG 2957enumeration values for the queries that are not trivially deducible from the S# 2958representation. 2959 2960HSA Signals 2961~~~~~~~~~~~ 2962 2963HSA signal handles created by the ROCm runtime are 64-bit addresses of a 2964structure allocated in memory accessible from both the CPU and GPU. The 2965structure is defined by the ROCm runtime and subject to change between releases 2966(see [AMD-ROCm-github]_). 2967 2968.. _amdgpu-amdhsa-hsa-aql-queue: 2969 2970HSA AQL Queue 2971~~~~~~~~~~~~~ 2972 2973The HSA AQL queue structure is defined by the ROCm runtime and subject to change 2974between releases (see [AMD-ROCm-github]_). For some processors it contains 2975fields needed to implement certain language features such as the flat address 2976aperture bases. It also contains fields used by CP such as managing the 2977allocation of scratch memory. 2978 2979.. _amdgpu-amdhsa-kernel-descriptor: 2980 2981Kernel Descriptor 2982~~~~~~~~~~~~~~~~~ 2983 2984A kernel descriptor consists of the information needed by CP to initiate the 2985execution of a kernel, including the entry point address of the machine code 2986that implements the kernel. 2987 2988Kernel Descriptor for GFX6-GFX10 2989++++++++++++++++++++++++++++++++ 2990 2991CP microcode requires the Kernel descriptor to be allocated on 64-byte 2992alignment. 2993 2994 .. table:: Kernel Descriptor for GFX6-GFX10 2995 :name: amdgpu-amdhsa-kernel-descriptor-gfx6-gfx10-table 2996 2997 ======= ======= =============================== ============================ 2998 Bits Size Field Name Description 2999 ======= ======= =============================== ============================ 3000 31:0 4 bytes GROUP_SEGMENT_FIXED_SIZE The amount of fixed local 3001 address space memory 3002 required for a work-group 3003 in bytes. This does not 3004 include any dynamically 3005 allocated local address 3006 space memory that may be 3007 added when the kernel is 3008 dispatched. 3009 63:32 4 bytes PRIVATE_SEGMENT_FIXED_SIZE The amount of fixed 3010 private address space 3011 memory required for a 3012 work-item in bytes. If 3013 is_dynamic_callstack is 1 3014 then additional space must 3015 be added to this value for 3016 the call stack. 3017 127:64 8 bytes Reserved, must be 0. 3018 191:128 8 bytes KERNEL_CODE_ENTRY_BYTE_OFFSET Byte offset (possibly 3019 negative) from base 3020 address of kernel 3021 descriptor to kernel's 3022 entry point instruction 3023 which must be 256 byte 3024 aligned. 3025 351:272 20 Reserved, must be 0. 3026 bytes 3027 383:352 4 bytes COMPUTE_PGM_RSRC3 GFX6-9 3028 Reserved, must be 0. 3029 GFX10 3030 Compute Shader (CS) 3031 program settings used by 3032 CP to set up 3033 ``COMPUTE_PGM_RSRC3`` 3034 configuration 3035 register. See 3036 :ref:`amdgpu-amdhsa-compute_pgm_rsrc3-gfx10-table`. 3037 415:384 4 bytes COMPUTE_PGM_RSRC1 Compute Shader (CS) 3038 program settings used by 3039 CP to set up 3040 ``COMPUTE_PGM_RSRC1`` 3041 configuration 3042 register. See 3043 :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx10-table`. 3044 447:416 4 bytes COMPUTE_PGM_RSRC2 Compute Shader (CS) 3045 program settings used by 3046 CP to set up 3047 ``COMPUTE_PGM_RSRC2`` 3048 configuration 3049 register. See 3050 :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx10-table`. 3051 448 1 bit ENABLE_SGPR_PRIVATE_SEGMENT Enable the setup of the 3052 _BUFFER SGPR user data registers 3053 (see 3054 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`). 3055 3056 The total number of SGPR 3057 user data registers 3058 requested must not exceed 3059 16 and match value in 3060 ``compute_pgm_rsrc2.user_sgpr.user_sgpr_count``. 3061 Any requests beyond 16 3062 will be ignored. 3063 449 1 bit ENABLE_SGPR_DISPATCH_PTR *see above* 3064 450 1 bit ENABLE_SGPR_QUEUE_PTR *see above* 3065 451 1 bit ENABLE_SGPR_KERNARG_SEGMENT_PTR *see above* 3066 452 1 bit ENABLE_SGPR_DISPATCH_ID *see above* 3067 453 1 bit ENABLE_SGPR_FLAT_SCRATCH_INIT *see above* 3068 454 1 bit ENABLE_SGPR_PRIVATE_SEGMENT *see above* 3069 _SIZE 3070 457:455 3 bits Reserved, must be 0. 3071 458 1 bit ENABLE_WAVEFRONT_SIZE32 GFX6-9 3072 Reserved, must be 0. 3073 GFX10 3074 - If 0 execute in 3075 wavefront size 64 mode. 3076 - If 1 execute in 3077 native wavefront size 3078 32 mode. 3079 463:459 5 bits Reserved, must be 0. 3080 511:464 6 bytes Reserved, must be 0. 3081 512 **Total size 64 bytes.** 3082 ======= ==================================================================== 3083 3084.. 3085 3086 .. table:: compute_pgm_rsrc1 for GFX6-GFX10 3087 :name: amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx10-table 3088 3089 ======= ======= =============================== =========================================================================== 3090 Bits Size Field Name Description 3091 ======= ======= =============================== =========================================================================== 3092 5:0 6 bits GRANULATED_WORKITEM_VGPR_COUNT Number of vector register 3093 blocks used by each work-item; 3094 granularity is device 3095 specific: 3096 3097 GFX6-GFX9 3098 - vgprs_used 0..256 3099 - max(0, ceil(vgprs_used / 4) - 1) 3100 GFX10 (wavefront size 64) 3101 - max_vgpr 1..256 3102 - max(0, ceil(vgprs_used / 4) - 1) 3103 GFX10 (wavefront size 32) 3104 - max_vgpr 1..256 3105 - max(0, ceil(vgprs_used / 8) - 1) 3106 3107 Where vgprs_used is defined 3108 as the highest VGPR number 3109 explicitly referenced plus 3110 one. 3111 3112 Used by CP to set up 3113 ``COMPUTE_PGM_RSRC1.VGPRS``. 3114 3115 The 3116 :ref:`amdgpu-assembler` 3117 calculates this 3118 automatically for the 3119 selected processor from 3120 values provided to the 3121 `.amdhsa_kernel` directive 3122 by the 3123 `.amdhsa_next_free_vgpr` 3124 nested directive (see 3125 :ref:`amdhsa-kernel-directives-table`). 3126 9:6 4 bits GRANULATED_WAVEFRONT_SGPR_COUNT Number of scalar register 3127 blocks used by a wavefront; 3128 granularity is device 3129 specific: 3130 3131 GFX6-GFX8 3132 - sgprs_used 0..112 3133 - max(0, ceil(sgprs_used / 8) - 1) 3134 GFX9 3135 - sgprs_used 0..112 3136 - 2 * max(0, ceil(sgprs_used / 16) - 1) 3137 GFX10 3138 Reserved, must be 0. 3139 (128 SGPRs always 3140 allocated.) 3141 3142 Where sgprs_used is 3143 defined as the highest 3144 SGPR number explicitly 3145 referenced plus one, plus 3146 a target specific number 3147 of additional special 3148 SGPRs for VCC, 3149 FLAT_SCRATCH (GFX7+) and 3150 XNACK_MASK (GFX8+), and 3151 any additional 3152 target specific 3153 limitations. It does not 3154 include the 16 SGPRs added 3155 if a trap handler is 3156 enabled. 3157 3158 The target specific 3159 limitations and special 3160 SGPR layout are defined in 3161 the hardware 3162 documentation, which can 3163 be found in the 3164 :ref:`amdgpu-processors` 3165 table. 3166 3167 Used by CP to set up 3168 ``COMPUTE_PGM_RSRC1.SGPRS``. 3169 3170 The 3171 :ref:`amdgpu-assembler` 3172 calculates this 3173 automatically for the 3174 selected processor from 3175 values provided to the 3176 `.amdhsa_kernel` directive 3177 by the 3178 `.amdhsa_next_free_sgpr` 3179 and `.amdhsa_reserve_*` 3180 nested directives (see 3181 :ref:`amdhsa-kernel-directives-table`). 3182 11:10 2 bits PRIORITY Must be 0. 3183 3184 Start executing wavefront 3185 at the specified priority. 3186 3187 CP is responsible for 3188 filling in 3189 ``COMPUTE_PGM_RSRC1.PRIORITY``. 3190 13:12 2 bits FLOAT_ROUND_MODE_32 Wavefront starts execution 3191 with specified rounding 3192 mode for single (32 3193 bit) floating point 3194 precision floating point 3195 operations. 3196 3197 Floating point rounding 3198 mode values are defined in 3199 :ref:`amdgpu-amdhsa-floating-point-rounding-mode-enumeration-values-table`. 3200 3201 Used by CP to set up 3202 ``COMPUTE_PGM_RSRC1.FLOAT_MODE``. 3203 15:14 2 bits FLOAT_ROUND_MODE_16_64 Wavefront starts execution 3204 with specified rounding 3205 denorm mode for half/double (16 3206 and 64-bit) floating point 3207 precision floating point 3208 operations. 3209 3210 Floating point rounding 3211 mode values are defined in 3212 :ref:`amdgpu-amdhsa-floating-point-rounding-mode-enumeration-values-table`. 3213 3214 Used by CP to set up 3215 ``COMPUTE_PGM_RSRC1.FLOAT_MODE``. 3216 17:16 2 bits FLOAT_DENORM_MODE_32 Wavefront starts execution 3217 with specified denorm mode 3218 for single (32 3219 bit) floating point 3220 precision floating point 3221 operations. 3222 3223 Floating point denorm mode 3224 values are defined in 3225 :ref:`amdgpu-amdhsa-floating-point-denorm-mode-enumeration-values-table`. 3226 3227 Used by CP to set up 3228 ``COMPUTE_PGM_RSRC1.FLOAT_MODE``. 3229 19:18 2 bits FLOAT_DENORM_MODE_16_64 Wavefront starts execution 3230 with specified denorm mode 3231 for half/double (16 3232 and 64-bit) floating point 3233 precision floating point 3234 operations. 3235 3236 Floating point denorm mode 3237 values are defined in 3238 :ref:`amdgpu-amdhsa-floating-point-denorm-mode-enumeration-values-table`. 3239 3240 Used by CP to set up 3241 ``COMPUTE_PGM_RSRC1.FLOAT_MODE``. 3242 20 1 bit PRIV Must be 0. 3243 3244 Start executing wavefront 3245 in privilege trap handler 3246 mode. 3247 3248 CP is responsible for 3249 filling in 3250 ``COMPUTE_PGM_RSRC1.PRIV``. 3251 21 1 bit ENABLE_DX10_CLAMP Wavefront starts execution 3252 with DX10 clamp mode 3253 enabled. Used by the vector 3254 ALU to force DX10 style 3255 treatment of NaN's (when 3256 set, clamp NaN to zero, 3257 otherwise pass NaN 3258 through). 3259 3260 Used by CP to set up 3261 ``COMPUTE_PGM_RSRC1.DX10_CLAMP``. 3262 22 1 bit DEBUG_MODE Must be 0. 3263 3264 Start executing wavefront 3265 in single step mode. 3266 3267 CP is responsible for 3268 filling in 3269 ``COMPUTE_PGM_RSRC1.DEBUG_MODE``. 3270 23 1 bit ENABLE_IEEE_MODE Wavefront starts execution 3271 with IEEE mode 3272 enabled. Floating point 3273 opcodes that support 3274 exception flag gathering 3275 will quiet and propagate 3276 signaling-NaN inputs per 3277 IEEE 754-2008. Min_dx10 and 3278 max_dx10 become IEEE 3279 754-2008 compliant due to 3280 signaling-NaN propagation 3281 and quieting. 3282 3283 Used by CP to set up 3284 ``COMPUTE_PGM_RSRC1.IEEE_MODE``. 3285 24 1 bit BULKY Must be 0. 3286 3287 Only one work-group allowed 3288 to execute on a compute 3289 unit. 3290 3291 CP is responsible for 3292 filling in 3293 ``COMPUTE_PGM_RSRC1.BULKY``. 3294 25 1 bit CDBG_USER Must be 0. 3295 3296 Flag that can be used to 3297 control debugging code. 3298 3299 CP is responsible for 3300 filling in 3301 ``COMPUTE_PGM_RSRC1.CDBG_USER``. 3302 26 1 bit FP16_OVFL GFX6-GFX8 3303 Reserved, must be 0. 3304 GFX9-GFX10 3305 Wavefront starts execution 3306 with specified fp16 overflow 3307 mode. 3308 3309 - If 0, fp16 overflow generates 3310 +/-INF values. 3311 - If 1, fp16 overflow that is the 3312 result of an +/-INF input value 3313 or divide by 0 produces a +/-INF, 3314 otherwise clamps computed 3315 overflow to +/-MAX_FP16 as 3316 appropriate. 3317 3318 Used by CP to set up 3319 ``COMPUTE_PGM_RSRC1.FP16_OVFL``. 3320 28:27 2 bits Reserved, must be 0. 3321 29 1 bit WGP_MODE GFX6-GFX9 3322 Reserved, must be 0. 3323 GFX10 3324 - If 0 execute work-groups in 3325 CU wavefront execution mode. 3326 - If 1 execute work-groups on 3327 in WGP wavefront execution mode. 3328 3329 See :ref:`amdgpu-amdhsa-memory-model`. 3330 3331 Used by CP to set up 3332 ``COMPUTE_PGM_RSRC1.WGP_MODE``. 3333 30 1 bit MEM_ORDERED GFX6-9 3334 Reserved, must be 0. 3335 GFX10 3336 Controls the behavior of the 3337 waitcnt's vmcnt and vscnt 3338 counters. 3339 3340 - If 0 vmcnt reports completion 3341 of load and atomic with return 3342 out of order with sample 3343 instructions, and the vscnt 3344 reports the completion of 3345 store and atomic without 3346 return in order. 3347 - If 1 vmcnt reports completion 3348 of load, atomic with return 3349 and sample instructions in 3350 order, and the vscnt reports 3351 the completion of store and 3352 atomic without return in order. 3353 3354 Used by CP to set up 3355 ``COMPUTE_PGM_RSRC1.MEM_ORDERED``. 3356 31 1 bit FWD_PROGRESS GFX6-9 3357 Reserved, must be 0. 3358 GFX10 3359 - If 0 execute SIMD wavefronts 3360 using oldest first policy. 3361 - If 1 execute SIMD wavefronts to 3362 ensure wavefronts will make some 3363 forward progress. 3364 3365 Used by CP to set up 3366 ``COMPUTE_PGM_RSRC1.FWD_PROGRESS``. 3367 32 **Total size 4 bytes** 3368 ======= =================================================================================================================== 3369 3370.. 3371 3372 .. table:: compute_pgm_rsrc2 for GFX6-GFX10 3373 :name: amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx10-table 3374 3375 ======= ======= =============================== =========================================================================== 3376 Bits Size Field Name Description 3377 ======= ======= =============================== =========================================================================== 3378 0 1 bit ENABLE_SGPR_PRIVATE_SEGMENT Enable the setup of the 3379 _WAVEFRONT_OFFSET SGPR wavefront scratch offset 3380 system register (see 3381 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`). 3382 3383 Used by CP to set up 3384 ``COMPUTE_PGM_RSRC2.SCRATCH_EN``. 3385 5:1 5 bits USER_SGPR_COUNT The total number of SGPR 3386 user data registers 3387 requested. This number must 3388 match the number of user 3389 data registers enabled. 3390 3391 Used by CP to set up 3392 ``COMPUTE_PGM_RSRC2.USER_SGPR``. 3393 6 1 bit ENABLE_TRAP_HANDLER Must be 0. 3394 3395 This bit represents 3396 ``COMPUTE_PGM_RSRC2.TRAP_PRESENT``, 3397 which is set by the CP if 3398 the runtime has installed a 3399 trap handler. 3400 7 1 bit ENABLE_SGPR_WORKGROUP_ID_X Enable the setup of the 3401 system SGPR register for 3402 the work-group id in the X 3403 dimension (see 3404 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`). 3405 3406 Used by CP to set up 3407 ``COMPUTE_PGM_RSRC2.TGID_X_EN``. 3408 8 1 bit ENABLE_SGPR_WORKGROUP_ID_Y Enable the setup of the 3409 system SGPR register for 3410 the work-group id in the Y 3411 dimension (see 3412 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`). 3413 3414 Used by CP to set up 3415 ``COMPUTE_PGM_RSRC2.TGID_Y_EN``. 3416 9 1 bit ENABLE_SGPR_WORKGROUP_ID_Z Enable the setup of the 3417 system SGPR register for 3418 the work-group id in the Z 3419 dimension (see 3420 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`). 3421 3422 Used by CP to set up 3423 ``COMPUTE_PGM_RSRC2.TGID_Z_EN``. 3424 10 1 bit ENABLE_SGPR_WORKGROUP_INFO Enable the setup of the 3425 system SGPR register for 3426 work-group information (see 3427 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`). 3428 3429 Used by CP to set up 3430 ``COMPUTE_PGM_RSRC2.TGID_SIZE_EN``. 3431 12:11 2 bits ENABLE_VGPR_WORKITEM_ID Enable the setup of the 3432 VGPR system registers used 3433 for the work-item ID. 3434 :ref:`amdgpu-amdhsa-system-vgpr-work-item-id-enumeration-values-table` 3435 defines the values. 3436 3437 Used by CP to set up 3438 ``COMPUTE_PGM_RSRC2.TIDIG_CMP_CNT``. 3439 13 1 bit ENABLE_EXCEPTION_ADDRESS_WATCH Must be 0. 3440 3441 Wavefront starts execution 3442 with address watch 3443 exceptions enabled which 3444 are generated when L1 has 3445 witnessed a thread access 3446 an *address of 3447 interest*. 3448 3449 CP is responsible for 3450 filling in the address 3451 watch bit in 3452 ``COMPUTE_PGM_RSRC2.EXCP_EN_MSB`` 3453 according to what the 3454 runtime requests. 3455 14 1 bit ENABLE_EXCEPTION_MEMORY Must be 0. 3456 3457 Wavefront starts execution 3458 with memory violation 3459 exceptions exceptions 3460 enabled which are generated 3461 when a memory violation has 3462 occurred for this wavefront from 3463 L1 or LDS 3464 (write-to-read-only-memory, 3465 mis-aligned atomic, LDS 3466 address out of range, 3467 illegal address, etc.). 3468 3469 CP sets the memory 3470 violation bit in 3471 ``COMPUTE_PGM_RSRC2.EXCP_EN_MSB`` 3472 according to what the 3473 runtime requests. 3474 23:15 9 bits GRANULATED_LDS_SIZE Must be 0. 3475 3476 CP uses the rounded value 3477 from the dispatch packet, 3478 not this value, as the 3479 dispatch may contain 3480 dynamically allocated group 3481 segment memory. CP writes 3482 directly to 3483 ``COMPUTE_PGM_RSRC2.LDS_SIZE``. 3484 3485 Amount of group segment 3486 (LDS) to allocate for each 3487 work-group. Granularity is 3488 device specific: 3489 3490 GFX6: 3491 roundup(lds-size / (64 * 4)) 3492 GFX7-GFX10: 3493 roundup(lds-size / (128 * 4)) 3494 3495 24 1 bit ENABLE_EXCEPTION_IEEE_754_FP Wavefront starts execution 3496 _INVALID_OPERATION with specified exceptions 3497 enabled. 3498 3499 Used by CP to set up 3500 ``COMPUTE_PGM_RSRC2.EXCP_EN`` 3501 (set from bits 0..6). 3502 3503 IEEE 754 FP Invalid 3504 Operation 3505 25 1 bit ENABLE_EXCEPTION_FP_DENORMAL FP Denormal one or more 3506 _SOURCE input operands is a 3507 denormal number 3508 26 1 bit ENABLE_EXCEPTION_IEEE_754_FP IEEE 754 FP Division by 3509 _DIVISION_BY_ZERO Zero 3510 27 1 bit ENABLE_EXCEPTION_IEEE_754_FP IEEE 754 FP FP Overflow 3511 _OVERFLOW 3512 28 1 bit ENABLE_EXCEPTION_IEEE_754_FP IEEE 754 FP Underflow 3513 _UNDERFLOW 3514 29 1 bit ENABLE_EXCEPTION_IEEE_754_FP IEEE 754 FP Inexact 3515 _INEXACT 3516 30 1 bit ENABLE_EXCEPTION_INT_DIVIDE_BY Integer Division by Zero 3517 _ZERO (rcp_iflag_f32 instruction 3518 only) 3519 31 1 bit Reserved, must be 0. 3520 32 **Total size 4 bytes.** 3521 ======= =================================================================================================================== 3522 3523.. 3524 3525 .. table:: compute_pgm_rsrc3 for GFX10 3526 :name: amdgpu-amdhsa-compute_pgm_rsrc3-gfx10-table 3527 3528 ======= ======= =============================== =========================================================================== 3529 Bits Size Field Name Description 3530 ======= ======= =============================== =========================================================================== 3531 3:0 4 bits SHARED_VGPR_COUNT Number of shared VGPRs for wavefront size 64. Granularity 8. Value 0-120. 3532 compute_pgm_rsrc1.vgprs + shared_vgpr_cnt cannot exceed 64. 3533 31:4 28 Reserved, must be 0. 3534 bits 3535 32 **Total size 4 bytes.** 3536 ======= =================================================================================================================== 3537 3538.. 3539 3540 .. table:: Floating Point Rounding Mode Enumeration Values 3541 :name: amdgpu-amdhsa-floating-point-rounding-mode-enumeration-values-table 3542 3543 ====================================== ===== ============================== 3544 Enumeration Name Value Description 3545 ====================================== ===== ============================== 3546 FLOAT_ROUND_MODE_NEAR_EVEN 0 Round Ties To Even 3547 FLOAT_ROUND_MODE_PLUS_INFINITY 1 Round Toward +infinity 3548 FLOAT_ROUND_MODE_MINUS_INFINITY 2 Round Toward -infinity 3549 FLOAT_ROUND_MODE_ZERO 3 Round Toward 0 3550 ====================================== ===== ============================== 3551 3552.. 3553 3554 .. table:: Floating Point Denorm Mode Enumeration Values 3555 :name: amdgpu-amdhsa-floating-point-denorm-mode-enumeration-values-table 3556 3557 ====================================== ===== ============================== 3558 Enumeration Name Value Description 3559 ====================================== ===== ============================== 3560 FLOAT_DENORM_MODE_FLUSH_SRC_DST 0 Flush Source and Destination 3561 Denorms 3562 FLOAT_DENORM_MODE_FLUSH_DST 1 Flush Output Denorms 3563 FLOAT_DENORM_MODE_FLUSH_SRC 2 Flush Source Denorms 3564 FLOAT_DENORM_MODE_FLUSH_NONE 3 No Flush 3565 ====================================== ===== ============================== 3566 3567.. 3568 3569 .. table:: System VGPR Work-Item ID Enumeration Values 3570 :name: amdgpu-amdhsa-system-vgpr-work-item-id-enumeration-values-table 3571 3572 ======================================== ===== ============================ 3573 Enumeration Name Value Description 3574 ======================================== ===== ============================ 3575 SYSTEM_VGPR_WORKITEM_ID_X 0 Set work-item X dimension 3576 ID. 3577 SYSTEM_VGPR_WORKITEM_ID_X_Y 1 Set work-item X and Y 3578 dimensions ID. 3579 SYSTEM_VGPR_WORKITEM_ID_X_Y_Z 2 Set work-item X, Y and Z 3580 dimensions ID. 3581 SYSTEM_VGPR_WORKITEM_ID_UNDEFINED 3 Undefined. 3582 ======================================== ===== ============================ 3583 3584.. _amdgpu-amdhsa-initial-kernel-execution-state: 3585 3586Initial Kernel Execution State 3587~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 3588 3589This section defines the register state that will be set up by the packet 3590processor prior to the start of execution of every wavefront. This is limited by 3591the constraints of the hardware controllers of CP/ADC/SPI. 3592 3593The order of the SGPR registers is defined, but the compiler can specify which 3594ones are actually setup in the kernel descriptor using the ``enable_sgpr_*`` bit 3595fields (see :ref:`amdgpu-amdhsa-kernel-descriptor`). The register numbers used 3596for enabled registers are dense starting at SGPR0: the first enabled register is 3597SGPR0, the next enabled register is SGPR1 etc.; disabled registers do not have 3598an SGPR number. 3599 3600The initial SGPRs comprise up to 16 User SRGPs that are set by CP and apply to 3601all wavefronts of the grid. It is possible to specify more than 16 User SGPRs 3602using the ``enable_sgpr_*`` bit fields, in which case only the first 16 are 3603actually initialized. These are then immediately followed by the System SGPRs 3604that are set up by ADC/SPI and can have different values for each wavefront of 3605the grid dispatch. 3606 3607SGPR register initial state is defined in 3608:ref:`amdgpu-amdhsa-sgpr-register-set-up-order-table`. 3609 3610 .. table:: SGPR Register Set Up Order 3611 :name: amdgpu-amdhsa-sgpr-register-set-up-order-table 3612 3613 ========== ========================== ====== ============================== 3614 SGPR Order Name Number Description 3615 (kernel descriptor enable of 3616 field) SGPRs 3617 ========== ========================== ====== ============================== 3618 First Private Segment Buffer 4 V# that can be used, together 3619 (enable_sgpr_private with Scratch Wavefront Offset 3620 _segment_buffer) as an offset, to access the 3621 private address space using a 3622 segment address. 3623 3624 CP uses the value provided by 3625 the runtime. 3626 then Dispatch Ptr 2 64-bit address of AQL dispatch 3627 (enable_sgpr_dispatch_ptr) packet for kernel dispatch 3628 actually executing. 3629 then Queue Ptr 2 64-bit address of amd_queue_t 3630 (enable_sgpr_queue_ptr) object for AQL queue on which 3631 the dispatch packet was 3632 queued. 3633 then Kernarg Segment Ptr 2 64-bit address of Kernarg 3634 (enable_sgpr_kernarg segment. This is directly 3635 _segment_ptr) copied from the 3636 kernarg_address in the kernel 3637 dispatch packet. 3638 3639 Having CP load it once avoids 3640 loading it at the beginning of 3641 every wavefront. 3642 then Dispatch Id 2 64-bit Dispatch ID of the 3643 (enable_sgpr_dispatch_id) dispatch packet being 3644 executed. 3645 then Flat Scratch Init 2 This is 2 SGPRs: 3646 (enable_sgpr_flat_scratch 3647 _init) GFX6 3648 Not supported. 3649 GFX7-GFX8 3650 The first SGPR is a 32-bit 3651 byte offset from 3652 ``SH_HIDDEN_PRIVATE_BASE_VIMID`` 3653 to per SPI base of memory 3654 for scratch for the queue 3655 executing the kernel 3656 dispatch. CP obtains this 3657 from the runtime. (The 3658 Scratch Segment Buffer base 3659 address is 3660 ``SH_HIDDEN_PRIVATE_BASE_VIMID`` 3661 plus this offset.) The value 3662 of Scratch Wavefront Offset must 3663 be added to this offset by 3664 the kernel machine code, 3665 right shifted by 8, and 3666 moved to the FLAT_SCRATCH_HI 3667 SGPR register. 3668 FLAT_SCRATCH_HI corresponds 3669 to SGPRn-4 on GFX7, and 3670 SGPRn-6 on GFX8 (where SGPRn 3671 is the highest numbered SGPR 3672 allocated to the wavefront). 3673 FLAT_SCRATCH_HI is 3674 multiplied by 256 (as it is 3675 in units of 256 bytes) and 3676 added to 3677 ``SH_HIDDEN_PRIVATE_BASE_VIMID`` 3678 to calculate the per wavefront 3679 FLAT SCRATCH BASE in flat 3680 memory instructions that 3681 access the scratch 3682 aperture. 3683 3684 The second SGPR is 32-bit 3685 byte size of a single 3686 work-item's scratch memory 3687 usage. CP obtains this from 3688 the runtime, and it is 3689 always a multiple of DWORD. 3690 CP checks that the value in 3691 the kernel dispatch packet 3692 Private Segment Byte Size is 3693 not larger and requests the 3694 runtime to increase the 3695 queue's scratch size if 3696 necessary. The kernel code 3697 must move it to 3698 FLAT_SCRATCH_LO which is 3699 SGPRn-3 on GFX7 and SGPRn-5 3700 on GFX8. FLAT_SCRATCH_LO is 3701 used as the FLAT SCRATCH 3702 SIZE in flat memory 3703 instructions. Having CP load 3704 it once avoids loading it at 3705 the beginning of every 3706 wavefront. 3707 GFX9-GFX10 3708 This is the 3709 64-bit base address of the 3710 per SPI scratch backing 3711 memory managed by SPI for 3712 the queue executing the 3713 kernel dispatch. CP obtains 3714 this from the runtime (and 3715 divides it if there are 3716 multiple Shader Arrays each 3717 with its own SPI). The value 3718 of Scratch Wavefront Offset must 3719 be added by the kernel 3720 machine code and the result 3721 moved to the FLAT_SCRATCH 3722 SGPR which is SGPRn-6 and 3723 SGPRn-5. It is used as the 3724 FLAT SCRATCH BASE in flat 3725 memory instructions. 3726 then Private Segment Size 1 The 32-bit byte size of a 3727 (enable_sgpr_private single 3728 work-item's 3729 scratch_segment_size) memory 3730 allocation. This is the 3731 value from the kernel 3732 dispatch packet Private 3733 Segment Byte Size rounded up 3734 by CP to a multiple of 3735 DWORD. 3736 3737 Having CP load it once avoids 3738 loading it at the beginning of 3739 every wavefront. 3740 3741 This is not used for 3742 GFX7-GFX8 since it is the same 3743 value as the second SGPR of 3744 Flat Scratch Init. However, it 3745 may be needed for GFX9-GFX10 which 3746 changes the meaning of the 3747 Flat Scratch Init value. 3748 then Grid Work-Group Count X 1 32-bit count of the number of 3749 (enable_sgpr_grid work-groups in the X dimension 3750 _workgroup_count_X) for the grid being 3751 executed. Computed from the 3752 fields in the kernel dispatch 3753 packet as ((grid_size.x + 3754 workgroup_size.x - 1) / 3755 workgroup_size.x). 3756 then Grid Work-Group Count Y 1 32-bit count of the number of 3757 (enable_sgpr_grid work-groups in the Y dimension 3758 _workgroup_count_Y && for the grid being 3759 less than 16 previous executed. Computed from the 3760 SGPRs) fields in the kernel dispatch 3761 packet as ((grid_size.y + 3762 workgroup_size.y - 1) / 3763 workgroupSize.y). 3764 3765 Only initialized if <16 3766 previous SGPRs initialized. 3767 then Grid Work-Group Count Z 1 32-bit count of the number of 3768 (enable_sgpr_grid work-groups in the Z dimension 3769 _workgroup_count_Z && for the grid being 3770 less than 16 previous executed. Computed from the 3771 SGPRs) fields in the kernel dispatch 3772 packet as ((grid_size.z + 3773 workgroup_size.z - 1) / 3774 workgroupSize.z). 3775 3776 Only initialized if <16 3777 previous SGPRs initialized. 3778 then Work-Group Id X 1 32-bit work-group id in X 3779 (enable_sgpr_workgroup_id dimension of grid for 3780 _X) wavefront. 3781 then Work-Group Id Y 1 32-bit work-group id in Y 3782 (enable_sgpr_workgroup_id dimension of grid for 3783 _Y) wavefront. 3784 then Work-Group Id Z 1 32-bit work-group id in Z 3785 (enable_sgpr_workgroup_id dimension of grid for 3786 _Z) wavefront. 3787 then Work-Group Info 1 {first_wavefront, 14'b0000, 3788 (enable_sgpr_workgroup ordered_append_term[10:0], 3789 _info) threadgroup_size_in_wavefronts[5:0]} 3790 then Scratch Wavefront Offset 1 32-bit byte offset from base 3791 (enable_sgpr_private of scratch base of queue 3792 _segment_wavefront_offset) executing the kernel 3793 dispatch. Must be used as an 3794 offset with Private 3795 segment address when using 3796 Scratch Segment Buffer. It 3797 must be used to set up FLAT 3798 SCRATCH for flat addressing 3799 (see 3800 :ref:`amdgpu-amdhsa-kernel-prolog-flat-scratch`). 3801 ========== ========================== ====== ============================== 3802 3803The order of the VGPR registers is defined, but the compiler can specify which 3804ones are actually setup in the kernel descriptor using the ``enable_vgpr*`` bit 3805fields (see :ref:`amdgpu-amdhsa-kernel-descriptor`). The register numbers used 3806for enabled registers are dense starting at VGPR0: the first enabled register is 3807VGPR0, the next enabled register is VGPR1 etc.; disabled registers do not have a 3808VGPR number. 3809 3810VGPR register initial state is defined in 3811:ref:`amdgpu-amdhsa-vgpr-register-set-up-order-table`. 3812 3813 .. table:: VGPR Register Set Up Order 3814 :name: amdgpu-amdhsa-vgpr-register-set-up-order-table 3815 3816 ========== ========================== ====== ============================== 3817 VGPR Order Name Number Description 3818 (kernel descriptor enable of 3819 field) VGPRs 3820 ========== ========================== ====== ============================== 3821 First Work-Item Id X 1 32-bit work item id in X 3822 (Always initialized) dimension of work-group for 3823 wavefront lane. 3824 then Work-Item Id Y 1 32-bit work item id in Y 3825 (enable_vgpr_workitem_id dimension of work-group for 3826 > 0) wavefront lane. 3827 then Work-Item Id Z 1 32-bit work item id in Z 3828 (enable_vgpr_workitem_id dimension of work-group for 3829 > 1) wavefront lane. 3830 ========== ========================== ====== ============================== 3831 3832The setting of registers is done by GPU CP/ADC/SPI hardware as follows: 3833 38341. SGPRs before the Work-Group Ids are set by CP using the 16 User Data 3835 registers. 38362. Work-group Id registers X, Y, Z are set by ADC which supports any 3837 combination including none. 38383. Scratch Wavefront Offset is set by SPI in a per wavefront basis which is why 3839 its value cannot be included with the flat scratch init value which is per 3840 queue. 38414. The VGPRs are set by SPI which only supports specifying either (X), (X, Y) 3842 or (X, Y, Z). 3843 3844Flat Scratch register pair are adjacent SGPRs so they can be moved as a 64-bit 3845value to the hardware required SGPRn-3 and SGPRn-4 respectively. 3846 3847The global segment can be accessed either using buffer instructions (GFX6 which 3848has V# 64-bit address support), flat instructions (GFX7-GFX10), or global 3849instructions (GFX9-GFX10). 3850 3851If buffer operations are used, then the compiler can generate a V# with the 3852following properties: 3853 3854* base address of 0 3855* no swizzle 3856* ATC: 1 if IOMMU present (such as APU) 3857* ptr64: 1 3858* MTYPE set to support memory coherence that matches the runtime (such as CC for 3859 APU and NC for dGPU). 3860 3861.. _amdgpu-amdhsa-kernel-prolog: 3862 3863Kernel Prolog 3864~~~~~~~~~~~~~ 3865 3866The compiler performs initialization in the kernel prologue depending on the 3867target and information about things like stack usage in the kernel and called 3868functions. Some of this initialization requires the compiler to request certain 3869User and System SGPRs be present in the 3870:ref:`amdgpu-amdhsa-initial-kernel-execution-state` via the 3871:ref:`amdgpu-amdhsa-kernel-descriptor`. 3872 3873.. _amdgpu-amdhsa-kernel-prolog-cfi: 3874 3875CFI 3876+++ 3877 38781. The CFI return address is undefined. 3879 38802. The CFI CFA is defined using an expression which evaluates to a location 3881 description that comprises one memory location description for the 3882 ``DW_ASPACE_AMDGPU_private_lane`` address space address ``0``. 3883 3884.. _amdgpu-amdhsa-kernel-prolog-m0: 3885 3886M0 3887++ 3888 3889GFX6-GFX8 3890 The M0 register must be initialized with a value at least the total LDS size 3891 if the kernel may access LDS via DS or flat operations. Total LDS size is 3892 available in dispatch packet. For M0, it is also possible to use maximum 3893 possible value of LDS for given target (0x7FFF for GFX6 and 0xFFFF for 3894 GFX7-GFX8). 3895GFX9-GFX10 3896 The M0 register is not used for range checking LDS accesses and so does not 3897 need to be initialized in the prolog. 3898 3899.. _amdgpu-amdhsa-kernel-prolog-stack-pointer: 3900 3901Stack Pointer 3902+++++++++++++ 3903 3904If the kernel has function calls it must set up the ABI stack pointer described 3905in :ref:`amdgpu-amdhsa-function-call-convention-non-kernel-functions` by setting 3906SGPR32 to the unswizzled scratch offset of the address past the last local 3907allocation. 3908 3909.. _amdgpu-amdhsa-kernel-prolog-frame-pointer: 3910 3911Frame Pointer 3912+++++++++++++ 3913 3914If the kernel needs a frame pointer for the reasons defined in 3915``SIFrameLowering`` then SGPR33 is used and is always set to ``0`` in the 3916kernel prolog. If a frame pointer is not required then all uses of the frame 3917pointer are replaced with immediate ``0`` offsets. 3918 3919.. _amdgpu-amdhsa-kernel-prolog-flat-scratch: 3920 3921Flat Scratch 3922++++++++++++ 3923 3924If the kernel or any function it calls may use flat operations to access 3925scratch memory, the prolog code must set up the FLAT_SCRATCH register pair 3926(FLAT_SCRATCH_LO/FLAT_SCRATCH_HI which are in SGPRn-4/SGPRn-3). Initialization 3927uses Flat Scratch Init and Scratch Wavefront Offset SGPR registers (see 3928:ref:`amdgpu-amdhsa-initial-kernel-execution-state`): 3929 3930GFX6 3931 Flat scratch is not supported. 3932 3933GFX7-GFX8 3934 3935 1. The low word of Flat Scratch Init is 32-bit byte offset from 3936 ``SH_HIDDEN_PRIVATE_BASE_VIMID`` to the base of scratch backing memory 3937 being managed by SPI for the queue executing the kernel dispatch. This is 3938 the same value used in the Scratch Segment Buffer V# base address. The 3939 prolog must add the value of Scratch Wavefront Offset to get the 3940 wavefront's byte scratch backing memory offset from 3941 ``SH_HIDDEN_PRIVATE_BASE_VIMID``. Since FLAT_SCRATCH_LO is in units of 256 3942 bytes, the offset must be right shifted by 8 before moving into 3943 FLAT_SCRATCH_LO. 3944 2. The second word of Flat Scratch Init is 32-bit byte size of a single 3945 work-items scratch memory usage. This is directly loaded from the kernel 3946 dispatch packet Private Segment Byte Size and rounded up to a multiple of 3947 DWORD. Having CP load it once avoids loading it at the beginning of every 3948 wavefront. The prolog must move it to FLAT_SCRATCH_LO for use as FLAT 3949 SCRATCH SIZE. 3950 3951GFX9-GFX10 3952 The Flat Scratch Init is the 64-bit address of the base of scratch backing 3953 memory being managed by SPI for the queue executing the kernel dispatch. The 3954 prolog must add the value of Scratch Wavefront Offset and moved to the 3955 FLAT_SCRATCH pair for use as the flat scratch base in flat memory 3956 instructions. 3957 3958.. _amdgpu-amdhsa-kernel-prolog-private-segment-buffer: 3959 3960Private Segment Buffer 3961++++++++++++++++++++++ 3962 3963A set of four SGPRs beginning at a four-aligned SGPR index are always selected 3964to serve as the scratch V# for the kernel as follows: 3965 3966 - If it is known during instruction selection that there is stack usage, 3967 SGPR0-3 is reserved for use as the scratch V#. Stack usage is assumed if 3968 optimizations are disabled (``-O0``), if stack objects already exist (for 3969 locals, etc.), or if there are any function calls. 3970 3971 - Otherwise, four high numbered SGPRs beginning at a four-aligned SGPR index 3972 are reserved for the tentative scratch V#. These will be used if it is 3973 determined that spilling is needed. 3974 3975 - If no use is made of the tentative scratch V#, then it is unreserved, 3976 and the register count is determined ignoring it. 3977 - If use is made of the tentative scratch V#, then its register numbers 3978 are shifted to the first four-aligned SGPR index after the highest one 3979 allocated by the register allocator, and all uses are updated. The 3980 register count includes them in the shifted location. 3981 - In either case, if the processor has the SGPR allocation bug, the 3982 tentative allocation is not shifted or unreserved in order to ensure 3983 the register count is higher to workaround the bug. 3984 3985 .. note:: 3986 3987 This approach of using a tentative scratch V# and shifting the register 3988 numbers if used avoids having to perform register allocation a second 3989 time if the tentative V# is eliminated. This is more efficient and 3990 avoids the problem that the second register allocation may perform 3991 spilling which will fail as there is no longer a scratch V#. 3992 3993When the kernel prolog code is being emitted it is known whether the scratch V# 3994described above is actually used. If it is, the prolog code must set it up by 3995copying the Private Segment Buffer to the scratch V# registers and then adding 3996the Private Segment Wavefront Offset to the queue base address in the V#. The 3997result is a V# with a base address pointing to the beginning of the wavefront 3998scratch backing memory. 3999 4000The Private Segment Buffer is always requested, but the Private Segment 4001Wavefront Offset is only requested if it is used (see 4002:ref:`amdgpu-amdhsa-initial-kernel-execution-state`). 4003 4004.. _amdgpu-amdhsa-memory-model: 4005 4006Memory Model 4007~~~~~~~~~~~~ 4008 4009This section describes the mapping of LLVM memory model onto AMDGPU machine code 4010(see :ref:`memmodel`). 4011 4012The AMDGPU backend supports the memory synchronization scopes specified in 4013:ref:`amdgpu-memory-scopes`. 4014 4015The code sequences used to implement the memory model are defined in table 4016:ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx6-gfx10-table`. 4017 4018The sequences specify the order of instructions that a single thread must 4019execute. The ``s_waitcnt`` and ``buffer_wbinvl1_vol`` are defined with respect 4020to other memory instructions executed by the same thread. This allows them to be 4021moved earlier or later which can allow them to be combined with other instances 4022of the same instruction, or hoisted/sunk out of loops to improve 4023performance. Only the instructions related to the memory model are given; 4024additional ``s_waitcnt`` instructions are required to ensure registers are 4025defined before being used. These may be able to be combined with the memory 4026model ``s_waitcnt`` instructions as described above. 4027 4028The AMDGPU backend supports the following memory models: 4029 4030 HSA Memory Model [HSA]_ 4031 The HSA memory model uses a single happens-before relation for all address 4032 spaces (see :ref:`amdgpu-address-spaces`). 4033 OpenCL Memory Model [OpenCL]_ 4034 The OpenCL memory model which has separate happens-before relations for the 4035 global and local address spaces. Only a fence specifying both global and 4036 local address space, and seq_cst instructions join the relationships. Since 4037 the LLVM ``memfence`` instruction does not allow an address space to be 4038 specified the OpenCL fence has to conservatively assume both local and 4039 global address space was specified. However, optimizations can often be 4040 done to eliminate the additional ``s_waitcnt`` instructions when there are 4041 no intervening memory instructions which access the corresponding address 4042 space. The code sequences in the table indicate what can be omitted for the 4043 OpenCL memory. The target triple environment is used to determine if the 4044 source language is OpenCL (see :ref:`amdgpu-opencl`). 4045 4046``ds/flat_load/store/atomic`` instructions to local memory are termed LDS 4047operations. 4048 4049``buffer/global/flat_load/store/atomic`` instructions to global memory are 4050termed vector memory operations. 4051 4052For GFX6-GFX9: 4053 4054* Each agent has multiple shader arrays (SA). 4055* Each SA has multiple compute units (CU). 4056* Each CU has multiple SIMDs that execute wavefronts. 4057* The wavefronts for a single work-group are executed in the same CU but may be 4058 executed by different SIMDs. 4059* Each CU has a single LDS memory shared by the wavefronts of the work-groups 4060 executing on it. 4061* All LDS operations of a CU are performed as wavefront wide operations in a 4062 global order and involve no caching. Completion is reported to a wavefront in 4063 execution order. 4064* The LDS memory has multiple request queues shared by the SIMDs of a 4065 CU. Therefore, the LDS operations performed by different wavefronts of a 4066 work-group can be reordered relative to each other, which can result in 4067 reordering the visibility of vector memory operations with respect to LDS 4068 operations of other wavefronts in the same work-group. A ``s_waitcnt 4069 lgkmcnt(0)`` is required to ensure synchronization between LDS operations and 4070 vector memory operations between wavefronts of a work-group, but not between 4071 operations performed by the same wavefront. 4072* The vector memory operations are performed as wavefront wide operations and 4073 completion is reported to a wavefront in execution order. The exception is 4074 that for GFX7-GFX9 ``flat_load/store/atomic`` instructions can report out of 4075 vector memory order if they access LDS memory, and out of LDS operation order 4076 if they access global memory. 4077* The vector memory operations access a single vector L1 cache shared by all 4078 SIMDs a CU. Therefore, no special action is required for coherence between the 4079 lanes of a single wavefront, or for coherence between wavefronts in the same 4080 work-group. A ``buffer_wbinvl1_vol`` is required for coherence between 4081 wavefronts executing in different work-groups as they may be executing on 4082 different CUs. 4083* The scalar memory operations access a scalar L1 cache shared by all wavefronts 4084 on a group of CUs. The scalar and vector L1 caches are not coherent. However, 4085 scalar operations are used in a restricted way so do not impact the memory 4086 model. See :ref:`amdgpu-address-spaces`. 4087* The vector and scalar memory operations use an L2 cache shared by all CUs on 4088 the same agent. 4089* The L2 cache has independent channels to service disjoint ranges of virtual 4090 addresses. 4091* Each CU has a separate request queue per channel. Therefore, the vector and 4092 scalar memory operations performed by wavefronts executing in different 4093 work-groups (which may be executing on different CUs) of an agent can be 4094 reordered relative to each other. A ``s_waitcnt vmcnt(0)`` is required to 4095 ensure synchronization between vector memory operations of different CUs. It 4096 ensures a previous vector memory operation has completed before executing a 4097 subsequent vector memory or LDS operation and so can be used to meet the 4098 requirements of acquire and release. 4099* The L2 cache can be kept coherent with other agents on some targets, or ranges 4100 of virtual addresses can be set up to bypass it to ensure system coherence. 4101 4102For GFX10: 4103 4104* Each agent has multiple shader arrays (SA). 4105* Each SA has multiple work-group processors (WGP). 4106* Each WGP has multiple compute units (CU). 4107* Each CU has multiple SIMDs that execute wavefronts. 4108* The wavefronts for a single work-group are executed in the same 4109 WGP. In CU wavefront execution mode the wavefronts may be executed by 4110 different SIMDs in the same CU. In WGP wavefront execution mode the 4111 wavefronts may be executed by different SIMDs in different CUs in the same 4112 WGP. 4113* Each WGP has a single LDS memory shared by the wavefronts of the work-groups 4114 executing on it. 4115* All LDS operations of a WGP are performed as wavefront wide operations in a 4116 global order and involve no caching. Completion is reported to a wavefront in 4117 execution order. 4118* The LDS memory has multiple request queues shared by the SIMDs of a 4119 WGP. Therefore, the LDS operations performed by different wavefronts of a 4120 work-group can be reordered relative to each other, which can result in 4121 reordering the visibility of vector memory operations with respect to LDS 4122 operations of other wavefronts in the same work-group. A ``s_waitcnt 4123 lgkmcnt(0)`` is required to ensure synchronization between LDS operations and 4124 vector memory operations between wavefronts of a work-group, but not between 4125 operations performed by the same wavefront. 4126* The vector memory operations are performed as wavefront wide operations. 4127 Completion of load/store/sample operations are reported to a wavefront in 4128 execution order of other load/store/sample operations performed by that 4129 wavefront. 4130* The vector memory operations access a vector L0 cache. There is a single L0 4131 cache per CU. Each SIMD of a CU accesses the same L0 cache. Therefore, no 4132 special action is required for coherence between the lanes of a single 4133 wavefront. However, a ``BUFFER_GL0_INV`` is required for coherence between 4134 wavefronts executing in the same work-group as they may be executing on SIMDs 4135 of different CUs that access different L0s. A ``BUFFER_GL0_INV`` is also 4136 required for coherence between wavefronts executing in different work-groups 4137 as they may be executing on different WGPs. 4138* The scalar memory operations access a scalar L0 cache shared by all wavefronts 4139 on a WGP. The scalar and vector L0 caches are not coherent. However, scalar 4140 operations are used in a restricted way so do not impact the memory model. See 4141 :ref:`amdgpu-address-spaces`. 4142* The vector and scalar memory L0 caches use an L1 cache shared by all WGPs on 4143 the same SA. Therefore, no special action is required for coherence between 4144 the wavefronts of a single work-group. However, a ``BUFFER_GL1_INV`` is 4145 required for coherence between wavefronts executing in different work-groups 4146 as they may be executing on different SAs that access different L1s. 4147* The L1 caches have independent quadrants to service disjoint ranges of virtual 4148 addresses. 4149* Each L0 cache has a separate request queue per L1 quadrant. Therefore, the 4150 vector and scalar memory operations performed by different wavefronts, whether 4151 executing in the same or different work-groups (which may be executing on 4152 different CUs accessing different L0s), can be reordered relative to each 4153 other. A ``s_waitcnt vmcnt(0) & vscnt(0)`` is required to ensure 4154 synchronization between vector memory operations of different wavefronts. It 4155 ensures a previous vector memory operation has completed before executing a 4156 subsequent vector memory or LDS operation and so can be used to meet the 4157 requirements of acquire, release and sequential consistency. 4158* The L1 caches use an L2 cache shared by all SAs on the same agent. 4159* The L2 cache has independent channels to service disjoint ranges of virtual 4160 addresses. 4161* Each L1 quadrant of a single SA accesses a different L2 channel. Each L1 4162 quadrant has a separate request queue per L2 channel. Therefore, the vector 4163 and scalar memory operations performed by wavefronts executing in different 4164 work-groups (which may be executing on different SAs) of an agent can be 4165 reordered relative to each other. A ``s_waitcnt vmcnt(0) & vscnt(0)`` is 4166 required to ensure synchronization between vector memory operations of 4167 different SAs. It ensures a previous vector memory operation has completed 4168 before executing a subsequent vector memory and so can be used to meet the 4169 requirements of acquire, release and sequential consistency. 4170* The L2 cache can be kept coherent with other agents on some targets, or ranges 4171 of virtual addresses can be set up to bypass it to ensure system coherence. 4172 4173Private address space uses ``buffer_load/store`` using the scratch V# 4174(GFX6-GFX8), or ``scratch_load/store`` (GFX9-GFX10). Since only a single thread 4175is accessing the memory, atomic memory orderings are not meaningful, and all 4176accesses are treated as non-atomic. 4177 4178Constant address space uses ``buffer/global_load`` instructions (or equivalent 4179scalar memory instructions). Since the constant address space contents do not 4180change during the execution of a kernel dispatch it is not legal to perform 4181stores, and atomic memory orderings are not meaningful, and all access are 4182treated as non-atomic. 4183 4184A memory synchronization scope wider than work-group is not meaningful for the 4185group (LDS) address space and is treated as work-group. 4186 4187The memory model does not support the region address space which is treated as 4188non-atomic. 4189 4190Acquire memory ordering is not meaningful on store atomic instructions and is 4191treated as non-atomic. 4192 4193Release memory ordering is not meaningful on load atomic instructions and is 4194treated a non-atomic. 4195 4196Acquire-release memory ordering is not meaningful on load or store atomic 4197instructions and is treated as acquire and release respectively. 4198 4199AMDGPU backend only uses scalar memory operations to access memory that is 4200proven to not change during the execution of the kernel dispatch. This includes 4201constant address space and global address space for program scope const 4202variables. Therefore, the kernel machine code does not have to maintain the 4203scalar L1 cache to ensure it is coherent with the vector L1 cache. The scalar 4204and vector L1 caches are invalidated between kernel dispatches by CP since 4205constant address space data may change between kernel dispatch executions. See 4206:ref:`amdgpu-address-spaces`. 4207 4208The one exception is if scalar writes are used to spill SGPR registers. In this 4209case the AMDGPU backend ensures the memory location used to spill is never 4210accessed by vector memory operations at the same time. If scalar writes are used 4211then a ``s_dcache_wb`` is inserted before the ``s_endpgm`` and before a function 4212return since the locations may be used for vector memory instructions by a 4213future wavefront that uses the same scratch area, or a function call that 4214creates a frame at the same address, respectively. There is no need for a 4215``s_dcache_inv`` as all scalar writes are write-before-read in the same thread. 4216 4217For GFX6-GFX9, scratch backing memory (which is used for the private address 4218space) is accessed with MTYPE NC_NV (non-coherent non-volatile). Since the 4219private address space is only accessed by a single thread, and is always 4220write-before-read, there is never a need to invalidate these entries from the L1 4221cache. Hence all cache invalidates are done as ``*_vol`` to only invalidate the 4222volatile cache lines. 4223 4224For GFX10, scratch backing memory (which is used for the private address space) 4225is accessed with MTYPE NC (non-coherent). Since the private address space is 4226only accessed by a single thread, and is always write-before-read, there is 4227never a need to invalidate these entries from the L0 or L1 caches. 4228 4229For GFX10, wavefronts are executed in native mode with in-order reporting of 4230loads and sample instructions. In this mode vmcnt reports completion of load, 4231atomic with return and sample instructions in order, and the vscnt reports the 4232completion of store and atomic without return in order. See ``MEM_ORDERED`` 4233field in :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx10-table`. 4234 4235In GFX10, wavefronts can be executed in WGP or CU wavefront execution mode: 4236 4237* In WGP wavefront execution mode the wavefronts of a work-group are executed 4238 on the SIMDs of both CUs of the WGP. Therefore, explicit management of the per 4239 CU L0 caches is required for work-group synchronization. Also accesses to L1 4240 at work-group scope need to be explicitly ordered as the accesses from 4241 different CUs are not ordered. 4242* In CU wavefront execution mode the wavefronts of a work-group are executed on 4243 the SIMDs of a single CU of the WGP. Therefore, all global memory access by 4244 the work-group access the same L0 which in turn ensures L1 accesses are 4245 ordered and so do not require explicit management of the caches for 4246 work-group synchronization. 4247 4248See ``WGP_MODE`` field in 4249:ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx10-table` and 4250:ref:`amdgpu-target-features`. 4251 4252On dGPU the kernarg backing memory is accessed as UC (uncached) to avoid needing 4253to invalidate the L2 cache. For GFX6-GFX9, this also causes it to be treated as 4254non-volatile and so is not invalidated by ``*_vol``. On APU it is accessed as CC 4255(cache coherent) and so the L2 cache will be coherent with the CPU and other 4256agents. 4257 4258 .. table:: AMDHSA Memory Model Code Sequences GFX6-GFX10 4259 :name: amdgpu-amdhsa-memory-model-code-sequences-gfx6-gfx10-table 4260 4261 ============ ============ ============== ========== =============================== ================================== 4262 LLVM Instr LLVM Memory LLVM Memory AMDGPU AMDGPU Machine Code AMDGPU Machine Code 4263 Ordering Sync Scope Address GFX6-9 GFX10 4264 Space 4265 ============ ============ ============== ========== =============================== ================================== 4266 **Non-Atomic** 4267 ---------------------------------------------------------------------------------------------------------------------- 4268 load *none* *none* - global - !volatile & !nontemporal - !volatile & !nontemporal 4269 - generic 4270 - private 1. buffer/global/flat_load 1. buffer/global/flat_load 4271 - constant 4272 - volatile & !nontemporal - volatile & !nontemporal 4273 4274 1. buffer/global/flat_load 1. buffer/global/flat_load 4275 glc=1 glc=1 dlc=1 4276 4277 - nontemporal - nontemporal 4278 4279 1. buffer/global/flat_load 1. buffer/global/flat_load 4280 glc=1 slc=1 slc=1 4281 4282 load *none* *none* - local 1. ds_load 1. ds_load 4283 store *none* *none* - global - !nontemporal - !nontemporal 4284 - generic 4285 - private 1. buffer/global/flat_store 1. buffer/global/flat_store 4286 - constant 4287 - nontemporal - nontemporal 4288 4289 1. buffer/global/flat_store 1. buffer/global/flat_store 4290 glc=1 slc=1 slc=1 4291 4292 store *none* *none* - local 1. ds_store 1. ds_store 4293 **Unordered Atomic** 4294 ---------------------------------------------------------------------------------------------------------------------- 4295 load atomic unordered *any* *any* *Same as non-atomic*. *Same as non-atomic*. 4296 store atomic unordered *any* *any* *Same as non-atomic*. *Same as non-atomic*. 4297 atomicrmw unordered *any* *any* *Same as monotonic *Same as monotonic 4298 atomic*. atomic*. 4299 **Monotonic Atomic** 4300 ---------------------------------------------------------------------------------------------------------------------- 4301 load atomic monotonic - singlethread - global 1. buffer/global/flat_load 1. buffer/global/flat_load 4302 - wavefront - generic 4303 load atomic monotonic - workgroup - global 1. buffer/global/flat_load 1. buffer/global/flat_load 4304 - generic glc=1 4305 4306 - If CU wavefront execution mode, omit glc=1. 4307 4308 load atomic monotonic - singlethread - local 1. ds_load 1. ds_load 4309 - wavefront 4310 - workgroup 4311 load atomic monotonic - agent - global 1. buffer/global/flat_load 1. buffer/global/flat_load 4312 - system - generic glc=1 glc=1 dlc=1 4313 store atomic monotonic - singlethread - global 1. buffer/global/flat_store 1. buffer/global/flat_store 4314 - wavefront - generic 4315 - workgroup 4316 - agent 4317 - system 4318 store atomic monotonic - singlethread - local 1. ds_store 1. ds_store 4319 - wavefront 4320 - workgroup 4321 atomicrmw monotonic - singlethread - global 1. buffer/global/flat_atomic 1. buffer/global/flat_atomic 4322 - wavefront - generic 4323 - workgroup 4324 - agent 4325 - system 4326 atomicrmw monotonic - singlethread - local 1. ds_atomic 1. ds_atomic 4327 - wavefront 4328 - workgroup 4329 **Acquire Atomic** 4330 ---------------------------------------------------------------------------------------------------------------------- 4331 load atomic acquire - singlethread - global 1. buffer/global/ds/flat_load 1. buffer/global/ds/flat_load 4332 - wavefront - local 4333 - generic 4334 load atomic acquire - workgroup - global 1. buffer/global/flat_load 1. buffer/global_load glc=1 4335 4336 - If CU wavefront execution mode, omit glc=1. 4337 4338 2. s_waitcnt vmcnt(0) 4339 4340 - If CU wavefront execution mode, omit. 4341 - Must happen before 4342 the following buffer_gl0_inv 4343 and before any following 4344 global/generic 4345 load/load 4346 atomic/store/store 4347 atomic/atomicrmw. 4348 4349 3. buffer_gl0_inv 4350 4351 - If CU wavefront execution mode, omit. 4352 - Ensures that 4353 following 4354 loads will not see 4355 stale data. 4356 4357 load atomic acquire - workgroup - local 1. ds_load 1. ds_load 4358 2. s_waitcnt lgkmcnt(0) 2. s_waitcnt lgkmcnt(0) 4359 4360 - If OpenCL, omit. - If OpenCL, omit. 4361 - Must happen before - Must happen before 4362 any following the following buffer_gl0_inv 4363 global/generic and before any following 4364 load/load global/generic load/load 4365 atomic/store/store atomic/store/store 4366 atomic/atomicrmw. atomic/atomicrmw. 4367 - Ensures any - Ensures any 4368 following global following global 4369 data read is no data read is no 4370 older than the load older than the load 4371 atomic value being atomic value being 4372 acquired. acquired. 4373 4374 3. buffer_gl0_inv 4375 4376 - If CU wavefront execution mode, omit. 4377 - If OpenCL, omit. 4378 - Ensures that 4379 following 4380 loads will not see 4381 stale data. 4382 4383 load atomic acquire - workgroup - generic 1. flat_load 1. flat_load glc=1 4384 4385 - If CU wavefront execution mode, omit glc=1. 4386 4387 2. s_waitcnt lgkmcnt(0) 2. s_waitcnt lgkmcnt(0) & 4388 vmcnt(0) 4389 4390 - If CU wavefront execution mode, omit vmcnt. 4391 - If OpenCL, omit. - If OpenCL, omit 4392 lgkmcnt(0). 4393 - Must happen before - Must happen before 4394 any following the following 4395 global/generic buffer_gl0_inv and any 4396 load/load following global/generic 4397 atomic/store/store load/load 4398 atomic/atomicrmw. atomic/store/store 4399 atomic/atomicrmw. 4400 - Ensures any - Ensures any 4401 following global following global 4402 data read is no data read is no 4403 older than the load older than the load 4404 atomic value being atomic value being 4405 acquired. acquired. 4406 4407 3. buffer_gl0_inv 4408 4409 - If CU wavefront execution mode, omit. 4410 - Ensures that 4411 following 4412 loads will not see 4413 stale data. 4414 4415 load atomic acquire - agent - global 1. buffer/global/flat_load 1. buffer/global_load 4416 - system glc=1 glc=1 dlc=1 4417 2. s_waitcnt vmcnt(0) 2. s_waitcnt vmcnt(0) 4418 4419 - Must happen before - Must happen before 4420 following following 4421 buffer_wbinvl1_vol. buffer_gl*_inv. 4422 - Ensures the load - Ensures the load 4423 has completed has completed 4424 before invalidating before invalidating 4425 the cache. the caches. 4426 4427 3. buffer_wbinvl1_vol 3. buffer_gl0_inv; 4428 buffer_gl1_inv 4429 4430 - Must happen before - Must happen before 4431 any following any following 4432 global/generic global/generic 4433 load/load load/load 4434 atomic/atomicrmw. atomic/atomicrmw. 4435 - Ensures that - Ensures that 4436 following following 4437 loads will not see loads will not see 4438 stale global data. stale global data. 4439 4440 load atomic acquire - agent - generic 1. flat_load glc=1 1. flat_load glc=1 dlc=1 4441 - system 2. s_waitcnt vmcnt(0) & 2. s_waitcnt vmcnt(0) & 4442 lgkmcnt(0) lgkmcnt(0) 4443 4444 - If OpenCL omit - If OpenCL omit 4445 lgkmcnt(0). lgkmcnt(0). 4446 - Must happen before - Must happen before 4447 following following 4448 buffer_wbinvl1_vol. buffer_gl*_invl. 4449 - Ensures the flat_load - Ensures the flat_load 4450 has completed has completed 4451 before invalidating before invalidating 4452 the cache. the caches. 4453 4454 3. buffer_wbinvl1_vol 3. buffer_gl0_inv; 4455 buffer_gl1_inv 4456 4457 - Must happen before - Must happen before 4458 any following any following 4459 global/generic global/generic 4460 load/load load/load 4461 atomic/atomicrmw. atomic/atomicrmw. 4462 - Ensures that - Ensures that 4463 following loads following loads 4464 will not see stale will not see stale 4465 global data. global data. 4466 4467 atomicrmw acquire - singlethread - global 1. buffer/global/ds/flat_atomic 1. buffer/global/ds/flat_atomic 4468 - wavefront - local 4469 - generic 4470 atomicrmw acquire - workgroup - global 1. buffer/global/flat_atomic 1. buffer/global_atomic 4471 2. s_waitcnt vm/vscnt(0) 4472 4473 - If CU wavefront execution mode, omit. 4474 - Use vmcnt if atomic with 4475 return and vscnt if atomic 4476 with no-return. 4477 - Must happen before 4478 the following buffer_gl0_inv 4479 and before any following 4480 global/generic 4481 load/load 4482 atomic/store/store 4483 atomic/atomicrmw. 4484 4485 3. buffer_gl0_inv 4486 4487 - If CU wavefront execution mode, omit. 4488 - Ensures that 4489 following 4490 loads will not see 4491 stale data. 4492 4493 atomicrmw acquire - workgroup - local 1. ds_atomic 1. ds_atomic 4494 2. waitcnt lgkmcnt(0) 2. waitcnt lgkmcnt(0) 4495 4496 - If OpenCL, omit. - If OpenCL, omit. 4497 - Must happen before - Must happen before 4498 any following the following 4499 global/generic buffer_gl0_inv. 4500 load/load 4501 atomic/store/store 4502 atomic/atomicrmw. 4503 - Ensures any - Ensures any 4504 following global following global 4505 data read is no data read is no 4506 older than the older than the 4507 atomicrmw value atomicrmw value 4508 being acquired. being acquired. 4509 4510 3. buffer_gl0_inv 4511 4512 - If OpenCL omit. 4513 - Ensures that 4514 following 4515 loads will not see 4516 stale data. 4517 4518 atomicrmw acquire - workgroup - generic 1. flat_atomic 1. flat_atomic 4519 2. waitcnt lgkmcnt(0) 2. waitcnt lgkmcnt(0) & 4520 vm/vscnt(0) 4521 4522 - If CU wavefront execution mode, omit vm/vscnt. 4523 - If OpenCL, omit. - If OpenCL, omit 4524 waitcnt lgkmcnt(0).. 4525 - Use vmcnt if atomic with 4526 return and vscnt if atomic 4527 with no-return. 4528 waitcnt lgkmcnt(0). 4529 - Must happen before - Must happen before 4530 any following the following 4531 global/generic buffer_gl0_inv. 4532 load/load 4533 atomic/store/store 4534 atomic/atomicrmw. 4535 - Ensures any - Ensures any 4536 following global following global 4537 data read is no data read is no 4538 older than the older than the 4539 atomicrmw value atomicrmw value 4540 being acquired. being acquired. 4541 4542 3. buffer_gl0_inv 4543 4544 - If CU wavefront execution mode, omit. 4545 - Ensures that 4546 following 4547 loads will not see 4548 stale data. 4549 4550 atomicrmw acquire - agent - global 1. buffer/global/flat_atomic 1. buffer/global_atomic 4551 - system 2. s_waitcnt vmcnt(0) 2. s_waitcnt vm/vscnt(0) 4552 4553 - Use vmcnt if atomic with 4554 return and vscnt if atomic 4555 with no-return. 4556 waitcnt lgkmcnt(0). 4557 - Must happen before - Must happen before 4558 following following 4559 buffer_wbinvl1_vol. buffer_gl*_inv. 4560 - Ensures the - Ensures the 4561 atomicrmw has atomicrmw has 4562 completed before completed before 4563 invalidating the invalidating the 4564 cache. caches. 4565 4566 3. buffer_wbinvl1_vol 3. buffer_gl0_inv; 4567 buffer_gl1_inv 4568 4569 - Must happen before - Must happen before 4570 any following any following 4571 global/generic global/generic 4572 load/load load/load 4573 atomic/atomicrmw. atomic/atomicrmw. 4574 - Ensures that - Ensures that 4575 following loads following loads 4576 will not see stale will not see stale 4577 global data. global data. 4578 4579 atomicrmw acquire - agent - generic 1. flat_atomic 1. flat_atomic 4580 - system 2. s_waitcnt vmcnt(0) & 2. s_waitcnt vm/vscnt(0) & 4581 lgkmcnt(0) lgkmcnt(0) 4582 4583 - If OpenCL, omit - If OpenCL, omit 4584 lgkmcnt(0). lgkmcnt(0). 4585 - Use vmcnt if atomic with 4586 return and vscnt if atomic 4587 with no-return. 4588 - Must happen before - Must happen before 4589 following following 4590 buffer_wbinvl1_vol. buffer_gl*_inv. 4591 - Ensures the - Ensures the 4592 atomicrmw has atomicrmw has 4593 completed before completed before 4594 invalidating the invalidating the 4595 cache. caches. 4596 4597 3. buffer_wbinvl1_vol 3. buffer_gl0_inv; 4598 buffer_gl1_inv 4599 4600 - Must happen before - Must happen before 4601 any following any following 4602 global/generic global/generic 4603 load/load load/load 4604 atomic/atomicrmw. atomic/atomicrmw. 4605 - Ensures that - Ensures that 4606 following loads following loads 4607 will not see stale will not see stale 4608 global data. global data. 4609 4610 fence acquire - singlethread *none* *none* *none* 4611 - wavefront 4612 fence acquire - workgroup *none* 1. s_waitcnt lgkmcnt(0) 1. s_waitcnt lgkmcnt(0) & 4613 vmcnt(0) & vscnt(0) 4614 4615 - If CU wavefront execution mode, omit vmcnt and 4616 vscnt. 4617 - If OpenCL and - If OpenCL and 4618 address space is address space is 4619 not generic, omit. not generic, omit 4620 lgkmcnt(0). 4621 - If OpenCL and 4622 address space is 4623 local, omit 4624 vmcnt(0) and vscnt(0). 4625 - However, since LLVM - However, since LLVM 4626 currently has no currently has no 4627 address space on address space on 4628 the fence need to the fence need to 4629 conservatively conservatively 4630 always generate. If always generate. If 4631 fence had an fence had an 4632 address space then address space then 4633 set to address set to address 4634 space of OpenCL space of OpenCL 4635 fence flag, or to fence flag, or to 4636 generic if both generic if both 4637 local and global local and global 4638 flags are flags are 4639 specified. specified. 4640 - Must happen after 4641 any preceding 4642 local/generic load 4643 atomic/atomicrmw 4644 with an equal or 4645 wider sync scope 4646 and memory ordering 4647 stronger than 4648 unordered (this is 4649 termed the 4650 fence-paired-atomic). 4651 - Must happen before 4652 any following 4653 global/generic 4654 load/load 4655 atomic/store/store 4656 atomic/atomicrmw. 4657 - Ensures any 4658 following global 4659 data read is no 4660 older than the 4661 value read by the 4662 fence-paired-atomic. 4663 - Could be split into 4664 separate s_waitcnt 4665 vmcnt(0), s_waitcnt 4666 vscnt(0) and s_waitcnt 4667 lgkmcnt(0) to allow 4668 them to be 4669 independently moved 4670 according to the 4671 following rules. 4672 - s_waitcnt vmcnt(0) 4673 must happen after 4674 any preceding 4675 global/generic load 4676 atomic/ 4677 atomicrmw-with-return-value 4678 with an equal or 4679 wider sync scope 4680 and memory ordering 4681 stronger than 4682 unordered (this is 4683 termed the 4684 fence-paired-atomic). 4685 - s_waitcnt vscnt(0) 4686 must happen after 4687 any preceding 4688 global/generic 4689 atomicrmw-no-return-value 4690 with an equal or 4691 wider sync scope 4692 and memory ordering 4693 stronger than 4694 unordered (this is 4695 termed the 4696 fence-paired-atomic). 4697 - s_waitcnt lgkmcnt(0) 4698 must happen after 4699 any preceding 4700 local/generic load 4701 atomic/atomicrmw 4702 with an equal or 4703 wider sync scope 4704 and memory ordering 4705 stronger than 4706 unordered (this is 4707 termed the 4708 fence-paired-atomic). 4709 - Must happen before 4710 the following 4711 buffer_gl0_inv. 4712 - Ensures that the 4713 fence-paired atomic 4714 has completed 4715 before invalidating 4716 the 4717 cache. Therefore 4718 any following 4719 locations read must 4720 be no older than 4721 the value read by 4722 the 4723 fence-paired-atomic. 4724 4725 3. buffer_gl0_inv 4726 4727 - If CU wavefront execution mode, omit. 4728 - Ensures that 4729 following 4730 loads will not see 4731 stale data. 4732 4733 fence acquire - agent *none* 1. s_waitcnt lgkmcnt(0) & 1. s_waitcnt lgkmcnt(0) & 4734 - system vmcnt(0) vmcnt(0) & vscnt(0) 4735 4736 - If OpenCL and - If OpenCL and 4737 address space is address space is 4738 not generic, omit not generic, omit 4739 lgkmcnt(0). lgkmcnt(0). 4740 - If OpenCL and 4741 address space is 4742 local, omit 4743 vmcnt(0) and vscnt(0). 4744 - However, since LLVM - However, since LLVM 4745 currently has no currently has no 4746 address space on address space on 4747 the fence need to the fence need to 4748 conservatively conservatively 4749 always generate always generate 4750 (see comment for (see comment for 4751 previous fence). previous fence). 4752 - Could be split into 4753 separate s_waitcnt 4754 vmcnt(0) and 4755 s_waitcnt 4756 lgkmcnt(0) to allow 4757 them to be 4758 independently moved 4759 according to the 4760 following rules. 4761 - s_waitcnt vmcnt(0) 4762 must happen after 4763 any preceding 4764 global/generic load 4765 atomic/atomicrmw 4766 with an equal or 4767 wider sync scope 4768 and memory ordering 4769 stronger than 4770 unordered (this is 4771 termed the 4772 fence-paired-atomic). 4773 - s_waitcnt lgkmcnt(0) 4774 must happen after 4775 any preceding 4776 local/generic load 4777 atomic/atomicrmw 4778 with an equal or 4779 wider sync scope 4780 and memory ordering 4781 stronger than 4782 unordered (this is 4783 termed the 4784 fence-paired-atomic). 4785 - Must happen before 4786 the following 4787 buffer_wbinvl1_vol. 4788 - Ensures that the 4789 fence-paired atomic 4790 has completed 4791 before invalidating 4792 the 4793 cache. Therefore 4794 any following 4795 locations read must 4796 be no older than 4797 the value read by 4798 the 4799 fence-paired-atomic. 4800 - Could be split into 4801 separate s_waitcnt 4802 vmcnt(0), s_waitcnt 4803 vscnt(0) and s_waitcnt 4804 lgkmcnt(0) to allow 4805 them to be 4806 independently moved 4807 according to the 4808 following rules. 4809 - s_waitcnt vmcnt(0) 4810 must happen after 4811 any preceding 4812 global/generic load 4813 atomic/ 4814 atomicrmw-with-return-value 4815 with an equal or 4816 wider sync scope 4817 and memory ordering 4818 stronger than 4819 unordered (this is 4820 termed the 4821 fence-paired-atomic). 4822 - s_waitcnt vscnt(0) 4823 must happen after 4824 any preceding 4825 global/generic 4826 atomicrmw-no-return-value 4827 with an equal or 4828 wider sync scope 4829 and memory ordering 4830 stronger than 4831 unordered (this is 4832 termed the 4833 fence-paired-atomic). 4834 - s_waitcnt lgkmcnt(0) 4835 must happen after 4836 any preceding 4837 local/generic load 4838 atomic/atomicrmw 4839 with an equal or 4840 wider sync scope 4841 and memory ordering 4842 stronger than 4843 unordered (this is 4844 termed the 4845 fence-paired-atomic). 4846 - Must happen before 4847 the following 4848 buffer_gl*_inv. 4849 - Ensures that the 4850 fence-paired atomic 4851 has completed 4852 before invalidating 4853 the 4854 caches. Therefore 4855 any following 4856 locations read must 4857 be no older than 4858 the value read by 4859 the 4860 fence-paired-atomic. 4861 4862 2. buffer_wbinvl1_vol 2. buffer_gl0_inv; 4863 buffer_gl1_inv 4864 4865 - Must happen before any - Must happen before any 4866 following global/generic following global/generic 4867 load/load load/load 4868 atomic/store/store atomic/store/store 4869 atomic/atomicrmw. atomic/atomicrmw. 4870 - Ensures that - Ensures that 4871 following loads following loads 4872 will not see stale will not see stale 4873 global data. global data. 4874 4875 **Release Atomic** 4876 ---------------------------------------------------------------------------------------------------------------------- 4877 store atomic release - singlethread - global 1. buffer/global/ds/flat_store 1. buffer/global/ds/flat_store 4878 - wavefront - local 4879 - generic 4880 store atomic release - workgroup - global 1. s_waitcnt lgkmcnt(0) 1. s_waitcnt lgkmcnt(0) & 4881 vmcnt(0) & vscnt(0) 4882 4883 - If CU wavefront execution mode, omit vmcnt and 4884 vscnt. 4885 - If OpenCL, omit. - If OpenCL, omit 4886 lgkmcnt(0). 4887 - Must happen after 4888 any preceding 4889 local/generic 4890 load/store/load 4891 atomic/store 4892 atomic/atomicrmw. 4893 - Could be split into 4894 separate s_waitcnt 4895 vmcnt(0), s_waitcnt 4896 vscnt(0) and s_waitcnt 4897 lgkmcnt(0) to allow 4898 them to be 4899 independently moved 4900 according to the 4901 following rules. 4902 - s_waitcnt vmcnt(0) 4903 must happen after 4904 any preceding 4905 global/generic load/load 4906 atomic/ 4907 atomicrmw-with-return-value. 4908 - s_waitcnt vscnt(0) 4909 must happen after 4910 any preceding 4911 global/generic 4912 store/store 4913 atomic/ 4914 atomicrmw-no-return-value. 4915 - s_waitcnt lgkmcnt(0) 4916 must happen after 4917 any preceding 4918 local/generic 4919 load/store/load 4920 atomic/store 4921 atomic/atomicrmw. 4922 - Must happen before - Must happen before 4923 the following the following 4924 store. store. 4925 - Ensures that all - Ensures that all 4926 memory operations memory operations 4927 to local have have 4928 completed before completed before 4929 performing the performing the 4930 store that is being store that is being 4931 released. released. 4932 4933 2. buffer/global/flat_store 2. buffer/global_store 4934 store atomic release - workgroup - local 1. waitcnt vmcnt(0) & vscnt(0) 4935 4936 - If CU wavefront execution mode, omit. 4937 - If OpenCL, omit. 4938 - Could be split into 4939 separate s_waitcnt 4940 vmcnt(0) and s_waitcnt 4941 vscnt(0) to allow 4942 them to be 4943 independently moved 4944 according to the 4945 following rules. 4946 - s_waitcnt vmcnt(0) 4947 must happen after 4948 any preceding 4949 global/generic load/load 4950 atomic/ 4951 atomicrmw-with-return-value. 4952 - s_waitcnt vscnt(0) 4953 must happen after 4954 any preceding 4955 global/generic 4956 store/store atomic/ 4957 atomicrmw-no-return-value. 4958 - Must happen before 4959 the following 4960 store. 4961 - Ensures that all 4962 global memory 4963 operations have 4964 completed before 4965 performing the 4966 store that is being 4967 released. 4968 4969 1. ds_store 2. ds_store 4970 store atomic release - workgroup - generic 1. s_waitcnt lgkmcnt(0) 1. s_waitcnt lgkmcnt(0) & 4971 vmcnt(0) & vscnt(0) 4972 4973 - If CU wavefront execution mode, omit vmcnt and 4974 vscnt. 4975 - If OpenCL, omit. - If OpenCL, omit 4976 lgkmcnt(0). 4977 - Must happen after 4978 any preceding 4979 local/generic 4980 load/store/load 4981 atomic/store 4982 atomic/atomicrmw. 4983 - Could be split into 4984 separate s_waitcnt 4985 vmcnt(0), s_waitcnt 4986 vscnt(0) and s_waitcnt 4987 lgkmcnt(0) to allow 4988 them to be 4989 independently moved 4990 according to the 4991 following rules. 4992 - s_waitcnt vmcnt(0) 4993 must happen after 4994 any preceding 4995 global/generic load/load 4996 atomic/ 4997 atomicrmw-with-return-value. 4998 - s_waitcnt vscnt(0) 4999 must happen after 5000 any preceding 5001 global/generic 5002 store/store 5003 atomic/ 5004 atomicrmw-no-return-value. 5005 - s_waitcnt lgkmcnt(0) 5006 must happen after 5007 any preceding 5008 local/generic load/store/load 5009 atomic/store atomic/atomicrmw. 5010 - Must happen before - Must happen before 5011 the following the following 5012 store. store. 5013 - Ensures that all - Ensures that all 5014 memory operations memory operations 5015 to local have have 5016 completed before completed before 5017 performing the performing the 5018 store that is being store that is being 5019 released. released. 5020 5021 2. flat_store 2. flat_store 5022 store atomic release - agent - global 1. s_waitcnt lgkmcnt(0) & 1. s_waitcnt lgkmcnt(0) & 5023 - system - generic vmcnt(0) vmcnt(0) & vscnt(0) 5024 5025 - If OpenCL, omit - If OpenCL, omit 5026 lgkmcnt(0). lgkmcnt(0). 5027 - Could be split into - Could be split into 5028 separate s_waitcnt separate s_waitcnt 5029 vmcnt(0) and vmcnt(0), s_waitcnt vscnt(0) 5030 s_waitcnt and s_waitcnt 5031 lgkmcnt(0) to allow lgkmcnt(0) to allow 5032 them to be them to be 5033 independently moved independently moved 5034 according to the according to the 5035 following rules. following rules. 5036 - s_waitcnt vmcnt(0) - s_waitcnt vmcnt(0) 5037 must happen after must happen after 5038 any preceding any preceding 5039 global/generic global/generic 5040 load/store/load load/load 5041 atomic/store atomic/ 5042 atomic/atomicrmw. atomicrmw-with-return-value. 5043 - s_waitcnt vscnt(0) 5044 must happen after 5045 any preceding 5046 global/generic 5047 store/store atomic/ 5048 atomicrmw-no-return-value. 5049 - s_waitcnt lgkmcnt(0) - s_waitcnt lgkmcnt(0) 5050 must happen after must happen after 5051 any preceding any preceding 5052 local/generic local/generic 5053 load/store/load load/store/load 5054 atomic/store atomic/store 5055 atomic/atomicrmw. atomic/atomicrmw. 5056 - Must happen before - Must happen before 5057 the following the following 5058 store. store. 5059 - Ensures that all - Ensures that all 5060 memory operations memory operations 5061 to memory have to memory have 5062 completed before completed before 5063 performing the performing the 5064 store that is being store that is being 5065 released. released. 5066 5067 2. buffer/global/ds/flat_store 2. buffer/global/ds/flat_store 5068 atomicrmw release - singlethread - global 1. buffer/global/ds/flat_atomic 1. buffer/global/ds/flat_atomic 5069 - wavefront - local 5070 - generic 5071 atomicrmw release - workgroup - global 1. s_waitcnt lgkmcnt(0) 1. s_waitcnt lgkmcnt(0) & 5072 vmcnt(0) & vscnt(0) 5073 5074 - If CU wavefront execution mode, omit vmcnt and 5075 vscnt. 5076 - If OpenCL, omit. 5077 5078 - Must happen after 5079 any preceding 5080 local/generic 5081 load/store/load 5082 atomic/store 5083 atomic/atomicrmw. 5084 - Could be split into 5085 separate s_waitcnt 5086 vmcnt(0), s_waitcnt 5087 vscnt(0) and s_waitcnt 5088 lgkmcnt(0) to allow 5089 them to be 5090 independently moved 5091 according to the 5092 following rules. 5093 - s_waitcnt vmcnt(0) 5094 must happen after 5095 any preceding 5096 global/generic load/load 5097 atomic/ 5098 atomicrmw-with-return-value. 5099 - s_waitcnt vscnt(0) 5100 must happen after 5101 any preceding 5102 global/generic 5103 store/store 5104 atomic/ 5105 atomicrmw-no-return-value. 5106 - s_waitcnt lgkmcnt(0) 5107 must happen after 5108 any preceding 5109 local/generic 5110 load/store/load 5111 atomic/store 5112 atomic/atomicrmw. 5113 - Must happen before - Must happen before 5114 the following the following 5115 atomicrmw. atomicrmw. 5116 - Ensures that all - Ensures that all 5117 memory operations memory operations 5118 to local have have 5119 completed before completed before 5120 performing the performing the 5121 atomicrmw that is atomicrmw that is 5122 being released. being released. 5123 5124 2. buffer/global/flat_atomic 2. buffer/global_atomic 5125 atomicrmw release - workgroup - local 1. waitcnt vmcnt(0) & vscnt(0) 5126 5127 - If CU wavefront execution mode, omit. 5128 - If OpenCL, omit. 5129 - Could be split into 5130 separate s_waitcnt 5131 vmcnt(0) and s_waitcnt 5132 vscnt(0) to allow 5133 them to be 5134 independently moved 5135 according to the 5136 following rules. 5137 - s_waitcnt vmcnt(0) 5138 must happen after 5139 any preceding 5140 global/generic load/load 5141 atomic/ 5142 atomicrmw-with-return-value. 5143 - s_waitcnt vscnt(0) 5144 must happen after 5145 any preceding 5146 global/generic 5147 store/store atomic/ 5148 atomicrmw-no-return-value. 5149 - Must happen before 5150 the following 5151 store. 5152 - Ensures that all 5153 global memory 5154 operations have 5155 completed before 5156 performing the 5157 store that is being 5158 released. 5159 5160 1. ds_atomic 2. ds_atomic 5161 atomicrmw release - workgroup - generic 1. s_waitcnt lgkmcnt(0) 1. s_waitcnt lgkmcnt(0) & 5162 vmcnt(0) & vscnt(0) 5163 5164 - If CU wavefront execution mode, omit vmcnt and 5165 vscnt. 5166 - If OpenCL, omit. - If OpenCL, omit 5167 waitcnt lgkmcnt(0). 5168 - Must happen after 5169 any preceding 5170 local/generic 5171 load/store/load 5172 atomic/store 5173 atomic/atomicrmw. 5174 - Could be split into 5175 separate s_waitcnt 5176 vmcnt(0), s_waitcnt 5177 vscnt(0) and s_waitcnt 5178 lgkmcnt(0) to allow 5179 them to be 5180 independently moved 5181 according to the 5182 following rules. 5183 - s_waitcnt vmcnt(0) 5184 must happen after 5185 any preceding 5186 global/generic load/load 5187 atomic/ 5188 atomicrmw-with-return-value. 5189 - s_waitcnt vscnt(0) 5190 must happen after 5191 any preceding 5192 global/generic 5193 store/store 5194 atomic/ 5195 atomicrmw-no-return-value. 5196 - s_waitcnt lgkmcnt(0) 5197 must happen after 5198 any preceding 5199 local/generic load/store/load 5200 atomic/store atomic/atomicrmw. 5201 - Must happen before - Must happen before 5202 the following the following 5203 atomicrmw. atomicrmw. 5204 - Ensures that all - Ensures that all 5205 memory operations memory operations 5206 to local have have 5207 completed before completed before 5208 performing the performing the 5209 atomicrmw that is atomicrmw that is 5210 being released. being released. 5211 5212 2. flat_atomic 2. flat_atomic 5213 atomicrmw release - agent - global 1. s_waitcnt lgkmcnt(0) & 1. s_waitcnt lkkmcnt(0) & 5214 - system - generic vmcnt(0) vmcnt(0) & vscnt(0) 5215 5216 - If OpenCL, omit - If OpenCL, omit 5217 lgkmcnt(0). lgkmcnt(0). 5218 - Could be split into - Could be split into 5219 separate s_waitcnt separate s_waitcnt 5220 vmcnt(0) and vmcnt(0), s_waitcnt 5221 s_waitcnt vscnt(0) and s_waitcnt 5222 lgkmcnt(0) to allow lgkmcnt(0) to allow 5223 them to be them to be 5224 independently moved independently moved 5225 according to the according to the 5226 following rules. following rules. 5227 - s_waitcnt vmcnt(0) - s_waitcnt vmcnt(0) 5228 must happen after must happen after 5229 any preceding any preceding 5230 global/generic global/generic 5231 load/store/load load/load atomic/ 5232 atomic/store atomicrmw-with-return-value. 5233 atomic/atomicrmw. 5234 - s_waitcnt vscnt(0) 5235 must happen after 5236 any preceding 5237 global/generic 5238 store/store atomic/ 5239 atomicrmw-no-return-value. 5240 - s_waitcnt lgkmcnt(0) - s_waitcnt lgkmcnt(0) 5241 must happen after must happen after 5242 any preceding any preceding 5243 local/generic local/generic 5244 load/store/load load/store/load 5245 atomic/store atomic/store 5246 atomic/atomicrmw. atomic/atomicrmw. 5247 - Must happen before - Must happen before 5248 the following the following 5249 atomicrmw. atomicrmw. 5250 - Ensures that all - Ensures that all 5251 memory operations memory operations 5252 to global and local to global and local 5253 have completed have completed 5254 before performing before performing 5255 the atomicrmw that the atomicrmw that 5256 is being released. is being released. 5257 5258 2. buffer/global/ds/flat_atomic 2. buffer/global/ds/flat_atomic 5259 fence release - singlethread *none* *none* *none* 5260 - wavefront 5261 fence release - workgroup *none* 1. s_waitcnt lgkmcnt(0) 1. s_waitcnt lgkmcnt(0) & 5262 vmcnt(0) & vscnt(0) 5263 5264 - If CU wavefront execution mode, omit vmcnt and 5265 vscnt. 5266 - If OpenCL and - If OpenCL and 5267 address space is address space is 5268 not generic, omit. not generic, omit 5269 lgkmcnt(0). 5270 - If OpenCL and 5271 address space is 5272 local, omit 5273 vmcnt(0) and vscnt(0). 5274 - However, since LLVM - However, since LLVM 5275 currently has no currently has no 5276 address space on address space on 5277 the fence need to the fence need to 5278 conservatively conservatively 5279 always generate. If always generate. If 5280 fence had an fence had an 5281 address space then address space then 5282 set to address set to address 5283 space of OpenCL space of OpenCL 5284 fence flag, or to fence flag, or to 5285 generic if both generic if both 5286 local and global local and global 5287 flags are flags are 5288 specified. specified. 5289 - Must happen after 5290 any preceding 5291 local/generic 5292 load/load 5293 atomic/store/store 5294 atomic/atomicrmw. 5295 - Could be split into 5296 separate s_waitcnt 5297 vmcnt(0), s_waitcnt 5298 vscnt(0) and s_waitcnt 5299 lgkmcnt(0) to allow 5300 them to be 5301 independently moved 5302 according to the 5303 following rules. 5304 - s_waitcnt vmcnt(0) 5305 must happen after 5306 any preceding 5307 global/generic 5308 load/load 5309 atomic/ 5310 atomicrmw-with-return-value. 5311 - s_waitcnt vscnt(0) 5312 must happen after 5313 any preceding 5314 global/generic 5315 store/store atomic/ 5316 atomicrmw-no-return-value. 5317 - s_waitcnt lgkmcnt(0) 5318 must happen after 5319 any preceding 5320 local/generic 5321 load/store/load 5322 atomic/store atomic/ 5323 atomicrmw. 5324 - Must happen before - Must happen before 5325 any following store any following store 5326 atomic/atomicrmw atomic/atomicrmw 5327 with an equal or with an equal or 5328 wider sync scope wider sync scope 5329 and memory ordering and memory ordering 5330 stronger than stronger than 5331 unordered (this is unordered (this is 5332 termed the termed the 5333 fence-paired-atomic). fence-paired-atomic). 5334 - Ensures that all - Ensures that all 5335 memory operations memory operations 5336 to local have have 5337 completed before completed before 5338 performing the performing the 5339 following following 5340 fence-paired-atomic. fence-paired-atomic. 5341 5342 fence release - agent *none* 1. s_waitcnt lgkmcnt(0) & 1. s_waitcnt lgkmcnt(0) & 5343 - system vmcnt(0) vmcnt(0) & vscnt(0) 5344 5345 - If OpenCL and - If OpenCL and 5346 address space is address space is 5347 not generic, omit not generic, omit 5348 lgkmcnt(0). lgkmcnt(0). 5349 - If OpenCL and - If OpenCL and 5350 address space is address space is 5351 local, omit local, omit 5352 vmcnt(0). vmcnt(0) and vscnt(0). 5353 - However, since LLVM - However, since LLVM 5354 currently has no currently has no 5355 address space on address space on 5356 the fence need to the fence need to 5357 conservatively conservatively 5358 always generate. If always generate. If 5359 fence had an fence had an 5360 address space then address space then 5361 set to address set to address 5362 space of OpenCL space of OpenCL 5363 fence flag, or to fence flag, or to 5364 generic if both generic if both 5365 local and global local and global 5366 flags are flags are 5367 specified. specified. 5368 - Could be split into - Could be split into 5369 separate s_waitcnt separate s_waitcnt 5370 vmcnt(0) and vmcnt(0), s_waitcnt 5371 s_waitcnt vscnt(0) and s_waitcnt 5372 lgkmcnt(0) to allow lgkmcnt(0) to allow 5373 them to be them to be 5374 independently moved independently moved 5375 according to the according to the 5376 following rules. following rules. 5377 - s_waitcnt vmcnt(0) - s_waitcnt vmcnt(0) 5378 must happen after must happen after 5379 any preceding any preceding 5380 global/generic global/generic 5381 load/store/load load/load atomic/ 5382 atomic/store atomicrmw-with-return-value. 5383 atomic/atomicrmw. 5384 - s_waitcnt vscnt(0) 5385 must happen after 5386 any preceding 5387 global/generic 5388 store/store atomic/ 5389 atomicrmw-no-return-value. 5390 - s_waitcnt lgkmcnt(0) - s_waitcnt lgkmcnt(0) 5391 must happen after must happen after 5392 any preceding any preceding 5393 local/generic local/generic 5394 load/store/load load/store/load 5395 atomic/store atomic/store 5396 atomic/atomicrmw. atomic/atomicrmw. 5397 - Must happen before - Must happen before 5398 any following store any following store 5399 atomic/atomicrmw atomic/atomicrmw 5400 with an equal or with an equal or 5401 wider sync scope wider sync scope 5402 and memory ordering and memory ordering 5403 stronger than stronger than 5404 unordered (this is unordered (this is 5405 termed the termed the 5406 fence-paired-atomic). fence-paired-atomic). 5407 - Ensures that all - Ensures that all 5408 memory operations memory operations 5409 have have 5410 completed before completed before 5411 performing the performing the 5412 following following 5413 fence-paired-atomic. fence-paired-atomic. 5414 5415 **Acquire-Release Atomic** 5416 ---------------------------------------------------------------------------------------------------------------------- 5417 atomicrmw acq_rel - singlethread - global 1. buffer/global/ds/flat_atomic 1. buffer/global/ds/flat_atomic 5418 - wavefront - local 5419 - generic 5420 atomicrmw acq_rel - workgroup - global 1. s_waitcnt lgkmcnt(0) 1. s_waitcnt lgkmcnt(0) & 5421 vmcnt(0) & vscnt(0) 5422 5423 - If CU wavefront execution mode, omit vmcnt and 5424 vscnt. 5425 - If OpenCL, omit. - If OpenCL, omit 5426 s_waitcnt lgkmcnt(0). 5427 - Must happen after - Must happen after 5428 any preceding any preceding 5429 local/generic local/generic 5430 load/store/load load/store/load 5431 atomic/store atomic/store 5432 atomic/atomicrmw. atomic/atomicrmw. 5433 - Could be split into 5434 separate s_waitcnt 5435 vmcnt(0), s_waitcnt 5436 vscnt(0) and s_waitcnt 5437 lgkmcnt(0) to allow 5438 them to be 5439 independently moved 5440 according to the 5441 following rules. 5442 - s_waitcnt vmcnt(0) 5443 must happen after 5444 any preceding 5445 global/generic load/load 5446 atomic/ 5447 atomicrmw-with-return-value. 5448 - s_waitcnt vscnt(0) 5449 must happen after 5450 any preceding 5451 global/generic 5452 store/store 5453 atomic/ 5454 atomicrmw-no-return-value. 5455 - s_waitcnt lgkmcnt(0) 5456 must happen after 5457 any preceding 5458 local/generic load/store/load 5459 atomic/store atomic/atomicrmw. 5460 - Must happen before - Must happen before 5461 the following the following 5462 atomicrmw. atomicrmw. 5463 - Ensures that all - Ensures that all 5464 memory operations memory operations 5465 to local have have 5466 completed before completed before 5467 performing the performing the 5468 atomicrmw that is atomicrmw that is 5469 being released. being released. 5470 5471 2. buffer/global/flat_atomic 2. buffer/global_atomic 5472 3. s_waitcnt vm/vscnt(0) 5473 5474 - If CU wavefront execution mode, omit vm/vscnt. 5475 - Use vmcnt if atomic with 5476 return and vscnt if atomic 5477 with no-return. 5478 waitcnt lgkmcnt(0). 5479 - Must happen before 5480 the following 5481 buffer_gl0_inv. 5482 - Ensures any 5483 following global 5484 data read is no 5485 older than the 5486 atomicrmw value 5487 being acquired. 5488 5489 4. buffer_gl0_inv 5490 5491 - If CU wavefront execution mode, omit. 5492 - Ensures that 5493 following 5494 loads will not see 5495 stale data. 5496 5497 atomicrmw acq_rel - workgroup - local 1. waitcnt vmcnt(0) & vscnt(0) 5498 5499 - If CU wavefront execution mode, omit. 5500 - If OpenCL, omit. 5501 - Could be split into 5502 separate s_waitcnt 5503 vmcnt(0) and s_waitcnt 5504 vscnt(0) to allow 5505 them to be 5506 independently moved 5507 according to the 5508 following rules. 5509 - s_waitcnt vmcnt(0) 5510 must happen after 5511 any preceding 5512 global/generic load/load 5513 atomic/ 5514 atomicrmw-with-return-value. 5515 - s_waitcnt vscnt(0) 5516 must happen after 5517 any preceding 5518 global/generic 5519 store/store atomic/ 5520 atomicrmw-no-return-value. 5521 - Must happen before 5522 the following 5523 store. 5524 - Ensures that all 5525 global memory 5526 operations have 5527 completed before 5528 performing the 5529 store that is being 5530 released. 5531 5532 1. ds_atomic 2. ds_atomic 5533 2. s_waitcnt lgkmcnt(0) 3. s_waitcnt lgkmcnt(0) 5534 5535 - If OpenCL, omit. - If OpenCL, omit. 5536 - Must happen before - Must happen before 5537 any following the following 5538 global/generic buffer_gl0_inv. 5539 load/load 5540 atomic/store/store 5541 atomic/atomicrmw. 5542 - Ensures any - Ensures any 5543 following global following global 5544 data read is no data read is no 5545 older than the load older than the load 5546 atomic value being atomic value being 5547 acquired. acquired. 5548 5549 4. buffer_gl0_inv 5550 5551 - If CU wavefront execution mode, omit. 5552 - If OpenCL omit. 5553 - Ensures that 5554 following 5555 loads will not see 5556 stale data. 5557 5558 atomicrmw acq_rel - workgroup - generic 1. s_waitcnt lgkmcnt(0) 1. s_waitcnt lgkmcnt(0) & 5559 vmcnt(0) & vscnt(0) 5560 5561 - If CU wavefront execution mode, omit vmcnt and 5562 vscnt. 5563 - If OpenCL, omit. - If OpenCL, omit 5564 waitcnt lgkmcnt(0). 5565 - Must happen after 5566 any preceding 5567 local/generic 5568 load/store/load 5569 atomic/store 5570 atomic/atomicrmw. 5571 - Could be split into 5572 separate s_waitcnt 5573 vmcnt(0), s_waitcnt 5574 vscnt(0) and s_waitcnt 5575 lgkmcnt(0) to allow 5576 them to be 5577 independently moved 5578 according to the 5579 following rules. 5580 - s_waitcnt vmcnt(0) 5581 must happen after 5582 any preceding 5583 global/generic load/load 5584 atomic/ 5585 atomicrmw-with-return-value. 5586 - s_waitcnt vscnt(0) 5587 must happen after 5588 any preceding 5589 global/generic 5590 store/store 5591 atomic/ 5592 atomicrmw-no-return-value. 5593 - s_waitcnt lgkmcnt(0) 5594 must happen after 5595 any preceding 5596 local/generic load/store/load 5597 atomic/store atomic/atomicrmw. 5598 - Must happen before - Must happen before 5599 the following the following 5600 atomicrmw. atomicrmw. 5601 - Ensures that all - Ensures that all 5602 memory operations memory operations 5603 to local have have 5604 completed before completed before 5605 performing the performing the 5606 atomicrmw that is atomicrmw that is 5607 being released. being released. 5608 5609 2. flat_atomic 2. flat_atomic 5610 3. s_waitcnt lgkmcnt(0) 3. s_waitcnt lgkmcnt(0) & 5611 vm/vscnt(0) 5612 5613 - If CU wavefront execution mode, omit vm/vscnt. 5614 - If OpenCL, omit. - If OpenCL, omit 5615 waitcnt lgkmcnt(0). 5616 - Must happen before - Must happen before 5617 any following the following 5618 global/generic buffer_gl0_inv. 5619 load/load 5620 atomic/store/store 5621 atomic/atomicrmw. 5622 - Ensures any - Ensures any 5623 following global following global 5624 data read is no data read is no 5625 older than the load older than the load 5626 atomic value being atomic value being 5627 acquired. acquired. 5628 5629 3. buffer_gl0_inv 5630 5631 - If CU wavefront execution mode, omit. 5632 - Ensures that 5633 following 5634 loads will not see 5635 stale data. 5636 5637 atomicrmw acq_rel - agent - global 1. s_waitcnt lgkmcnt(0) & 1. s_waitcnt lgkmcnt(0) & 5638 - system vmcnt(0) vmcnt(0) & vscnt(0) 5639 5640 - If OpenCL, omit - If OpenCL, omit 5641 lgkmcnt(0). lgkmcnt(0). 5642 - Could be split into - Could be split into 5643 separate s_waitcnt separate s_waitcnt 5644 vmcnt(0) and vmcnt(0), s_waitcnt 5645 s_waitcnt vscnt(0) and s_waitcnt 5646 lgkmcnt(0) to allow lgkmcnt(0) to allow 5647 them to be them to be 5648 independently moved independently moved 5649 according to the according to the 5650 following rules. following rules. 5651 - s_waitcnt vmcnt(0) - s_waitcnt vmcnt(0) 5652 must happen after must happen after 5653 any preceding any preceding 5654 global/generic global/generic 5655 load/store/load load/load atomic/ 5656 atomic/store atomicrmw-with-return-value. 5657 atomic/atomicrmw. 5658 - s_waitcnt vscnt(0) 5659 must happen after 5660 any preceding 5661 global/generic 5662 store/store atomic/ 5663 atomicrmw-no-return-value. 5664 - s_waitcnt lgkmcnt(0) - s_waitcnt lgkmcnt(0) 5665 must happen after must happen after 5666 any preceding any preceding 5667 local/generic local/generic 5668 load/store/load load/store/load 5669 atomic/store atomic/store 5670 atomic/atomicrmw. atomic/atomicrmw. 5671 - Must happen before - Must happen before 5672 the following the following 5673 atomicrmw. atomicrmw. 5674 - Ensures that all - Ensures that all 5675 memory operations memory operations 5676 to global have to global have 5677 completed before completed before 5678 performing the performing the 5679 atomicrmw that is atomicrmw that is 5680 being released. being released. 5681 5682 2. buffer/global/flat_atomic 2. buffer/global_atomic 5683 3. s_waitcnt vmcnt(0) 3. s_waitcnt vm/vscnt(0) 5684 5685 - Use vmcnt if atomic with 5686 return and vscnt if atomic 5687 with no-return. 5688 waitcnt lgkmcnt(0). 5689 - Must happen before - Must happen before 5690 following following 5691 buffer_wbinvl1_vol. buffer_gl*_inv. 5692 - Ensures the - Ensures the 5693 atomicrmw has atomicrmw has 5694 completed before completed before 5695 invalidating the invalidating the 5696 cache. caches. 5697 5698 4. buffer_wbinvl1_vol 4. buffer_gl0_inv; 5699 buffer_gl1_inv 5700 5701 - Must happen before - Must happen before 5702 any following any following 5703 global/generic global/generic 5704 load/load load/load 5705 atomic/atomicrmw. atomic/atomicrmw. 5706 - Ensures that - Ensures that 5707 following loads following loads 5708 will not see stale will not see stale 5709 global data. global data. 5710 5711 atomicrmw acq_rel - agent - generic 1. s_waitcnt lgkmcnt(0) & 1. s_waitcnt lgkmcnt(0) & 5712 - system vmcnt(0) vmcnt(0) & vscnt(0) 5713 5714 - If OpenCL, omit - If OpenCL, omit 5715 lgkmcnt(0). lgkmcnt(0). 5716 - Could be split into - Could be split into 5717 separate s_waitcnt separate s_waitcnt 5718 vmcnt(0) and vmcnt(0), s_waitcnt 5719 s_waitcnt vscnt(0) and s_waitcnt 5720 lgkmcnt(0) to allow lgkmcnt(0) to allow 5721 them to be them to be 5722 independently moved independently moved 5723 according to the according to the 5724 following rules. following rules. 5725 - s_waitcnt vmcnt(0) - s_waitcnt vmcnt(0) 5726 must happen after must happen after 5727 any preceding any preceding 5728 global/generic global/generic 5729 load/store/load load/load atomic 5730 atomic/store atomicrmw-with-return-value. 5731 atomic/atomicrmw. 5732 - s_waitcnt vscnt(0) 5733 must happen after 5734 any preceding 5735 global/generic 5736 store/store atomic/ 5737 atomicrmw-no-return-value. 5738 - s_waitcnt lgkmcnt(0) - s_waitcnt lgkmcnt(0) 5739 must happen after must happen after 5740 any preceding any preceding 5741 local/generic local/generic 5742 load/store/load load/store/load 5743 atomic/store atomic/store 5744 atomic/atomicrmw. atomic/atomicrmw. 5745 - Must happen before - Must happen before 5746 the following the following 5747 atomicrmw. atomicrmw. 5748 - Ensures that all - Ensures that all 5749 memory operations memory operations 5750 to global have have 5751 completed before completed before 5752 performing the performing the 5753 atomicrmw that is atomicrmw that is 5754 being released. being released. 5755 5756 2. flat_atomic 2. flat_atomic 5757 3. s_waitcnt vmcnt(0) & 3. s_waitcnt vm/vscnt(0) & 5758 lgkmcnt(0) lgkmcnt(0) 5759 5760 - If OpenCL, omit - If OpenCL, omit 5761 lgkmcnt(0). lgkmcnt(0). 5762 - Use vmcnt if atomic with 5763 return and vscnt if atomic 5764 with no-return. 5765 - Must happen before - Must happen before 5766 following following 5767 buffer_wbinvl1_vol. buffer_gl*_inv. 5768 - Ensures the - Ensures the 5769 atomicrmw has atomicrmw has 5770 completed before completed before 5771 invalidating the invalidating the 5772 cache. caches. 5773 5774 4. buffer_wbinvl1_vol 4. buffer_gl0_inv; 5775 buffer_gl1_inv 5776 5777 - Must happen before - Must happen before 5778 any following any following 5779 global/generic global/generic 5780 load/load load/load 5781 atomic/atomicrmw. atomic/atomicrmw. 5782 - Ensures that - Ensures that 5783 following loads following loads 5784 will not see stale will not see stale 5785 global data. global data. 5786 5787 fence acq_rel - singlethread *none* *none* *none* 5788 - wavefront 5789 fence acq_rel - workgroup *none* 1. s_waitcnt lgkmcnt(0) 1. s_waitcnt lgkmcnt(0) & 5790 vmcnt(0) & vscnt(0) 5791 5792 - If CU wavefront execution mode, omit vmcnt and 5793 vscnt. 5794 - If OpenCL and - If OpenCL and 5795 address space is address space is 5796 not generic, omit. not generic, omit 5797 lgkmcnt(0). 5798 - If OpenCL and 5799 address space is 5800 local, omit 5801 vmcnt(0) and vscnt(0). 5802 - However, - However, 5803 since LLVM since LLVM 5804 currently has no currently has no 5805 address space on address space on 5806 the fence need to the fence need to 5807 conservatively conservatively 5808 always generate always generate 5809 (see comment for (see comment for 5810 previous fence). previous fence). 5811 - Must happen after 5812 any preceding 5813 local/generic 5814 load/load 5815 atomic/store/store 5816 atomic/atomicrmw. 5817 - Could be split into 5818 separate s_waitcnt 5819 vmcnt(0), s_waitcnt 5820 vscnt(0) and s_waitcnt 5821 lgkmcnt(0) to allow 5822 them to be 5823 independently moved 5824 according to the 5825 following rules. 5826 - s_waitcnt vmcnt(0) 5827 must happen after 5828 any preceding 5829 global/generic 5830 load/load 5831 atomic/ 5832 atomicrmw-with-return-value. 5833 - s_waitcnt vscnt(0) 5834 must happen after 5835 any preceding 5836 global/generic 5837 store/store atomic/ 5838 atomicrmw-no-return-value. 5839 - s_waitcnt lgkmcnt(0) 5840 must happen after 5841 any preceding 5842 local/generic 5843 load/store/load 5844 atomic/store atomic/ 5845 atomicrmw. 5846 - Must happen before - Must happen before 5847 any following any following 5848 global/generic global/generic 5849 load/load load/load 5850 atomic/store/store atomic/store/store 5851 atomic/atomicrmw. atomic/atomicrmw. 5852 - Ensures that all - Ensures that all 5853 memory operations memory operations 5854 to local have have 5855 completed before completed before 5856 performing any performing any 5857 following global following global 5858 memory operations. memory operations. 5859 - Ensures that the - Ensures that the 5860 preceding preceding 5861 local/generic load local/generic load 5862 atomic/atomicrmw atomic/atomicrmw 5863 with an equal or with an equal or 5864 wider sync scope wider sync scope 5865 and memory ordering and memory ordering 5866 stronger than stronger than 5867 unordered (this is unordered (this is 5868 termed the termed the 5869 acquire-fence-paired-atomic acquire-fence-paired-atomic 5870 ) has completed ) has completed 5871 before following before following 5872 global memory global memory 5873 operations. This operations. This 5874 satisfies the satisfies the 5875 requirements of requirements of 5876 acquire. acquire. 5877 - Ensures that all - Ensures that all 5878 previous memory previous memory 5879 operations have operations have 5880 completed before a completed before a 5881 following following 5882 local/generic store local/generic store 5883 atomic/atomicrmw atomic/atomicrmw 5884 with an equal or with an equal or 5885 wider sync scope wider sync scope 5886 and memory ordering and memory ordering 5887 stronger than stronger than 5888 unordered (this is unordered (this is 5889 termed the termed the 5890 release-fence-paired-atomic release-fence-paired-atomic 5891 ). This satisfies the ). This satisfies the 5892 requirements of requirements of 5893 release. release. 5894 - Must happen before 5895 the following 5896 buffer_gl0_inv. 5897 - Ensures that the 5898 acquire-fence-paired 5899 atomic has completed 5900 before invalidating 5901 the 5902 cache. Therefore 5903 any following 5904 locations read must 5905 be no older than 5906 the value read by 5907 the 5908 acquire-fence-paired-atomic. 5909 5910 3. buffer_gl0_inv 5911 5912 - If CU wavefront execution mode, omit. 5913 - Ensures that 5914 following 5915 loads will not see 5916 stale data. 5917 5918 fence acq_rel - agent *none* 1. s_waitcnt lgkmcnt(0) & 1. s_waitcnt lgkmcnt(0) & 5919 - system vmcnt(0) vmcnt(0) & vscnt(0) 5920 5921 - If OpenCL and - If OpenCL and 5922 address space is address space is 5923 not generic, omit not generic, omit 5924 lgkmcnt(0). lgkmcnt(0). 5925 - If OpenCL and 5926 address space is 5927 local, omit 5928 vmcnt(0) and vscnt(0). 5929 - However, since LLVM - However, since LLVM 5930 currently has no currently has no 5931 address space on address space on 5932 the fence need to the fence need to 5933 conservatively conservatively 5934 always generate always generate 5935 (see comment for (see comment for 5936 previous fence). previous fence). 5937 - Could be split into - Could be split into 5938 separate s_waitcnt separate s_waitcnt 5939 vmcnt(0) and vmcnt(0), s_waitcnt 5940 s_waitcnt vscnt(0) and s_waitcnt 5941 lgkmcnt(0) to allow lgkmcnt(0) to allow 5942 them to be them to be 5943 independently moved independently moved 5944 according to the according to the 5945 following rules. following rules. 5946 - s_waitcnt vmcnt(0) - s_waitcnt vmcnt(0) 5947 must happen after must happen after 5948 any preceding any preceding 5949 global/generic global/generic 5950 load/store/load load/load 5951 atomic/store atomic/ 5952 atomic/atomicrmw. atomicrmw-with-return-value. 5953 - s_waitcnt vscnt(0) 5954 must happen after 5955 any preceding 5956 global/generic 5957 store/store atomic/ 5958 atomicrmw-no-return-value. 5959 - s_waitcnt lgkmcnt(0) - s_waitcnt lgkmcnt(0) 5960 must happen after must happen after 5961 any preceding any preceding 5962 local/generic local/generic 5963 load/store/load load/store/load 5964 atomic/store atomic/store 5965 atomic/atomicrmw. atomic/atomicrmw. 5966 - Must happen before - Must happen before 5967 the following the following 5968 buffer_wbinvl1_vol. buffer_gl*_inv. 5969 - Ensures that the - Ensures that the 5970 preceding preceding 5971 global/local/generic global/local/generic 5972 load load 5973 atomic/atomicrmw atomic/atomicrmw 5974 with an equal or with an equal or 5975 wider sync scope wider sync scope 5976 and memory ordering and memory ordering 5977 stronger than stronger than 5978 unordered (this is unordered (this is 5979 termed the termed the 5980 acquire-fence-paired-atomic acquire-fence-paired-atomic 5981 ) has completed ) has completed 5982 before invalidating before invalidating 5983 the cache. This the caches. This 5984 satisfies the satisfies the 5985 requirements of requirements of 5986 acquire. acquire. 5987 - Ensures that all - Ensures that all 5988 previous memory previous memory 5989 operations have operations have 5990 completed before a completed before a 5991 following following 5992 global/local/generic global/local/generic 5993 store store 5994 atomic/atomicrmw atomic/atomicrmw 5995 with an equal or with an equal or 5996 wider sync scope wider sync scope 5997 and memory ordering and memory ordering 5998 stronger than stronger than 5999 unordered (this is unordered (this is 6000 termed the termed the 6001 release-fence-paired-atomic release-fence-paired-atomic 6002 ). This satisfies the ). This satisfies the 6003 requirements of requirements of 6004 release. release. 6005 6006 2. buffer_wbinvl1_vol 2. buffer_gl0_inv; 6007 buffer_gl1_inv 6008 6009 - Must happen before - Must happen before 6010 any following any following 6011 global/generic global/generic 6012 load/load load/load 6013 atomic/store/store atomic/store/store 6014 atomic/atomicrmw. atomic/atomicrmw. 6015 - Ensures that - Ensures that 6016 following loads following loads 6017 will not see stale will not see stale 6018 global data. This global data. This 6019 satisfies the satisfies the 6020 requirements of requirements of 6021 acquire. acquire. 6022 6023 **Sequential Consistent Atomic** 6024 ---------------------------------------------------------------------------------------------------------------------- 6025 load atomic seq_cst - singlethread - global *Same as corresponding *Same as corresponding 6026 - wavefront - local load atomic acquire, load atomic acquire, 6027 - generic except must generated except must generated 6028 all instructions even all instructions even 6029 for OpenCL.* for OpenCL.* 6030 load atomic seq_cst - workgroup - global 1. s_waitcnt lgkmcnt(0) 1. s_waitcnt lgkmcnt(0) & 6031 - generic vmcnt(0) & vscnt(0) 6032 6033 - If CU wavefront execution mode, omit vmcnt and 6034 vscnt. 6035 - Could be split into 6036 separate s_waitcnt 6037 vmcnt(0), s_waitcnt 6038 vscnt(0) and s_waitcnt 6039 lgkmcnt(0) to allow 6040 them to be 6041 independently moved 6042 according to the 6043 following rules. 6044 - Must - waitcnt lgkmcnt(0) must 6045 happen after happen after 6046 preceding preceding 6047 global/generic load local load 6048 atomic/store atomic/store 6049 atomic/atomicrmw atomic/atomicrmw 6050 with memory with memory 6051 ordering of seq_cst ordering of seq_cst 6052 and with equal or and with equal or 6053 wider sync scope. wider sync scope. 6054 (Note that seq_cst (Note that seq_cst 6055 fences have their fences have their 6056 own s_waitcnt own s_waitcnt 6057 lgkmcnt(0) and so do lgkmcnt(0) and so do 6058 not need to be not need to be 6059 considered.) considered.) 6060 - waitcnt vmcnt(0) 6061 Must happen after 6062 preceding 6063 global/generic load 6064 atomic/ 6065 atomicrmw-with-return-value 6066 with memory 6067 ordering of seq_cst 6068 and with equal or 6069 wider sync scope. 6070 (Note that seq_cst 6071 fences have their 6072 own s_waitcnt 6073 vmcnt(0) and so do 6074 not need to be 6075 considered.) 6076 - waitcnt vscnt(0) 6077 Must happen after 6078 preceding 6079 global/generic store 6080 atomic/ 6081 atomicrmw-no-return-value 6082 with memory 6083 ordering of seq_cst 6084 and with equal or 6085 wider sync scope. 6086 (Note that seq_cst 6087 fences have their 6088 own s_waitcnt 6089 vscnt(0) and so do 6090 not need to be 6091 considered.) 6092 - Ensures any - Ensures any 6093 preceding preceding 6094 sequential sequential 6095 consistent local consistent global/local 6096 memory instructions memory instructions 6097 have completed have completed 6098 before executing before executing 6099 this sequentially this sequentially 6100 consistent consistent 6101 instruction. This instruction. This 6102 prevents reordering prevents reordering 6103 a seq_cst store a seq_cst store 6104 followed by a followed by a 6105 seq_cst load. (Note seq_cst load. (Note 6106 that seq_cst is that seq_cst is 6107 stronger than stronger than 6108 acquire/release as acquire/release as 6109 the reordering of the reordering of 6110 load acquire load acquire 6111 followed by a store followed by a store 6112 release is release is 6113 prevented by the prevented by the 6114 waitcnt of waitcnt of 6115 the release, but the release, but 6116 there is nothing there is nothing 6117 preventing a store preventing a store 6118 release followed by release followed by 6119 load acquire from load acquire from 6120 competing out of competing out of 6121 order.) order.) 6122 6123 2. *Following 2. *Following 6124 instructions same as instructions same as 6125 corresponding load corresponding load 6126 atomic acquire, atomic acquire, 6127 except must generated except must generated 6128 all instructions even all instructions even 6129 for OpenCL.* for OpenCL.* 6130 load atomic seq_cst - workgroup - local *Same as corresponding 6131 load atomic acquire, 6132 except must generated 6133 all instructions even 6134 for OpenCL.* 6135 6136 1. s_waitcnt vmcnt(0) & vscnt(0) 6137 6138 - If CU wavefront execution mode, omit. 6139 - Could be split into 6140 separate s_waitcnt 6141 vmcnt(0) and s_waitcnt 6142 vscnt(0) to allow 6143 them to be 6144 independently moved 6145 according to the 6146 following rules. 6147 - waitcnt vmcnt(0) 6148 Must happen after 6149 preceding 6150 global/generic load 6151 atomic/ 6152 atomicrmw-with-return-value 6153 with memory 6154 ordering of seq_cst 6155 and with equal or 6156 wider sync scope. 6157 (Note that seq_cst 6158 fences have their 6159 own s_waitcnt 6160 vmcnt(0) and so do 6161 not need to be 6162 considered.) 6163 - waitcnt vscnt(0) 6164 Must happen after 6165 preceding 6166 global/generic store 6167 atomic/ 6168 atomicrmw-no-return-value 6169 with memory 6170 ordering of seq_cst 6171 and with equal or 6172 wider sync scope. 6173 (Note that seq_cst 6174 fences have their 6175 own s_waitcnt 6176 vscnt(0) and so do 6177 not need to be 6178 considered.) 6179 - Ensures any 6180 preceding 6181 sequential 6182 consistent global 6183 memory instructions 6184 have completed 6185 before executing 6186 this sequentially 6187 consistent 6188 instruction. This 6189 prevents reordering 6190 a seq_cst store 6191 followed by a 6192 seq_cst load. (Note 6193 that seq_cst is 6194 stronger than 6195 acquire/release as 6196 the reordering of 6197 load acquire 6198 followed by a store 6199 release is 6200 prevented by the 6201 waitcnt of 6202 the release, but 6203 there is nothing 6204 preventing a store 6205 release followed by 6206 load acquire from 6207 competing out of 6208 order.) 6209 6210 2. *Following 6211 instructions same as 6212 corresponding load 6213 atomic acquire, 6214 except must generated 6215 all instructions even 6216 for OpenCL.* 6217 6218 load atomic seq_cst - agent - global 1. s_waitcnt lgkmcnt(0) & 1. s_waitcnt lgkmcnt(0) & 6219 - system - generic vmcnt(0) vmcnt(0) & vscnt(0) 6220 6221 - Could be split into - Could be split into 6222 separate s_waitcnt separate s_waitcnt 6223 vmcnt(0) vmcnt(0), s_waitcnt 6224 and s_waitcnt vscnt(0) and s_waitcnt 6225 lgkmcnt(0) to allow lgkmcnt(0) to allow 6226 them to be them to be 6227 independently moved independently moved 6228 according to the according to the 6229 following rules. following rules. 6230 - waitcnt lgkmcnt(0) - waitcnt lgkmcnt(0) 6231 must happen after must happen after 6232 preceding preceding 6233 global/generic load local load 6234 atomic/store atomic/store 6235 atomic/atomicrmw atomic/atomicrmw 6236 with memory with memory 6237 ordering of seq_cst ordering of seq_cst 6238 and with equal or and with equal or 6239 wider sync scope. wider sync scope. 6240 (Note that seq_cst (Note that seq_cst 6241 fences have their fences have their 6242 own s_waitcnt own s_waitcnt 6243 lgkmcnt(0) and so do lgkmcnt(0) and so do 6244 not need to be not need to be 6245 considered.) considered.) 6246 - waitcnt vmcnt(0) - waitcnt vmcnt(0) 6247 must happen after must happen after 6248 preceding preceding 6249 global/generic load global/generic load 6250 atomic/store atomic/ 6251 atomic/atomicrmw atomicrmw-with-return-value 6252 with memory with memory 6253 ordering of seq_cst ordering of seq_cst 6254 and with equal or and with equal or 6255 wider sync scope. wider sync scope. 6256 (Note that seq_cst (Note that seq_cst 6257 fences have their fences have their 6258 own s_waitcnt own s_waitcnt 6259 vmcnt(0) and so do vmcnt(0) and so do 6260 not need to be not need to be 6261 considered.) considered.) 6262 - waitcnt vscnt(0) 6263 Must happen after 6264 preceding 6265 global/generic store 6266 atomic/ 6267 atomicrmw-no-return-value 6268 with memory 6269 ordering of seq_cst 6270 and with equal or 6271 wider sync scope. 6272 (Note that seq_cst 6273 fences have their 6274 own s_waitcnt 6275 vscnt(0) and so do 6276 not need to be 6277 considered.) 6278 - Ensures any - Ensures any 6279 preceding preceding 6280 sequential sequential 6281 consistent global consistent global 6282 memory instructions memory instructions 6283 have completed have completed 6284 before executing before executing 6285 this sequentially this sequentially 6286 consistent consistent 6287 instruction. This instruction. This 6288 prevents reordering prevents reordering 6289 a seq_cst store a seq_cst store 6290 followed by a followed by a 6291 seq_cst load. (Note seq_cst load. (Note 6292 that seq_cst is that seq_cst is 6293 stronger than stronger than 6294 acquire/release as acquire/release as 6295 the reordering of the reordering of 6296 load acquire load acquire 6297 followed by a store followed by a store 6298 release is release is 6299 prevented by the prevented by the 6300 waitcnt of waitcnt of 6301 the release, but the release, but 6302 there is nothing there is nothing 6303 preventing a store preventing a store 6304 release followed by release followed by 6305 load acquire from load acquire from 6306 competing out of competing out of 6307 order.) order.) 6308 6309 2. *Following 2. *Following 6310 instructions same as instructions same as 6311 corresponding load corresponding load 6312 atomic acquire, atomic acquire, 6313 except must generated except must generated 6314 all instructions even all instructions even 6315 for OpenCL.* for OpenCL.* 6316 store atomic seq_cst - singlethread - global *Same as corresponding *Same as corresponding 6317 - wavefront - local store atomic release, store atomic release, 6318 - workgroup - generic except must generated except must generated 6319 all instructions even all instructions even 6320 for OpenCL.* for OpenCL.* 6321 store atomic seq_cst - agent - global *Same as corresponding *Same as corresponding 6322 - system - generic store atomic release, store atomic release, 6323 except must generated except must generated 6324 all instructions even all instructions even 6325 for OpenCL.* for OpenCL.* 6326 atomicrmw seq_cst - singlethread - global *Same as corresponding *Same as corresponding 6327 - wavefront - local atomicrmw acq_rel, atomicrmw acq_rel, 6328 - workgroup - generic except must generated except must generated 6329 all instructions even all instructions even 6330 for OpenCL.* for OpenCL.* 6331 atomicrmw seq_cst - agent - global *Same as corresponding *Same as corresponding 6332 - system - generic atomicrmw acq_rel, atomicrmw acq_rel, 6333 except must generated except must generated 6334 all instructions even all instructions even 6335 for OpenCL.* for OpenCL.* 6336 fence seq_cst - singlethread *none* *Same as corresponding *Same as corresponding 6337 - wavefront fence acq_rel, fence acq_rel, 6338 - workgroup except must generated except must generated 6339 - agent all instructions even all instructions even 6340 - system for OpenCL.* for OpenCL.* 6341 ============ ============ ============== ========== =============================== ================================== 6342 6343The memory order also adds the single thread optimization constrains defined in 6344table 6345:ref:`amdgpu-amdhsa-memory-model-single-thread-optimization-constraints-gfx6-gfx10-table`. 6346 6347 .. table:: AMDHSA Memory Model Single Thread Optimization Constraints GFX6-GFX10 6348 :name: amdgpu-amdhsa-memory-model-single-thread-optimization-constraints-gfx6-gfx10-table 6349 6350 ============ ============================================================== 6351 LLVM Memory Optimization Constraints 6352 Ordering 6353 ============ ============================================================== 6354 unordered *none* 6355 monotonic *none* 6356 acquire - If a load atomic/atomicrmw then no following load/load 6357 atomic/store/ store atomic/atomicrmw/fence instruction can 6358 be moved before the acquire. 6359 - If a fence then same as load atomic, plus no preceding 6360 associated fence-paired-atomic can be moved after the fence. 6361 release - If a store atomic/atomicrmw then no preceding load/load 6362 atomic/store/ store atomic/atomicrmw/fence instruction can 6363 be moved after the release. 6364 - If a fence then same as store atomic, plus no following 6365 associated fence-paired-atomic can be moved before the 6366 fence. 6367 acq_rel Same constraints as both acquire and release. 6368 seq_cst - If a load atomic then same constraints as acquire, plus no 6369 preceding sequentially consistent load atomic/store 6370 atomic/atomicrmw/fence instruction can be moved after the 6371 seq_cst. 6372 - If a store atomic then the same constraints as release, plus 6373 no following sequentially consistent load atomic/store 6374 atomic/atomicrmw/fence instruction can be moved before the 6375 seq_cst. 6376 - If an atomicrmw/fence then same constraints as acq_rel. 6377 ============ ============================================================== 6378 6379Trap Handler ABI 6380~~~~~~~~~~~~~~~~ 6381 6382For code objects generated by AMDGPU backend for HSA [HSA]_ compatible runtimes 6383(such as ROCm [AMD-ROCm]_), the runtime installs a trap handler that supports 6384the ``s_trap`` instruction with the following usage: 6385 6386 .. table:: AMDGPU Trap Handler for AMDHSA OS 6387 :name: amdgpu-trap-handler-for-amdhsa-os-table 6388 6389 =================== =============== =============== ======================= 6390 Usage Code Sequence Trap Handler Description 6391 Inputs 6392 =================== =============== =============== ======================= 6393 reserved ``s_trap 0x00`` Reserved by hardware. 6394 ``debugtrap(arg)`` ``s_trap 0x01`` ``SGPR0-1``: Reserved for HSA 6395 ``queue_ptr`` ``debugtrap`` 6396 ``VGPR0``: intrinsic (not 6397 ``arg`` implemented). 6398 ``llvm.trap`` ``s_trap 0x02`` ``SGPR0-1``: Causes dispatch to be 6399 ``queue_ptr`` terminated and its 6400 associated queue put 6401 into the error state. 6402 ``llvm.debugtrap`` ``s_trap 0x03`` - If debugger not 6403 installed then 6404 behaves as a 6405 no-operation. The 6406 trap handler is 6407 entered and 6408 immediately returns 6409 to continue 6410 execution of the 6411 wavefront. 6412 - If the debugger is 6413 installed, causes 6414 the debug trap to be 6415 reported by the 6416 debugger and the 6417 wavefront is put in 6418 the halt state until 6419 resumed by the 6420 debugger. 6421 reserved ``s_trap 0x04`` Reserved. 6422 reserved ``s_trap 0x05`` Reserved. 6423 reserved ``s_trap 0x06`` Reserved. 6424 debugger breakpoint ``s_trap 0x07`` Reserved for debugger 6425 breakpoints. 6426 reserved ``s_trap 0x08`` Reserved. 6427 reserved ``s_trap 0xfe`` Reserved. 6428 reserved ``s_trap 0xff`` Reserved. 6429 =================== =============== =============== ======================= 6430 6431.. _amdgpu-amdhsa-function-call-convention: 6432 6433Call Convention 6434~~~~~~~~~~~~~~~ 6435 6436.. note:: 6437 6438 This section is currently incomplete and has inakkuracies. It is WIP that will 6439 be updated as information is determined. 6440 6441See :ref:`amdgpu-dwarf-address-space-identifier` for information on swizzled 6442addresses. Unswizzled addresses are normal linear addresses. 6443 6444.. _amdgpu-amdhsa-function-call-convention-kernel-functions: 6445 6446Kernel Functions 6447++++++++++++++++ 6448 6449This section describes the call convention ABI for the outer kernel function. 6450 6451See :ref:`amdgpu-amdhsa-initial-kernel-execution-state` for the kernel call 6452convention. 6453 6454The following is not part of the AMDGPU kernel calling convention but describes 6455how the AMDGPU implements function calls: 6456 64571. Clang decides the kernarg layout to match the *HSA Programmer's Language 6458 Reference* [HSA]_. 6459 6460 - All structs are passed directly. 6461 - Lambda values are passed *TBA*. 6462 6463 .. TODO:: 6464 6465 - Does this really follow HSA rules? Or are structs >16 bytes passed 6466 by-value struct? 6467 - What is ABI for lambda values? 6468 64694. The kernel performs certain setup in its prolog, as described in 6470 :ref:`amdgpu-amdhsa-kernel-prolog`. 6471 6472.. _amdgpu-amdhsa-function-call-convention-non-kernel-functions: 6473 6474Non-Kernel Functions 6475++++++++++++++++++++ 6476 6477This section describes the call convention ABI for functions other than the 6478outer kernel function. 6479 6480If a kernel has function calls then scratch is always allocated and used for 6481the call stack which grows from low address to high address using the swizzled 6482scratch address space. 6483 6484On entry to a function: 6485 64861. SGPR0-3 contain a V# with the following properties (see 6487 :ref:`amdgpu-amdhsa-kernel-prolog-private-segment-buffer`): 6488 6489 * Base address pointing to the beginning of the wavefront scratch backing 6490 memory. 6491 * Swizzled with dword element size and stride of wavefront size elements. 6492 64932. The FLAT_SCRATCH register pair is setup. See 6494 :ref:`amdgpu-amdhsa-kernel-prolog-flat-scratch`. 64953. GFX6-8: M0 register set to the size of LDS in bytes. See 6496 :ref:`amdgpu-amdhsa-kernel-prolog-m0`. 64974. The EXEC register is set to the lanes active on entry to the function. 64985. MODE register: *TBD* 64996. VGPR0-31 and SGPR4-29 are used to pass function input arguments as described 6500 below. 65017. SGPR30-31 return address (RA). The code address that the function must 6502 return to when it completes. The value is undefined if the function is *no 6503 return*. 65048. SGPR32 is used for the stack pointer (SP). It is an unswizzled scratch 6505 offset relative to the beginning of the wavefront scratch backing memory. 6506 6507 The unswizzled SP can be used with buffer instructions as an unswizzled SGPR 6508 offset with the scratch V# in SGPR0-3 to access the stack in a swizzled 6509 manner. 6510 6511 The unswizzled SP value can be converted into the swizzled SP value by: 6512 6513 | swizzled SP = unswizzled SP / wavefront size 6514 6515 This may be used to obtain the private address space address of stack 6516 objects and to convert this address to a flat address by adding the flat 6517 scratch aperture base address. 6518 6519 The swizzled SP value is always 4 bytes aligned for the ``r600`` 6520 architecture and 16 byte aligned for the ``amdgcn`` architecture. 6521 6522 .. note:: 6523 6524 The ``amdgcn`` value is selected to avoid dynamic stack alignment for the 6525 OpenCL language which has the largest base type defined as 16 bytes. 6526 6527 On entry, the swizzled SP value is the address of the first function 6528 argument passed on the stack. Other stack passed arguments are positive 6529 offsets from the entry swizzled SP value. 6530 6531 The function may use positive offsets beyond the last stack passed argument 6532 for stack allocated local variables and register spill slots. If necessary, 6533 the function may align these to greater alignment than 16 bytes. After these 6534 the function may dynamically allocate space for such things as runtime sized 6535 ``alloca`` local allocations. 6536 6537 If the function calls another function, it will place any stack allocated 6538 arguments after the last local allocation and adjust SGPR32 to the address 6539 after the last local allocation. 6540 65419. All other registers are unspecified. 654210. Any necessary ``waitcnt`` has been performed to ensure memory is available 6543 to the function. 6544 6545On exit from a function: 6546 65471. VGPR0-31 and SGPR4-29 are used to pass function result arguments as 6548 described below. Any registers used are considered clobbered registers. 65492. The following registers are preserved and have the same value as on entry: 6550 6551 * FLAT_SCRATCH 6552 * EXEC 6553 * GFX6-8: M0 6554 * All SGPR registers except the clobbered registers of SGPR4-31. 6555 * VGPR40-47 6556 VGPR56-63 6557 VGPR72-79 6558 VGPR88-95 6559 VGPR104-111 6560 VGPR120-127 6561 VGPR136-143 6562 VGPR152-159 6563 VGPR168-175 6564 VGPR184-191 6565 VGPR200-207 6566 VGPR216-223 6567 VGPR232-239 6568 VGPR248-255 6569 6570 *Except the argument registers, the VGPR cloberred and the preserved 6571 registers are intermixed at regular intervals in order to 6572 get a better occupancy.* 6573 6574 For the AMDGPU backend, an inter-procedural register allocation (IPRA) 6575 optimization may mark some of clobbered SGPR and VGPR registers as 6576 preserved if it can be determined that the called function does not change 6577 their value. 6578 65792. The PC is set to the RA provided on entry. 65803. MODE register: *TBD*. 65814. All other registers are clobbered. 65825. Any necessary ``waitcnt`` has been performed to ensure memory accessed by 6583 function is available to the caller. 6584 6585.. TODO:: 6586 6587 - On gfx908 are all ACC registers clobbered? 6588 6589 - How are function results returned? The address of structured types is passed 6590 by reference, but what about other types? 6591 6592The function input arguments are made up of the formal arguments explicitly 6593declared by the source language function plus the implicit input arguments used 6594by the implementation. 6595 6596The source language input arguments are: 6597 65981. Any source language implicit ``this`` or ``self`` argument comes first as a 6599 pointer type. 66002. Followed by the function formal arguments in left to right source order. 6601 6602The source language result arguments are: 6603 66041. The function result argument. 6605 6606The source language input or result struct type arguments that are less than or 6607equal to 16 bytes, are decomposed recursively into their base type fields, and 6608each field is passed as if a separate argument. For input arguments, if the 6609called function requires the struct to be in memory, for example because its 6610address is taken, then the function body is responsible for allocating a stack 6611location and copying the field arguments into it. Clang terms this *direct 6612struct*. 6613 6614The source language input struct type arguments that are greater than 16 bytes, 6615are passed by reference. The caller is responsible for allocating a stack 6616location to make a copy of the struct value and pass the address as the input 6617argument. The called function is responsible to perform the dereference when 6618accessing the input argument. Clang terms this *by-value struct*. 6619 6620A source language result struct type argument that is greater than 16 bytes, is 6621returned by reference. The caller is responsible for allocating a stack location 6622to hold the result value and passes the address as the last input argument 6623(before the implicit input arguments). In this case there are no result 6624arguments. The called function is responsible to perform the dereference when 6625storing the result value. Clang terms this *structured return (sret)*. 6626 6627*TODO: correct the ``sret`` definition.* 6628 6629.. TODO:: 6630 6631 Is this definition correct? Or is ``sret`` only used if passing in registers, and 6632 pass as non-decomposed struct as stack argument? Or something else? Is the 6633 memory location in the caller stack frame, or a stack memory argument and so 6634 no address is passed as the caller can directly write to the argument stack 6635 location? But then the stack location is still live after return. If an 6636 argument stack location is it the first stack argument or the last one? 6637 6638Lambda argument types are treated as struct types with an implementation defined 6639set of fields. 6640 6641.. TODO:: 6642 6643 Need to specify the ABI for lambda types for AMDGPU. 6644 6645For AMDGPU backend all source language arguments (including the decomposed 6646struct type arguments) are passed in VGPRs unless marked ``inreg`` in which case 6647they are passed in SGPRs. 6648 6649The AMDGPU backend walks the function call graph from the leaves to determine 6650which implicit input arguments are used, propagating to each caller of the 6651function. The used implicit arguments are appended to the function arguments 6652after the source language arguments in the following order: 6653 6654.. TODO:: 6655 6656 Is recursion or external functions supported? 6657 66581. Work-Item ID (1 VGPR) 6659 6660 The X, Y and Z work-item ID are packed into a single VGRP with the following 6661 layout. Only fields actually used by the function are set. The other bits 6662 are undefined. 6663 6664 The values come from the initial kernel execution state. See 6665 :ref:`amdgpu-amdhsa-vgpr-register-set-up-order-table`. 6666 6667 .. table:: Work-item implicit argument layout 6668 :name: amdgpu-amdhsa-workitem-implicit-argument-layout-table 6669 6670 ======= ======= ============== 6671 Bits Size Field Name 6672 ======= ======= ============== 6673 9:0 10 bits X Work-Item ID 6674 19:10 10 bits Y Work-Item ID 6675 29:20 10 bits Z Work-Item ID 6676 31:30 2 bits Unused 6677 ======= ======= ============== 6678 66792. Dispatch Ptr (2 SGPRs) 6680 6681 The value comes from the initial kernel execution state. See 6682 :ref:`amdgpu-amdhsa-sgpr-register-set-up-order-table`. 6683 66843. Queue Ptr (2 SGPRs) 6685 6686 The value comes from the initial kernel execution state. See 6687 :ref:`amdgpu-amdhsa-sgpr-register-set-up-order-table`. 6688 66894. Kernarg Segment Ptr (2 SGPRs) 6690 6691 The value comes from the initial kernel execution state. See 6692 :ref:`amdgpu-amdhsa-sgpr-register-set-up-order-table`. 6693 66945. Dispatch id (2 SGPRs) 6695 6696 The value comes from the initial kernel execution state. See 6697 :ref:`amdgpu-amdhsa-sgpr-register-set-up-order-table`. 6698 66996. Work-Group ID X (1 SGPR) 6700 6701 The value comes from the initial kernel execution state. See 6702 :ref:`amdgpu-amdhsa-sgpr-register-set-up-order-table`. 6703 67047. Work-Group ID Y (1 SGPR) 6705 6706 The value comes from the initial kernel execution state. See 6707 :ref:`amdgpu-amdhsa-sgpr-register-set-up-order-table`. 6708 67098. Work-Group ID Z (1 SGPR) 6710 6711 The value comes from the initial kernel execution state. See 6712 :ref:`amdgpu-amdhsa-sgpr-register-set-up-order-table`. 6713 67149. Implicit Argument Ptr (2 SGPRs) 6715 6716 The value is computed by adding an offset to Kernarg Segment Ptr to get the 6717 global address space pointer to the first kernarg implicit argument. 6718 6719The input and result arguments are assigned in order in the following manner: 6720 6721.. note:: 6722 6723 There are likely some errors and omissions in the following description that 6724 need correction. 6725 6726 .. TODO:: 6727 6728 Check the clang source code to decipher how function arguments and return 6729 results are handled. Also see the AMDGPU specific values used. 6730 6731* VGPR arguments are assigned to consecutive VGPRs starting at VGPR0 up to 6732 VGPR31. 6733 6734 If there are more arguments than will fit in these registers, the remaining 6735 arguments are allocated on the stack in order on naturally aligned 6736 addresses. 6737 6738 .. TODO:: 6739 6740 How are overly aligned structures allocated on the stack? 6741 6742* SGPR arguments are assigned to consecutive SGPRs starting at SGPR0 up to 6743 SGPR29. 6744 6745 If there are more arguments than will fit in these registers, the remaining 6746 arguments are allocated on the stack in order on naturally aligned 6747 addresses. 6748 6749Note that decomposed struct type arguments may have some fields passed in 6750registers and some in memory. 6751 6752.. TODO:: 6753 6754 So, a struct which can pass some fields as decomposed register arguments, will 6755 pass the rest as decomposed stack elements? But an argument that will not start 6756 in registers will not be decomposed and will be passed as a non-decomposed 6757 stack value? 6758 6759The following is not part of the AMDGPU function calling convention but 6760describes how the AMDGPU implements function calls: 6761 67621. SGPR33 is used as a frame pointer (FP) if necessary. Like the SP it is an 6763 unswizzled scratch address. It is only needed if runtime sized ``alloca`` 6764 are used, or for the reasons defined in ``SIFrameLowering``. 67652. Runtime stack alignment is supported. SGPR34 is used as a base pointer (BP) 6766 to access the incoming stack arguments in the function. The BP is needed 6767 only when the function requires the runtime stack alignment. 6768 67693. Allocating SGPR arguments on the stack are not supported. 6770 67714. No CFI is currently generated. See 6772 :ref:`amdgpu-dwarf-call-frame-information`. 6773 6774 .. note:: 6775 6776 CFI will be generated that defines the CFA as the unswizzled address 6777 relative to the wave scratch base in the unswizzled private address space 6778 of the lowest address stack allocated local variable. 6779 6780 ``DW_AT_frame_base`` will be defined as the swizzled address in the 6781 swizzled private address space by dividing the CFA by the wavefront size 6782 (since CFA is always at least dword aligned which matches the scratch 6783 swizzle element size). 6784 6785 If no dynamic stack alignment was performed, the stack allocated arguments 6786 are accessed as negative offsets relative to ``DW_AT_frame_base``, and the 6787 local variables and register spill slots are accessed as positive offsets 6788 relative to ``DW_AT_frame_base``. 6789 67905. Function argument passing is implemented by copying the input physical 6791 registers to virtual registers on entry. The register allocator can spill if 6792 necessary. These are copied back to physical registers at call sites. The 6793 net effect is that each function call can have these values in entirely 6794 distinct locations. The IPRA can help avoid shuffling argument registers. 67956. Call sites are implemented by setting up the arguments at positive offsets 6796 from SP. Then SP is incremented to account for the known frame size before 6797 the call and decremented after the call. 6798 6799 .. note:: 6800 6801 The CFI will reflect the changed calculation needed to compute the CFA 6802 from SP. 6803 68047. 4 byte spill slots are used in the stack frame. One slot is allocated for an 6805 emergency spill slot. Buffer instructions are used for stack accesses and 6806 not the ``flat_scratch`` instruction. 6807 6808 .. TODO:: 6809 6810 Explain when the emergency spill slot is used. 6811 6812.. TODO:: 6813 6814 Possible broken issues: 6815 6816 - Stack arguments must be aligned to required alignment. 6817 - Stack is aligned to max(16, max formal argument alignment) 6818 - Direct argument < 64 bits should check register budget. 6819 - Register budget calculation should respect ``inreg`` for SGPR. 6820 - SGPR overflow is not handled. 6821 - struct with 1 member unpeeling is not checking size of member. 6822 - ``sret`` is after ``this`` pointer. 6823 - Caller is not implementing stack realignment: need an extra pointer. 6824 - Should say AMDGPU passes FP rather than SP. 6825 - Should CFI define CFA as address of locals or arguments. Difference is 6826 apparent when have implemented dynamic alignment. 6827 - If ``SCRATCH`` instruction could allow negative offsets, then can make FP be 6828 highest address of stack frame and use negative offset for locals. Would 6829 allow SP to be the same as FP and could support signal-handler-like as now 6830 have a real SP for the top of the stack. 6831 - How is ``sret`` passed on the stack? In argument stack area? Can it overlay 6832 arguments? 6833 6834AMDPAL 6835------ 6836 6837This section provides code conventions used when the target triple OS is 6838``amdpal`` (see :ref:`amdgpu-target-triples`) for passing runtime parameters 6839from the application/runtime to each invocation of a hardware shader. These 6840parameters include both generic, application-controlled parameters called 6841*user data* as well as system-generated parameters that are a product of the 6842draw or dispatch execution. 6843 6844User Data 6845~~~~~~~~~ 6846 6847Each hardware stage has a set of 32-bit *user data registers* which can be 6848written from a command buffer and then loaded into SGPRs when waves are launched 6849via a subsequent dispatch or draw operation. This is the way most arguments are 6850passed from the application/runtime to a hardware shader. 6851 6852Compute User Data 6853~~~~~~~~~~~~~~~~~ 6854 6855Compute shader user data mappings are simpler than graphics shaders and have a 6856fixed mapping. 6857 6858Note that there are always 10 available *user data entries* in registers - 6859entries beyond that limit must be fetched from memory (via the spill table 6860pointer) by the shader. 6861 6862 .. table:: PAL Compute Shader User Data Registers 6863 :name: pal-compute-user-data-registers 6864 6865 ============= ================================ 6866 User Register Description 6867 ============= ================================ 6868 0 Global Internal Table (32-bit pointer) 6869 1 Per-Shader Internal Table (32-bit pointer) 6870 2 - 11 Application-Controlled User Data (10 32-bit values) 6871 12 Spill Table (32-bit pointer) 6872 13 - 14 Thread Group Count (64-bit pointer) 6873 15 GDS Range 6874 ============= ================================ 6875 6876Graphics User Data 6877~~~~~~~~~~~~~~~~~~ 6878 6879Graphics pipelines support a much more flexible user data mapping: 6880 6881 .. table:: PAL Graphics Shader User Data Registers 6882 :name: pal-graphics-user-data-registers 6883 6884 ============= ================================ 6885 User Register Description 6886 ============= ================================ 6887 0 Global Internal Table (32-bit pointer) 6888 + Per-Shader Internal Table (32-bit pointer) 6889 + 1-15 Application Controlled User Data 6890 (1-15 Contiguous 32-bit Values in Registers) 6891 + Spill Table (32-bit pointer) 6892 + Draw Index (First Stage Only) 6893 + Vertex Offset (First Stage Only) 6894 + Instance Offset (First Stage Only) 6895 ============= ================================ 6896 6897 The placement of the global internal table remains fixed in the first *user 6898 data SGPR register*. Otherwise all parameters are optional, and can be mapped 6899 to any desired *user data SGPR register*, with the following restrictions: 6900 6901 * Draw Index, Vertex Offset, and Instance Offset can only be used by the first 6902 active hardware stage in a graphics pipeline (i.e. where the API vertex 6903 shader runs). 6904 6905 * Application-controlled user data must be mapped into a contiguous range of 6906 user data registers. 6907 6908 * The application-controlled user data range supports compaction remapping, so 6909 only *entries* that are actually consumed by the shader must be assigned to 6910 corresponding *registers*. Note that in order to support an efficient runtime 6911 implementation, the remapping must pack *registers* in the same order as 6912 *entries*, with unused *entries* removed. 6913 6914.. _pal_global_internal_table: 6915 6916Global Internal Table 6917~~~~~~~~~~~~~~~~~~~~~ 6918 6919The global internal table is a table of *shader resource descriptors* (SRDs) 6920that define how certain engine-wide, runtime-managed resources should be 6921accessed from a shader. The majority of these resources have HW-defined formats, 6922and it is up to the compiler to write/read data as required by the target 6923hardware. 6924 6925The following table illustrates the required format: 6926 6927 .. table:: PAL Global Internal Table 6928 :name: pal-git-table 6929 6930 ============= ================================ 6931 Offset Description 6932 ============= ================================ 6933 0-3 Graphics Scratch SRD 6934 4-7 Compute Scratch SRD 6935 8-11 ES/GS Ring Output SRD 6936 12-15 ES/GS Ring Input SRD 6937 16-19 GS/VS Ring Output #0 6938 20-23 GS/VS Ring Output #1 6939 24-27 GS/VS Ring Output #2 6940 28-31 GS/VS Ring Output #3 6941 32-35 GS/VS Ring Input SRD 6942 36-39 Tessellation Factor Buffer SRD 6943 40-43 Off-Chip LDS Buffer SRD 6944 44-47 Off-Chip Param Cache Buffer SRD 6945 48-51 Sample Position Buffer SRD 6946 52 vaRange::ShadowDescriptorTable High Bits 6947 ============= ================================ 6948 6949 The pointer to the global internal table passed to the shader as user data 6950 is a 32-bit pointer. The top 32 bits should be assumed to be the same as 6951 the top 32 bits of the pipeline, so the shader may use the program 6952 counter's top 32 bits. 6953 6954Unspecified OS 6955-------------- 6956 6957This section provides code conventions used when the target triple OS is 6958empty (see :ref:`amdgpu-target-triples`). 6959 6960Trap Handler ABI 6961~~~~~~~~~~~~~~~~ 6962 6963For code objects generated by AMDGPU backend for non-amdhsa OS, the runtime does 6964not install a trap handler. The ``llvm.trap`` and ``llvm.debugtrap`` 6965instructions are handled as follows: 6966 6967 .. table:: AMDGPU Trap Handler for Non-AMDHSA OS 6968 :name: amdgpu-trap-handler-for-non-amdhsa-os-table 6969 6970 =============== =============== =========================================== 6971 Usage Code Sequence Description 6972 =============== =============== =========================================== 6973 llvm.trap s_endpgm Causes wavefront to be terminated. 6974 llvm.debugtrap *none* Compiler warning given that there is no 6975 trap handler installed. 6976 =============== =============== =========================================== 6977 6978Source Languages 6979================ 6980 6981.. _amdgpu-opencl: 6982 6983OpenCL 6984------ 6985 6986When the language is OpenCL the following differences occur: 6987 69881. The OpenCL memory model is used (see :ref:`amdgpu-amdhsa-memory-model`). 69892. The AMDGPU backend appends additional arguments to the kernel's explicit 6990 arguments for the AMDHSA OS (see 6991 :ref:`opencl-kernel-implicit-arguments-appended-for-amdhsa-os-table`). 69923. Additional metadata is generated 6993 (see :ref:`amdgpu-amdhsa-code-object-metadata`). 6994 6995 .. table:: OpenCL kernel implicit arguments appended for AMDHSA OS 6996 :name: opencl-kernel-implicit-arguments-appended-for-amdhsa-os-table 6997 6998 ======== ==== ========= =========================================== 6999 Position Byte Byte Description 7000 Size Alignment 7001 ======== ==== ========= =========================================== 7002 1 8 8 OpenCL Global Offset X 7003 2 8 8 OpenCL Global Offset Y 7004 3 8 8 OpenCL Global Offset Z 7005 4 8 8 OpenCL address of printf buffer 7006 5 8 8 OpenCL address of virtual queue used by 7007 enqueue_kernel. 7008 6 8 8 OpenCL address of AqlWrap struct used by 7009 enqueue_kernel. 7010 7 8 8 Pointer argument used for Multi-gird 7011 synchronization. 7012 ======== ==== ========= =========================================== 7013 7014.. _amdgpu-hcc: 7015 7016HCC 7017--- 7018 7019When the language is HCC the following differences occur: 7020 70211. The HSA memory model is used (see :ref:`amdgpu-amdhsa-memory-model`). 7022 7023.. _amdgpu-assembler: 7024 7025Assembler 7026--------- 7027 7028AMDGPU backend has LLVM-MC based assembler which is currently in development. 7029It supports AMDGCN GFX6-GFX10. 7030 7031This section describes general syntax for instructions and operands. 7032 7033Instructions 7034~~~~~~~~~~~~ 7035 7036An instruction has the following :doc:`syntax<AMDGPUInstructionSyntax>`: 7037 7038 | ``<``\ *opcode*\ ``> <``\ *operand0*\ ``>, <``\ *operand1*\ ``>,... 7039 <``\ *modifier0*\ ``> <``\ *modifier1*\ ``>...`` 7040 7041:doc:`Operands<AMDGPUOperandSyntax>` are comma-separated while 7042:doc:`modifiers<AMDGPUModifierSyntax>` are space-separated. 7043 7044The order of operands and modifiers is fixed. 7045Most modifiers are optional and may be omitted. 7046 7047Links to detailed instruction syntax description may be found in the following 7048table. Note that features under development are not included 7049in this description. 7050 7051 =================================== ======================================= 7052 Core ISA ISA Extensions 7053 =================================== ======================================= 7054 :doc:`GFX7<AMDGPU/AMDGPUAsmGFX7>` \- 7055 :doc:`GFX8<AMDGPU/AMDGPUAsmGFX8>` \- 7056 :doc:`GFX9<AMDGPU/AMDGPUAsmGFX9>` :doc:`gfx900<AMDGPU/AMDGPUAsmGFX900>` 7057 7058 :doc:`gfx902<AMDGPU/AMDGPUAsmGFX900>` 7059 7060 :doc:`gfx904<AMDGPU/AMDGPUAsmGFX904>` 7061 7062 :doc:`gfx906<AMDGPU/AMDGPUAsmGFX906>` 7063 7064 :doc:`gfx908<AMDGPU/AMDGPUAsmGFX908>` 7065 7066 :doc:`gfx909<AMDGPU/AMDGPUAsmGFX900>` 7067 7068 :doc:`GFX10<AMDGPU/AMDGPUAsmGFX10>` :doc:`gfx1011<AMDGPU/AMDGPUAsmGFX1011>` 7069 7070 :doc:`gfx1012<AMDGPU/AMDGPUAsmGFX1011>` 7071 =================================== ======================================= 7072 7073For more information about instructions, their semantics and supported 7074combinations of operands, refer to one of instruction set architecture manuals 7075[AMD-GCN-GFX6]_, [AMD-GCN-GFX7]_, [AMD-GCN-GFX8]_, [AMD-GCN-GFX9]_ and 7076[AMD-GCN-GFX10]_. 7077 7078Operands 7079~~~~~~~~ 7080 7081Detailed description of operands may be found :doc:`here<AMDGPUOperandSyntax>`. 7082 7083Modifiers 7084~~~~~~~~~ 7085 7086Detailed description of modifiers may be found 7087:doc:`here<AMDGPUModifierSyntax>`. 7088 7089Instruction Examples 7090~~~~~~~~~~~~~~~~~~~~ 7091 7092DS 7093++ 7094 7095.. code-block:: nasm 7096 7097 ds_add_u32 v2, v4 offset:16 7098 ds_write_src2_b64 v2 offset0:4 offset1:8 7099 ds_cmpst_f32 v2, v4, v6 7100 ds_min_rtn_f64 v[8:9], v2, v[4:5] 7101 7102For full list of supported instructions, refer to "LDS/GDS instructions" in ISA 7103Manual. 7104 7105FLAT 7106++++ 7107 7108.. code-block:: nasm 7109 7110 flat_load_dword v1, v[3:4] 7111 flat_store_dwordx3 v[3:4], v[5:7] 7112 flat_atomic_swap v1, v[3:4], v5 glc 7113 flat_atomic_cmpswap v1, v[3:4], v[5:6] glc slc 7114 flat_atomic_fmax_x2 v[1:2], v[3:4], v[5:6] glc 7115 7116For full list of supported instructions, refer to "FLAT instructions" in ISA 7117Manual. 7118 7119MUBUF 7120+++++ 7121 7122.. code-block:: nasm 7123 7124 buffer_load_dword v1, off, s[4:7], s1 7125 buffer_store_dwordx4 v[1:4], v2, ttmp[4:7], s1 offen offset:4 glc tfe 7126 buffer_store_format_xy v[1:2], off, s[4:7], s1 7127 buffer_wbinvl1 7128 buffer_atomic_inc v1, v2, s[8:11], s4 idxen offset:4 slc 7129 7130For full list of supported instructions, refer to "MUBUF Instructions" in ISA 7131Manual. 7132 7133SMRD/SMEM 7134+++++++++ 7135 7136.. code-block:: nasm 7137 7138 s_load_dword s1, s[2:3], 0xfc 7139 s_load_dwordx8 s[8:15], s[2:3], s4 7140 s_load_dwordx16 s[88:103], s[2:3], s4 7141 s_dcache_inv_vol 7142 s_memtime s[4:5] 7143 7144For full list of supported instructions, refer to "Scalar Memory Operations" in 7145ISA Manual. 7146 7147SOP1 7148++++ 7149 7150.. code-block:: nasm 7151 7152 s_mov_b32 s1, s2 7153 s_mov_b64 s[0:1], 0x80000000 7154 s_cmov_b32 s1, 200 7155 s_wqm_b64 s[2:3], s[4:5] 7156 s_bcnt0_i32_b64 s1, s[2:3] 7157 s_swappc_b64 s[2:3], s[4:5] 7158 s_cbranch_join s[4:5] 7159 7160For full list of supported instructions, refer to "SOP1 Instructions" in ISA 7161Manual. 7162 7163SOP2 7164++++ 7165 7166.. code-block:: nasm 7167 7168 s_add_u32 s1, s2, s3 7169 s_and_b64 s[2:3], s[4:5], s[6:7] 7170 s_cselect_b32 s1, s2, s3 7171 s_andn2_b32 s2, s4, s6 7172 s_lshr_b64 s[2:3], s[4:5], s6 7173 s_ashr_i32 s2, s4, s6 7174 s_bfm_b64 s[2:3], s4, s6 7175 s_bfe_i64 s[2:3], s[4:5], s6 7176 s_cbranch_g_fork s[4:5], s[6:7] 7177 7178For full list of supported instructions, refer to "SOP2 Instructions" in ISA 7179Manual. 7180 7181SOPC 7182++++ 7183 7184.. code-block:: nasm 7185 7186 s_cmp_eq_i32 s1, s2 7187 s_bitcmp1_b32 s1, s2 7188 s_bitcmp0_b64 s[2:3], s4 7189 s_setvskip s3, s5 7190 7191For full list of supported instructions, refer to "SOPC Instructions" in ISA 7192Manual. 7193 7194SOPP 7195++++ 7196 7197.. code-block:: nasm 7198 7199 s_barrier 7200 s_nop 2 7201 s_endpgm 7202 s_waitcnt 0 ; Wait for all counters to be 0 7203 s_waitcnt vmcnt(0) & expcnt(0) & lgkmcnt(0) ; Equivalent to above 7204 s_waitcnt vmcnt(1) ; Wait for vmcnt counter to be 1. 7205 s_sethalt 9 7206 s_sleep 10 7207 s_sendmsg 0x1 7208 s_sendmsg sendmsg(MSG_INTERRUPT) 7209 s_trap 1 7210 7211For full list of supported instructions, refer to "SOPP Instructions" in ISA 7212Manual. 7213 7214Unless otherwise mentioned, little verification is performed on the operands 7215of SOPP Instructions, so it is up to the programmer to be familiar with the 7216range or acceptable values. 7217 7218VALU 7219++++ 7220 7221For vector ALU instruction opcodes (VOP1, VOP2, VOP3, VOPC, VOP_DPP, VOP_SDWA), 7222the assembler will automatically use optimal encoding based on its operands. To 7223force specific encoding, one can add a suffix to the opcode of the instruction: 7224 7225* _e32 for 32-bit VOP1/VOP2/VOPC 7226* _e64 for 64-bit VOP3 7227* _dpp for VOP_DPP 7228* _sdwa for VOP_SDWA 7229 7230VOP1/VOP2/VOP3/VOPC examples: 7231 7232.. code-block:: nasm 7233 7234 v_mov_b32 v1, v2 7235 v_mov_b32_e32 v1, v2 7236 v_nop 7237 v_cvt_f64_i32_e32 v[1:2], v2 7238 v_floor_f32_e32 v1, v2 7239 v_bfrev_b32_e32 v1, v2 7240 v_add_f32_e32 v1, v2, v3 7241 v_mul_i32_i24_e64 v1, v2, 3 7242 v_mul_i32_i24_e32 v1, -3, v3 7243 v_mul_i32_i24_e32 v1, -100, v3 7244 v_addc_u32 v1, s[0:1], v2, v3, s[2:3] 7245 v_max_f16_e32 v1, v2, v3 7246 7247VOP_DPP examples: 7248 7249.. code-block:: nasm 7250 7251 v_mov_b32 v0, v0 quad_perm:[0,2,1,1] 7252 v_sin_f32 v0, v0 row_shl:1 row_mask:0xa bank_mask:0x1 bound_ctrl:0 7253 v_mov_b32 v0, v0 wave_shl:1 7254 v_mov_b32 v0, v0 row_mirror 7255 v_mov_b32 v0, v0 row_bcast:31 7256 v_mov_b32 v0, v0 quad_perm:[1,3,0,1] row_mask:0xa bank_mask:0x1 bound_ctrl:0 7257 v_add_f32 v0, v0, |v0| row_shl:1 row_mask:0xa bank_mask:0x1 bound_ctrl:0 7258 v_max_f16 v1, v2, v3 row_shl:1 row_mask:0xa bank_mask:0x1 bound_ctrl:0 7259 7260VOP_SDWA examples: 7261 7262.. code-block:: nasm 7263 7264 v_mov_b32 v1, v2 dst_sel:BYTE_0 dst_unused:UNUSED_PRESERVE src0_sel:DWORD 7265 v_min_u32 v200, v200, v1 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_1 src1_sel:DWORD 7266 v_sin_f32 v0, v0 dst_unused:UNUSED_PAD src0_sel:WORD_1 7267 v_fract_f32 v0, |v0| dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_1 7268 v_cmpx_le_u32 vcc, v1, v2 src0_sel:BYTE_2 src1_sel:WORD_0 7269 7270For full list of supported instructions, refer to "Vector ALU instructions". 7271 7272.. TODO:: 7273 7274 Remove once we switch to code object v3 by default. 7275 7276.. _amdgpu-amdhsa-assembler-predefined-symbols-v2: 7277 7278Code Object V2 Predefined Symbols (-mattr=-code-object-v3) 7279~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 7280 7281.. warning:: Code Object V2 is not the default code object version emitted by 7282 this version of LLVM. For a description of the predefined symbols available 7283 with the default configuration (Code Object V3) see 7284 :ref:`amdgpu-amdhsa-assembler-predefined-symbols-v3`. 7285 7286The AMDGPU assembler defines and updates some symbols automatically. These 7287symbols do not affect code generation. 7288 7289.option.machine_version_major 7290+++++++++++++++++++++++++++++ 7291 7292Set to the GFX major generation number of the target being assembled for. For 7293example, when assembling for a "GFX9" target this will be set to the integer 7294value "9". The possible GFX major generation numbers are presented in 7295:ref:`amdgpu-processors`. 7296 7297.option.machine_version_minor 7298+++++++++++++++++++++++++++++ 7299 7300Set to the GFX minor generation number of the target being assembled for. For 7301example, when assembling for a "GFX810" target this will be set to the integer 7302value "1". The possible GFX minor generation numbers are presented in 7303:ref:`amdgpu-processors`. 7304 7305.option.machine_version_stepping 7306++++++++++++++++++++++++++++++++ 7307 7308Set to the GFX stepping generation number of the target being assembled for. 7309For example, when assembling for a "GFX704" target this will be set to the 7310integer value "4". The possible GFX stepping generation numbers are presented 7311in :ref:`amdgpu-processors`. 7312 7313.kernel.vgpr_count 7314++++++++++++++++++ 7315 7316Set to zero each time a 7317:ref:`amdgpu-amdhsa-assembler-directive-amdgpu_hsa_kernel` directive is 7318encountered. At each instruction, if the current value of this symbol is less 7319than or equal to the maximum VPGR number explicitly referenced within that 7320instruction then the symbol value is updated to equal that VGPR number plus 7321one. 7322 7323.kernel.sgpr_count 7324++++++++++++++++++ 7325 7326Set to zero each time a 7327:ref:`amdgpu-amdhsa-assembler-directive-amdgpu_hsa_kernel` directive is 7328encountered. At each instruction, if the current value of this symbol is less 7329than or equal to the maximum VPGR number explicitly referenced within that 7330instruction then the symbol value is updated to equal that SGPR number plus 7331one. 7332 7333.. _amdgpu-amdhsa-assembler-directives-v2: 7334 7335Code Object V2 Directives (-mattr=-code-object-v3) 7336~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 7337 7338.. warning:: Code Object V2 is not the default code object version emitted by 7339 this version of LLVM. For a description of the directives supported with 7340 the default configuration (Code Object V3) see 7341 :ref:`amdgpu-amdhsa-assembler-directives-v3`. 7342 7343AMDGPU ABI defines auxiliary data in output code object. In assembly source, 7344one can specify them with assembler directives. 7345 7346.hsa_code_object_version major, minor 7347+++++++++++++++++++++++++++++++++++++ 7348 7349*major* and *minor* are integers that specify the version of the HSA code 7350object that will be generated by the assembler. 7351 7352.hsa_code_object_isa [major, minor, stepping, vendor, arch] 7353+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ 7354 7355 7356*major*, *minor*, and *stepping* are all integers that describe the instruction 7357set architecture (ISA) version of the assembly program. 7358 7359*vendor* and *arch* are quoted strings. *vendor* should always be equal to 7360"AMD" and *arch* should always be equal to "AMDGPU". 7361 7362By default, the assembler will derive the ISA version, *vendor*, and *arch* 7363from the value of the -mcpu option that is passed to the assembler. 7364 7365.. _amdgpu-amdhsa-assembler-directive-amdgpu_hsa_kernel: 7366 7367.amdgpu_hsa_kernel (name) 7368+++++++++++++++++++++++++ 7369 7370This directives specifies that the symbol with given name is a kernel entry 7371point (label) and the object should contain corresponding symbol of type 7372STT_AMDGPU_HSA_KERNEL. 7373 7374.amd_kernel_code_t 7375++++++++++++++++++ 7376 7377This directive marks the beginning of a list of key / value pairs that are used 7378to specify the amd_kernel_code_t object that will be emitted by the assembler. 7379The list must be terminated by the *.end_amd_kernel_code_t* directive. For any 7380amd_kernel_code_t values that are unspecified a default value will be used. The 7381default value for all keys is 0, with the following exceptions: 7382 7383- *amd_code_version_major* defaults to 1. 7384- *amd_kernel_code_version_minor* defaults to 2. 7385- *amd_machine_kind* defaults to 1. 7386- *amd_machine_version_major*, *machine_version_minor*, and 7387 *amd_machine_version_stepping* are derived from the value of the -mcpu option 7388 that is passed to the assembler. 7389- *kernel_code_entry_byte_offset* defaults to 256. 7390- *wavefront_size* defaults 6 for all targets before GFX10. For GFX10 onwards 7391 defaults to 6 if target feature ``wavefrontsize64`` is enabled, otherwise 5. 7392 Note that wavefront size is specified as a power of two, so a value of **n** 7393 means a size of 2^ **n**. 7394- *call_convention* defaults to -1. 7395- *kernarg_segment_alignment*, *group_segment_alignment*, and 7396 *private_segment_alignment* default to 4. Note that alignments are specified 7397 as a power of 2, so a value of **n** means an alignment of 2^ **n**. 7398- *enable_wgp_mode* defaults to 1 if target feature ``cumode`` is disabled for 7399 GFX10 onwards. 7400- *enable_mem_ordered* defaults to 1 for GFX10 onwards. 7401 7402The *.amd_kernel_code_t* directive must be placed immediately after the 7403function label and before any instructions. 7404 7405For a full list of amd_kernel_code_t keys, refer to AMDGPU ABI document, 7406comments in lib/Target/AMDGPU/AmdKernelCodeT.h and test/CodeGen/AMDGPU/hsa.s. 7407 7408.. _amdgpu-amdhsa-assembler-example-v2: 7409 7410Code Object V2 Example Source Code (-mattr=-code-object-v3) 7411~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 7412 7413.. warning:: Code Object V2 is not the default code object version emitted by 7414 this version of LLVM. For a description of the directives supported with 7415 the default configuration (Code Object V3) see 7416 :ref:`amdgpu-amdhsa-assembler-example-v3`. 7417 7418Here is an example of a minimal assembly source file, defining one HSA kernel: 7419 7420.. code:: 7421 :number-lines: 7422 7423 .hsa_code_object_version 1,0 7424 .hsa_code_object_isa 7425 7426 .hsatext 7427 .globl hello_world 7428 .p2align 8 7429 .amdgpu_hsa_kernel hello_world 7430 7431 hello_world: 7432 7433 .amd_kernel_code_t 7434 enable_sgpr_kernarg_segment_ptr = 1 7435 is_ptr64 = 1 7436 compute_pgm_rsrc1_vgprs = 0 7437 compute_pgm_rsrc1_sgprs = 0 7438 compute_pgm_rsrc2_user_sgpr = 2 7439 compute_pgm_rsrc1_wgp_mode = 0 7440 compute_pgm_rsrc1_mem_ordered = 0 7441 compute_pgm_rsrc1_fwd_progress = 1 7442 .end_amd_kernel_code_t 7443 7444 s_load_dwordx2 s[0:1], s[0:1] 0x0 7445 v_mov_b32 v0, 3.14159 7446 s_waitcnt lgkmcnt(0) 7447 v_mov_b32 v1, s0 7448 v_mov_b32 v2, s1 7449 flat_store_dword v[1:2], v0 7450 s_endpgm 7451 .Lfunc_end0: 7452 .size hello_world, .Lfunc_end0-hello_world 7453 7454.. _amdgpu-amdhsa-assembler-predefined-symbols-v3: 7455 7456Code Object V3 Predefined Symbols (-mattr=+code-object-v3) 7457~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 7458 7459The AMDGPU assembler defines and updates some symbols automatically. These 7460symbols do not affect code generation. 7461 7462.amdgcn.gfx_generation_number 7463+++++++++++++++++++++++++++++ 7464 7465Set to the GFX major generation number of the target being assembled for. For 7466example, when assembling for a "GFX9" target this will be set to the integer 7467value "9". The possible GFX major generation numbers are presented in 7468:ref:`amdgpu-processors`. 7469 7470.amdgcn.gfx_generation_minor 7471++++++++++++++++++++++++++++ 7472 7473Set to the GFX minor generation number of the target being assembled for. For 7474example, when assembling for a "GFX810" target this will be set to the integer 7475value "1". The possible GFX minor generation numbers are presented in 7476:ref:`amdgpu-processors`. 7477 7478.amdgcn.gfx_generation_stepping 7479+++++++++++++++++++++++++++++++ 7480 7481Set to the GFX stepping generation number of the target being assembled for. 7482For example, when assembling for a "GFX704" target this will be set to the 7483integer value "4". The possible GFX stepping generation numbers are presented 7484in :ref:`amdgpu-processors`. 7485 7486.. _amdgpu-amdhsa-assembler-symbol-next_free_vgpr: 7487 7488.amdgcn.next_free_vgpr 7489++++++++++++++++++++++ 7490 7491Set to zero before assembly begins. At each instruction, if the current value 7492of this symbol is less than or equal to the maximum VGPR number explicitly 7493referenced within that instruction then the symbol value is updated to equal 7494that VGPR number plus one. 7495 7496May be used to set the `.amdhsa_next_free_vpgr` directive in 7497:ref:`amdhsa-kernel-directives-table`. 7498 7499May be set at any time, e.g. manually set to zero at the start of each kernel. 7500 7501.. _amdgpu-amdhsa-assembler-symbol-next_free_sgpr: 7502 7503.amdgcn.next_free_sgpr 7504++++++++++++++++++++++ 7505 7506Set to zero before assembly begins. At each instruction, if the current value 7507of this symbol is less than or equal the maximum SGPR number explicitly 7508referenced within that instruction then the symbol value is updated to equal 7509that SGPR number plus one. 7510 7511May be used to set the `.amdhsa_next_free_spgr` directive in 7512:ref:`amdhsa-kernel-directives-table`. 7513 7514May be set at any time, e.g. manually set to zero at the start of each kernel. 7515 7516.. _amdgpu-amdhsa-assembler-directives-v3: 7517 7518Code Object V3 Directives (-mattr=+code-object-v3) 7519~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 7520 7521Directives which begin with ``.amdgcn`` are valid for all ``amdgcn`` 7522architecture processors, and are not OS-specific. Directives which begin with 7523``.amdhsa`` are specific to ``amdgcn`` architecture processors when the 7524``amdhsa`` OS is specified. See :ref:`amdgpu-target-triples` and 7525:ref:`amdgpu-processors`. 7526 7527.amdgcn_target <target> 7528+++++++++++++++++++++++ 7529 7530Optional directive which declares the target supported by the containing 7531assembler source file. Valid values are described in 7532:ref:`amdgpu-amdhsa-code-object-target-identification`. Used by the assembler 7533to validate command-line options such as ``-triple``, ``-mcpu``, and those 7534which specify target features. 7535 7536.amdhsa_kernel <name> 7537+++++++++++++++++++++ 7538 7539Creates a correctly aligned AMDHSA kernel descriptor and a symbol, 7540``<name>.kd``, in the current location of the current section. Only valid when 7541the OS is ``amdhsa``. ``<name>`` must be a symbol that labels the first 7542instruction to execute, and does not need to be previously defined. 7543 7544Marks the beginning of a list of directives used to generate the bytes of a 7545kernel descriptor, as described in :ref:`amdgpu-amdhsa-kernel-descriptor`. 7546Directives which may appear in this list are described in 7547:ref:`amdhsa-kernel-directives-table`. Directives may appear in any order, must 7548be valid for the target being assembled for, and cannot be repeated. Directives 7549support the range of values specified by the field they reference in 7550:ref:`amdgpu-amdhsa-kernel-descriptor`. If a directive is not specified, it is 7551assumed to have its default value, unless it is marked as "Required", in which 7552case it is an error to omit the directive. This list of directives is 7553terminated by an ``.end_amdhsa_kernel`` directive. 7554 7555 .. table:: AMDHSA Kernel Assembler Directives 7556 :name: amdhsa-kernel-directives-table 7557 7558 ======================================================== =================== ============ =================== 7559 Directive Default Supported On Description 7560 ======================================================== =================== ============ =================== 7561 ``.amdhsa_group_segment_fixed_size`` 0 GFX6-GFX10 Controls GROUP_SEGMENT_FIXED_SIZE in 7562 :ref:`amdgpu-amdhsa-kernel-descriptor-gfx6-gfx10-table`. 7563 ``.amdhsa_private_segment_fixed_size`` 0 GFX6-GFX10 Controls PRIVATE_SEGMENT_FIXED_SIZE in 7564 :ref:`amdgpu-amdhsa-kernel-descriptor-gfx6-gfx10-table`. 7565 ``.amdhsa_user_sgpr_private_segment_buffer`` 0 GFX6-GFX10 Controls ENABLE_SGPR_PRIVATE_SEGMENT_BUFFER in 7566 :ref:`amdgpu-amdhsa-kernel-descriptor-gfx6-gfx10-table`. 7567 ``.amdhsa_user_sgpr_dispatch_ptr`` 0 GFX6-GFX10 Controls ENABLE_SGPR_DISPATCH_PTR in 7568 :ref:`amdgpu-amdhsa-kernel-descriptor-gfx6-gfx10-table`. 7569 ``.amdhsa_user_sgpr_queue_ptr`` 0 GFX6-GFX10 Controls ENABLE_SGPR_QUEUE_PTR in 7570 :ref:`amdgpu-amdhsa-kernel-descriptor-gfx6-gfx10-table`. 7571 ``.amdhsa_user_sgpr_kernarg_segment_ptr`` 0 GFX6-GFX10 Controls ENABLE_SGPR_KERNARG_SEGMENT_PTR in 7572 :ref:`amdgpu-amdhsa-kernel-descriptor-gfx6-gfx10-table`. 7573 ``.amdhsa_user_sgpr_dispatch_id`` 0 GFX6-GFX10 Controls ENABLE_SGPR_DISPATCH_ID in 7574 :ref:`amdgpu-amdhsa-kernel-descriptor-gfx6-gfx10-table`. 7575 ``.amdhsa_user_sgpr_flat_scratch_init`` 0 GFX6-GFX10 Controls ENABLE_SGPR_FLAT_SCRATCH_INIT in 7576 :ref:`amdgpu-amdhsa-kernel-descriptor-gfx6-gfx10-table`. 7577 ``.amdhsa_user_sgpr_private_segment_size`` 0 GFX6-GFX10 Controls ENABLE_SGPR_PRIVATE_SEGMENT_SIZE in 7578 :ref:`amdgpu-amdhsa-kernel-descriptor-gfx6-gfx10-table`. 7579 ``.amdhsa_wavefront_size32`` Target GFX10 Controls ENABLE_WAVEFRONT_SIZE32 in 7580 Feature :ref:`amdgpu-amdhsa-kernel-descriptor-gfx6-gfx10-table`. 7581 Specific 7582 (-wavefrontsize64) 7583 ``.amdhsa_system_sgpr_private_segment_wavefront_offset`` 0 GFX6-GFX10 Controls ENABLE_SGPR_PRIVATE_SEGMENT_WAVEFRONT_OFFSET in 7584 :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx10-table`. 7585 ``.amdhsa_system_sgpr_workgroup_id_x`` 1 GFX6-GFX10 Controls ENABLE_SGPR_WORKGROUP_ID_X in 7586 :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx10-table`. 7587 ``.amdhsa_system_sgpr_workgroup_id_y`` 0 GFX6-GFX10 Controls ENABLE_SGPR_WORKGROUP_ID_Y in 7588 :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx10-table`. 7589 ``.amdhsa_system_sgpr_workgroup_id_z`` 0 GFX6-GFX10 Controls ENABLE_SGPR_WORKGROUP_ID_Z in 7590 :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx10-table`. 7591 ``.amdhsa_system_sgpr_workgroup_info`` 0 GFX6-GFX10 Controls ENABLE_SGPR_WORKGROUP_INFO in 7592 :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx10-table`. 7593 ``.amdhsa_system_vgpr_workitem_id`` 0 GFX6-GFX10 Controls ENABLE_VGPR_WORKITEM_ID in 7594 :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx10-table`. 7595 Possible values are defined in 7596 :ref:`amdgpu-amdhsa-system-vgpr-work-item-id-enumeration-values-table`. 7597 ``.amdhsa_next_free_vgpr`` Required GFX6-GFX10 Maximum VGPR number explicitly referenced, plus one. 7598 Used to calculate GRANULATED_WORKITEM_VGPR_COUNT in 7599 :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx10-table`. 7600 ``.amdhsa_next_free_sgpr`` Required GFX6-GFX10 Maximum SGPR number explicitly referenced, plus one. 7601 Used to calculate GRANULATED_WAVEFRONT_SGPR_COUNT in 7602 :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx10-table`. 7603 ``.amdhsa_reserve_vcc`` 1 GFX6-GFX10 Whether the kernel may use the special VCC SGPR. 7604 Used to calculate GRANULATED_WAVEFRONT_SGPR_COUNT in 7605 :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx10-table`. 7606 ``.amdhsa_reserve_flat_scratch`` 1 GFX7-GFX10 Whether the kernel may use flat instructions to access 7607 scratch memory. Used to calculate 7608 GRANULATED_WAVEFRONT_SGPR_COUNT in 7609 :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx10-table`. 7610 ``.amdhsa_reserve_xnack_mask`` Target GFX8-GFX10 Whether the kernel may trigger XNACK replay. 7611 Feature Used to calculate GRANULATED_WAVEFRONT_SGPR_COUNT in 7612 Specific :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx10-table`. 7613 (+xnack) 7614 ``.amdhsa_float_round_mode_32`` 0 GFX6-GFX10 Controls FLOAT_ROUND_MODE_32 in 7615 :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx10-table`. 7616 Possible values are defined in 7617 :ref:`amdgpu-amdhsa-floating-point-rounding-mode-enumeration-values-table`. 7618 ``.amdhsa_float_round_mode_16_64`` 0 GFX6-GFX10 Controls FLOAT_ROUND_MODE_16_64 in 7619 :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx10-table`. 7620 Possible values are defined in 7621 :ref:`amdgpu-amdhsa-floating-point-rounding-mode-enumeration-values-table`. 7622 ``.amdhsa_float_denorm_mode_32`` 0 GFX6-GFX10 Controls FLOAT_DENORM_MODE_32 in 7623 :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx10-table`. 7624 Possible values are defined in 7625 :ref:`amdgpu-amdhsa-floating-point-denorm-mode-enumeration-values-table`. 7626 ``.amdhsa_float_denorm_mode_16_64`` 3 GFX6-GFX10 Controls FLOAT_DENORM_MODE_16_64 in 7627 :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx10-table`. 7628 Possible values are defined in 7629 :ref:`amdgpu-amdhsa-floating-point-denorm-mode-enumeration-values-table`. 7630 ``.amdhsa_dx10_clamp`` 1 GFX6-GFX10 Controls ENABLE_DX10_CLAMP in 7631 :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx10-table`. 7632 ``.amdhsa_ieee_mode`` 1 GFX6-GFX10 Controls ENABLE_IEEE_MODE in 7633 :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx10-table`. 7634 ``.amdhsa_fp16_overflow`` 0 GFX9-GFX10 Controls FP16_OVFL in 7635 :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx10-table`. 7636 ``.amdhsa_workgroup_processor_mode`` Target GFX10 Controls ENABLE_WGP_MODE in 7637 Feature :ref:`amdgpu-amdhsa-kernel-descriptor-gfx6-gfx10-table`. 7638 Specific 7639 (-cumode) 7640 ``.amdhsa_memory_ordered`` 1 GFX10 Controls MEM_ORDERED in 7641 :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx10-table`. 7642 ``.amdhsa_forward_progress`` 0 GFX10 Controls FWD_PROGRESS in 7643 :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx10-table`. 7644 ``.amdhsa_exception_fp_ieee_invalid_op`` 0 GFX6-GFX10 Controls ENABLE_EXCEPTION_IEEE_754_FP_INVALID_OPERATION in 7645 :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx10-table`. 7646 ``.amdhsa_exception_fp_denorm_src`` 0 GFX6-GFX10 Controls ENABLE_EXCEPTION_FP_DENORMAL_SOURCE in 7647 :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx10-table`. 7648 ``.amdhsa_exception_fp_ieee_div_zero`` 0 GFX6-GFX10 Controls ENABLE_EXCEPTION_IEEE_754_FP_DIVISION_BY_ZERO in 7649 :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx10-table`. 7650 ``.amdhsa_exception_fp_ieee_overflow`` 0 GFX6-GFX10 Controls ENABLE_EXCEPTION_IEEE_754_FP_OVERFLOW in 7651 :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx10-table`. 7652 ``.amdhsa_exception_fp_ieee_underflow`` 0 GFX6-GFX10 Controls ENABLE_EXCEPTION_IEEE_754_FP_UNDERFLOW in 7653 :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx10-table`. 7654 ``.amdhsa_exception_fp_ieee_inexact`` 0 GFX6-GFX10 Controls ENABLE_EXCEPTION_IEEE_754_FP_INEXACT in 7655 :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx10-table`. 7656 ``.amdhsa_exception_int_div_zero`` 0 GFX6-GFX10 Controls ENABLE_EXCEPTION_INT_DIVIDE_BY_ZERO in 7657 :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx10-table`. 7658 ======================================================== =================== ============ =================== 7659 7660.amdgpu_metadata 7661++++++++++++++++ 7662 7663Optional directive which declares the contents of the ``NT_AMDGPU_METADATA`` 7664note record (see :ref:`amdgpu-elf-note-records-table-v3`). 7665 7666The contents must be in the [YAML]_ markup format, with the same structure and 7667semantics described in :ref:`amdgpu-amdhsa-code-object-metadata-v3`. 7668 7669This directive is terminated by an ``.end_amdgpu_metadata`` directive. 7670 7671.. _amdgpu-amdhsa-assembler-example-v3: 7672 7673Code Object V3 Example Source Code (-mattr=+code-object-v3) 7674~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 7675 7676Here is an example of a minimal assembly source file, defining one HSA kernel: 7677 7678.. code:: 7679 :number-lines: 7680 7681 .amdgcn_target "amdgcn-amd-amdhsa--gfx900+xnack" // optional 7682 7683 .text 7684 .globl hello_world 7685 .p2align 8 7686 .type hello_world,@function 7687 hello_world: 7688 s_load_dwordx2 s[0:1], s[0:1] 0x0 7689 v_mov_b32 v0, 3.14159 7690 s_waitcnt lgkmcnt(0) 7691 v_mov_b32 v1, s0 7692 v_mov_b32 v2, s1 7693 flat_store_dword v[1:2], v0 7694 s_endpgm 7695 .Lfunc_end0: 7696 .size hello_world, .Lfunc_end0-hello_world 7697 7698 .rodata 7699 .p2align 6 7700 .amdhsa_kernel hello_world 7701 .amdhsa_user_sgpr_kernarg_segment_ptr 1 7702 .amdhsa_next_free_vgpr .amdgcn.next_free_vgpr 7703 .amdhsa_next_free_sgpr .amdgcn.next_free_sgpr 7704 .end_amdhsa_kernel 7705 7706 .amdgpu_metadata 7707 --- 7708 amdhsa.version: 7709 - 1 7710 - 0 7711 amdhsa.kernels: 7712 - .name: hello_world 7713 .symbol: hello_world.kd 7714 .kernarg_segment_size: 48 7715 .group_segment_fixed_size: 0 7716 .private_segment_fixed_size: 0 7717 .kernarg_segment_align: 4 7718 .wavefront_size: 64 7719 .sgpr_count: 2 7720 .vgpr_count: 3 7721 .max_flat_workgroup_size: 256 7722 ... 7723 .end_amdgpu_metadata 7724 7725If an assembly source file contains multiple kernels and/or functions, the 7726:ref:`amdgpu-amdhsa-assembler-symbol-next_free_vgpr` and 7727:ref:`amdgpu-amdhsa-assembler-symbol-next_free_sgpr` symbols may be reset using 7728the ``.set <symbol>, <expression>`` directive. For example, in the case of two 7729kernels, where ``function1`` is only called from ``kernel1`` it is sufficient 7730to group the function with the kernel that calls it and reset the symbols 7731between the two connected components: 7732 7733.. code:: 7734 :number-lines: 7735 7736 .amdgcn_target "amdgcn-amd-amdhsa--gfx900+xnack" // optional 7737 7738 // gpr tracking symbols are implicitly set to zero 7739 7740 .text 7741 .globl kern0 7742 .p2align 8 7743 .type kern0,@function 7744 kern0: 7745 // ... 7746 s_endpgm 7747 .Lkern0_end: 7748 .size kern0, .Lkern0_end-kern0 7749 7750 .rodata 7751 .p2align 6 7752 .amdhsa_kernel kern0 7753 // ... 7754 .amdhsa_next_free_vgpr .amdgcn.next_free_vgpr 7755 .amdhsa_next_free_sgpr .amdgcn.next_free_sgpr 7756 .end_amdhsa_kernel 7757 7758 // reset symbols to begin tracking usage in func1 and kern1 7759 .set .amdgcn.next_free_vgpr, 0 7760 .set .amdgcn.next_free_sgpr, 0 7761 7762 .text 7763 .hidden func1 7764 .global func1 7765 .p2align 2 7766 .type func1,@function 7767 func1: 7768 // ... 7769 s_setpc_b64 s[30:31] 7770 .Lfunc1_end: 7771 .size func1, .Lfunc1_end-func1 7772 7773 .globl kern1 7774 .p2align 8 7775 .type kern1,@function 7776 kern1: 7777 // ... 7778 s_getpc_b64 s[4:5] 7779 s_add_u32 s4, s4, func1@rel32@lo+4 7780 s_addc_u32 s5, s5, func1@rel32@lo+4 7781 s_swappc_b64 s[30:31], s[4:5] 7782 // ... 7783 s_endpgm 7784 .Lkern1_end: 7785 .size kern1, .Lkern1_end-kern1 7786 7787 .rodata 7788 .p2align 6 7789 .amdhsa_kernel kern1 7790 // ... 7791 .amdhsa_next_free_vgpr .amdgcn.next_free_vgpr 7792 .amdhsa_next_free_sgpr .amdgcn.next_free_sgpr 7793 .end_amdhsa_kernel 7794 7795These symbols cannot identify connected components in order to automatically 7796track the usage for each kernel. However, in some cases careful organization of 7797the kernels and functions in the source file means there is minimal additional 7798effort required to accurately calculate GPR usage. 7799 7800Additional Documentation 7801======================== 7802 7803.. [AMD-GCN-GFX6] `AMD Southern Islands Series ISA <http://developer.amd.com/wordpress/media/2012/12/AMD_Southern_Islands_Instruction_Set_Architecture.pdf>`__ 7804.. [AMD-GCN-GFX7] `AMD Sea Islands Series ISA <http://developer.amd.com/wordpress/media/2013/07/AMD_Sea_Islands_Instruction_Set_Architecture.pdf>`_ 7805.. [AMD-GCN-GFX8] `AMD GCN3 Instruction Set Architecture <http://amd-dev.wpengine.netdna-cdn.com/wordpress/media/2013/12/AMD_GCN3_Instruction_Set_Architecture_rev1.1.pdf>`__ 7806.. [AMD-GCN-GFX9] `AMD "Vega" Instruction Set Architecture <http://developer.amd.com/wordpress/media/2013/12/Vega_Shader_ISA_28July2017.pdf>`__ 7807.. [AMD-GCN-GFX10] `AMD "RDNA 1.0" Instruction Set Architecture <https://gpuopen.com/wp-content/uploads/2019/08/RDNA_Shader_ISA_5August2019.pdf>`__ 7808.. [AMD-RADEON-HD-2000-3000] `AMD R6xx shader ISA <http://developer.amd.com/wordpress/media/2012/10/R600_Instruction_Set_Architecture.pdf>`__ 7809.. [AMD-RADEON-HD-4000] `AMD R7xx shader ISA <http://developer.amd.com/wordpress/media/2012/10/R700-Family_Instruction_Set_Architecture.pdf>`__ 7810.. [AMD-RADEON-HD-5000] `AMD Evergreen shader ISA <http://developer.amd.com/wordpress/media/2012/10/AMD_Evergreen-Family_Instruction_Set_Architecture.pdf>`__ 7811.. [AMD-RADEON-HD-6000] `AMD Cayman/Trinity shader ISA <http://developer.amd.com/wordpress/media/2012/10/AMD_HD_6900_Series_Instruction_Set_Architecture.pdf>`__ 7812.. [AMD-ROCm] `AMD ROCm Platform <https://rocm-documentation.readthedocs.io>`__ 7813.. [AMD-ROCm-github] `ROCm github <http://github.com/RadeonOpenCompute>`__ 7814.. [CLANG-ATTR] `Attributes in Clang <https://clang.llvm.org/docs/AttributeReference.html>`__ 7815.. [DWARF] `DWARF Debugging Information Format <http://dwarfstd.org/>`__ 7816.. [ELF] `Executable and Linkable Format (ELF) <http://www.sco.com/developers/gabi/>`__ 7817.. [HRF] `Heterogeneous-race-free Memory Models <http://benedictgaster.org/wp-content/uploads/2014/01/asplos269-FINAL.pdf>`__ 7818.. [HSA] `Heterogeneous System Architecture (HSA) Foundation <http://www.hsafoundation.com/>`__ 7819.. [MsgPack] `Message Pack <http://www.msgpack.org/>`__ 7820.. [OpenCL] `The OpenCL Specification Version 2.0 <http://www.khronos.org/registry/cl/specs/opencl-2.0.pdf>`__ 7821.. [SEMVER] `Semantic Versioning <https://semver.org/>`__ 7822.. [YAML] `YAML Ain't Markup Language (YAML™) Version 1.2 <http://www.yaml.org/spec/1.2/spec.html>`__ 7823