1============================= 2User Guide for AMDGPU Backend 3============================= 4 5.. contents:: 6 :local: 7 8Introduction 9============ 10 11The AMDGPU backend provides ISA code generation for AMD GPUs, starting with the 12R600 family up until the current GCN families. It lives in the 13``llvm/lib/Target/AMDGPU`` directory. 14 15LLVM 16==== 17 18.. _amdgpu-target-triples: 19 20Target Triples 21-------------- 22 23Use the ``clang -target <Architecture>-<Vendor>-<OS>-<Environment>`` option to 24specify the target triple: 25 26 .. table:: AMDGPU Architectures 27 :name: amdgpu-architecture-table 28 29 ============ ============================================================== 30 Architecture Description 31 ============ ============================================================== 32 ``r600`` AMD GPUs HD2XXX-HD6XXX for graphics and compute shaders. 33 ``amdgcn`` AMD GPUs GCN GFX6 onwards for graphics and compute shaders. 34 ============ ============================================================== 35 36 .. table:: AMDGPU Vendors 37 :name: amdgpu-vendor-table 38 39 ============ ============================================================== 40 Vendor Description 41 ============ ============================================================== 42 ``amd`` Can be used for all AMD GPU usage. 43 ``mesa3d`` Can be used if the OS is ``mesa3d``. 44 ============ ============================================================== 45 46 .. table:: AMDGPU Operating Systems 47 :name: amdgpu-os-table 48 49 ============== ============================================================ 50 OS Description 51 ============== ============================================================ 52 *<empty>* Defaults to the *unknown* OS. 53 ``amdhsa`` Compute kernels executed on HSA [HSA]_ compatible runtimes 54 such as AMD's ROCm [AMD-ROCm]_. 55 ``amdpal`` Graphic shaders and compute kernels executed on AMD PAL 56 runtime. 57 ``mesa3d`` Graphic shaders and compute kernels executed on Mesa 3D 58 runtime. 59 ============== ============================================================ 60 61 .. table:: AMDGPU Environments 62 :name: amdgpu-environment-table 63 64 ============ ============================================================== 65 Environment Description 66 ============ ============================================================== 67 *<empty>* Default. 68 ============ ============================================================== 69 70.. _amdgpu-processors: 71 72Processors 73---------- 74 75Use the ``clang -mcpu <Processor>`` option to specify the AMDGPU processor. The 76names from both the *Processor* and *Alternative Processor* can be used. 77 78 .. table:: AMDGPU Processors 79 :name: amdgpu-processor-table 80 81 =========== =============== ============ ===== ================= ======= ====================== 82 Processor Alternative Target dGPU/ Target ROCm Example 83 Processor Triple APU Features Support Products 84 Architecture Supported 85 [Default] 86 =========== =============== ============ ===== ================= ======= ====================== 87 **Radeon HD 2000/3000 Series (R600)** [AMD-RADEON-HD-2000-3000]_ 88 ----------------------------------------------------------------------------------------------- 89 ``r600`` ``r600`` dGPU 90 ``r630`` ``r600`` dGPU 91 ``rs880`` ``r600`` dGPU 92 ``rv670`` ``r600`` dGPU 93 **Radeon HD 4000 Series (R700)** [AMD-RADEON-HD-4000]_ 94 ----------------------------------------------------------------------------------------------- 95 ``rv710`` ``r600`` dGPU 96 ``rv730`` ``r600`` dGPU 97 ``rv770`` ``r600`` dGPU 98 **Radeon HD 5000 Series (Evergreen)** [AMD-RADEON-HD-5000]_ 99 ----------------------------------------------------------------------------------------------- 100 ``cedar`` ``r600`` dGPU 101 ``cypress`` ``r600`` dGPU 102 ``juniper`` ``r600`` dGPU 103 ``redwood`` ``r600`` dGPU 104 ``sumo`` ``r600`` dGPU 105 **Radeon HD 6000 Series (Northern Islands)** [AMD-RADEON-HD-6000]_ 106 ----------------------------------------------------------------------------------------------- 107 ``barts`` ``r600`` dGPU 108 ``caicos`` ``r600`` dGPU 109 ``cayman`` ``r600`` dGPU 110 ``turks`` ``r600`` dGPU 111 **GCN GFX6 (Southern Islands (SI))** [AMD-GCN-GFX6]_ 112 ----------------------------------------------------------------------------------------------- 113 ``gfx600`` - ``tahiti`` ``amdgcn`` dGPU 114 ``gfx601`` - ``hainan`` ``amdgcn`` dGPU 115 - ``oland`` 116 - ``pitcairn`` 117 - ``verde`` 118 **GCN GFX7 (Sea Islands (CI))** [AMD-GCN-GFX7]_ 119 ----------------------------------------------------------------------------------------------- 120 ``gfx700`` - ``kaveri`` ``amdgcn`` APU - A6-7000 121 - A6 Pro-7050B 122 - A8-7100 123 - A8 Pro-7150B 124 - A10-7300 125 - A10 Pro-7350B 126 - FX-7500 127 - A8-7200P 128 - A10-7400P 129 - FX-7600P 130 ``gfx701`` - ``hawaii`` ``amdgcn`` dGPU ROCm - FirePro W8100 131 - FirePro W9100 132 - FirePro S9150 133 - FirePro S9170 134 ``gfx702`` ``amdgcn`` dGPU ROCm - Radeon R9 290 135 - Radeon R9 290x 136 - Radeon R390 137 - Radeon R390x 138 ``gfx703`` - ``kabini`` ``amdgcn`` APU - E1-2100 139 - ``mullins`` - E1-2200 140 - E1-2500 141 - E2-3000 142 - E2-3800 143 - A4-5000 144 - A4-5100 145 - A6-5200 146 - A4 Pro-3340B 147 ``gfx704`` - ``bonaire`` ``amdgcn`` dGPU - Radeon HD 7790 148 - Radeon HD 8770 149 - R7 260 150 - R7 260X 151 **GCN GFX8 (Volcanic Islands (VI))** [AMD-GCN-GFX8]_ 152 ----------------------------------------------------------------------------------------------- 153 ``gfx801`` - ``carrizo`` ``amdgcn`` APU - xnack - A6-8500P 154 [on] - Pro A6-8500B 155 - A8-8600P 156 - Pro A8-8600B 157 - FX-8800P 158 - Pro A12-8800B 159 \ ``amdgcn`` APU - xnack ROCm - A10-8700P 160 [on] - Pro A10-8700B 161 - A10-8780P 162 \ ``amdgcn`` APU - xnack - A10-9600P 163 [on] - A10-9630P 164 - A12-9700P 165 - A12-9730P 166 - FX-9800P 167 - FX-9830P 168 \ ``amdgcn`` APU - xnack - E2-9010 169 [on] - A6-9210 170 - A9-9410 171 ``gfx802`` - ``iceland`` ``amdgcn`` dGPU - xnack ROCm - FirePro S7150 172 - ``tonga`` [off] - FirePro S7100 173 - FirePro W7100 174 - Radeon R285 175 - Radeon R9 380 176 - Radeon R9 385 177 - Mobile FirePro 178 M7170 179 ``gfx803`` - ``fiji`` ``amdgcn`` dGPU - xnack ROCm - Radeon R9 Nano 180 [off] - Radeon R9 Fury 181 - Radeon R9 FuryX 182 - Radeon Pro Duo 183 - FirePro S9300x2 184 - Radeon Instinct MI8 185 \ - ``polaris10`` ``amdgcn`` dGPU - xnack ROCm - Radeon RX 470 186 [off] - Radeon RX 480 187 - Radeon Instinct MI6 188 \ - ``polaris11`` ``amdgcn`` dGPU - xnack ROCm - Radeon RX 460 189 [off] 190 ``gfx810`` - ``stoney`` ``amdgcn`` APU - xnack 191 [on] 192 **GCN GFX9** [AMD-GCN-GFX9]_ 193 ----------------------------------------------------------------------------------------------- 194 ``gfx900`` ``amdgcn`` dGPU - xnack ROCm - Radeon Vega 195 [off] Frontier Edition 196 - Radeon RX Vega 56 197 - Radeon RX Vega 64 198 - Radeon RX Vega 64 199 Liquid 200 - Radeon Instinct MI25 201 ``gfx902`` ``amdgcn`` APU - xnack - Ryzen 3 2200G 202 [on] - Ryzen 5 2400G 203 ``gfx904`` ``amdgcn`` dGPU - xnack *TBA* 204 [off] 205 .. TODO:: 206 Add product 207 names. 208 ``gfx906`` ``amdgcn`` dGPU - xnack - Radeon Instinct MI50 209 [off] - Radeon Instinct MI60 210 ``gfx908`` ``amdgcn`` dGPU - xnack *TBA* 211 [off] 212 sram-ecc 213 [on] 214 ``gfx909`` ``amdgcn`` APU - xnack *TBA* (Raven Ridge 2) 215 [on] 216 .. TODO:: 217 Add product 218 names. 219 **GCN GFX10** [AMD-GCN-GFX10]_ 220 ----------------------------------------------------------------------------------------------- 221 ``gfx1010`` ``amdgcn`` dGPU - xnack *TBA* 222 [off] 223 - wavefrontsize64 224 [off] 225 - cumode 226 [off] 227 .. TODO:: 228 Add product 229 names. 230 ``gfx1011`` ``amdgcn`` dGPU - xnack *TBA* 231 [off] 232 - wavefrontsize64 233 [off] 234 - cumode 235 [off] 236 .. TODO:: 237 Add product 238 names. 239 ``gfx1012`` ``amdgcn`` dGPU - xnack *TBA* 240 [off] 241 - wavefrontsize64 242 [off] 243 - cumode 244 [off] 245 .. TODO:: 246 Add product 247 names. 248 =========== =============== ============ ===== ================= ======= ====================== 249 250.. _amdgpu-target-features: 251 252Target Features 253--------------- 254 255Target features control how code is generated to support certain 256processor specific features. Not all target features are supported by 257all processors. The runtime must ensure that the features supported by 258the device used to execute the code match the features enabled when 259generating the code. A mismatch of features may result in incorrect 260execution, or a reduction in performance. 261 262The target features supported by each processor, and the default value 263used if not specified explicitly, is listed in 264:ref:`amdgpu-processor-table`. 265 266Use the ``clang -m[no-]<TargetFeature>`` option to specify the AMDGPU 267target features. 268 269For example: 270 271``-mxnack`` 272 Enable the ``xnack`` feature. 273``-mno-xnack`` 274 Disable the ``xnack`` feature. 275 276 .. table:: AMDGPU Target Features 277 :name: amdgpu-target-feature-table 278 279 ====================== ================================================== 280 Target Feature Description 281 ====================== ================================================== 282 -m[no-]xnack Enable/disable generating code that has 283 memory clauses that are compatible with 284 having XNACK replay enabled. 285 286 This is used for demand paging and page 287 migration. If XNACK replay is enabled in 288 the device, then if a page fault occurs 289 the code may execute incorrectly if the 290 ``xnack`` feature is not enabled. Executing 291 code that has the feature enabled on a 292 device that does not have XNACK replay 293 enabled will execute correctly, but may 294 be less performant than code with the 295 feature disabled. 296 297 -m[no-]sram-ecc Enable/disable generating code that assumes SRAM 298 ECC is enabled/disabled. 299 300 -m[no-]wavefrontsize64 Control the default wavefront size used when 301 generating code for kernels. When disabled 302 native wavefront size 32 is used, when enabled 303 wavefront size 64 is used. 304 305 -m[no-]cumode Control the default wavefront execution mode used 306 when generating code for kernels. When disabled 307 native WGP wavefront execution mode is used, 308 when enabled CU wavefront execution mode is used 309 (see :ref:`amdgpu-amdhsa-memory-model`). 310 ====================== ================================================== 311 312.. _amdgpu-address-spaces: 313 314Address Spaces 315-------------- 316 317The AMDGPU architecture supports a number of memory address spaces. The address 318space names use the OpenCL standard names, with some additions. 319 320The AMDGPU address spaces correspond to architecture-specific LLVM address 321space numbers used in LLVM IR. 322 323The AMDGPU address spaces are described in 324:ref:`amdgpu-address-spaces-table`. Only 64-bit process address spaces are 325supported for the ``amdgcn`` target. 326 327 .. table:: AMDGPU Address Spaces 328 :name: amdgpu-address-spaces-table 329 330 ================================= =============== =========== ================ ======= ============================ 331 .. 64-Bit Process Address Space 332 --------------------------------- --------------- ----------- ---------------- ------------------------------------ 333 Address Space Name LLVM IR Address HSA Segment Hardware Address NULL Value 334 Space Number Name Name Size 335 ================================= =============== =========== ================ ======= ============================ 336 Generic 0 flat flat 64 0x0000000000000000 337 Global 1 global global 64 0x0000000000000000 338 Region 2 N/A GDS 32 *not implemented for AMDHSA* 339 Local 3 group LDS 32 0xFFFFFFFF 340 Constant 4 constant *same as global* 64 0x0000000000000000 341 Private 5 private scratch 32 0x00000000 342 Constant 32-bit 6 *TODO* 343 Buffer Fat Pointer (experimental) 7 *TODO* 344 ================================= =============== =========== ================ ======= ============================ 345 346**Generic** 347 The generic address space uses the hardware flat address support available in 348 GFX7-GFX10. This uses two fixed ranges of virtual addresses (the private and 349 local apertures), that are outside the range of addressable global memory, to 350 map from a flat address to a private or local address. 351 352 FLAT instructions can take a flat address and access global, private 353 (scratch), and group (LDS) memory depending on if the address is within one 354 of the aperture ranges. Flat access to scratch requires hardware aperture 355 setup and setup in the kernel prologue (see 356 :ref:`amdgpu-amdhsa-flat-scratch`). Flat access to LDS requires hardware 357 aperture setup and M0 (GFX7-GFX8) register setup (see 358 :ref:`amdgpu-amdhsa-m0`). 359 360 To convert between a private or group address space address (termed a segment 361 address) and a flat address the base address of the corresponding aperture 362 can be used. For GFX7-GFX8 these are available in the 363 :ref:`amdgpu-amdhsa-hsa-aql-queue` the address of which can be obtained with 364 Queue Ptr SGPR (see :ref:`amdgpu-amdhsa-initial-kernel-execution-state`). For 365 GFX9-GFX10 the aperture base addresses are directly available as inline 366 constant registers ``SRC_SHARED_BASE/LIMIT`` and ``SRC_PRIVATE_BASE/LIMIT``. 367 In 64-bit address mode the aperture sizes are 2^32 bytes and the base is 368 aligned to 2^32 which makes it easier to convert from flat to segment or 369 segment to flat. 370 371 A global address space address has the same value when used as a flat address 372 so no conversion is needed. 373 374**Global and Constant** 375 The global and constant address spaces both use global virtual addresses, 376 which are the same virtual address space used by the CPU. However, some 377 virtual addresses may only be accessible to the CPU, some only accessible 378 by the GPU, and some by both. 379 380 Using the constant address space indicates that the data will not change 381 during the execution of the kernel. This allows scalar read instructions to 382 be used. The vector and scalar L1 caches are invalidated of volatile data 383 before each kernel dispatch execution to allow constant memory to change 384 values between kernel dispatches. 385 386**Region** 387 The region address space uses the hardware Global Data Store (GDS). All 388 wavefronts executing on the same device will access the same memory for any 389 given region address. However, the same region address accessed by wavefronts 390 executing on different devices will access different memory. It is higher 391 performance than global memory. It is allocated by the runtime. The data 392 store (DS) instructions can be used to access it. 393 394**Local** 395 The local address space uses the hardware Local Data Store (LDS) which is 396 automatically allocated when the hardware creates the wavefronts of a 397 work-group, and freed when all the wavefronts of a work-group have 398 terminated. All wavefronts belonging to the same work-group will access the 399 same memory for any given local address. However, the same local address 400 accessed by wavefronts belonging to different work-groups will access 401 different memory. It is higher performance than global memory. The data store 402 (DS) instructions can be used to access it. 403 404**Private** 405 The private address space uses the hardware scratch memory support which 406 automatically allocates memory when it creates a wavefront, and frees it when 407 a wavefronts terminates. The memory accessed by a lane of a wavefront for any 408 given private address will be different to the memory accessed by another lane 409 of the same or different wavefront for the same private address. 410 411 If a kernel dispatch uses scratch, then the hardware allocates memory from a 412 pool of backing memory allocated by the runtime for each wavefront. The lanes 413 of the wavefront access this using dword (4 byte) interleaving. The mapping 414 used from private address to backing memory address is: 415 416 ``wavefront-scratch-base + 417 ((private-address / 4) * wavefront-size * 4) + 418 (wavefront-lane-id * 4) + (private-address % 4)`` 419 420 If each lane of a wavefront accesses the same private address, the 421 interleaving results in adjacent dwords being accessed and hence requires 422 fewer cache lines to be fetched. 423 424 There are different ways that the wavefront scratch base address is 425 determined by a wavefront (see 426 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`). 427 428 Scratch memory can be accessed in an interleaved manner using buffer 429 instructions with the scratch buffer descriptor and per wavefront scratch 430 offset, by the scratch instructions, or by flat instructions. Multi-dword 431 access is not supported except by flat and scratch instructions in 432 GFX9-GFX10. 433 434**Constant 32-bit** 435 *TODO* 436 437**Buffer Fat Pointer** 438 The buffer fat pointer is an experimental address space that is currently 439 unsupported in the backend. It exposes a non-integral pointer that is in 440 the future intended to support the modelling of 128-bit buffer descriptors 441 plus a 32-bit offset into the buffer (in total encapsulating a 160-bit 442 *pointer*), allowing normal LLVM load/store/atomic operations to be used to 443 model the buffer descriptors used heavily in graphics workloads targeting 444 the backend. 445 446.. _amdgpu-memory-scopes: 447 448Memory Scopes 449------------- 450 451This section provides LLVM memory synchronization scopes supported by the AMDGPU 452backend memory model when the target triple OS is ``amdhsa`` (see 453:ref:`amdgpu-amdhsa-memory-model` and :ref:`amdgpu-target-triples`). 454 455The memory model supported is based on the HSA memory model [HSA]_ which is 456based in turn on HRF-indirect with scope inclusion [HRF]_. The happens-before 457relation is transitive over the synchronizes-with relation independent of scope, 458and synchronizes-with allows the memory scope instances to be inclusive (see 459table :ref:`amdgpu-amdhsa-llvm-sync-scopes-table`). 460 461This is different to the OpenCL [OpenCL]_ memory model which does not have scope 462inclusion and requires the memory scopes to exactly match. However, this 463is conservatively correct for OpenCL. 464 465 .. table:: AMDHSA LLVM Sync Scopes 466 :name: amdgpu-amdhsa-llvm-sync-scopes-table 467 468 ======================= =================================================== 469 LLVM Sync Scope Description 470 ======================= =================================================== 471 *none* The default: ``system``. 472 473 Synchronizes with, and participates in modification 474 and seq_cst total orderings with, other operations 475 (except image operations) for all address spaces 476 (except private, or generic that accesses private) 477 provided the other operation's sync scope is: 478 479 - ``system``. 480 - ``agent`` and executed by a thread on the same 481 agent. 482 - ``workgroup`` and executed by a thread in the 483 same workgroup. 484 - ``wavefront`` and executed by a thread in the 485 same wavefront. 486 487 ``agent`` Synchronizes with, and participates in modification 488 and seq_cst total orderings with, other operations 489 (except image operations) for all address spaces 490 (except private, or generic that accesses private) 491 provided the other operation's sync scope is: 492 493 - ``system`` or ``agent`` and executed by a thread 494 on the same agent. 495 - ``workgroup`` and executed by a thread in the 496 same workgroup. 497 - ``wavefront`` and executed by a thread in the 498 same wavefront. 499 500 ``workgroup`` Synchronizes with, and participates in modification 501 and seq_cst total orderings with, other operations 502 (except image operations) for all address spaces 503 (except private, or generic that accesses private) 504 provided the other operation's sync scope is: 505 506 - ``system``, ``agent`` or ``workgroup`` and 507 executed by a thread in the same workgroup. 508 - ``wavefront`` and executed by a thread in the 509 same wavefront. 510 511 ``wavefront`` Synchronizes with, and participates in modification 512 and seq_cst total orderings with, other operations 513 (except image operations) for all address spaces 514 (except private, or generic that accesses private) 515 provided the other operation's sync scope is: 516 517 - ``system``, ``agent``, ``workgroup`` or 518 ``wavefront`` and executed by a thread in the 519 same wavefront. 520 521 ``singlethread`` Only synchronizes with, and participates in 522 modification and seq_cst total orderings with, 523 other operations (except image operations) running 524 in the same thread for all address spaces (for 525 example, in signal handlers). 526 527 ``one-as`` Same as ``system`` but only synchronizes with other 528 operations within the same address space. 529 530 ``agent-one-as`` Same as ``agent`` but only synchronizes with other 531 operations within the same address space. 532 533 ``workgroup-one-as`` Same as ``workgroup`` but only synchronizes with 534 other operations within the same address space. 535 536 ``wavefront-one-as`` Same as ``wavefront`` but only synchronizes with 537 other operations within the same address space. 538 539 ``singlethread-one-as`` Same as ``singlethread`` but only synchronizes with 540 other operations within the same address space. 541 ======================= =================================================== 542 543AMDGPU Intrinsics 544----------------- 545 546The AMDGPU backend implements the following LLVM IR intrinsics. 547 548*This section is WIP.* 549 550.. TODO:: 551 552 List AMDGPU intrinsics. 553 554AMDGPU Attributes 555----------------- 556 557The AMDGPU backend supports the following LLVM IR attributes. 558 559 .. table:: AMDGPU LLVM IR Attributes 560 :name: amdgpu-llvm-ir-attributes-table 561 562 ======================================= ========================================================== 563 LLVM Attribute Description 564 ======================================= ========================================================== 565 "amdgpu-flat-work-group-size"="min,max" Specify the minimum and maximum flat work group sizes that 566 will be specified when the kernel is dispatched. Generated 567 by the ``amdgpu_flat_work_group_size`` CLANG attribute [CLANG-ATTR]_. 568 "amdgpu-implicitarg-num-bytes"="n" Number of kernel argument bytes to add to the kernel 569 argument block size for the implicit arguments. This 570 varies by OS and language (for OpenCL see 571 :ref:`opencl-kernel-implicit-arguments-appended-for-amdhsa-os-table`). 572 "amdgpu-num-sgpr"="n" Specifies the number of SGPRs to use. Generated by 573 the ``amdgpu_num_sgpr`` CLANG attribute [CLANG-ATTR]_. 574 "amdgpu-num-vgpr"="n" Specifies the number of VGPRs to use. Generated by the 575 ``amdgpu_num_vgpr`` CLANG attribute [CLANG-ATTR]_. 576 "amdgpu-waves-per-eu"="m,n" Specify the minimum and maximum number of waves per 577 execution unit. Generated by the ``amdgpu_waves_per_eu`` 578 CLANG attribute [CLANG-ATTR]_. 579 "amdgpu-ieee" true/false. Specify whether the function expects the IEEE field of the 580 mode register to be set on entry. Overrides the default for 581 the calling convention. 582 "amdgpu-dx10-clamp" true/false. Specify whether the function expects the DX10_CLAMP field of 583 the mode register to be set on entry. Overrides the default 584 for the calling convention. 585 ======================================= ========================================================== 586 587Code Object 588=========== 589 590The AMDGPU backend generates a standard ELF [ELF]_ relocatable code object that 591can be linked by ``lld`` to produce a standard ELF shared code object which can 592be loaded and executed on an AMDGPU target. 593 594Header 595------ 596 597The AMDGPU backend uses the following ELF header: 598 599 .. table:: AMDGPU ELF Header 600 :name: amdgpu-elf-header-table 601 602 ========================== =============================== 603 Field Value 604 ========================== =============================== 605 ``e_ident[EI_CLASS]`` ``ELFCLASS64`` 606 ``e_ident[EI_DATA]`` ``ELFDATA2LSB`` 607 ``e_ident[EI_OSABI]`` - ``ELFOSABI_NONE`` 608 - ``ELFOSABI_AMDGPU_HSA`` 609 - ``ELFOSABI_AMDGPU_PAL`` 610 - ``ELFOSABI_AMDGPU_MESA3D`` 611 ``e_ident[EI_ABIVERSION]`` - ``ELFABIVERSION_AMDGPU_HSA`` 612 - ``ELFABIVERSION_AMDGPU_PAL`` 613 - ``ELFABIVERSION_AMDGPU_MESA3D`` 614 ``e_type`` - ``ET_REL`` 615 - ``ET_DYN`` 616 ``e_machine`` ``EM_AMDGPU`` 617 ``e_entry`` 0 618 ``e_flags`` See :ref:`amdgpu-elf-header-e_flags-table` 619 ========================== =============================== 620 621.. 622 623 .. table:: AMDGPU ELF Header Enumeration Values 624 :name: amdgpu-elf-header-enumeration-values-table 625 626 =============================== ===== 627 Name Value 628 =============================== ===== 629 ``EM_AMDGPU`` 224 630 ``ELFOSABI_NONE`` 0 631 ``ELFOSABI_AMDGPU_HSA`` 64 632 ``ELFOSABI_AMDGPU_PAL`` 65 633 ``ELFOSABI_AMDGPU_MESA3D`` 66 634 ``ELFABIVERSION_AMDGPU_HSA`` 1 635 ``ELFABIVERSION_AMDGPU_PAL`` 0 636 ``ELFABIVERSION_AMDGPU_MESA3D`` 0 637 =============================== ===== 638 639``e_ident[EI_CLASS]`` 640 The ELF class is: 641 642 * ``ELFCLASS32`` for ``r600`` architecture. 643 644 * ``ELFCLASS64`` for ``amdgcn`` architecture which only supports 64-bit 645 process address space applications. 646 647``e_ident[EI_DATA]`` 648 All AMDGPU targets use ``ELFDATA2LSB`` for little-endian byte ordering. 649 650``e_ident[EI_OSABI]`` 651 One of the following AMDGPU architecture specific OS ABIs 652 (see :ref:`amdgpu-os-table`): 653 654 * ``ELFOSABI_NONE`` for *unknown* OS. 655 656 * ``ELFOSABI_AMDGPU_HSA`` for ``amdhsa`` OS. 657 658 * ``ELFOSABI_AMDGPU_PAL`` for ``amdpal`` OS. 659 660 * ``ELFOSABI_AMDGPU_MESA3D`` for ``mesa3D`` OS. 661 662``e_ident[EI_ABIVERSION]`` 663 The ABI version of the AMDGPU architecture specific OS ABI to which the code 664 object conforms: 665 666 * ``ELFABIVERSION_AMDGPU_HSA`` is used to specify the version of AMD HSA 667 runtime ABI. 668 669 * ``ELFABIVERSION_AMDGPU_PAL`` is used to specify the version of AMD PAL 670 runtime ABI. 671 672 * ``ELFABIVERSION_AMDGPU_MESA3D`` is used to specify the version of AMD MESA 673 3D runtime ABI. 674 675``e_type`` 676 Can be one of the following values: 677 678 679 ``ET_REL`` 680 The type produced by the AMDGPU backend compiler as it is relocatable code 681 object. 682 683 ``ET_DYN`` 684 The type produced by the linker as it is a shared code object. 685 686 The AMD HSA runtime loader requires a ``ET_DYN`` code object. 687 688``e_machine`` 689 The value ``EM_AMDGPU`` is used for the machine for all processors supported 690 by the ``r600`` and ``amdgcn`` architectures (see 691 :ref:`amdgpu-processor-table`). The specific processor is specified in the 692 ``EF_AMDGPU_MACH`` bit field of the ``e_flags`` (see 693 :ref:`amdgpu-elf-header-e_flags-table`). 694 695``e_entry`` 696 The entry point is 0 as the entry points for individual kernels must be 697 selected in order to invoke them through AQL packets. 698 699``e_flags`` 700 The AMDGPU backend uses the following ELF header flags: 701 702 .. table:: AMDGPU ELF Header ``e_flags`` 703 :name: amdgpu-elf-header-e_flags-table 704 705 ================================= ========== ============================= 706 Name Value Description 707 ================================= ========== ============================= 708 **AMDGPU Processor Flag** See :ref:`amdgpu-processor-table`. 709 -------------------------------------------- ----------------------------- 710 ``EF_AMDGPU_MACH`` 0x000000ff AMDGPU processor selection 711 mask for 712 ``EF_AMDGPU_MACH_xxx`` values 713 defined in 714 :ref:`amdgpu-ef-amdgpu-mach-table`. 715 ``EF_AMDGPU_XNACK`` 0x00000100 Indicates if the ``xnack`` 716 target feature is 717 enabled for all code 718 contained in the code object. 719 If the processor 720 does not support the 721 ``xnack`` target 722 feature then must 723 be 0. 724 See 725 :ref:`amdgpu-target-features`. 726 ``EF_AMDGPU_SRAM_ECC`` 0x00000200 Indicates if the ``sram-ecc`` 727 target feature is 728 enabled for all code 729 contained in the code object. 730 If the processor 731 does not support the 732 ``sram-ecc`` target 733 feature then must 734 be 0. 735 See 736 :ref:`amdgpu-target-features`. 737 ================================= ========== ============================= 738 739 .. table:: AMDGPU ``EF_AMDGPU_MACH`` Values 740 :name: amdgpu-ef-amdgpu-mach-table 741 742 ================================= ========== ============================= 743 Name Value Description (see 744 :ref:`amdgpu-processor-table`) 745 ================================= ========== ============================= 746 ``EF_AMDGPU_MACH_NONE`` 0x000 *not specified* 747 ``EF_AMDGPU_MACH_R600_R600`` 0x001 ``r600`` 748 ``EF_AMDGPU_MACH_R600_R630`` 0x002 ``r630`` 749 ``EF_AMDGPU_MACH_R600_RS880`` 0x003 ``rs880`` 750 ``EF_AMDGPU_MACH_R600_RV670`` 0x004 ``rv670`` 751 ``EF_AMDGPU_MACH_R600_RV710`` 0x005 ``rv710`` 752 ``EF_AMDGPU_MACH_R600_RV730`` 0x006 ``rv730`` 753 ``EF_AMDGPU_MACH_R600_RV770`` 0x007 ``rv770`` 754 ``EF_AMDGPU_MACH_R600_CEDAR`` 0x008 ``cedar`` 755 ``EF_AMDGPU_MACH_R600_CYPRESS`` 0x009 ``cypress`` 756 ``EF_AMDGPU_MACH_R600_JUNIPER`` 0x00a ``juniper`` 757 ``EF_AMDGPU_MACH_R600_REDWOOD`` 0x00b ``redwood`` 758 ``EF_AMDGPU_MACH_R600_SUMO`` 0x00c ``sumo`` 759 ``EF_AMDGPU_MACH_R600_BARTS`` 0x00d ``barts`` 760 ``EF_AMDGPU_MACH_R600_CAICOS`` 0x00e ``caicos`` 761 ``EF_AMDGPU_MACH_R600_CAYMAN`` 0x00f ``cayman`` 762 ``EF_AMDGPU_MACH_R600_TURKS`` 0x010 ``turks`` 763 *reserved* 0x011 - Reserved for ``r600`` 764 0x01f architecture processors. 765 ``EF_AMDGPU_MACH_AMDGCN_GFX600`` 0x020 ``gfx600`` 766 ``EF_AMDGPU_MACH_AMDGCN_GFX601`` 0x021 ``gfx601`` 767 ``EF_AMDGPU_MACH_AMDGCN_GFX700`` 0x022 ``gfx700`` 768 ``EF_AMDGPU_MACH_AMDGCN_GFX701`` 0x023 ``gfx701`` 769 ``EF_AMDGPU_MACH_AMDGCN_GFX702`` 0x024 ``gfx702`` 770 ``EF_AMDGPU_MACH_AMDGCN_GFX703`` 0x025 ``gfx703`` 771 ``EF_AMDGPU_MACH_AMDGCN_GFX704`` 0x026 ``gfx704`` 772 *reserved* 0x027 Reserved. 773 ``EF_AMDGPU_MACH_AMDGCN_GFX801`` 0x028 ``gfx801`` 774 ``EF_AMDGPU_MACH_AMDGCN_GFX802`` 0x029 ``gfx802`` 775 ``EF_AMDGPU_MACH_AMDGCN_GFX803`` 0x02a ``gfx803`` 776 ``EF_AMDGPU_MACH_AMDGCN_GFX810`` 0x02b ``gfx810`` 777 ``EF_AMDGPU_MACH_AMDGCN_GFX900`` 0x02c ``gfx900`` 778 ``EF_AMDGPU_MACH_AMDGCN_GFX902`` 0x02d ``gfx902`` 779 ``EF_AMDGPU_MACH_AMDGCN_GFX904`` 0x02e ``gfx904`` 780 ``EF_AMDGPU_MACH_AMDGCN_GFX906`` 0x02f ``gfx906`` 781 ``EF_AMDGPU_MACH_AMDGCN_GFX908`` 0x030 ``gfx908`` 782 ``EF_AMDGPU_MACH_AMDGCN_GFX909`` 0x031 ``gfx909`` 783 *reserved* 0x032 Reserved. 784 ``EF_AMDGPU_MACH_AMDGCN_GFX1010`` 0x033 ``gfx1010`` 785 ``EF_AMDGPU_MACH_AMDGCN_GFX1011`` 0x034 ``gfx1011`` 786 ``EF_AMDGPU_MACH_AMDGCN_GFX1012`` 0x035 ``gfx1012`` 787 ================================= ========== ============================= 788 789Sections 790-------- 791 792An AMDGPU target ELF code object has the standard ELF sections which include: 793 794 .. table:: AMDGPU ELF Sections 795 :name: amdgpu-elf-sections-table 796 797 ================== ================ ================================= 798 Name Type Attributes 799 ================== ================ ================================= 800 ``.bss`` ``SHT_NOBITS`` ``SHF_ALLOC`` + ``SHF_WRITE`` 801 ``.data`` ``SHT_PROGBITS`` ``SHF_ALLOC`` + ``SHF_WRITE`` 802 ``.debug_``\ *\** ``SHT_PROGBITS`` *none* 803 ``.dynamic`` ``SHT_DYNAMIC`` ``SHF_ALLOC`` 804 ``.dynstr`` ``SHT_PROGBITS`` ``SHF_ALLOC`` 805 ``.dynsym`` ``SHT_PROGBITS`` ``SHF_ALLOC`` 806 ``.got`` ``SHT_PROGBITS`` ``SHF_ALLOC`` + ``SHF_WRITE`` 807 ``.hash`` ``SHT_HASH`` ``SHF_ALLOC`` 808 ``.note`` ``SHT_NOTE`` *none* 809 ``.rela``\ *name* ``SHT_RELA`` *none* 810 ``.rela.dyn`` ``SHT_RELA`` *none* 811 ``.rodata`` ``SHT_PROGBITS`` ``SHF_ALLOC`` 812 ``.shstrtab`` ``SHT_STRTAB`` *none* 813 ``.strtab`` ``SHT_STRTAB`` *none* 814 ``.symtab`` ``SHT_SYMTAB`` *none* 815 ``.text`` ``SHT_PROGBITS`` ``SHF_ALLOC`` + ``SHF_EXECINSTR`` 816 ================== ================ ================================= 817 818These sections have their standard meanings (see [ELF]_) and are only generated 819if needed. 820 821``.debug``\ *\** 822 The standard DWARF sections. See :ref:`amdgpu-dwarf` for information on the 823 DWARF produced by the AMDGPU backend. 824 825``.dynamic``, ``.dynstr``, ``.dynsym``, ``.hash`` 826 The standard sections used by a dynamic loader. 827 828``.note`` 829 See :ref:`amdgpu-note-records` for the note records supported by the AMDGPU 830 backend. 831 832``.rela``\ *name*, ``.rela.dyn`` 833 For relocatable code objects, *name* is the name of the section that the 834 relocation records apply. For example, ``.rela.text`` is the section name for 835 relocation records associated with the ``.text`` section. 836 837 For linked shared code objects, ``.rela.dyn`` contains all the relocation 838 records from each of the relocatable code object's ``.rela``\ *name* sections. 839 840 See :ref:`amdgpu-relocation-records` for the relocation records supported by 841 the AMDGPU backend. 842 843``.text`` 844 The executable machine code for the kernels and functions they call. Generated 845 as position independent code. See :ref:`amdgpu-code-conventions` for 846 information on conventions used in the isa generation. 847 848.. _amdgpu-note-records: 849 850Note Records 851------------ 852 853The AMDGPU backend code object contains ELF note records in the ``.note`` 854section. The set of generated notes and their semantics depend on the code 855object version; see :ref:`amdgpu-note-records-v2` and 856:ref:`amdgpu-note-records-v3`. 857 858As required by ``ELFCLASS32`` and ``ELFCLASS64``, minimal zero byte padding 859must be generated after the ``name`` field to ensure the ``desc`` field is 4 860byte aligned. In addition, minimal zero byte padding must be generated to 861ensure the ``desc`` field size is a multiple of 4 bytes. The ``sh_addralign`` 862field of the ``.note`` section must be at least 4 to indicate at least 8 byte 863alignment. 864 865.. _amdgpu-note-records-v2: 866 867Code Object V2 Note Records (-mattr=-code-object-v3) 868~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 869 870.. warning:: Code Object V2 is not the default code object version emitted by 871 this version of LLVM. For a description of the notes generated with the 872 default configuration (Code Object V3) see :ref:`amdgpu-note-records-v3`. 873 874The AMDGPU backend code object uses the following ELF note record in the 875``.note`` section when compiling for Code Object V2 (-mattr=-code-object-v3). 876 877Additional note records may be present, but any which are not documented here 878are deprecated and should not be used. 879 880 .. table:: AMDGPU Code Object V2 ELF Note Records 881 :name: amdgpu-elf-note-records-table-v2 882 883 ===== ============================== ====================================== 884 Name Type Description 885 ===== ============================== ====================================== 886 "AMD" ``NT_AMD_AMDGPU_HSA_METADATA`` <metadata null terminated string> 887 ===== ============================== ====================================== 888 889.. 890 891 .. table:: AMDGPU Code Object V2 ELF Note Record Enumeration Values 892 :name: amdgpu-elf-note-record-enumeration-values-table-v2 893 894 ============================== ===== 895 Name Value 896 ============================== ===== 897 *reserved* 0-9 898 ``NT_AMD_AMDGPU_HSA_METADATA`` 10 899 *reserved* 11 900 ============================== ===== 901 902``NT_AMD_AMDGPU_HSA_METADATA`` 903 Specifies extensible metadata associated with the code objects executed on HSA 904 [HSA]_ compatible runtimes such as AMD's ROCm [AMD-ROCm]_. It is required when 905 the target triple OS is ``amdhsa`` (see :ref:`amdgpu-target-triples`). See 906 :ref:`amdgpu-amdhsa-code-object-metadata-v2` for the syntax of the code 907 object metadata string. 908 909.. _amdgpu-note-records-v3: 910 911Code Object V3 Note Records (-mattr=+code-object-v3) 912~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 913 914The AMDGPU backend code object uses the following ELF note record in the 915``.note`` section when compiling for Code Object V3 (-mattr=+code-object-v3). 916 917Additional note records may be present, but any which are not documented here 918are deprecated and should not be used. 919 920 .. table:: AMDGPU Code Object V3 ELF Note Records 921 :name: amdgpu-elf-note-records-table-v3 922 923 ======== ============================== ====================================== 924 Name Type Description 925 ======== ============================== ====================================== 926 "AMDGPU" ``NT_AMDGPU_METADATA`` Metadata in Message Pack [MsgPack]_ 927 binary format. 928 ======== ============================== ====================================== 929 930.. 931 932 .. table:: AMDGPU Code Object V3 ELF Note Record Enumeration Values 933 :name: amdgpu-elf-note-record-enumeration-values-table-v3 934 935 ============================== ===== 936 Name Value 937 ============================== ===== 938 *reserved* 0-31 939 ``NT_AMDGPU_METADATA`` 32 940 ============================== ===== 941 942``NT_AMDGPU_METADATA`` 943 Specifies extensible metadata associated with an AMDGPU code 944 object. It is encoded as a map in the Message Pack [MsgPack]_ binary 945 data format. See :ref:`amdgpu-amdhsa-code-object-metadata-v3` for the 946 map keys defined for the ``amdhsa`` OS. 947 948.. _amdgpu-symbols: 949 950Symbols 951------- 952 953Symbols include the following: 954 955 .. table:: AMDGPU ELF Symbols 956 :name: amdgpu-elf-symbols-table 957 958 ===================== ================== ================ ================== 959 Name Type Section Description 960 ===================== ================== ================ ================== 961 *link-name* ``STT_OBJECT`` - ``.data`` Global variable 962 - ``.rodata`` 963 - ``.bss`` 964 *link-name*\ ``.kd`` ``STT_OBJECT`` - ``.rodata`` Kernel descriptor 965 *link-name* ``STT_FUNC`` - ``.text`` Kernel entry point 966 *link-name* ``STT_OBJECT`` - SHN_AMDGPU_LDS Global variable in LDS 967 ===================== ================== ================ ================== 968 969Global variable 970 Global variables both used and defined by the compilation unit. 971 972 If the symbol is defined in the compilation unit then it is allocated in the 973 appropriate section according to if it has initialized data or is readonly. 974 975 If the symbol is external then its section is ``STN_UNDEF`` and the loader 976 will resolve relocations using the definition provided by another code object 977 or explicitly defined by the runtime. 978 979 If the symbol resides in local/group memory (LDS) then its section is the 980 special processor-specific section name ``SHN_AMDGPU_LDS``, and the 981 ``st_value`` field describes alignment requirements as it does for common 982 symbols. 983 984 .. TODO:: 985 986 Add description of linked shared object symbols. Seems undefined symbols 987 are marked as STT_NOTYPE. 988 989Kernel descriptor 990 Every HSA kernel has an associated kernel descriptor. It is the address of the 991 kernel descriptor that is used in the AQL dispatch packet used to invoke the 992 kernel, not the kernel entry point. The layout of the HSA kernel descriptor is 993 defined in :ref:`amdgpu-amdhsa-kernel-descriptor`. 994 995Kernel entry point 996 Every HSA kernel also has a symbol for its machine code entry point. 997 998.. _amdgpu-relocation-records: 999 1000Relocation Records 1001------------------ 1002 1003AMDGPU backend generates ``Elf64_Rela`` relocation records. Supported 1004relocatable fields are: 1005 1006``word32`` 1007 This specifies a 32-bit field occupying 4 bytes with arbitrary byte 1008 alignment. These values use the same byte order as other word values in the 1009 AMDGPU architecture. 1010 1011``word64`` 1012 This specifies a 64-bit field occupying 8 bytes with arbitrary byte 1013 alignment. These values use the same byte order as other word values in the 1014 AMDGPU architecture. 1015 1016Following notations are used for specifying relocation calculations: 1017 1018**A** 1019 Represents the addend used to compute the value of the relocatable field. 1020 1021**G** 1022 Represents the offset into the global offset table at which the relocation 1023 entry's symbol will reside during execution. 1024 1025**GOT** 1026 Represents the address of the global offset table. 1027 1028**P** 1029 Represents the place (section offset for ``et_rel`` or address for ``et_dyn``) 1030 of the storage unit being relocated (computed using ``r_offset``). 1031 1032**S** 1033 Represents the value of the symbol whose index resides in the relocation 1034 entry. Relocations not using this must specify a symbol index of 1035 ``STN_UNDEF``. 1036 1037**B** 1038 Represents the base address of a loaded executable or shared object which is 1039 the difference between the ELF address and the actual load address. 1040 Relocations using this are only valid in executable or shared objects. 1041 1042The following relocation types are supported: 1043 1044 .. table:: AMDGPU ELF Relocation Records 1045 :name: amdgpu-elf-relocation-records-table 1046 1047 ========================== ======= ===== ========== ============================== 1048 Relocation Type Kind Value Field Calculation 1049 ========================== ======= ===== ========== ============================== 1050 ``R_AMDGPU_NONE`` 0 *none* *none* 1051 ``R_AMDGPU_ABS32_LO`` Static, 1 ``word32`` (S + A) & 0xFFFFFFFF 1052 Dynamic 1053 ``R_AMDGPU_ABS32_HI`` Static, 2 ``word32`` (S + A) >> 32 1054 Dynamic 1055 ``R_AMDGPU_ABS64`` Static, 3 ``word64`` S + A 1056 Dynamic 1057 ``R_AMDGPU_REL32`` Static 4 ``word32`` S + A - P 1058 ``R_AMDGPU_REL64`` Static 5 ``word64`` S + A - P 1059 ``R_AMDGPU_ABS32`` Static, 6 ``word32`` S + A 1060 Dynamic 1061 ``R_AMDGPU_GOTPCREL`` Static 7 ``word32`` G + GOT + A - P 1062 ``R_AMDGPU_GOTPCREL32_LO`` Static 8 ``word32`` (G + GOT + A - P) & 0xFFFFFFFF 1063 ``R_AMDGPU_GOTPCREL32_HI`` Static 9 ``word32`` (G + GOT + A - P) >> 32 1064 ``R_AMDGPU_REL32_LO`` Static 10 ``word32`` (S + A - P) & 0xFFFFFFFF 1065 ``R_AMDGPU_REL32_HI`` Static 11 ``word32`` (S + A - P) >> 32 1066 *reserved* 12 1067 ``R_AMDGPU_RELATIVE64`` Dynamic 13 ``word64`` B + A 1068 ========================== ======= ===== ========== ============================== 1069 1070``R_AMDGPU_ABS32_LO`` and ``R_AMDGPU_ABS32_HI`` are only supported by 1071the ``mesa3d`` OS, which does not support ``R_AMDGPU_ABS64``. 1072 1073There is no current OS loader support for 32-bit programs and so 1074``R_AMDGPU_ABS32`` is not used. 1075 1076.. _amdgpu-dwarf: 1077 1078DWARF 1079----- 1080 1081Standard DWARF [DWARF]_ Version 5 sections can be generated. These contain 1082information that maps the code object executable code and data to the source 1083language constructs. It can be used by tools such as debuggers and profilers. 1084 1085Address Space Mapping 1086~~~~~~~~~~~~~~~~~~~~~ 1087 1088The following address space mapping is used: 1089 1090 .. table:: AMDGPU DWARF Address Space Mapping 1091 :name: amdgpu-dwarf-address-space-mapping-table 1092 1093 =================== ================= 1094 DWARF Address Space Memory Space 1095 =================== ================= 1096 1 Private (Scratch) 1097 2 Local (group/LDS) 1098 *omitted* Global 1099 *omitted* Constant 1100 *omitted* Generic (Flat) 1101 *not supported* Region (GDS) 1102 =================== ================= 1103 1104See :ref:`amdgpu-address-spaces` for information on the address space 1105terminology used in the table. 1106 1107An ``address_class`` attribute is generated on pointer type DIEs to specify the 1108DWARF address space of the value of the pointer when it is in the *private* or 1109*local* address space. Otherwise the attribute is omitted. 1110 1111An ``DW_OP_xderef`` operation is generated in location list expressions for 1112variables that are allocated in the *private* and *local* address space. 1113Otherwise, ``DW_OP_xderef`` is omitted. 1114 1115Register Mapping 1116~~~~~~~~~~~~~~~~ 1117 1118*This section is WIP.* 1119 1120.. TODO:: 1121 Define DWARF register enumeration. 1122 1123 If want to present a wavefront state then should expose vector registers as 1124 64 dword wide (rather than per work-item view that LLVM uses). Either as 1125 separate registers, or a 64x4 byte single register. In either case use a new 1126 ``DW_OP_lane`` op (akin to ``DW_OP_xderef``) to select the current lane usage 1127 in a location expression. This would also allow scalar register spilling to 1128 vector register lanes to be expressed (currently no debug information is 1129 being generated for spilling). If choose a wide single register approach then 1130 use ``DW_OP_lane`` in conjunction with ``DW_OP_piece`` operation to select 1131 the dword part of the register for the current lane. If the separate register 1132 approach then use ``DW_OP_lane`` to select the register. 1133 1134Source Text 1135~~~~~~~~~~~ 1136 1137Source text for online-compiled programs (e.g. those compiled by the OpenCL 1138runtime) may be embedded into the DWARF v5 line table using the ``clang 1139-gembed-source`` option, described in table :ref:`amdgpu-debug-options`. 1140 1141For example: 1142 1143``-gembed-source`` 1144 Enable the embedded source DWARF v5 extension. 1145``-gno-embed-source`` 1146 Disable the embedded source DWARF v5 extension. 1147 1148 .. table:: AMDGPU Debug Options 1149 :name: amdgpu-debug-options 1150 1151 ==================== ================================================== 1152 Debug Flag Description 1153 ==================== ================================================== 1154 -g[no-]embed-source Enable/disable embedding source text in DWARF 1155 debug sections. Useful for environments where 1156 source cannot be written to disk, such as 1157 when performing online compilation. 1158 ==================== ================================================== 1159 1160This option enables one extended content types in the DWARF v5 Line Number 1161Program Header, which is used to encode embedded source. 1162 1163 .. table:: AMDGPU DWARF Line Number Program Header Extended Content Types 1164 :name: amdgpu-dwarf-extended-content-types 1165 1166 ============================ ====================== 1167 Content Type Form 1168 ============================ ====================== 1169 ``DW_LNCT_LLVM_source`` ``DW_FORM_line_strp`` 1170 ============================ ====================== 1171 1172The source field will contain the UTF-8 encoded, null-terminated source text 1173with ``'\n'`` line endings. When the source field is present, consumers can use 1174the embedded source instead of attempting to discover the source on disk. When 1175the source field is absent, consumers can access the file to get the source 1176text. 1177 1178The above content type appears in the ``file_name_entry_format`` field of the 1179line table prologue, and its corresponding value appear in the ``file_names`` 1180field. The current encoding of the content type is documented in table 1181:ref:`amdgpu-dwarf-extended-content-types-encoding` 1182 1183 .. table:: AMDGPU DWARF Line Number Program Header Extended Content Types Encoding 1184 :name: amdgpu-dwarf-extended-content-types-encoding 1185 1186 ============================ ==================== 1187 Content Type Value 1188 ============================ ==================== 1189 ``DW_LNCT_LLVM_source`` 0x2001 1190 ============================ ==================== 1191 1192.. _amdgpu-code-conventions: 1193 1194Code Conventions 1195================ 1196 1197This section provides code conventions used for each supported target triple OS 1198(see :ref:`amdgpu-target-triples`). 1199 1200AMDHSA 1201------ 1202 1203This section provides code conventions used when the target triple OS is 1204``amdhsa`` (see :ref:`amdgpu-target-triples`). 1205 1206.. _amdgpu-amdhsa-code-object-target-identification: 1207 1208Code Object Target Identification 1209~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 1210 1211The AMDHSA OS uses the following syntax to specify the code object 1212target as a single string: 1213 1214 ``<Architecture>-<Vendor>-<OS>-<Environment>-<Processor><Target Features>`` 1215 1216Where: 1217 1218 - ``<Architecture>``, ``<Vendor>``, ``<OS>`` and ``<Environment>`` 1219 are the same as the *Target Triple* (see 1220 :ref:`amdgpu-target-triples`). 1221 1222 - ``<Processor>`` is the same as the *Processor* (see 1223 :ref:`amdgpu-processors`). 1224 1225 - ``<Target Features>`` is a list of the enabled *Target Features* 1226 (see :ref:`amdgpu-target-features`), each prefixed by a plus, that 1227 apply to *Processor*. The list must be in the same order as listed 1228 in the table :ref:`amdgpu-target-feature-table`. Note that *Target 1229 Features* must be included in the list if they are enabled even if 1230 that is the default for *Processor*. 1231 1232For example: 1233 1234 ``"amdgcn-amd-amdhsa--gfx902+xnack"`` 1235 1236.. _amdgpu-amdhsa-code-object-metadata: 1237 1238Code Object Metadata 1239~~~~~~~~~~~~~~~~~~~~ 1240 1241The code object metadata specifies extensible metadata associated with the code 1242objects executed on HSA [HSA]_ compatible runtimes such as AMD's ROCm 1243[AMD-ROCm]_. The encoding and semantics of this metadata depends on the code 1244object version; see :ref:`amdgpu-amdhsa-code-object-metadata-v2` and 1245:ref:`amdgpu-amdhsa-code-object-metadata-v3`. 1246 1247Code object metadata is specified in a note record (see 1248:ref:`amdgpu-note-records`) and is required when the target triple OS is 1249``amdhsa`` (see :ref:`amdgpu-target-triples`). It must contain the minimum 1250information necessary to support the ROCM kernel queries. For example, the 1251segment sizes needed in a dispatch packet. In addition, a high level language 1252runtime may require other information to be included. For example, the AMD 1253OpenCL runtime records kernel argument information. 1254 1255.. _amdgpu-amdhsa-code-object-metadata-v2: 1256 1257Code Object V2 Metadata (-mattr=-code-object-v3) 1258++++++++++++++++++++++++++++++++++++++++++++++++ 1259 1260.. warning:: Code Object V2 is not the default code object version emitted by 1261 this version of LLVM. For a description of the metadata generated with the 1262 default configuration (Code Object V3) see 1263 :ref:`amdgpu-amdhsa-code-object-metadata-v3`. 1264 1265Code object V2 metadata is specified by the ``NT_AMD_AMDGPU_METADATA`` note 1266record (see :ref:`amdgpu-note-records-v2`). 1267 1268The metadata is specified as a YAML formatted string (see [YAML]_ and 1269:doc:`YamlIO`). 1270 1271.. TODO:: 1272 1273 Is the string null terminated? It probably should not if YAML allows it to 1274 contain null characters, otherwise it should be. 1275 1276The metadata is represented as a single YAML document comprised of the mapping 1277defined in table :ref:`amdgpu-amdhsa-code-object-metadata-map-table-v2` and 1278referenced tables. 1279 1280For boolean values, the string values of ``false`` and ``true`` are used for 1281false and true respectively. 1282 1283Additional information can be added to the mappings. To avoid conflicts, any 1284non-AMD key names should be prefixed by "*vendor-name*.". 1285 1286 .. table:: AMDHSA Code Object V2 Metadata Map 1287 :name: amdgpu-amdhsa-code-object-metadata-map-table-v2 1288 1289 ========== ============== ========= ======================================= 1290 String Key Value Type Required? Description 1291 ========== ============== ========= ======================================= 1292 "Version" sequence of Required - The first integer is the major 1293 2 integers version. Currently 1. 1294 - The second integer is the minor 1295 version. Currently 0. 1296 "Printf" sequence of Each string is encoded information 1297 strings about a printf function call. The 1298 encoded information is organized as 1299 fields separated by colon (':'): 1300 1301 ``ID:N:S[0]:S[1]:...:S[N-1]:FormatString`` 1302 1303 where: 1304 1305 ``ID`` 1306 A 32-bit integer as a unique id for 1307 each printf function call 1308 1309 ``N`` 1310 A 32-bit integer equal to the number 1311 of arguments of printf function call 1312 minus 1 1313 1314 ``S[i]`` (where i = 0, 1, ... , N-1) 1315 32-bit integers for the size in bytes 1316 of the i-th FormatString argument of 1317 the printf function call 1318 1319 FormatString 1320 The format string passed to the 1321 printf function call. 1322 "Kernels" sequence of Required Sequence of the mappings for each 1323 mapping kernel in the code object. See 1324 :ref:`amdgpu-amdhsa-code-object-kernel-metadata-map-table-v2` 1325 for the definition of the mapping. 1326 ========== ============== ========= ======================================= 1327 1328.. 1329 1330 .. table:: AMDHSA Code Object V2 Kernel Metadata Map 1331 :name: amdgpu-amdhsa-code-object-kernel-metadata-map-table-v2 1332 1333 ================= ============== ========= ================================ 1334 String Key Value Type Required? Description 1335 ================= ============== ========= ================================ 1336 "Name" string Required Source name of the kernel. 1337 "SymbolName" string Required Name of the kernel 1338 descriptor ELF symbol. 1339 "Language" string Source language of the kernel. 1340 Values include: 1341 1342 - "OpenCL C" 1343 - "OpenCL C++" 1344 - "HCC" 1345 - "OpenMP" 1346 1347 "LanguageVersion" sequence of - The first integer is the major 1348 2 integers version. 1349 - The second integer is the 1350 minor version. 1351 "Attrs" mapping Mapping of kernel attributes. 1352 See 1353 :ref:`amdgpu-amdhsa-code-object-kernel-attribute-metadata-map-table-v2` 1354 for the mapping definition. 1355 "Args" sequence of Sequence of mappings of the 1356 mapping kernel arguments. See 1357 :ref:`amdgpu-amdhsa-code-object-kernel-argument-metadata-map-table-v2` 1358 for the definition of the mapping. 1359 "CodeProps" mapping Mapping of properties related to 1360 the kernel code. See 1361 :ref:`amdgpu-amdhsa-code-object-kernel-code-properties-metadata-map-table-v2` 1362 for the mapping definition. 1363 ================= ============== ========= ================================ 1364 1365.. 1366 1367 .. table:: AMDHSA Code Object V2 Kernel Attribute Metadata Map 1368 :name: amdgpu-amdhsa-code-object-kernel-attribute-metadata-map-table-v2 1369 1370 =================== ============== ========= ============================== 1371 String Key Value Type Required? Description 1372 =================== ============== ========= ============================== 1373 "ReqdWorkGroupSize" sequence of If not 0, 0, 0 then all values 1374 3 integers must be >=1 and the dispatch 1375 work-group size X, Y, Z must 1376 correspond to the specified 1377 values. Defaults to 0, 0, 0. 1378 1379 Corresponds to the OpenCL 1380 ``reqd_work_group_size`` 1381 attribute. 1382 "WorkGroupSizeHint" sequence of The dispatch work-group size 1383 3 integers X, Y, Z is likely to be the 1384 specified values. 1385 1386 Corresponds to the OpenCL 1387 ``work_group_size_hint`` 1388 attribute. 1389 "VecTypeHint" string The name of a scalar or vector 1390 type. 1391 1392 Corresponds to the OpenCL 1393 ``vec_type_hint`` attribute. 1394 1395 "RuntimeHandle" string The external symbol name 1396 associated with a kernel. 1397 OpenCL runtime allocates a 1398 global buffer for the symbol 1399 and saves the kernel's address 1400 to it, which is used for 1401 device side enqueueing. Only 1402 available for device side 1403 enqueued kernels. 1404 =================== ============== ========= ============================== 1405 1406.. 1407 1408 .. table:: AMDHSA Code Object V2 Kernel Argument Metadata Map 1409 :name: amdgpu-amdhsa-code-object-kernel-argument-metadata-map-table-v2 1410 1411 ================= ============== ========= ================================ 1412 String Key Value Type Required? Description 1413 ================= ============== ========= ================================ 1414 "Name" string Kernel argument name. 1415 "TypeName" string Kernel argument type name. 1416 "Size" integer Required Kernel argument size in bytes. 1417 "Align" integer Required Kernel argument alignment in 1418 bytes. Must be a power of two. 1419 "ValueKind" string Required Kernel argument kind that 1420 specifies how to set up the 1421 corresponding argument. 1422 Values include: 1423 1424 "ByValue" 1425 The argument is copied 1426 directly into the kernarg. 1427 1428 "GlobalBuffer" 1429 A global address space pointer 1430 to the buffer data is passed 1431 in the kernarg. 1432 1433 "DynamicSharedPointer" 1434 A group address space pointer 1435 to dynamically allocated LDS 1436 is passed in the kernarg. 1437 1438 "Sampler" 1439 A global address space 1440 pointer to a S# is passed in 1441 the kernarg. 1442 1443 "Image" 1444 A global address space 1445 pointer to a T# is passed in 1446 the kernarg. 1447 1448 "Pipe" 1449 A global address space pointer 1450 to an OpenCL pipe is passed in 1451 the kernarg. 1452 1453 "Queue" 1454 A global address space pointer 1455 to an OpenCL device enqueue 1456 queue is passed in the 1457 kernarg. 1458 1459 "HiddenGlobalOffsetX" 1460 The OpenCL grid dispatch 1461 global offset for the X 1462 dimension is passed in the 1463 kernarg. 1464 1465 "HiddenGlobalOffsetY" 1466 The OpenCL grid dispatch 1467 global offset for the Y 1468 dimension is passed in the 1469 kernarg. 1470 1471 "HiddenGlobalOffsetZ" 1472 The OpenCL grid dispatch 1473 global offset for the Z 1474 dimension is passed in the 1475 kernarg. 1476 1477 "HiddenNone" 1478 An argument that is not used 1479 by the kernel. Space needs to 1480 be left for it, but it does 1481 not need to be set up. 1482 1483 "HiddenPrintfBuffer" 1484 A global address space pointer 1485 to the runtime printf buffer 1486 is passed in kernarg. 1487 1488 "HiddenHostcallBuffer" 1489 A global address space pointer 1490 to the runtime hostcall buffer 1491 is passed in kernarg. 1492 1493 "HiddenDefaultQueue" 1494 A global address space pointer 1495 to the OpenCL device enqueue 1496 queue that should be used by 1497 the kernel by default is 1498 passed in the kernarg. 1499 1500 "HiddenCompletionAction" 1501 A global address space pointer 1502 to help link enqueued kernels into 1503 the ancestor tree for determining 1504 when the parent kernel has finished. 1505 1506 "HiddenMultiGridSyncArg" 1507 A global address space pointer for 1508 multi-grid synchronization is 1509 passed in the kernarg. 1510 1511 "ValueType" string Required Kernel argument value type. Only 1512 present if "ValueKind" is 1513 "ByValue". For vector data 1514 types, the value is for the 1515 element type. Values include: 1516 1517 - "Struct" 1518 - "I8" 1519 - "U8" 1520 - "I16" 1521 - "U16" 1522 - "F16" 1523 - "I32" 1524 - "U32" 1525 - "F32" 1526 - "I64" 1527 - "U64" 1528 - "F64" 1529 1530 .. TODO:: 1531 How can it be determined if a 1532 vector type, and what size 1533 vector? 1534 "PointeeAlign" integer Alignment in bytes of pointee 1535 type for pointer type kernel 1536 argument. Must be a power 1537 of 2. Only present if 1538 "ValueKind" is 1539 "DynamicSharedPointer". 1540 "AddrSpaceQual" string Kernel argument address space 1541 qualifier. Only present if 1542 "ValueKind" is "GlobalBuffer" or 1543 "DynamicSharedPointer". Values 1544 are: 1545 1546 - "Private" 1547 - "Global" 1548 - "Constant" 1549 - "Local" 1550 - "Generic" 1551 - "Region" 1552 1553 .. TODO:: 1554 Is GlobalBuffer only Global 1555 or Constant? Is 1556 DynamicSharedPointer always 1557 Local? Can HCC allow Generic? 1558 How can Private or Region 1559 ever happen? 1560 "AccQual" string Kernel argument access 1561 qualifier. Only present if 1562 "ValueKind" is "Image" or 1563 "Pipe". Values 1564 are: 1565 1566 - "ReadOnly" 1567 - "WriteOnly" 1568 - "ReadWrite" 1569 1570 .. TODO:: 1571 Does this apply to 1572 GlobalBuffer? 1573 "ActualAccQual" string The actual memory accesses 1574 performed by the kernel on the 1575 kernel argument. Only present if 1576 "ValueKind" is "GlobalBuffer", 1577 "Image", or "Pipe". This may be 1578 more restrictive than indicated 1579 by "AccQual" to reflect what the 1580 kernel actual does. If not 1581 present then the runtime must 1582 assume what is implied by 1583 "AccQual" and "IsConst". Values 1584 are: 1585 1586 - "ReadOnly" 1587 - "WriteOnly" 1588 - "ReadWrite" 1589 1590 "IsConst" boolean Indicates if the kernel argument 1591 is const qualified. Only present 1592 if "ValueKind" is 1593 "GlobalBuffer". 1594 1595 "IsRestrict" boolean Indicates if the kernel argument 1596 is restrict qualified. Only 1597 present if "ValueKind" is 1598 "GlobalBuffer". 1599 1600 "IsVolatile" boolean Indicates if the kernel argument 1601 is volatile qualified. Only 1602 present if "ValueKind" is 1603 "GlobalBuffer". 1604 1605 "IsPipe" boolean Indicates if the kernel argument 1606 is pipe qualified. Only present 1607 if "ValueKind" is "Pipe". 1608 1609 .. TODO:: 1610 Can GlobalBuffer be pipe 1611 qualified? 1612 ================= ============== ========= ================================ 1613 1614.. 1615 1616 .. table:: AMDHSA Code Object V2 Kernel Code Properties Metadata Map 1617 :name: amdgpu-amdhsa-code-object-kernel-code-properties-metadata-map-table-v2 1618 1619 ============================ ============== ========= ===================== 1620 String Key Value Type Required? Description 1621 ============================ ============== ========= ===================== 1622 "KernargSegmentSize" integer Required The size in bytes of 1623 the kernarg segment 1624 that holds the values 1625 of the arguments to 1626 the kernel. 1627 "GroupSegmentFixedSize" integer Required The amount of group 1628 segment memory 1629 required by a 1630 work-group in 1631 bytes. This does not 1632 include any 1633 dynamically allocated 1634 group segment memory 1635 that may be added 1636 when the kernel is 1637 dispatched. 1638 "PrivateSegmentFixedSize" integer Required The amount of fixed 1639 private address space 1640 memory required for a 1641 work-item in 1642 bytes. If the kernel 1643 uses a dynamic call 1644 stack then additional 1645 space must be added 1646 to this value for the 1647 call stack. 1648 "KernargSegmentAlign" integer Required The maximum byte 1649 alignment of 1650 arguments in the 1651 kernarg segment. Must 1652 be a power of 2. 1653 "WavefrontSize" integer Required Wavefront size. Must 1654 be a power of 2. 1655 "NumSGPRs" integer Required Number of scalar 1656 registers used by a 1657 wavefront for 1658 GFX6-GFX10. This 1659 includes the special 1660 SGPRs for VCC, Flat 1661 Scratch (GFX7-GFX10) 1662 and XNACK (for 1663 GFX8-GFX10). It does 1664 not include the 16 1665 SGPR added if a trap 1666 handler is 1667 enabled. It is not 1668 rounded up to the 1669 allocation 1670 granularity. 1671 "NumVGPRs" integer Required Number of vector 1672 registers used by 1673 each work-item for 1674 GFX6-GFX10 1675 "MaxFlatWorkGroupSize" integer Required Maximum flat 1676 work-group size 1677 supported by the 1678 kernel in work-items. 1679 Must be >=1 and 1680 consistent with 1681 ReqdWorkGroupSize if 1682 not 0, 0, 0. 1683 "NumSpilledSGPRs" integer Number of stores from 1684 a scalar register to 1685 a register allocator 1686 created spill 1687 location. 1688 "NumSpilledVGPRs" integer Number of stores from 1689 a vector register to 1690 a register allocator 1691 created spill 1692 location. 1693 ============================ ============== ========= ===================== 1694 1695.. _amdgpu-amdhsa-code-object-metadata-v3: 1696 1697Code Object V3 Metadata (-mattr=+code-object-v3) 1698++++++++++++++++++++++++++++++++++++++++++++++++ 1699 1700Code object V3 metadata is specified by the ``NT_AMDGPU_METADATA`` note record 1701(see :ref:`amdgpu-note-records-v3`). 1702 1703The metadata is represented as Message Pack formatted binary data (see 1704[MsgPack]_). The top level is a Message Pack map that includes the 1705keys defined in table 1706:ref:`amdgpu-amdhsa-code-object-metadata-map-table-v3` and referenced 1707tables. 1708 1709Additional information can be added to the maps. To avoid conflicts, 1710any key names should be prefixed by "*vendor-name*." where 1711``vendor-name`` can be the name of the vendor and specific vendor 1712tool that generates the information. The prefix is abbreviated to 1713simply "." when it appears within a map that has been added by the 1714same *vendor-name*. 1715 1716 .. table:: AMDHSA Code Object V3 Metadata Map 1717 :name: amdgpu-amdhsa-code-object-metadata-map-table-v3 1718 1719 ================= ============== ========= ======================================= 1720 String Key Value Type Required? Description 1721 ================= ============== ========= ======================================= 1722 "amdhsa.version" sequence of Required - The first integer is the major 1723 2 integers version. Currently 1. 1724 - The second integer is the minor 1725 version. Currently 0. 1726 "amdhsa.printf" sequence of Each string is encoded information 1727 strings about a printf function call. The 1728 encoded information is organized as 1729 fields separated by colon (':'): 1730 1731 ``ID:N:S[0]:S[1]:...:S[N-1]:FormatString`` 1732 1733 where: 1734 1735 ``ID`` 1736 A 32-bit integer as a unique id for 1737 each printf function call 1738 1739 ``N`` 1740 A 32-bit integer equal to the number 1741 of arguments of printf function call 1742 minus 1 1743 1744 ``S[i]`` (where i = 0, 1, ... , N-1) 1745 32-bit integers for the size in bytes 1746 of the i-th FormatString argument of 1747 the printf function call 1748 1749 FormatString 1750 The format string passed to the 1751 printf function call. 1752 "amdhsa.kernels" sequence of Required Sequence of the maps for each 1753 map kernel in the code object. See 1754 :ref:`amdgpu-amdhsa-code-object-kernel-metadata-map-table-v3` 1755 for the definition of the keys included 1756 in that map. 1757 ================= ============== ========= ======================================= 1758 1759.. 1760 1761 .. table:: AMDHSA Code Object V3 Kernel Metadata Map 1762 :name: amdgpu-amdhsa-code-object-kernel-metadata-map-table-v3 1763 1764 =================================== ============== ========= ================================ 1765 String Key Value Type Required? Description 1766 =================================== ============== ========= ================================ 1767 ".name" string Required Source name of the kernel. 1768 ".symbol" string Required Name of the kernel 1769 descriptor ELF symbol. 1770 ".language" string Source language of the kernel. 1771 Values include: 1772 1773 - "OpenCL C" 1774 - "OpenCL C++" 1775 - "HCC" 1776 - "HIP" 1777 - "OpenMP" 1778 - "Assembler" 1779 1780 ".language_version" sequence of - The first integer is the major 1781 2 integers version. 1782 - The second integer is the 1783 minor version. 1784 ".args" sequence of Sequence of maps of the 1785 map kernel arguments. See 1786 :ref:`amdgpu-amdhsa-code-object-kernel-argument-metadata-map-table-v3` 1787 for the definition of the keys 1788 included in that map. 1789 ".reqd_workgroup_size" sequence of If not 0, 0, 0 then all values 1790 3 integers must be >=1 and the dispatch 1791 work-group size X, Y, Z must 1792 correspond to the specified 1793 values. Defaults to 0, 0, 0. 1794 1795 Corresponds to the OpenCL 1796 ``reqd_work_group_size`` 1797 attribute. 1798 ".workgroup_size_hint" sequence of The dispatch work-group size 1799 3 integers X, Y, Z is likely to be the 1800 specified values. 1801 1802 Corresponds to the OpenCL 1803 ``work_group_size_hint`` 1804 attribute. 1805 ".vec_type_hint" string The name of a scalar or vector 1806 type. 1807 1808 Corresponds to the OpenCL 1809 ``vec_type_hint`` attribute. 1810 1811 ".device_enqueue_symbol" string The external symbol name 1812 associated with a kernel. 1813 OpenCL runtime allocates a 1814 global buffer for the symbol 1815 and saves the kernel's address 1816 to it, which is used for 1817 device side enqueueing. Only 1818 available for device side 1819 enqueued kernels. 1820 ".kernarg_segment_size" integer Required The size in bytes of 1821 the kernarg segment 1822 that holds the values 1823 of the arguments to 1824 the kernel. 1825 ".group_segment_fixed_size" integer Required The amount of group 1826 segment memory 1827 required by a 1828 work-group in 1829 bytes. This does not 1830 include any 1831 dynamically allocated 1832 group segment memory 1833 that may be added 1834 when the kernel is 1835 dispatched. 1836 ".private_segment_fixed_size" integer Required The amount of fixed 1837 private address space 1838 memory required for a 1839 work-item in 1840 bytes. If the kernel 1841 uses a dynamic call 1842 stack then additional 1843 space must be added 1844 to this value for the 1845 call stack. 1846 ".kernarg_segment_align" integer Required The maximum byte 1847 alignment of 1848 arguments in the 1849 kernarg segment. Must 1850 be a power of 2. 1851 ".wavefront_size" integer Required Wavefront size. Must 1852 be a power of 2. 1853 ".sgpr_count" integer Required Number of scalar 1854 registers required by a 1855 wavefront for 1856 GFX6-GFX9. A register 1857 is required if it is 1858 used explicitly, or 1859 if a higher numbered 1860 register is used 1861 explicitly. This 1862 includes the special 1863 SGPRs for VCC, Flat 1864 Scratch (GFX7-GFX9) 1865 and XNACK (for 1866 GFX8-GFX9). It does 1867 not include the 16 1868 SGPR added if a trap 1869 handler is 1870 enabled. It is not 1871 rounded up to the 1872 allocation 1873 granularity. 1874 ".vgpr_count" integer Required Number of vector 1875 registers required by 1876 each work-item for 1877 GFX6-GFX9. A register 1878 is required if it is 1879 used explicitly, or 1880 if a higher numbered 1881 register is used 1882 explicitly. 1883 ".max_flat_workgroup_size" integer Required Maximum flat 1884 work-group size 1885 supported by the 1886 kernel in work-items. 1887 Must be >=1 and 1888 consistent with 1889 ReqdWorkGroupSize if 1890 not 0, 0, 0. 1891 ".sgpr_spill_count" integer Number of stores from 1892 a scalar register to 1893 a register allocator 1894 created spill 1895 location. 1896 ".vgpr_spill_count" integer Number of stores from 1897 a vector register to 1898 a register allocator 1899 created spill 1900 location. 1901 =================================== ============== ========= ================================ 1902 1903.. 1904 1905 .. table:: AMDHSA Code Object V3 Kernel Argument Metadata Map 1906 :name: amdgpu-amdhsa-code-object-kernel-argument-metadata-map-table-v3 1907 1908 ====================== ============== ========= ================================ 1909 String Key Value Type Required? Description 1910 ====================== ============== ========= ================================ 1911 ".name" string Kernel argument name. 1912 ".type_name" string Kernel argument type name. 1913 ".size" integer Required Kernel argument size in bytes. 1914 ".offset" integer Required Kernel argument offset in 1915 bytes. The offset must be a 1916 multiple of the alignment 1917 required by the argument. 1918 ".value_kind" string Required Kernel argument kind that 1919 specifies how to set up the 1920 corresponding argument. 1921 Values include: 1922 1923 "by_value" 1924 The argument is copied 1925 directly into the kernarg. 1926 1927 "global_buffer" 1928 A global address space pointer 1929 to the buffer data is passed 1930 in the kernarg. 1931 1932 "dynamic_shared_pointer" 1933 A group address space pointer 1934 to dynamically allocated LDS 1935 is passed in the kernarg. 1936 1937 "sampler" 1938 A global address space 1939 pointer to a S# is passed in 1940 the kernarg. 1941 1942 "image" 1943 A global address space 1944 pointer to a T# is passed in 1945 the kernarg. 1946 1947 "pipe" 1948 A global address space pointer 1949 to an OpenCL pipe is passed in 1950 the kernarg. 1951 1952 "queue" 1953 A global address space pointer 1954 to an OpenCL device enqueue 1955 queue is passed in the 1956 kernarg. 1957 1958 "hidden_global_offset_x" 1959 The OpenCL grid dispatch 1960 global offset for the X 1961 dimension is passed in the 1962 kernarg. 1963 1964 "hidden_global_offset_y" 1965 The OpenCL grid dispatch 1966 global offset for the Y 1967 dimension is passed in the 1968 kernarg. 1969 1970 "hidden_global_offset_z" 1971 The OpenCL grid dispatch 1972 global offset for the Z 1973 dimension is passed in the 1974 kernarg. 1975 1976 "hidden_none" 1977 An argument that is not used 1978 by the kernel. Space needs to 1979 be left for it, but it does 1980 not need to be set up. 1981 1982 "hidden_printf_buffer" 1983 A global address space pointer 1984 to the runtime printf buffer 1985 is passed in kernarg. 1986 1987 "hidden_hostcall_buffer" 1988 A global address space pointer 1989 to the runtime hostcall buffer 1990 is passed in kernarg. 1991 1992 "hidden_default_queue" 1993 A global address space pointer 1994 to the OpenCL device enqueue 1995 queue that should be used by 1996 the kernel by default is 1997 passed in the kernarg. 1998 1999 "hidden_completion_action" 2000 A global address space pointer 2001 to help link enqueued kernels into 2002 the ancestor tree for determining 2003 when the parent kernel has finished. 2004 2005 "hidden_multigrid_sync_arg" 2006 A global address space pointer for 2007 multi-grid synchronization is 2008 passed in the kernarg. 2009 2010 ".value_type" string Required Kernel argument value type. Only 2011 present if ".value_kind" is 2012 "by_value". For vector data 2013 types, the value is for the 2014 element type. Values include: 2015 2016 - "struct" 2017 - "i8" 2018 - "u8" 2019 - "i16" 2020 - "u16" 2021 - "f16" 2022 - "i32" 2023 - "u32" 2024 - "f32" 2025 - "i64" 2026 - "u64" 2027 - "f64" 2028 2029 .. TODO:: 2030 How can it be determined if a 2031 vector type, and what size 2032 vector? 2033 ".pointee_align" integer Alignment in bytes of pointee 2034 type for pointer type kernel 2035 argument. Must be a power 2036 of 2. Only present if 2037 ".value_kind" is 2038 "dynamic_shared_pointer". 2039 ".address_space" string Kernel argument address space 2040 qualifier. Only present if 2041 ".value_kind" is "global_buffer" or 2042 "dynamic_shared_pointer". Values 2043 are: 2044 2045 - "private" 2046 - "global" 2047 - "constant" 2048 - "local" 2049 - "generic" 2050 - "region" 2051 2052 .. TODO:: 2053 Is "global_buffer" only "global" 2054 or "constant"? Is 2055 "dynamic_shared_pointer" always 2056 "local"? Can HCC allow "generic"? 2057 How can "private" or "region" 2058 ever happen? 2059 ".access" string Kernel argument access 2060 qualifier. Only present if 2061 ".value_kind" is "image" or 2062 "pipe". Values 2063 are: 2064 2065 - "read_only" 2066 - "write_only" 2067 - "read_write" 2068 2069 .. TODO:: 2070 Does this apply to 2071 "global_buffer"? 2072 ".actual_access" string The actual memory accesses 2073 performed by the kernel on the 2074 kernel argument. Only present if 2075 ".value_kind" is "global_buffer", 2076 "image", or "pipe". This may be 2077 more restrictive than indicated 2078 by ".access" to reflect what the 2079 kernel actual does. If not 2080 present then the runtime must 2081 assume what is implied by 2082 ".access" and ".is_const" . Values 2083 are: 2084 2085 - "read_only" 2086 - "write_only" 2087 - "read_write" 2088 2089 ".is_const" boolean Indicates if the kernel argument 2090 is const qualified. Only present 2091 if ".value_kind" is 2092 "global_buffer". 2093 2094 ".is_restrict" boolean Indicates if the kernel argument 2095 is restrict qualified. Only 2096 present if ".value_kind" is 2097 "global_buffer". 2098 2099 ".is_volatile" boolean Indicates if the kernel argument 2100 is volatile qualified. Only 2101 present if ".value_kind" is 2102 "global_buffer". 2103 2104 ".is_pipe" boolean Indicates if the kernel argument 2105 is pipe qualified. Only present 2106 if ".value_kind" is "pipe". 2107 2108 .. TODO:: 2109 Can "global_buffer" be pipe 2110 qualified? 2111 ====================== ============== ========= ================================ 2112 2113.. 2114 2115Kernel Dispatch 2116~~~~~~~~~~~~~~~ 2117 2118The HSA architected queuing language (AQL) defines a user space memory 2119interface that can be used to control the dispatch of kernels, in an agent 2120independent way. An agent can have zero or more AQL queues created for it using 2121the ROCm runtime, in which AQL packets (all of which are 64 bytes) can be 2122placed. See the *HSA Platform System Architecture Specification* [HSA]_ for the 2123AQL queue mechanics and packet layouts. 2124 2125The packet processor of a kernel agent is responsible for detecting and 2126dispatching HSA kernels from the AQL queues associated with it. For AMD GPUs the 2127packet processor is implemented by the hardware command processor (CP), 2128asynchronous dispatch controller (ADC) and shader processor input controller 2129(SPI). 2130 2131The ROCm runtime can be used to allocate an AQL queue object. It uses the kernel 2132mode driver to initialize and register the AQL queue with CP. 2133 2134To dispatch a kernel the following actions are performed. This can occur in the 2135CPU host program, or from an HSA kernel executing on a GPU. 2136 21371. A pointer to an AQL queue for the kernel agent on which the kernel is to be 2138 executed is obtained. 21392. A pointer to the kernel descriptor (see 2140 :ref:`amdgpu-amdhsa-kernel-descriptor`) of the kernel to execute is obtained. 2141 It must be for a kernel that is contained in a code object that that was 2142 loaded by the ROCm runtime on the kernel agent with which the AQL queue is 2143 associated. 21443. Space is allocated for the kernel arguments using the ROCm runtime allocator 2145 for a memory region with the kernarg property for the kernel agent that will 2146 execute the kernel. It must be at least 16 byte aligned. 21474. Kernel argument values are assigned to the kernel argument memory 2148 allocation. The layout is defined in the *HSA Programmer's Language 2149 Reference* [HSA]_. For AMDGPU the kernel execution directly accesses the 2150 kernel argument memory in the same way constant memory is accessed. (Note 2151 that the HSA specification allows an implementation to copy the kernel 2152 argument contents to another location that is accessed by the kernel.) 21535. An AQL kernel dispatch packet is created on the AQL queue. The ROCm runtime 2154 api uses 64-bit atomic operations to reserve space in the AQL queue for the 2155 packet. The packet must be set up, and the final write must use an atomic 2156 store release to set the packet kind to ensure the packet contents are 2157 visible to the kernel agent. AQL defines a doorbell signal mechanism to 2158 notify the kernel agent that the AQL queue has been updated. These rules, and 2159 the layout of the AQL queue and kernel dispatch packet is defined in the *HSA 2160 System Architecture Specification* [HSA]_. 21616. A kernel dispatch packet includes information about the actual dispatch, 2162 such as grid and work-group size, together with information from the code 2163 object about the kernel, such as segment sizes. The ROCm runtime queries on 2164 the kernel symbol can be used to obtain the code object values which are 2165 recorded in the :ref:`amdgpu-amdhsa-code-object-metadata`. 21667. CP executes micro-code and is responsible for detecting and setting up the 2167 GPU to execute the wavefronts of a kernel dispatch. 21688. CP ensures that when the a wavefront starts executing the kernel machine 2169 code, the scalar general purpose registers (SGPR) and vector general purpose 2170 registers (VGPR) are set up as required by the machine code. The required 2171 setup is defined in the :ref:`amdgpu-amdhsa-kernel-descriptor`. The initial 2172 register state is defined in 2173 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`. 21749. The prolog of the kernel machine code (see 2175 :ref:`amdgpu-amdhsa-kernel-prolog`) sets up the machine state as necessary 2176 before continuing executing the machine code that corresponds to the kernel. 217710. When the kernel dispatch has completed execution, CP signals the completion 2178 signal specified in the kernel dispatch packet if not 0. 2179 2180Image and Samplers 2181~~~~~~~~~~~~~~~~~~ 2182 2183Image and sample handles created by the ROCm runtime are 64-bit addresses of a 2184hardware 32 byte V# and 48 byte S# object respectively. In order to support the 2185HSA ``query_sampler`` operations two extra dwords are used to store the HSA BRIG 2186enumeration values for the queries that are not trivially deducible from the S# 2187representation. 2188 2189HSA Signals 2190~~~~~~~~~~~ 2191 2192HSA signal handles created by the ROCm runtime are 64-bit addresses of a 2193structure allocated in memory accessible from both the CPU and GPU. The 2194structure is defined by the ROCm runtime and subject to change between releases 2195(see [AMD-ROCm-github]_). 2196 2197.. _amdgpu-amdhsa-hsa-aql-queue: 2198 2199HSA AQL Queue 2200~~~~~~~~~~~~~ 2201 2202The HSA AQL queue structure is defined by the ROCm runtime and subject to change 2203between releases (see [AMD-ROCm-github]_). For some processors it contains 2204fields needed to implement certain language features such as the flat address 2205aperture bases. It also contains fields used by CP such as managing the 2206allocation of scratch memory. 2207 2208.. _amdgpu-amdhsa-kernel-descriptor: 2209 2210Kernel Descriptor 2211~~~~~~~~~~~~~~~~~ 2212 2213A kernel descriptor consists of the information needed by CP to initiate the 2214execution of a kernel, including the entry point address of the machine code 2215that implements the kernel. 2216 2217Kernel Descriptor for GFX6-GFX10 2218++++++++++++++++++++++++++++++++ 2219 2220CP microcode requires the Kernel descriptor to be allocated on 64 byte 2221alignment. 2222 2223 .. table:: Kernel Descriptor for GFX6-GFX10 2224 :name: amdgpu-amdhsa-kernel-descriptor-gfx6-gfx10-table 2225 2226 ======= ======= =============================== ============================ 2227 Bits Size Field Name Description 2228 ======= ======= =============================== ============================ 2229 31:0 4 bytes GROUP_SEGMENT_FIXED_SIZE The amount of fixed local 2230 address space memory 2231 required for a work-group 2232 in bytes. This does not 2233 include any dynamically 2234 allocated local address 2235 space memory that may be 2236 added when the kernel is 2237 dispatched. 2238 63:32 4 bytes PRIVATE_SEGMENT_FIXED_SIZE The amount of fixed 2239 private address space 2240 memory required for a 2241 work-item in bytes. If 2242 is_dynamic_callstack is 1 2243 then additional space must 2244 be added to this value for 2245 the call stack. 2246 127:64 8 bytes Reserved, must be 0. 2247 191:128 8 bytes KERNEL_CODE_ENTRY_BYTE_OFFSET Byte offset (possibly 2248 negative) from base 2249 address of kernel 2250 descriptor to kernel's 2251 entry point instruction 2252 which must be 256 byte 2253 aligned. 2254 351:272 20 Reserved, must be 0. 2255 bytes 2256 383:352 4 bytes COMPUTE_PGM_RSRC3 GFX6-9 2257 Reserved, must be 0. 2258 GFX10 2259 Compute Shader (CS) 2260 program settings used by 2261 CP to set up 2262 ``COMPUTE_PGM_RSRC3`` 2263 configuration 2264 register. See 2265 :ref:`amdgpu-amdhsa-compute_pgm_rsrc3-gfx10-table`. 2266 415:384 4 bytes COMPUTE_PGM_RSRC1 Compute Shader (CS) 2267 program settings used by 2268 CP to set up 2269 ``COMPUTE_PGM_RSRC1`` 2270 configuration 2271 register. See 2272 :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx10-table`. 2273 447:416 4 bytes COMPUTE_PGM_RSRC2 Compute Shader (CS) 2274 program settings used by 2275 CP to set up 2276 ``COMPUTE_PGM_RSRC2`` 2277 configuration 2278 register. See 2279 :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx10-table`. 2280 448 1 bit ENABLE_SGPR_PRIVATE_SEGMENT Enable the setup of the 2281 _BUFFER SGPR user data registers 2282 (see 2283 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`). 2284 2285 The total number of SGPR 2286 user data registers 2287 requested must not exceed 2288 16 and match value in 2289 ``compute_pgm_rsrc2.user_sgpr.user_sgpr_count``. 2290 Any requests beyond 16 2291 will be ignored. 2292 449 1 bit ENABLE_SGPR_DISPATCH_PTR *see above* 2293 450 1 bit ENABLE_SGPR_QUEUE_PTR *see above* 2294 451 1 bit ENABLE_SGPR_KERNARG_SEGMENT_PTR *see above* 2295 452 1 bit ENABLE_SGPR_DISPATCH_ID *see above* 2296 453 1 bit ENABLE_SGPR_FLAT_SCRATCH_INIT *see above* 2297 454 1 bit ENABLE_SGPR_PRIVATE_SEGMENT *see above* 2298 _SIZE 2299 457:455 3 bits Reserved, must be 0. 2300 458 1 bit ENABLE_WAVEFRONT_SIZE32 GFX6-9 2301 Reserved, must be 0. 2302 GFX10 2303 - If 0 execute in 2304 wavefront size 64 mode. 2305 - If 1 execute in 2306 native wavefront size 2307 32 mode. 2308 463:459 5 bits Reserved, must be 0. 2309 511:464 6 bytes Reserved, must be 0. 2310 512 **Total size 64 bytes.** 2311 ======= ==================================================================== 2312 2313.. 2314 2315 .. table:: compute_pgm_rsrc1 for GFX6-GFX10 2316 :name: amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx10-table 2317 2318 ======= ======= =============================== =========================================================================== 2319 Bits Size Field Name Description 2320 ======= ======= =============================== =========================================================================== 2321 5:0 6 bits GRANULATED_WORKITEM_VGPR_COUNT Number of vector register 2322 blocks used by each work-item; 2323 granularity is device 2324 specific: 2325 2326 GFX6-GFX9 2327 - vgprs_used 0..256 2328 - max(0, ceil(vgprs_used / 4) - 1) 2329 GFX10 (wavefront size 64) 2330 - max_vgpr 1..256 2331 - max(0, ceil(vgprs_used / 4) - 1) 2332 GFX10 (wavefront size 32) 2333 - max_vgpr 1..256 2334 - max(0, ceil(vgprs_used / 8) - 1) 2335 2336 Where vgprs_used is defined 2337 as the highest VGPR number 2338 explicitly referenced plus 2339 one. 2340 2341 Used by CP to set up 2342 ``COMPUTE_PGM_RSRC1.VGPRS``. 2343 2344 The 2345 :ref:`amdgpu-assembler` 2346 calculates this 2347 automatically for the 2348 selected processor from 2349 values provided to the 2350 `.amdhsa_kernel` directive 2351 by the 2352 `.amdhsa_next_free_vgpr` 2353 nested directive (see 2354 :ref:`amdhsa-kernel-directives-table`). 2355 9:6 4 bits GRANULATED_WAVEFRONT_SGPR_COUNT Number of scalar register 2356 blocks used by a wavefront; 2357 granularity is device 2358 specific: 2359 2360 GFX6-GFX8 2361 - sgprs_used 0..112 2362 - max(0, ceil(sgprs_used / 8) - 1) 2363 GFX9 2364 - sgprs_used 0..112 2365 - 2 * max(0, ceil(sgprs_used / 16) - 1) 2366 GFX10 2367 Reserved, must be 0. 2368 (128 SGPRs always 2369 allocated.) 2370 2371 Where sgprs_used is 2372 defined as the highest 2373 SGPR number explicitly 2374 referenced plus one, plus 2375 a target-specific number 2376 of additional special 2377 SGPRs for VCC, 2378 FLAT_SCRATCH (GFX7+) and 2379 XNACK_MASK (GFX8+), and 2380 any additional 2381 target-specific 2382 limitations. It does not 2383 include the 16 SGPRs added 2384 if a trap handler is 2385 enabled. 2386 2387 The target-specific 2388 limitations and special 2389 SGPR layout are defined in 2390 the hardware 2391 documentation, which can 2392 be found in the 2393 :ref:`amdgpu-processors` 2394 table. 2395 2396 Used by CP to set up 2397 ``COMPUTE_PGM_RSRC1.SGPRS``. 2398 2399 The 2400 :ref:`amdgpu-assembler` 2401 calculates this 2402 automatically for the 2403 selected processor from 2404 values provided to the 2405 `.amdhsa_kernel` directive 2406 by the 2407 `.amdhsa_next_free_sgpr` 2408 and `.amdhsa_reserve_*` 2409 nested directives (see 2410 :ref:`amdhsa-kernel-directives-table`). 2411 11:10 2 bits PRIORITY Must be 0. 2412 2413 Start executing wavefront 2414 at the specified priority. 2415 2416 CP is responsible for 2417 filling in 2418 ``COMPUTE_PGM_RSRC1.PRIORITY``. 2419 13:12 2 bits FLOAT_ROUND_MODE_32 Wavefront starts execution 2420 with specified rounding 2421 mode for single (32 2422 bit) floating point 2423 precision floating point 2424 operations. 2425 2426 Floating point rounding 2427 mode values are defined in 2428 :ref:`amdgpu-amdhsa-floating-point-rounding-mode-enumeration-values-table`. 2429 2430 Used by CP to set up 2431 ``COMPUTE_PGM_RSRC1.FLOAT_MODE``. 2432 15:14 2 bits FLOAT_ROUND_MODE_16_64 Wavefront starts execution 2433 with specified rounding 2434 denorm mode for half/double (16 2435 and 64-bit) floating point 2436 precision floating point 2437 operations. 2438 2439 Floating point rounding 2440 mode values are defined in 2441 :ref:`amdgpu-amdhsa-floating-point-rounding-mode-enumeration-values-table`. 2442 2443 Used by CP to set up 2444 ``COMPUTE_PGM_RSRC1.FLOAT_MODE``. 2445 17:16 2 bits FLOAT_DENORM_MODE_32 Wavefront starts execution 2446 with specified denorm mode 2447 for single (32 2448 bit) floating point 2449 precision floating point 2450 operations. 2451 2452 Floating point denorm mode 2453 values are defined in 2454 :ref:`amdgpu-amdhsa-floating-point-denorm-mode-enumeration-values-table`. 2455 2456 Used by CP to set up 2457 ``COMPUTE_PGM_RSRC1.FLOAT_MODE``. 2458 19:18 2 bits FLOAT_DENORM_MODE_16_64 Wavefront starts execution 2459 with specified denorm mode 2460 for half/double (16 2461 and 64-bit) floating point 2462 precision floating point 2463 operations. 2464 2465 Floating point denorm mode 2466 values are defined in 2467 :ref:`amdgpu-amdhsa-floating-point-denorm-mode-enumeration-values-table`. 2468 2469 Used by CP to set up 2470 ``COMPUTE_PGM_RSRC1.FLOAT_MODE``. 2471 20 1 bit PRIV Must be 0. 2472 2473 Start executing wavefront 2474 in privilege trap handler 2475 mode. 2476 2477 CP is responsible for 2478 filling in 2479 ``COMPUTE_PGM_RSRC1.PRIV``. 2480 21 1 bit ENABLE_DX10_CLAMP Wavefront starts execution 2481 with DX10 clamp mode 2482 enabled. Used by the vector 2483 ALU to force DX10 style 2484 treatment of NaN's (when 2485 set, clamp NaN to zero, 2486 otherwise pass NaN 2487 through). 2488 2489 Used by CP to set up 2490 ``COMPUTE_PGM_RSRC1.DX10_CLAMP``. 2491 22 1 bit DEBUG_MODE Must be 0. 2492 2493 Start executing wavefront 2494 in single step mode. 2495 2496 CP is responsible for 2497 filling in 2498 ``COMPUTE_PGM_RSRC1.DEBUG_MODE``. 2499 23 1 bit ENABLE_IEEE_MODE Wavefront starts execution 2500 with IEEE mode 2501 enabled. Floating point 2502 opcodes that support 2503 exception flag gathering 2504 will quiet and propagate 2505 signaling-NaN inputs per 2506 IEEE 754-2008. Min_dx10 and 2507 max_dx10 become IEEE 2508 754-2008 compliant due to 2509 signaling-NaN propagation 2510 and quieting. 2511 2512 Used by CP to set up 2513 ``COMPUTE_PGM_RSRC1.IEEE_MODE``. 2514 24 1 bit BULKY Must be 0. 2515 2516 Only one work-group allowed 2517 to execute on a compute 2518 unit. 2519 2520 CP is responsible for 2521 filling in 2522 ``COMPUTE_PGM_RSRC1.BULKY``. 2523 25 1 bit CDBG_USER Must be 0. 2524 2525 Flag that can be used to 2526 control debugging code. 2527 2528 CP is responsible for 2529 filling in 2530 ``COMPUTE_PGM_RSRC1.CDBG_USER``. 2531 26 1 bit FP16_OVFL GFX6-GFX8 2532 Reserved, must be 0. 2533 GFX9-GFX10 2534 Wavefront starts execution 2535 with specified fp16 overflow 2536 mode. 2537 2538 - If 0, fp16 overflow generates 2539 +/-INF values. 2540 - If 1, fp16 overflow that is the 2541 result of an +/-INF input value 2542 or divide by 0 produces a +/-INF, 2543 otherwise clamps computed 2544 overflow to +/-MAX_FP16 as 2545 appropriate. 2546 2547 Used by CP to set up 2548 ``COMPUTE_PGM_RSRC1.FP16_OVFL``. 2549 28:27 2 bits Reserved, must be 0. 2550 29 1 bit WGP_MODE GFX6-GFX9 2551 Reserved, must be 0. 2552 GFX10 2553 - If 0 execute work-groups in 2554 CU wavefront execution mode. 2555 - If 1 execute work-groups on 2556 in WGP wavefront execution mode. 2557 2558 See :ref:`amdgpu-amdhsa-memory-model`. 2559 2560 Used by CP to set up 2561 ``COMPUTE_PGM_RSRC1.WGP_MODE``. 2562 30 1 bit MEM_ORDERED GFX6-9 2563 Reserved, must be 0. 2564 GFX10 2565 Controls the behavior of the 2566 waitcnt's vmcnt and vscnt 2567 counters. 2568 2569 - If 0 vmcnt reports completion 2570 of load and atomic with return 2571 out of order with sample 2572 instructions, and the vscnt 2573 reports the completion of 2574 store and atomic without 2575 return in order. 2576 - If 1 vmcnt reports completion 2577 of load, atomic with return 2578 and sample instructions in 2579 order, and the vscnt reports 2580 the completion of store and 2581 atomic without return in order. 2582 2583 Used by CP to set up 2584 ``COMPUTE_PGM_RSRC1.MEM_ORDERED``. 2585 31 1 bit FWD_PROGRESS GFX6-9 2586 Reserved, must be 0. 2587 GFX10 2588 - If 0 execute SIMD wavefronts 2589 using oldest first policy. 2590 - If 1 execute SIMD wavefronts to 2591 ensure wavefronts will make some 2592 forward progress. 2593 2594 Used by CP to set up 2595 ``COMPUTE_PGM_RSRC1.FWD_PROGRESS``. 2596 32 **Total size 4 bytes** 2597 ======= =================================================================================================================== 2598 2599.. 2600 2601 .. table:: compute_pgm_rsrc2 for GFX6-GFX10 2602 :name: amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx10-table 2603 2604 ======= ======= =============================== =========================================================================== 2605 Bits Size Field Name Description 2606 ======= ======= =============================== =========================================================================== 2607 0 1 bit ENABLE_SGPR_PRIVATE_SEGMENT Enable the setup of the 2608 _WAVEFRONT_OFFSET SGPR wavefront scratch offset 2609 system register (see 2610 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`). 2611 2612 Used by CP to set up 2613 ``COMPUTE_PGM_RSRC2.SCRATCH_EN``. 2614 5:1 5 bits USER_SGPR_COUNT The total number of SGPR 2615 user data registers 2616 requested. This number must 2617 match the number of user 2618 data registers enabled. 2619 2620 Used by CP to set up 2621 ``COMPUTE_PGM_RSRC2.USER_SGPR``. 2622 6 1 bit ENABLE_TRAP_HANDLER Must be 0. 2623 2624 This bit represents 2625 ``COMPUTE_PGM_RSRC2.TRAP_PRESENT``, 2626 which is set by the CP if 2627 the runtime has installed a 2628 trap handler. 2629 7 1 bit ENABLE_SGPR_WORKGROUP_ID_X Enable the setup of the 2630 system SGPR register for 2631 the work-group id in the X 2632 dimension (see 2633 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`). 2634 2635 Used by CP to set up 2636 ``COMPUTE_PGM_RSRC2.TGID_X_EN``. 2637 8 1 bit ENABLE_SGPR_WORKGROUP_ID_Y Enable the setup of the 2638 system SGPR register for 2639 the work-group id in the Y 2640 dimension (see 2641 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`). 2642 2643 Used by CP to set up 2644 ``COMPUTE_PGM_RSRC2.TGID_Y_EN``. 2645 9 1 bit ENABLE_SGPR_WORKGROUP_ID_Z Enable the setup of the 2646 system SGPR register for 2647 the work-group id in the Z 2648 dimension (see 2649 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`). 2650 2651 Used by CP to set up 2652 ``COMPUTE_PGM_RSRC2.TGID_Z_EN``. 2653 10 1 bit ENABLE_SGPR_WORKGROUP_INFO Enable the setup of the 2654 system SGPR register for 2655 work-group information (see 2656 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`). 2657 2658 Used by CP to set up 2659 ``COMPUTE_PGM_RSRC2.TGID_SIZE_EN``. 2660 12:11 2 bits ENABLE_VGPR_WORKITEM_ID Enable the setup of the 2661 VGPR system registers used 2662 for the work-item ID. 2663 :ref:`amdgpu-amdhsa-system-vgpr-work-item-id-enumeration-values-table` 2664 defines the values. 2665 2666 Used by CP to set up 2667 ``COMPUTE_PGM_RSRC2.TIDIG_CMP_CNT``. 2668 13 1 bit ENABLE_EXCEPTION_ADDRESS_WATCH Must be 0. 2669 2670 Wavefront starts execution 2671 with address watch 2672 exceptions enabled which 2673 are generated when L1 has 2674 witnessed a thread access 2675 an *address of 2676 interest*. 2677 2678 CP is responsible for 2679 filling in the address 2680 watch bit in 2681 ``COMPUTE_PGM_RSRC2.EXCP_EN_MSB`` 2682 according to what the 2683 runtime requests. 2684 14 1 bit ENABLE_EXCEPTION_MEMORY Must be 0. 2685 2686 Wavefront starts execution 2687 with memory violation 2688 exceptions exceptions 2689 enabled which are generated 2690 when a memory violation has 2691 occurred for this wavefront from 2692 L1 or LDS 2693 (write-to-read-only-memory, 2694 mis-aligned atomic, LDS 2695 address out of range, 2696 illegal address, etc.). 2697 2698 CP sets the memory 2699 violation bit in 2700 ``COMPUTE_PGM_RSRC2.EXCP_EN_MSB`` 2701 according to what the 2702 runtime requests. 2703 23:15 9 bits GRANULATED_LDS_SIZE Must be 0. 2704 2705 CP uses the rounded value 2706 from the dispatch packet, 2707 not this value, as the 2708 dispatch may contain 2709 dynamically allocated group 2710 segment memory. CP writes 2711 directly to 2712 ``COMPUTE_PGM_RSRC2.LDS_SIZE``. 2713 2714 Amount of group segment 2715 (LDS) to allocate for each 2716 work-group. Granularity is 2717 device specific: 2718 2719 GFX6: 2720 roundup(lds-size / (64 * 4)) 2721 GFX7-GFX10: 2722 roundup(lds-size / (128 * 4)) 2723 2724 24 1 bit ENABLE_EXCEPTION_IEEE_754_FP Wavefront starts execution 2725 _INVALID_OPERATION with specified exceptions 2726 enabled. 2727 2728 Used by CP to set up 2729 ``COMPUTE_PGM_RSRC2.EXCP_EN`` 2730 (set from bits 0..6). 2731 2732 IEEE 754 FP Invalid 2733 Operation 2734 25 1 bit ENABLE_EXCEPTION_FP_DENORMAL FP Denormal one or more 2735 _SOURCE input operands is a 2736 denormal number 2737 26 1 bit ENABLE_EXCEPTION_IEEE_754_FP IEEE 754 FP Division by 2738 _DIVISION_BY_ZERO Zero 2739 27 1 bit ENABLE_EXCEPTION_IEEE_754_FP IEEE 754 FP FP Overflow 2740 _OVERFLOW 2741 28 1 bit ENABLE_EXCEPTION_IEEE_754_FP IEEE 754 FP Underflow 2742 _UNDERFLOW 2743 29 1 bit ENABLE_EXCEPTION_IEEE_754_FP IEEE 754 FP Inexact 2744 _INEXACT 2745 30 1 bit ENABLE_EXCEPTION_INT_DIVIDE_BY Integer Division by Zero 2746 _ZERO (rcp_iflag_f32 instruction 2747 only) 2748 31 1 bit Reserved, must be 0. 2749 32 **Total size 4 bytes.** 2750 ======= =================================================================================================================== 2751 2752.. 2753 2754 .. table:: compute_pgm_rsrc3 for GFX10 2755 :name: amdgpu-amdhsa-compute_pgm_rsrc3-gfx10-table 2756 2757 ======= ======= =============================== =========================================================================== 2758 Bits Size Field Name Description 2759 ======= ======= =============================== =========================================================================== 2760 3:0 4 bits SHARED_VGPR_COUNT Number of shared VGPRs for wavefront size 64. Granularity 8. Value 0-120. 2761 compute_pgm_rsrc1.vgprs + shared_vgpr_cnt cannot exceed 64. 2762 31:4 28 Reserved, must be 0. 2763 bits 2764 32 **Total size 4 bytes.** 2765 ======= =================================================================================================================== 2766 2767.. 2768 2769 .. table:: Floating Point Rounding Mode Enumeration Values 2770 :name: amdgpu-amdhsa-floating-point-rounding-mode-enumeration-values-table 2771 2772 ====================================== ===== ============================== 2773 Enumeration Name Value Description 2774 ====================================== ===== ============================== 2775 FLOAT_ROUND_MODE_NEAR_EVEN 0 Round Ties To Even 2776 FLOAT_ROUND_MODE_PLUS_INFINITY 1 Round Toward +infinity 2777 FLOAT_ROUND_MODE_MINUS_INFINITY 2 Round Toward -infinity 2778 FLOAT_ROUND_MODE_ZERO 3 Round Toward 0 2779 ====================================== ===== ============================== 2780 2781.. 2782 2783 .. table:: Floating Point Denorm Mode Enumeration Values 2784 :name: amdgpu-amdhsa-floating-point-denorm-mode-enumeration-values-table 2785 2786 ====================================== ===== ============================== 2787 Enumeration Name Value Description 2788 ====================================== ===== ============================== 2789 FLOAT_DENORM_MODE_FLUSH_SRC_DST 0 Flush Source and Destination 2790 Denorms 2791 FLOAT_DENORM_MODE_FLUSH_DST 1 Flush Output Denorms 2792 FLOAT_DENORM_MODE_FLUSH_SRC 2 Flush Source Denorms 2793 FLOAT_DENORM_MODE_FLUSH_NONE 3 No Flush 2794 ====================================== ===== ============================== 2795 2796.. 2797 2798 .. table:: System VGPR Work-Item ID Enumeration Values 2799 :name: amdgpu-amdhsa-system-vgpr-work-item-id-enumeration-values-table 2800 2801 ======================================== ===== ============================ 2802 Enumeration Name Value Description 2803 ======================================== ===== ============================ 2804 SYSTEM_VGPR_WORKITEM_ID_X 0 Set work-item X dimension 2805 ID. 2806 SYSTEM_VGPR_WORKITEM_ID_X_Y 1 Set work-item X and Y 2807 dimensions ID. 2808 SYSTEM_VGPR_WORKITEM_ID_X_Y_Z 2 Set work-item X, Y and Z 2809 dimensions ID. 2810 SYSTEM_VGPR_WORKITEM_ID_UNDEFINED 3 Undefined. 2811 ======================================== ===== ============================ 2812 2813.. _amdgpu-amdhsa-initial-kernel-execution-state: 2814 2815Initial Kernel Execution State 2816~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 2817 2818This section defines the register state that will be set up by the packet 2819processor prior to the start of execution of every wavefront. This is limited by 2820the constraints of the hardware controllers of CP/ADC/SPI. 2821 2822The order of the SGPR registers is defined, but the compiler can specify which 2823ones are actually setup in the kernel descriptor using the ``enable_sgpr_*`` bit 2824fields (see :ref:`amdgpu-amdhsa-kernel-descriptor`). The register numbers used 2825for enabled registers are dense starting at SGPR0: the first enabled register is 2826SGPR0, the next enabled register is SGPR1 etc.; disabled registers do not have 2827an SGPR number. 2828 2829The initial SGPRs comprise up to 16 User SRGPs that are set by CP and apply to 2830all wavefronts of the grid. It is possible to specify more than 16 User SGPRs 2831using the ``enable_sgpr_*`` bit fields, in which case only the first 16 are 2832actually initialized. These are then immediately followed by the System SGPRs 2833that are set up by ADC/SPI and can have different values for each wavefront of 2834the grid dispatch. 2835 2836SGPR register initial state is defined in 2837:ref:`amdgpu-amdhsa-sgpr-register-set-up-order-table`. 2838 2839 .. table:: SGPR Register Set Up Order 2840 :name: amdgpu-amdhsa-sgpr-register-set-up-order-table 2841 2842 ========== ========================== ====== ============================== 2843 SGPR Order Name Number Description 2844 (kernel descriptor enable of 2845 field) SGPRs 2846 ========== ========================== ====== ============================== 2847 First Private Segment Buffer 4 V# that can be used, together 2848 (enable_sgpr_private with Scratch Wavefront Offset 2849 _segment_buffer) as an offset, to access the 2850 private address space using a 2851 segment address. 2852 2853 CP uses the value provided by 2854 the runtime. 2855 then Dispatch Ptr 2 64-bit address of AQL dispatch 2856 (enable_sgpr_dispatch_ptr) packet for kernel dispatch 2857 actually executing. 2858 then Queue Ptr 2 64-bit address of amd_queue_t 2859 (enable_sgpr_queue_ptr) object for AQL queue on which 2860 the dispatch packet was 2861 queued. 2862 then Kernarg Segment Ptr 2 64-bit address of Kernarg 2863 (enable_sgpr_kernarg segment. This is directly 2864 _segment_ptr) copied from the 2865 kernarg_address in the kernel 2866 dispatch packet. 2867 2868 Having CP load it once avoids 2869 loading it at the beginning of 2870 every wavefront. 2871 then Dispatch Id 2 64-bit Dispatch ID of the 2872 (enable_sgpr_dispatch_id) dispatch packet being 2873 executed. 2874 then Flat Scratch Init 2 This is 2 SGPRs: 2875 (enable_sgpr_flat_scratch 2876 _init) GFX6 2877 Not supported. 2878 GFX7-GFX8 2879 The first SGPR is a 32-bit 2880 byte offset from 2881 ``SH_HIDDEN_PRIVATE_BASE_VIMID`` 2882 to per SPI base of memory 2883 for scratch for the queue 2884 executing the kernel 2885 dispatch. CP obtains this 2886 from the runtime. (The 2887 Scratch Segment Buffer base 2888 address is 2889 ``SH_HIDDEN_PRIVATE_BASE_VIMID`` 2890 plus this offset.) The value 2891 of Scratch Wavefront Offset must 2892 be added to this offset by 2893 the kernel machine code, 2894 right shifted by 8, and 2895 moved to the FLAT_SCRATCH_HI 2896 SGPR register. 2897 FLAT_SCRATCH_HI corresponds 2898 to SGPRn-4 on GFX7, and 2899 SGPRn-6 on GFX8 (where SGPRn 2900 is the highest numbered SGPR 2901 allocated to the wavefront). 2902 FLAT_SCRATCH_HI is 2903 multiplied by 256 (as it is 2904 in units of 256 bytes) and 2905 added to 2906 ``SH_HIDDEN_PRIVATE_BASE_VIMID`` 2907 to calculate the per wavefront 2908 FLAT SCRATCH BASE in flat 2909 memory instructions that 2910 access the scratch 2911 aperture. 2912 2913 The second SGPR is 32-bit 2914 byte size of a single 2915 work-item's scratch memory 2916 usage. CP obtains this from 2917 the runtime, and it is 2918 always a multiple of DWORD. 2919 CP checks that the value in 2920 the kernel dispatch packet 2921 Private Segment Byte Size is 2922 not larger, and requests the 2923 runtime to increase the 2924 queue's scratch size if 2925 necessary. The kernel code 2926 must move it to 2927 FLAT_SCRATCH_LO which is 2928 SGPRn-3 on GFX7 and SGPRn-5 2929 on GFX8. FLAT_SCRATCH_LO is 2930 used as the FLAT SCRATCH 2931 SIZE in flat memory 2932 instructions. Having CP load 2933 it once avoids loading it at 2934 the beginning of every 2935 wavefront. 2936 GFX9-GFX10 2937 This is the 2938 64-bit base address of the 2939 per SPI scratch backing 2940 memory managed by SPI for 2941 the queue executing the 2942 kernel dispatch. CP obtains 2943 this from the runtime (and 2944 divides it if there are 2945 multiple Shader Arrays each 2946 with its own SPI). The value 2947 of Scratch Wavefront Offset must 2948 be added by the kernel 2949 machine code and the result 2950 moved to the FLAT_SCRATCH 2951 SGPR which is SGPRn-6 and 2952 SGPRn-5. It is used as the 2953 FLAT SCRATCH BASE in flat 2954 memory instructions. 2955 then Private Segment Size 1 The 32-bit byte size of a 2956 (enable_sgpr_private single 2957 work-item's 2958 scratch_segment_size) memory 2959 allocation. This is the 2960 value from the kernel 2961 dispatch packet Private 2962 Segment Byte Size rounded up 2963 by CP to a multiple of 2964 DWORD. 2965 2966 Having CP load it once avoids 2967 loading it at the beginning of 2968 every wavefront. 2969 2970 This is not used for 2971 GFX7-GFX8 since it is the same 2972 value as the second SGPR of 2973 Flat Scratch Init. However, it 2974 may be needed for GFX9-GFX10 which 2975 changes the meaning of the 2976 Flat Scratch Init value. 2977 then Grid Work-Group Count X 1 32-bit count of the number of 2978 (enable_sgpr_grid work-groups in the X dimension 2979 _workgroup_count_X) for the grid being 2980 executed. Computed from the 2981 fields in the kernel dispatch 2982 packet as ((grid_size.x + 2983 workgroup_size.x - 1) / 2984 workgroup_size.x). 2985 then Grid Work-Group Count Y 1 32-bit count of the number of 2986 (enable_sgpr_grid work-groups in the Y dimension 2987 _workgroup_count_Y && for the grid being 2988 less than 16 previous executed. Computed from the 2989 SGPRs) fields in the kernel dispatch 2990 packet as ((grid_size.y + 2991 workgroup_size.y - 1) / 2992 workgroupSize.y). 2993 2994 Only initialized if <16 2995 previous SGPRs initialized. 2996 then Grid Work-Group Count Z 1 32-bit count of the number of 2997 (enable_sgpr_grid work-groups in the Z dimension 2998 _workgroup_count_Z && for the grid being 2999 less than 16 previous executed. Computed from the 3000 SGPRs) fields in the kernel dispatch 3001 packet as ((grid_size.z + 3002 workgroup_size.z - 1) / 3003 workgroupSize.z). 3004 3005 Only initialized if <16 3006 previous SGPRs initialized. 3007 then Work-Group Id X 1 32-bit work-group id in X 3008 (enable_sgpr_workgroup_id dimension of grid for 3009 _X) wavefront. 3010 then Work-Group Id Y 1 32-bit work-group id in Y 3011 (enable_sgpr_workgroup_id dimension of grid for 3012 _Y) wavefront. 3013 then Work-Group Id Z 1 32-bit work-group id in Z 3014 (enable_sgpr_workgroup_id dimension of grid for 3015 _Z) wavefront. 3016 then Work-Group Info 1 {first_wavefront, 14'b0000, 3017 (enable_sgpr_workgroup ordered_append_term[10:0], 3018 _info) threadgroup_size_in_wavefronts[5:0]} 3019 then Scratch Wavefront Offset 1 32-bit byte offset from base 3020 (enable_sgpr_private of scratch base of queue 3021 _segment_wavefront_offset) executing the kernel 3022 dispatch. Must be used as an 3023 offset with Private 3024 segment address when using 3025 Scratch Segment Buffer. It 3026 must be used to set up FLAT 3027 SCRATCH for flat addressing 3028 (see 3029 :ref:`amdgpu-amdhsa-flat-scratch`). 3030 ========== ========================== ====== ============================== 3031 3032The order of the VGPR registers is defined, but the compiler can specify which 3033ones are actually setup in the kernel descriptor using the ``enable_vgpr*`` bit 3034fields (see :ref:`amdgpu-amdhsa-kernel-descriptor`). The register numbers used 3035for enabled registers are dense starting at VGPR0: the first enabled register is 3036VGPR0, the next enabled register is VGPR1 etc.; disabled registers do not have a 3037VGPR number. 3038 3039VGPR register initial state is defined in 3040:ref:`amdgpu-amdhsa-vgpr-register-set-up-order-table`. 3041 3042 .. table:: VGPR Register Set Up Order 3043 :name: amdgpu-amdhsa-vgpr-register-set-up-order-table 3044 3045 ========== ========================== ====== ============================== 3046 VGPR Order Name Number Description 3047 (kernel descriptor enable of 3048 field) VGPRs 3049 ========== ========================== ====== ============================== 3050 First Work-Item Id X 1 32-bit work item id in X 3051 (Always initialized) dimension of work-group for 3052 wavefront lane. 3053 then Work-Item Id Y 1 32-bit work item id in Y 3054 (enable_vgpr_workitem_id dimension of work-group for 3055 > 0) wavefront lane. 3056 then Work-Item Id Z 1 32-bit work item id in Z 3057 (enable_vgpr_workitem_id dimension of work-group for 3058 > 1) wavefront lane. 3059 ========== ========================== ====== ============================== 3060 3061The setting of registers is done by GPU CP/ADC/SPI hardware as follows: 3062 30631. SGPRs before the Work-Group Ids are set by CP using the 16 User Data 3064 registers. 30652. Work-group Id registers X, Y, Z are set by ADC which supports any 3066 combination including none. 30673. Scratch Wavefront Offset is set by SPI in a per wavefront basis which is why 3068 its value cannot included with the flat scratch init value which is per 3069 queue. 30704. The VGPRs are set by SPI which only supports specifying either (X), (X, Y) 3071 or (X, Y, Z). 3072 3073Flat Scratch register pair are adjacent SGRRs so they can be moved as a 64-bit 3074value to the hardware required SGPRn-3 and SGPRn-4 respectively. 3075 3076The global segment can be accessed either using buffer instructions (GFX6 which 3077has V# 64-bit address support), flat instructions (GFX7-GFX10), or global 3078instructions (GFX9-GFX10). 3079 3080If buffer operations are used then the compiler can generate a V# with the 3081following properties: 3082 3083* base address of 0 3084* no swizzle 3085* ATC: 1 if IOMMU present (such as APU) 3086* ptr64: 1 3087* MTYPE set to support memory coherence that matches the runtime (such as CC for 3088 APU and NC for dGPU). 3089 3090.. _amdgpu-amdhsa-kernel-prolog: 3091 3092Kernel Prolog 3093~~~~~~~~~~~~~ 3094 3095.. _amdgpu-amdhsa-m0: 3096 3097M0 3098++ 3099 3100GFX6-GFX8 3101 The M0 register must be initialized with a value at least the total LDS size 3102 if the kernel may access LDS via DS or flat operations. Total LDS size is 3103 available in dispatch packet. For M0, it is also possible to use maximum 3104 possible value of LDS for given target (0x7FFF for GFX6 and 0xFFFF for 3105 GFX7-GFX8). 3106GFX9-GFX10 3107 The M0 register is not used for range checking LDS accesses and so does not 3108 need to be initialized in the prolog. 3109 3110.. _amdgpu-amdhsa-flat-scratch: 3111 3112Flat Scratch 3113++++++++++++ 3114 3115If the kernel may use flat operations to access scratch memory, the prolog code 3116must set up FLAT_SCRATCH register pair (FLAT_SCRATCH_LO/FLAT_SCRATCH_HI which 3117are in SGPRn-4/SGPRn-3). Initialization uses Flat Scratch Init and Scratch 3118Wavefront Offset SGPR registers (see 3119:ref:`amdgpu-amdhsa-initial-kernel-execution-state`): 3120 3121GFX6 3122 Flat scratch is not supported. 3123 3124GFX7-GFX8 3125 3126 1. The low word of Flat Scratch Init is 32-bit byte offset from 3127 ``SH_HIDDEN_PRIVATE_BASE_VIMID`` to the base of scratch backing memory 3128 being managed by SPI for the queue executing the kernel dispatch. This is 3129 the same value used in the Scratch Segment Buffer V# base address. The 3130 prolog must add the value of Scratch Wavefront Offset to get the 3131 wavefront's byte scratch backing memory offset from 3132 ``SH_HIDDEN_PRIVATE_BASE_VIMID``. Since FLAT_SCRATCH_LO is in units of 256 3133 bytes, the offset must be right shifted by 8 before moving into 3134 FLAT_SCRATCH_LO. 3135 2. The second word of Flat Scratch Init is 32-bit byte size of a single 3136 work-items scratch memory usage. This is directly loaded from the kernel 3137 dispatch packet Private Segment Byte Size and rounded up to a multiple of 3138 DWORD. Having CP load it once avoids loading it at the beginning of every 3139 wavefront. The prolog must move it to FLAT_SCRATCH_LO for use as FLAT 3140 SCRATCH SIZE. 3141 3142GFX9-GFX10 3143 The Flat Scratch Init is the 64-bit address of the base of scratch backing 3144 memory being managed by SPI for the queue executing the kernel dispatch. The 3145 prolog must add the value of Scratch Wavefront Offset and moved to the 3146 FLAT_SCRATCH pair for use as the flat scratch base in flat memory 3147 instructions. 3148 3149.. _amdgpu-amdhsa-memory-model: 3150 3151Memory Model 3152~~~~~~~~~~~~ 3153 3154This section describes the mapping of LLVM memory model onto AMDGPU machine code 3155(see :ref:`memmodel`). 3156 3157The AMDGPU backend supports the memory synchronization scopes specified in 3158:ref:`amdgpu-memory-scopes`. 3159 3160The code sequences used to implement the memory model are defined in table 3161:ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx6-gfx10-table`. 3162 3163The sequences specify the order of instructions that a single thread must 3164execute. The ``s_waitcnt`` and ``buffer_wbinvl1_vol`` are defined with respect 3165to other memory instructions executed by the same thread. This allows them to be 3166moved earlier or later which can allow them to be combined with other instances 3167of the same instruction, or hoisted/sunk out of loops to improve 3168performance. Only the instructions related to the memory model are given; 3169additional ``s_waitcnt`` instructions are required to ensure registers are 3170defined before being used. These may be able to be combined with the memory 3171model ``s_waitcnt`` instructions as described above. 3172 3173The AMDGPU backend supports the following memory models: 3174 3175 HSA Memory Model [HSA]_ 3176 The HSA memory model uses a single happens-before relation for all address 3177 spaces (see :ref:`amdgpu-address-spaces`). 3178 OpenCL Memory Model [OpenCL]_ 3179 The OpenCL memory model which has separate happens-before relations for the 3180 global and local address spaces. Only a fence specifying both global and 3181 local address space, and seq_cst instructions join the relationships. Since 3182 the LLVM ``memfence`` instruction does not allow an address space to be 3183 specified the OpenCL fence has to conservatively assume both local and 3184 global address space was specified. However, optimizations can often be 3185 done to eliminate the additional ``s_waitcnt`` instructions when there are 3186 no intervening memory instructions which access the corresponding address 3187 space. The code sequences in the table indicate what can be omitted for the 3188 OpenCL memory. The target triple environment is used to determine if the 3189 source language is OpenCL (see :ref:`amdgpu-opencl`). 3190 3191``ds/flat_load/store/atomic`` instructions to local memory are termed LDS 3192operations. 3193 3194``buffer/global/flat_load/store/atomic`` instructions to global memory are 3195termed vector memory operations. 3196 3197For GFX6-GFX9: 3198 3199* Each agent has multiple shader arrays (SA). 3200* Each SA has multiple compute units (CU). 3201* Each CU has multiple SIMDs that execute wavefronts. 3202* The wavefronts for a single work-group are executed in the same CU but may be 3203 executed by different SIMDs. 3204* Each CU has a single LDS memory shared by the wavefronts of the work-groups 3205 executing on it. 3206* All LDS operations of a CU are performed as wavefront wide operations in a 3207 global order and involve no caching. Completion is reported to a wavefront in 3208 execution order. 3209* The LDS memory has multiple request queues shared by the SIMDs of a 3210 CU. Therefore, the LDS operations performed by different wavefronts of a 3211 work-group can be reordered relative to each other, which can result in 3212 reordering the visibility of vector memory operations with respect to LDS 3213 operations of other wavefronts in the same work-group. A ``s_waitcnt 3214 lgkmcnt(0)`` is required to ensure synchronization between LDS operations and 3215 vector memory operations between wavefronts of a work-group, but not between 3216 operations performed by the same wavefront. 3217* The vector memory operations are performed as wavefront wide operations and 3218 completion is reported to a wavefront in execution order. The exception is 3219 that for GFX7-GFX9 ``flat_load/store/atomic`` instructions can report out of 3220 vector memory order if they access LDS memory, and out of LDS operation order 3221 if they access global memory. 3222* The vector memory operations access a single vector L1 cache shared by all 3223 SIMDs a CU. Therefore, no special action is required for coherence between the 3224 lanes of a single wavefront, or for coherence between wavefronts in the same 3225 work-group. A ``buffer_wbinvl1_vol`` is required for coherence between 3226 wavefronts executing in different work-groups as they may be executing on 3227 different CUs. 3228* The scalar memory operations access a scalar L1 cache shared by all wavefronts 3229 on a group of CUs. The scalar and vector L1 caches are not coherent. However, 3230 scalar operations are used in a restricted way so do not impact the memory 3231 model. See :ref:`amdgpu-address-spaces`. 3232* The vector and scalar memory operations use an L2 cache shared by all CUs on 3233 the same agent. 3234* The L2 cache has independent channels to service disjoint ranges of virtual 3235 addresses. 3236* Each CU has a separate request queue per channel. Therefore, the vector and 3237 scalar memory operations performed by wavefronts executing in different 3238 work-groups (which may be executing on different CUs) of an agent can be 3239 reordered relative to each other. A ``s_waitcnt vmcnt(0)`` is required to 3240 ensure synchronization between vector memory operations of different CUs. It 3241 ensures a previous vector memory operation has completed before executing a 3242 subsequent vector memory or LDS operation and so can be used to meet the 3243 requirements of acquire and release. 3244* The L2 cache can be kept coherent with other agents on some targets, or ranges 3245 of virtual addresses can be set up to bypass it to ensure system coherence. 3246 3247For GFX10: 3248 3249* Each agent has multiple shader arrays (SA). 3250* Each SA has multiple work-group processors (WGP). 3251* Each WGP has multiple compute units (CU). 3252* Each CU has multiple SIMDs that execute wavefronts. 3253* The wavefronts for a single work-group are executed in the same 3254 WGP. In CU wavefront execution mode the wavefronts may be executed by 3255 different SIMDs in the same CU. In WGP wavefront execution mode the 3256 wavefronts may be executed by different SIMDs in different CUs in the same 3257 WGP. 3258* Each WGP has a single LDS memory shared by the wavefronts of the work-groups 3259 executing on it. 3260* All LDS operations of a WGP are performed as wavefront wide operations in a 3261 global order and involve no caching. Completion is reported to a wavefront in 3262 execution order. 3263* The LDS memory has multiple request queues shared by the SIMDs of a 3264 WGP. Therefore, the LDS operations performed by different wavefronts of a 3265 work-group can be reordered relative to each other, which can result in 3266 reordering the visibility of vector memory operations with respect to LDS 3267 operations of other wavefronts in the same work-group. A ``s_waitcnt 3268 lgkmcnt(0)`` is required to ensure synchronization between LDS operations and 3269 vector memory operations between wavefronts of a work-group, but not between 3270 operations performed by the same wavefront. 3271* The vector memory operations are performed as wavefront wide operations. 3272 Completion of load/store/sample operations are reported to a wavefront in 3273 execution order of other load/store/sample operations performed by that 3274 wavefront. 3275* The vector memory operations access a vector L0 cache. There is a single L0 3276 cache per CU. Each SIMD of a CU accesses the same L0 cache. Therefore, no 3277 special action is required for coherence between the lanes of a single 3278 wavefront. However, a ``BUFFER_GL0_INV`` is required for coherence between 3279 wavefronts executing in the same work-group as they may be executing on SIMDs 3280 of different CUs that access different L0s. A ``BUFFER_GL0_INV`` is also 3281 required for coherence between wavefronts executing in different work-groups 3282 as they may be executing on different WGPs. 3283* The scalar memory operations access a scalar L0 cache shared by all wavefronts 3284 on a WGP. The scalar and vector L0 caches are not coherent. However, scalar 3285 operations are used in a restricted way so do not impact the memory model. See 3286 :ref:`amdgpu-address-spaces`. 3287* The vector and scalar memory L0 caches use an L1 cache shared by all WGPs on 3288 the same SA. Therefore, no special action is required for coherence between 3289 the wavefronts of a single work-group. However, a ``BUFFER_GL1_INV`` is 3290 required for coherence between wavefronts executing in different work-groups 3291 as they may be executing on different SAs that access different L1s. 3292* The L1 caches have independent quadrants to service disjoint ranges of virtual 3293 addresses. 3294* Each L0 cache has a separate request queue per L1 quadrant. Therefore, the 3295 vector and scalar memory operations performed by different wavefronts, whether 3296 executing in the same or different work-groups (which may be executing on 3297 different CUs accessing different L0s), can be reordered relative to each 3298 other. A ``s_waitcnt vmcnt(0) & vscnt(0)`` is required to ensure 3299 synchronization between vector memory operations of different wavefronts. It 3300 ensures a previous vector memory operation has completed before executing a 3301 subsequent vector memory or LDS operation and so can be used to meet the 3302 requirements of acquire, release and sequential consistency. 3303* The L1 caches use an L2 cache shared by all SAs on the same agent. 3304* The L2 cache has independent channels to service disjoint ranges of virtual 3305 addresses. 3306* Each L1 quadrant of a single SA accesses a different L2 channel. Each L1 3307 quadrant has a separate request queue per L2 channel. Therefore, the vector 3308 and scalar memory operations performed by wavefronts executing in different 3309 work-groups (which may be executing on different SAs) of an agent can be 3310 reordered relative to each other. A ``s_waitcnt vmcnt(0) & vscnt(0)`` is 3311 required to ensure synchronization between vector memory operations of 3312 different SAs. It ensures a previous vector memory operation has completed 3313 before executing a subsequent vector memory and so can be used to meet the 3314 requirements of acquire, release and sequential consistency. 3315* The L2 cache can be kept coherent with other agents on some targets, or ranges 3316 of virtual addresses can be set up to bypass it to ensure system coherence. 3317 3318Private address space uses ``buffer_load/store`` using the scratch V# 3319(GFX6-GFX8), or ``scratch_load/store`` (GFX9-GFX10). Since only a single thread 3320is accessing the memory, atomic memory orderings are not meaningful and all 3321accesses are treated as non-atomic. 3322 3323Constant address space uses ``buffer/global_load`` instructions (or equivalent 3324scalar memory instructions). Since the constant address space contents do not 3325change during the execution of a kernel dispatch it is not legal to perform 3326stores, and atomic memory orderings are not meaningful and all access are 3327treated as non-atomic. 3328 3329A memory synchronization scope wider than work-group is not meaningful for the 3330group (LDS) address space and is treated as work-group. 3331 3332The memory model does not support the region address space which is treated as 3333non-atomic. 3334 3335Acquire memory ordering is not meaningful on store atomic instructions and is 3336treated as non-atomic. 3337 3338Release memory ordering is not meaningful on load atomic instructions and is 3339treated a non-atomic. 3340 3341Acquire-release memory ordering is not meaningful on load or store atomic 3342instructions and is treated as acquire and release respectively. 3343 3344AMDGPU backend only uses scalar memory operations to access memory that is 3345proven to not change during the execution of the kernel dispatch. This includes 3346constant address space and global address space for program scope const 3347variables. Therefore the kernel machine code does not have to maintain the 3348scalar L1 cache to ensure it is coherent with the vector L1 cache. The scalar 3349and vector L1 caches are invalidated between kernel dispatches by CP since 3350constant address space data may change between kernel dispatch executions. See 3351:ref:`amdgpu-address-spaces`. 3352 3353The one exception is if scalar writes are used to spill SGPR registers. In this 3354case the AMDGPU backend ensures the memory location used to spill is never 3355accessed by vector memory operations at the same time. If scalar writes are used 3356then a ``s_dcache_wb`` is inserted before the ``s_endpgm`` and before a function 3357return since the locations may be used for vector memory instructions by a 3358future wavefront that uses the same scratch area, or a function call that 3359creates a frame at the same address, respectively. There is no need for a 3360``s_dcache_inv`` as all scalar writes are write-before-read in the same thread. 3361 3362For GFX6-GFX9, scratch backing memory (which is used for the private address 3363space) is accessed with MTYPE NC_NV (non-coherent non-volatile). Since the 3364private address space is only accessed by a single thread, and is always 3365write-before-read, there is never a need to invalidate these entries from the L1 3366cache. Hence all cache invalidates are done as ``*_vol`` to only invalidate the 3367volatile cache lines. 3368 3369For GFX10, scratch backing memory (which is used for the private address space) 3370is accessed with MTYPE NC (non-coherent). Since the private address space is 3371only accessed by a single thread, and is always write-before-read, there is 3372never a need to invalidate these entries from the L0 or L1 caches. 3373 3374For GFX10, wavefronts are executed in native mode with in-order reporting of 3375loads and sample instructions. In this mode vmcnt reports completion of load, 3376atomic with return and sample instructions in order, and the vscnt reports the 3377completion of store and atomic without return in order. See ``MEM_ORDERED`` 3378field in :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx10-table`. 3379 3380In GFX10, wavefronts can be executed in WGP or CU wavefront execution mode: 3381 3382* In WGP wavefront execution mode the wavefronts of a work-group are executed 3383 on the SIMDs of both CUs of the WGP. Therefore, explicit management of the per 3384 CU L0 caches is required for work-group synchronization. Also accesses to L1 3385 at work-group scope need to be explicitly ordered as the accesses from 3386 different CUs are not ordered. 3387* In CU wavefront execution mode the wavefronts of a work-group are executed on 3388 the SIMDs of a single CU of the WGP. Therefore, all global memory access by 3389 the work-group access the same L0 which in turn ensures L1 accesses are 3390 ordered and so do not require explicit management of the caches for 3391 work-group synchronization. 3392 3393See ``WGP_MODE`` field in 3394:ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx10-table` and 3395:ref:`amdgpu-target-features`. 3396 3397On dGPU the kernarg backing memory is accessed as UC (uncached) to avoid needing 3398to invalidate the L2 cache. For GFX6-GFX9, this also causes it to be treated as 3399non-volatile and so is not invalidated by ``*_vol``. On APU it is accessed as CC 3400(cache coherent) and so the L2 cache will be coherent with the CPU and other 3401agents. 3402 3403 .. table:: AMDHSA Memory Model Code Sequences GFX6-GFX10 3404 :name: amdgpu-amdhsa-memory-model-code-sequences-gfx6-gfx10-table 3405 3406 ============ ============ ============== ========== =============================== ================================== 3407 LLVM Instr LLVM Memory LLVM Memory AMDGPU AMDGPU Machine Code AMDGPU Machine Code 3408 Ordering Sync Scope Address GFX6-9 GFX10 3409 Space 3410 ============ ============ ============== ========== =============================== ================================== 3411 **Non-Atomic** 3412 ---------------------------------------------------------------------------------------------------------------------- 3413 load *none* *none* - global - !volatile & !nontemporal - !volatile & !nontemporal 3414 - generic 3415 - private 1. buffer/global/flat_load 1. buffer/global/flat_load 3416 - constant 3417 - volatile & !nontemporal - volatile & !nontemporal 3418 3419 1. buffer/global/flat_load 1. buffer/global/flat_load 3420 glc=1 glc=1 dlc=1 3421 3422 - nontemporal - nontemporal 3423 3424 1. buffer/global/flat_load 1. buffer/global/flat_load 3425 glc=1 slc=1 slc=1 3426 3427 load *none* *none* - local 1. ds_load 1. ds_load 3428 store *none* *none* - global - !nontemporal - !nontemporal 3429 - generic 3430 - private 1. buffer/global/flat_store 1. buffer/global/flat_store 3431 - constant 3432 - nontemporal - nontemporal 3433 3434 1. buffer/global/flat_store 1. buffer/global/flat_store 3435 glc=1 slc=1 slc=1 3436 3437 store *none* *none* - local 1. ds_store 1. ds_store 3438 **Unordered Atomic** 3439 ---------------------------------------------------------------------------------------------------------------------- 3440 load atomic unordered *any* *any* *Same as non-atomic*. *Same as non-atomic*. 3441 store atomic unordered *any* *any* *Same as non-atomic*. *Same as non-atomic*. 3442 atomicrmw unordered *any* *any* *Same as monotonic *Same as monotonic 3443 atomic*. atomic*. 3444 **Monotonic Atomic** 3445 ---------------------------------------------------------------------------------------------------------------------- 3446 load atomic monotonic - singlethread - global 1. buffer/global/flat_load 1. buffer/global/flat_load 3447 - wavefront - generic 3448 load atomic monotonic - workgroup - global 1. buffer/global/flat_load 1. buffer/global/flat_load 3449 - generic glc=1 3450 3451 - If CU wavefront execution mode, omit glc=1. 3452 3453 load atomic monotonic - singlethread - local 1. ds_load 1. ds_load 3454 - wavefront 3455 - workgroup 3456 load atomic monotonic - agent - global 1. buffer/global/flat_load 1. buffer/global/flat_load 3457 - system - generic glc=1 glc=1 dlc=1 3458 store atomic monotonic - singlethread - global 1. buffer/global/flat_store 1. buffer/global/flat_store 3459 - wavefront - generic 3460 - workgroup 3461 - agent 3462 - system 3463 store atomic monotonic - singlethread - local 1. ds_store 1. ds_store 3464 - wavefront 3465 - workgroup 3466 atomicrmw monotonic - singlethread - global 1. buffer/global/flat_atomic 1. buffer/global/flat_atomic 3467 - wavefront - generic 3468 - workgroup 3469 - agent 3470 - system 3471 atomicrmw monotonic - singlethread - local 1. ds_atomic 1. ds_atomic 3472 - wavefront 3473 - workgroup 3474 **Acquire Atomic** 3475 ---------------------------------------------------------------------------------------------------------------------- 3476 load atomic acquire - singlethread - global 1. buffer/global/ds/flat_load 1. buffer/global/ds/flat_load 3477 - wavefront - local 3478 - generic 3479 load atomic acquire - workgroup - global 1. buffer/global/flat_load 1. buffer/global_load glc=1 3480 3481 - If CU wavefront execution mode, omit glc=1. 3482 3483 2. s_waitcnt vmcnt(0) 3484 3485 - If CU wavefront execution mode, omit. 3486 - Must happen before 3487 the following buffer_gl0_inv 3488 and before any following 3489 global/generic 3490 load/load 3491 atomic/store/store 3492 atomic/atomicrmw. 3493 3494 3. buffer_gl0_inv 3495 3496 - If CU wavefront execution mode, omit. 3497 - Ensures that 3498 following 3499 loads will not see 3500 stale data. 3501 3502 load atomic acquire - workgroup - local 1. ds_load 1. ds_load 3503 2. s_waitcnt lgkmcnt(0) 2. s_waitcnt lgkmcnt(0) 3504 3505 - If OpenCL, omit. - If OpenCL, omit. 3506 - Must happen before - Must happen before 3507 any following the following buffer_gl0_inv 3508 global/generic and before any following 3509 load/load global/generic load/load 3510 atomic/store/store atomic/store/store 3511 atomic/atomicrmw. atomic/atomicrmw. 3512 - Ensures any - Ensures any 3513 following global following global 3514 data read is no data read is no 3515 older than the load older than the load 3516 atomic value being atomic value being 3517 acquired. acquired. 3518 3519 3. buffer_gl0_inv 3520 3521 - If CU wavefront execution mode, omit. 3522 - If OpenCL, omit. 3523 - Ensures that 3524 following 3525 loads will not see 3526 stale data. 3527 3528 load atomic acquire - workgroup - generic 1. flat_load 1. flat_load glc=1 3529 3530 - If CU wavefront execution mode, omit glc=1. 3531 3532 2. s_waitcnt lgkmcnt(0) 2. s_waitcnt lgkmcnt(0) & 3533 vmcnt(0) 3534 3535 - If CU wavefront execution mode, omit vmcnt. 3536 - If OpenCL, omit. - If OpenCL, omit 3537 lgkmcnt(0). 3538 - Must happen before - Must happen before 3539 any following the following 3540 global/generic buffer_gl0_inv and any 3541 load/load following global/generic 3542 atomic/store/store load/load 3543 atomic/atomicrmw. atomic/store/store 3544 atomic/atomicrmw. 3545 - Ensures any - Ensures any 3546 following global following global 3547 data read is no data read is no 3548 older than the load older than the load 3549 atomic value being atomic value being 3550 acquired. acquired. 3551 3552 3. buffer_gl0_inv 3553 3554 - If CU wavefront execution mode, omit. 3555 - Ensures that 3556 following 3557 loads will not see 3558 stale data. 3559 3560 load atomic acquire - agent - global 1. buffer/global/flat_load 1. buffer/global_load 3561 - system glc=1 glc=1 dlc=1 3562 2. s_waitcnt vmcnt(0) 2. s_waitcnt vmcnt(0) 3563 3564 - Must happen before - Must happen before 3565 following following 3566 buffer_wbinvl1_vol. buffer_gl*_inv. 3567 - Ensures the load - Ensures the load 3568 has completed has completed 3569 before invalidating before invalidating 3570 the cache. the caches. 3571 3572 3. buffer_wbinvl1_vol 3. buffer_gl0_inv; 3573 buffer_gl1_inv 3574 3575 - Must happen before - Must happen before 3576 any following any following 3577 global/generic global/generic 3578 load/load load/load 3579 atomic/atomicrmw. atomic/atomicrmw. 3580 - Ensures that - Ensures that 3581 following following 3582 loads will not see loads will not see 3583 stale global data. stale global data. 3584 3585 load atomic acquire - agent - generic 1. flat_load glc=1 1. flat_load glc=1 dlc=1 3586 - system 2. s_waitcnt vmcnt(0) & 2. s_waitcnt vmcnt(0) & 3587 lgkmcnt(0) lgkmcnt(0) 3588 3589 - If OpenCL omit - If OpenCL omit 3590 lgkmcnt(0). lgkmcnt(0). 3591 - Must happen before - Must happen before 3592 following following 3593 buffer_wbinvl1_vol. buffer_gl*_invl. 3594 - Ensures the flat_load - Ensures the flat_load 3595 has completed has completed 3596 before invalidating before invalidating 3597 the cache. the caches. 3598 3599 3. buffer_wbinvl1_vol 3. buffer_gl0_inv; 3600 buffer_gl1_inv 3601 3602 - Must happen before - Must happen before 3603 any following any following 3604 global/generic global/generic 3605 load/load load/load 3606 atomic/atomicrmw. atomic/atomicrmw. 3607 - Ensures that - Ensures that 3608 following loads following loads 3609 will not see stale will not see stale 3610 global data. global data. 3611 3612 atomicrmw acquire - singlethread - global 1. buffer/global/ds/flat_atomic 1. buffer/global/ds/flat_atomic 3613 - wavefront - local 3614 - generic 3615 atomicrmw acquire - workgroup - global 1. buffer/global/flat_atomic 1. buffer/global_atomic 3616 2. s_waitcnt vm/vscnt(0) 3617 3618 - If CU wavefront execution mode, omit. 3619 - Use vmcnt if atomic with 3620 return and vscnt if atomic 3621 with no-return. 3622 - Must happen before 3623 the following buffer_gl0_inv 3624 and before any following 3625 global/generic 3626 load/load 3627 atomic/store/store 3628 atomic/atomicrmw. 3629 3630 3. buffer_gl0_inv 3631 3632 - If CU wavefront execution mode, omit. 3633 - Ensures that 3634 following 3635 loads will not see 3636 stale data. 3637 3638 atomicrmw acquire - workgroup - local 1. ds_atomic 1. ds_atomic 3639 2. waitcnt lgkmcnt(0) 2. waitcnt lgkmcnt(0) 3640 3641 - If OpenCL, omit. - If OpenCL, omit. 3642 - Must happen before - Must happen before 3643 any following the following 3644 global/generic buffer_gl0_inv. 3645 load/load 3646 atomic/store/store 3647 atomic/atomicrmw. 3648 - Ensures any - Ensures any 3649 following global following global 3650 data read is no data read is no 3651 older than the older than the 3652 atomicrmw value atomicrmw value 3653 being acquired. being acquired. 3654 3655 3. buffer_gl0_inv 3656 3657 - If OpenCL omit. 3658 - Ensures that 3659 following 3660 loads will not see 3661 stale data. 3662 3663 atomicrmw acquire - workgroup - generic 1. flat_atomic 1. flat_atomic 3664 2. waitcnt lgkmcnt(0) 2. waitcnt lgkmcnt(0) & 3665 vm/vscnt(0) 3666 3667 - If CU wavefront execution mode, omit vm/vscnt. 3668 - If OpenCL, omit. - If OpenCL, omit 3669 waitcnt lgkmcnt(0).. 3670 - Use vmcnt if atomic with 3671 return and vscnt if atomic 3672 with no-return. 3673 waitcnt lgkmcnt(0). 3674 - Must happen before - Must happen before 3675 any following the following 3676 global/generic buffer_gl0_inv. 3677 load/load 3678 atomic/store/store 3679 atomic/atomicrmw. 3680 - Ensures any - Ensures any 3681 following global following global 3682 data read is no data read is no 3683 older than the older than the 3684 atomicrmw value atomicrmw value 3685 being acquired. being acquired. 3686 3687 3. buffer_gl0_inv 3688 3689 - If CU wavefront execution mode, omit. 3690 - Ensures that 3691 following 3692 loads will not see 3693 stale data. 3694 3695 atomicrmw acquire - agent - global 1. buffer/global/flat_atomic 1. buffer/global_atomic 3696 - system 2. s_waitcnt vmcnt(0) 2. s_waitcnt vm/vscnt(0) 3697 3698 - Use vmcnt if atomic with 3699 return and vscnt if atomic 3700 with no-return. 3701 waitcnt lgkmcnt(0). 3702 - Must happen before - Must happen before 3703 following following 3704 buffer_wbinvl1_vol. buffer_gl*_inv. 3705 - Ensures the - Ensures the 3706 atomicrmw has atomicrmw has 3707 completed before completed before 3708 invalidating the invalidating the 3709 cache. caches. 3710 3711 3. buffer_wbinvl1_vol 3. buffer_gl0_inv; 3712 buffer_gl1_inv 3713 3714 - Must happen before - Must happen before 3715 any following any following 3716 global/generic global/generic 3717 load/load load/load 3718 atomic/atomicrmw. atomic/atomicrmw. 3719 - Ensures that - Ensures that 3720 following loads following loads 3721 will not see stale will not see stale 3722 global data. global data. 3723 3724 atomicrmw acquire - agent - generic 1. flat_atomic 1. flat_atomic 3725 - system 2. s_waitcnt vmcnt(0) & 2. s_waitcnt vm/vscnt(0) & 3726 lgkmcnt(0) lgkmcnt(0) 3727 3728 - If OpenCL, omit - If OpenCL, omit 3729 lgkmcnt(0). lgkmcnt(0). 3730 - Use vmcnt if atomic with 3731 return and vscnt if atomic 3732 with no-return. 3733 - Must happen before - Must happen before 3734 following following 3735 buffer_wbinvl1_vol. buffer_gl*_inv. 3736 - Ensures the - Ensures the 3737 atomicrmw has atomicrmw has 3738 completed before completed before 3739 invalidating the invalidating the 3740 cache. caches. 3741 3742 3. buffer_wbinvl1_vol 3. buffer_gl0_inv; 3743 buffer_gl1_inv 3744 3745 - Must happen before - Must happen before 3746 any following any following 3747 global/generic global/generic 3748 load/load load/load 3749 atomic/atomicrmw. atomic/atomicrmw. 3750 - Ensures that - Ensures that 3751 following loads following loads 3752 will not see stale will not see stale 3753 global data. global data. 3754 3755 fence acquire - singlethread *none* *none* *none* 3756 - wavefront 3757 fence acquire - workgroup *none* 1. s_waitcnt lgkmcnt(0) 1. s_waitcnt lgkmcnt(0) & 3758 vmcnt(0) & vscnt(0) 3759 3760 - If CU wavefront execution mode, omit vmcnt and 3761 vscnt. 3762 - If OpenCL and - If OpenCL and 3763 address space is address space is 3764 not generic, omit. not generic, omit 3765 lgkmcnt(0). 3766 - If OpenCL and 3767 address space is 3768 local, omit 3769 vmcnt(0) and vscnt(0). 3770 - However, since LLVM - However, since LLVM 3771 currently has no currently has no 3772 address space on address space on 3773 the fence need to the fence need to 3774 conservatively conservatively 3775 always generate. If always generate. If 3776 fence had an fence had an 3777 address space then address space then 3778 set to address set to address 3779 space of OpenCL space of OpenCL 3780 fence flag, or to fence flag, or to 3781 generic if both generic if both 3782 local and global local and global 3783 flags are flags are 3784 specified. specified. 3785 - Must happen after 3786 any preceding 3787 local/generic load 3788 atomic/atomicrmw 3789 with an equal or 3790 wider sync scope 3791 and memory ordering 3792 stronger than 3793 unordered (this is 3794 termed the 3795 fence-paired-atomic). 3796 - Must happen before 3797 any following 3798 global/generic 3799 load/load 3800 atomic/store/store 3801 atomic/atomicrmw. 3802 - Ensures any 3803 following global 3804 data read is no 3805 older than the 3806 value read by the 3807 fence-paired-atomic. 3808 - Could be split into 3809 separate s_waitcnt 3810 vmcnt(0), s_waitcnt 3811 vscnt(0) and s_waitcnt 3812 lgkmcnt(0) to allow 3813 them to be 3814 independently moved 3815 according to the 3816 following rules. 3817 - s_waitcnt vmcnt(0) 3818 must happen after 3819 any preceding 3820 global/generic load 3821 atomic/ 3822 atomicrmw-with-return-value 3823 with an equal or 3824 wider sync scope 3825 and memory ordering 3826 stronger than 3827 unordered (this is 3828 termed the 3829 fence-paired-atomic). 3830 - s_waitcnt vscnt(0) 3831 must happen after 3832 any preceding 3833 global/generic 3834 atomicrmw-no-return-value 3835 with an equal or 3836 wider sync scope 3837 and memory ordering 3838 stronger than 3839 unordered (this is 3840 termed the 3841 fence-paired-atomic). 3842 - s_waitcnt lgkmcnt(0) 3843 must happen after 3844 any preceding 3845 local/generic load 3846 atomic/atomicrmw 3847 with an equal or 3848 wider sync scope 3849 and memory ordering 3850 stronger than 3851 unordered (this is 3852 termed the 3853 fence-paired-atomic). 3854 - Must happen before 3855 the following 3856 buffer_gl0_inv. 3857 - Ensures that the 3858 fence-paired atomic 3859 has completed 3860 before invalidating 3861 the 3862 cache. Therefore 3863 any following 3864 locations read must 3865 be no older than 3866 the value read by 3867 the 3868 fence-paired-atomic. 3869 3870 3. buffer_gl0_inv 3871 3872 - If CU wavefront execution mode, omit. 3873 - Ensures that 3874 following 3875 loads will not see 3876 stale data. 3877 3878 fence acquire - agent *none* 1. s_waitcnt lgkmcnt(0) & 1. s_waitcnt lgkmcnt(0) & 3879 - system vmcnt(0) vmcnt(0) & vscnt(0) 3880 3881 - If OpenCL and - If OpenCL and 3882 address space is address space is 3883 not generic, omit not generic, omit 3884 lgkmcnt(0). lgkmcnt(0). 3885 - If OpenCL and 3886 address space is 3887 local, omit 3888 vmcnt(0) and vscnt(0). 3889 - However, since LLVM - However, since LLVM 3890 currently has no currently has no 3891 address space on address space on 3892 the fence need to the fence need to 3893 conservatively conservatively 3894 always generate always generate 3895 (see comment for (see comment for 3896 previous fence). previous fence). 3897 - Could be split into 3898 separate s_waitcnt 3899 vmcnt(0) and 3900 s_waitcnt 3901 lgkmcnt(0) to allow 3902 them to be 3903 independently moved 3904 according to the 3905 following rules. 3906 - s_waitcnt vmcnt(0) 3907 must happen after 3908 any preceding 3909 global/generic load 3910 atomic/atomicrmw 3911 with an equal or 3912 wider sync scope 3913 and memory ordering 3914 stronger than 3915 unordered (this is 3916 termed the 3917 fence-paired-atomic). 3918 - s_waitcnt lgkmcnt(0) 3919 must happen after 3920 any preceding 3921 local/generic load 3922 atomic/atomicrmw 3923 with an equal or 3924 wider sync scope 3925 and memory ordering 3926 stronger than 3927 unordered (this is 3928 termed the 3929 fence-paired-atomic). 3930 - Must happen before 3931 the following 3932 buffer_wbinvl1_vol. 3933 - Ensures that the 3934 fence-paired atomic 3935 has completed 3936 before invalidating 3937 the 3938 cache. Therefore 3939 any following 3940 locations read must 3941 be no older than 3942 the value read by 3943 the 3944 fence-paired-atomic. 3945 - Could be split into 3946 separate s_waitcnt 3947 vmcnt(0), s_waitcnt 3948 vscnt(0) and s_waitcnt 3949 lgkmcnt(0) to allow 3950 them to be 3951 independently moved 3952 according to the 3953 following rules. 3954 - s_waitcnt vmcnt(0) 3955 must happen after 3956 any preceding 3957 global/generic load 3958 atomic/ 3959 atomicrmw-with-return-value 3960 with an equal or 3961 wider sync scope 3962 and memory ordering 3963 stronger than 3964 unordered (this is 3965 termed the 3966 fence-paired-atomic). 3967 - s_waitcnt vscnt(0) 3968 must happen after 3969 any preceding 3970 global/generic 3971 atomicrmw-no-return-value 3972 with an equal or 3973 wider sync scope 3974 and memory ordering 3975 stronger than 3976 unordered (this is 3977 termed the 3978 fence-paired-atomic). 3979 - s_waitcnt lgkmcnt(0) 3980 must happen after 3981 any preceding 3982 local/generic load 3983 atomic/atomicrmw 3984 with an equal or 3985 wider sync scope 3986 and memory ordering 3987 stronger than 3988 unordered (this is 3989 termed the 3990 fence-paired-atomic). 3991 - Must happen before 3992 the following 3993 buffer_gl*_inv. 3994 - Ensures that the 3995 fence-paired atomic 3996 has completed 3997 before invalidating 3998 the 3999 caches. Therefore 4000 any following 4001 locations read must 4002 be no older than 4003 the value read by 4004 the 4005 fence-paired-atomic. 4006 4007 2. buffer_wbinvl1_vol 2. buffer_gl0_inv; 4008 buffer_gl1_inv 4009 4010 - Must happen before any - Must happen before any 4011 following global/generic following global/generic 4012 load/load load/load 4013 atomic/store/store atomic/store/store 4014 atomic/atomicrmw. atomic/atomicrmw. 4015 - Ensures that - Ensures that 4016 following loads following loads 4017 will not see stale will not see stale 4018 global data. global data. 4019 4020 **Release Atomic** 4021 ---------------------------------------------------------------------------------------------------------------------- 4022 store atomic release - singlethread - global 1. buffer/global/ds/flat_store 1. buffer/global/ds/flat_store 4023 - wavefront - local 4024 - generic 4025 store atomic release - workgroup - global 1. s_waitcnt lgkmcnt(0) 1. s_waitcnt lgkmcnt(0) & 4026 vmcnt(0) & vscnt(0) 4027 4028 - If CU wavefront execution mode, omit vmcnt and 4029 vscnt. 4030 - If OpenCL, omit. - If OpenCL, omit 4031 lgkmcnt(0). 4032 - Must happen after 4033 any preceding 4034 local/generic 4035 load/store/load 4036 atomic/store 4037 atomic/atomicrmw. 4038 - Could be split into 4039 separate s_waitcnt 4040 vmcnt(0), s_waitcnt 4041 vscnt(0) and s_waitcnt 4042 lgkmcnt(0) to allow 4043 them to be 4044 independently moved 4045 according to the 4046 following rules. 4047 - s_waitcnt vmcnt(0) 4048 must happen after 4049 any preceding 4050 global/generic load/load 4051 atomic/ 4052 atomicrmw-with-return-value. 4053 - s_waitcnt vscnt(0) 4054 must happen after 4055 any preceding 4056 global/generic 4057 store/store 4058 atomic/ 4059 atomicrmw-no-return-value. 4060 - s_waitcnt lgkmcnt(0) 4061 must happen after 4062 any preceding 4063 local/generic 4064 load/store/load 4065 atomic/store 4066 atomic/atomicrmw. 4067 - Must happen before - Must happen before 4068 the following the following 4069 store. store. 4070 - Ensures that all - Ensures that all 4071 memory operations memory operations 4072 to local have have 4073 completed before completed before 4074 performing the performing the 4075 store that is being store that is being 4076 released. released. 4077 4078 2. buffer/global/flat_store 2. buffer/global_store 4079 store atomic release - workgroup - local 1. waitcnt vmcnt(0) & vscnt(0) 4080 4081 - If CU wavefront execution mode, omit. 4082 - If OpenCL, omit. 4083 - Could be split into 4084 separate s_waitcnt 4085 vmcnt(0) and s_waitcnt 4086 vscnt(0) to allow 4087 them to be 4088 independently moved 4089 according to the 4090 following rules. 4091 - s_waitcnt vmcnt(0) 4092 must happen after 4093 any preceding 4094 global/generic load/load 4095 atomic/ 4096 atomicrmw-with-return-value. 4097 - s_waitcnt vscnt(0) 4098 must happen after 4099 any preceding 4100 global/generic 4101 store/store atomic/ 4102 atomicrmw-no-return-value. 4103 - Must happen before 4104 the following 4105 store. 4106 - Ensures that all 4107 global memory 4108 operations have 4109 completed before 4110 performing the 4111 store that is being 4112 released. 4113 4114 1. ds_store 2. ds_store 4115 store atomic release - workgroup - generic 1. s_waitcnt lgkmcnt(0) 1. s_waitcnt lgkmcnt(0) & 4116 vmcnt(0) & vscnt(0) 4117 4118 - If CU wavefront execution mode, omit vmcnt and 4119 vscnt. 4120 - If OpenCL, omit. - If OpenCL, omit 4121 lgkmcnt(0). 4122 - Must happen after 4123 any preceding 4124 local/generic 4125 load/store/load 4126 atomic/store 4127 atomic/atomicrmw. 4128 - Could be split into 4129 separate s_waitcnt 4130 vmcnt(0), s_waitcnt 4131 vscnt(0) and s_waitcnt 4132 lgkmcnt(0) to allow 4133 them to be 4134 independently moved 4135 according to the 4136 following rules. 4137 - s_waitcnt vmcnt(0) 4138 must happen after 4139 any preceding 4140 global/generic load/load 4141 atomic/ 4142 atomicrmw-with-return-value. 4143 - s_waitcnt vscnt(0) 4144 must happen after 4145 any preceding 4146 global/generic 4147 store/store 4148 atomic/ 4149 atomicrmw-no-return-value. 4150 - s_waitcnt lgkmcnt(0) 4151 must happen after 4152 any preceding 4153 local/generic load/store/load 4154 atomic/store atomic/atomicrmw. 4155 - Must happen before - Must happen before 4156 the following the following 4157 store. store. 4158 - Ensures that all - Ensures that all 4159 memory operations memory operations 4160 to local have have 4161 completed before completed before 4162 performing the performing the 4163 store that is being store that is being 4164 released. released. 4165 4166 2. flat_store 2. flat_store 4167 store atomic release - agent - global 1. s_waitcnt lgkmcnt(0) & 1. s_waitcnt lgkmcnt(0) & 4168 - system - generic vmcnt(0) vmcnt(0) & vscnt(0) 4169 4170 - If OpenCL, omit - If OpenCL, omit 4171 lgkmcnt(0). lgkmcnt(0). 4172 - Could be split into - Could be split into 4173 separate s_waitcnt separate s_waitcnt 4174 vmcnt(0) and vmcnt(0), s_waitcnt vscnt(0) 4175 s_waitcnt and s_waitcnt 4176 lgkmcnt(0) to allow lgkmcnt(0) to allow 4177 them to be them to be 4178 independently moved independently moved 4179 according to the according to the 4180 following rules. following rules. 4181 - s_waitcnt vmcnt(0) - s_waitcnt vmcnt(0) 4182 must happen after must happen after 4183 any preceding any preceding 4184 global/generic global/generic 4185 load/store/load load/load 4186 atomic/store atomic/ 4187 atomic/atomicrmw. atomicrmw-with-return-value. 4188 - s_waitcnt vscnt(0) 4189 must happen after 4190 any preceding 4191 global/generic 4192 store/store atomic/ 4193 atomicrmw-no-return-value. 4194 - s_waitcnt lgkmcnt(0) - s_waitcnt lgkmcnt(0) 4195 must happen after must happen after 4196 any preceding any preceding 4197 local/generic local/generic 4198 load/store/load load/store/load 4199 atomic/store atomic/store 4200 atomic/atomicrmw. atomic/atomicrmw. 4201 - Must happen before - Must happen before 4202 the following the following 4203 store. store. 4204 - Ensures that all - Ensures that all 4205 memory operations memory operations 4206 to memory have to memory have 4207 completed before completed before 4208 performing the performing the 4209 store that is being store that is being 4210 released. released. 4211 4212 2. buffer/global/ds/flat_store 2. buffer/global/ds/flat_store 4213 atomicrmw release - singlethread - global 1. buffer/global/ds/flat_atomic 1. buffer/global/ds/flat_atomic 4214 - wavefront - local 4215 - generic 4216 atomicrmw release - workgroup - global 1. s_waitcnt lgkmcnt(0) 1. s_waitcnt lgkmcnt(0) & 4217 vmcnt(0) & vscnt(0) 4218 4219 - If CU wavefront execution mode, omit vmcnt and 4220 vscnt. 4221 - If OpenCL, omit. 4222 4223 - Must happen after 4224 any preceding 4225 local/generic 4226 load/store/load 4227 atomic/store 4228 atomic/atomicrmw. 4229 - Could be split into 4230 separate s_waitcnt 4231 vmcnt(0), s_waitcnt 4232 vscnt(0) and s_waitcnt 4233 lgkmcnt(0) to allow 4234 them to be 4235 independently moved 4236 according to the 4237 following rules. 4238 - s_waitcnt vmcnt(0) 4239 must happen after 4240 any preceding 4241 global/generic load/load 4242 atomic/ 4243 atomicrmw-with-return-value. 4244 - s_waitcnt vscnt(0) 4245 must happen after 4246 any preceding 4247 global/generic 4248 store/store 4249 atomic/ 4250 atomicrmw-no-return-value. 4251 - s_waitcnt lgkmcnt(0) 4252 must happen after 4253 any preceding 4254 local/generic 4255 load/store/load 4256 atomic/store 4257 atomic/atomicrmw. 4258 - Must happen before - Must happen before 4259 the following the following 4260 atomicrmw. atomicrmw. 4261 - Ensures that all - Ensures that all 4262 memory operations memory operations 4263 to local have have 4264 completed before completed before 4265 performing the performing the 4266 atomicrmw that is atomicrmw that is 4267 being released. being released. 4268 4269 2. buffer/global/flat_atomic 2. buffer/global_atomic 4270 atomicrmw release - workgroup - local 1. waitcnt vmcnt(0) & vscnt(0) 4271 4272 - If CU wavefront execution mode, omit. 4273 - If OpenCL, omit. 4274 - Could be split into 4275 separate s_waitcnt 4276 vmcnt(0) and s_waitcnt 4277 vscnt(0) to allow 4278 them to be 4279 independently moved 4280 according to the 4281 following rules. 4282 - s_waitcnt vmcnt(0) 4283 must happen after 4284 any preceding 4285 global/generic load/load 4286 atomic/ 4287 atomicrmw-with-return-value. 4288 - s_waitcnt vscnt(0) 4289 must happen after 4290 any preceding 4291 global/generic 4292 store/store atomic/ 4293 atomicrmw-no-return-value. 4294 - Must happen before 4295 the following 4296 store. 4297 - Ensures that all 4298 global memory 4299 operations have 4300 completed before 4301 performing the 4302 store that is being 4303 released. 4304 4305 1. ds_atomic 2. ds_atomic 4306 atomicrmw release - workgroup - generic 1. s_waitcnt lgkmcnt(0) 1. s_waitcnt lgkmcnt(0) & 4307 vmcnt(0) & vscnt(0) 4308 4309 - If CU wavefront execution mode, omit vmcnt and 4310 vscnt. 4311 - If OpenCL, omit. - If OpenCL, omit 4312 waitcnt lgkmcnt(0). 4313 - Must happen after 4314 any preceding 4315 local/generic 4316 load/store/load 4317 atomic/store 4318 atomic/atomicrmw. 4319 - Could be split into 4320 separate s_waitcnt 4321 vmcnt(0), s_waitcnt 4322 vscnt(0) and s_waitcnt 4323 lgkmcnt(0) to allow 4324 them to be 4325 independently moved 4326 according to the 4327 following rules. 4328 - s_waitcnt vmcnt(0) 4329 must happen after 4330 any preceding 4331 global/generic load/load 4332 atomic/ 4333 atomicrmw-with-return-value. 4334 - s_waitcnt vscnt(0) 4335 must happen after 4336 any preceding 4337 global/generic 4338 store/store 4339 atomic/ 4340 atomicrmw-no-return-value. 4341 - s_waitcnt lgkmcnt(0) 4342 must happen after 4343 any preceding 4344 local/generic load/store/load 4345 atomic/store atomic/atomicrmw. 4346 - Must happen before - Must happen before 4347 the following the following 4348 atomicrmw. atomicrmw. 4349 - Ensures that all - Ensures that all 4350 memory operations memory operations 4351 to local have have 4352 completed before completed before 4353 performing the performing the 4354 atomicrmw that is atomicrmw that is 4355 being released. being released. 4356 4357 2. flat_atomic 2. flat_atomic 4358 atomicrmw release - agent - global 1. s_waitcnt lgkmcnt(0) & 1. s_waitcnt lkkmcnt(0) & 4359 - system - generic vmcnt(0) vmcnt(0) & vscnt(0) 4360 4361 - If OpenCL, omit - If OpenCL, omit 4362 lgkmcnt(0). lgkmcnt(0). 4363 - Could be split into - Could be split into 4364 separate s_waitcnt separate s_waitcnt 4365 vmcnt(0) and vmcnt(0), s_waitcnt 4366 s_waitcnt vscnt(0) and s_waitcnt 4367 lgkmcnt(0) to allow lgkmcnt(0) to allow 4368 them to be them to be 4369 independently moved independently moved 4370 according to the according to the 4371 following rules. following rules. 4372 - s_waitcnt vmcnt(0) - s_waitcnt vmcnt(0) 4373 must happen after must happen after 4374 any preceding any preceding 4375 global/generic global/generic 4376 load/store/load load/load atomic/ 4377 atomic/store atomicrmw-with-return-value. 4378 atomic/atomicrmw. 4379 - s_waitcnt vscnt(0) 4380 must happen after 4381 any preceding 4382 global/generic 4383 store/store atomic/ 4384 atomicrmw-no-return-value. 4385 - s_waitcnt lgkmcnt(0) - s_waitcnt lgkmcnt(0) 4386 must happen after must happen after 4387 any preceding any preceding 4388 local/generic local/generic 4389 load/store/load load/store/load 4390 atomic/store atomic/store 4391 atomic/atomicrmw. atomic/atomicrmw. 4392 - Must happen before - Must happen before 4393 the following the following 4394 atomicrmw. atomicrmw. 4395 - Ensures that all - Ensures that all 4396 memory operations memory operations 4397 to global and local to global and local 4398 have completed have completed 4399 before performing before performing 4400 the atomicrmw that the atomicrmw that 4401 is being released. is being released. 4402 4403 2. buffer/global/ds/flat_atomic 2. buffer/global/ds/flat_atomic 4404 fence release - singlethread *none* *none* *none* 4405 - wavefront 4406 fence release - workgroup *none* 1. s_waitcnt lgkmcnt(0) 1. s_waitcnt lgkmcnt(0) & 4407 vmcnt(0) & vscnt(0) 4408 4409 - If CU wavefront execution mode, omit vmcnt and 4410 vscnt. 4411 - If OpenCL and - If OpenCL and 4412 address space is address space is 4413 not generic, omit. not generic, omit 4414 lgkmcnt(0). 4415 - If OpenCL and 4416 address space is 4417 local, omit 4418 vmcnt(0) and vscnt(0). 4419 - However, since LLVM - However, since LLVM 4420 currently has no currently has no 4421 address space on address space on 4422 the fence need to the fence need to 4423 conservatively conservatively 4424 always generate. If always generate. If 4425 fence had an fence had an 4426 address space then address space then 4427 set to address set to address 4428 space of OpenCL space of OpenCL 4429 fence flag, or to fence flag, or to 4430 generic if both generic if both 4431 local and global local and global 4432 flags are flags are 4433 specified. specified. 4434 - Must happen after 4435 any preceding 4436 local/generic 4437 load/load 4438 atomic/store/store 4439 atomic/atomicrmw. 4440 - Could be split into 4441 separate s_waitcnt 4442 vmcnt(0), s_waitcnt 4443 vscnt(0) and s_waitcnt 4444 lgkmcnt(0) to allow 4445 them to be 4446 independently moved 4447 according to the 4448 following rules. 4449 - s_waitcnt vmcnt(0) 4450 must happen after 4451 any preceding 4452 global/generic 4453 load/load 4454 atomic/ 4455 atomicrmw-with-return-value. 4456 - s_waitcnt vscnt(0) 4457 must happen after 4458 any preceding 4459 global/generic 4460 store/store atomic/ 4461 atomicrmw-no-return-value. 4462 - s_waitcnt lgkmcnt(0) 4463 must happen after 4464 any preceding 4465 local/generic 4466 load/store/load 4467 atomic/store atomic/ 4468 atomicrmw. 4469 - Must happen before - Must happen before 4470 any following store any following store 4471 atomic/atomicrmw atomic/atomicrmw 4472 with an equal or with an equal or 4473 wider sync scope wider sync scope 4474 and memory ordering and memory ordering 4475 stronger than stronger than 4476 unordered (this is unordered (this is 4477 termed the termed the 4478 fence-paired-atomic). fence-paired-atomic). 4479 - Ensures that all - Ensures that all 4480 memory operations memory operations 4481 to local have have 4482 completed before completed before 4483 performing the performing the 4484 following following 4485 fence-paired-atomic. fence-paired-atomic. 4486 4487 fence release - agent *none* 1. s_waitcnt lgkmcnt(0) & 1. s_waitcnt lgkmcnt(0) & 4488 - system vmcnt(0) vmcnt(0) & vscnt(0) 4489 4490 - If OpenCL and - If OpenCL and 4491 address space is address space is 4492 not generic, omit not generic, omit 4493 lgkmcnt(0). lgkmcnt(0). 4494 - If OpenCL and - If OpenCL and 4495 address space is address space is 4496 local, omit local, omit 4497 vmcnt(0). vmcnt(0) and vscnt(0). 4498 - However, since LLVM - However, since LLVM 4499 currently has no currently has no 4500 address space on address space on 4501 the fence need to the fence need to 4502 conservatively conservatively 4503 always generate. If always generate. If 4504 fence had an fence had an 4505 address space then address space then 4506 set to address set to address 4507 space of OpenCL space of OpenCL 4508 fence flag, or to fence flag, or to 4509 generic if both generic if both 4510 local and global local and global 4511 flags are flags are 4512 specified. specified. 4513 - Could be split into - Could be split into 4514 separate s_waitcnt separate s_waitcnt 4515 vmcnt(0) and vmcnt(0), s_waitcnt 4516 s_waitcnt vscnt(0) and s_waitcnt 4517 lgkmcnt(0) to allow lgkmcnt(0) to allow 4518 them to be them to be 4519 independently moved independently moved 4520 according to the according to the 4521 following rules. following rules. 4522 - s_waitcnt vmcnt(0) - s_waitcnt vmcnt(0) 4523 must happen after must happen after 4524 any preceding any preceding 4525 global/generic global/generic 4526 load/store/load load/load atomic/ 4527 atomic/store atomicrmw-with-return-value. 4528 atomic/atomicrmw. 4529 - s_waitcnt vscnt(0) 4530 must happen after 4531 any preceding 4532 global/generic 4533 store/store atomic/ 4534 atomicrmw-no-return-value. 4535 - s_waitcnt lgkmcnt(0) - s_waitcnt lgkmcnt(0) 4536 must happen after must happen after 4537 any preceding any preceding 4538 local/generic local/generic 4539 load/store/load load/store/load 4540 atomic/store atomic/store 4541 atomic/atomicrmw. atomic/atomicrmw. 4542 - Must happen before - Must happen before 4543 any following store any following store 4544 atomic/atomicrmw atomic/atomicrmw 4545 with an equal or with an equal or 4546 wider sync scope wider sync scope 4547 and memory ordering and memory ordering 4548 stronger than stronger than 4549 unordered (this is unordered (this is 4550 termed the termed the 4551 fence-paired-atomic). fence-paired-atomic). 4552 - Ensures that all - Ensures that all 4553 memory operations memory operations 4554 have have 4555 completed before completed before 4556 performing the performing the 4557 following following 4558 fence-paired-atomic. fence-paired-atomic. 4559 4560 **Acquire-Release Atomic** 4561 ---------------------------------------------------------------------------------------------------------------------- 4562 atomicrmw acq_rel - singlethread - global 1. buffer/global/ds/flat_atomic 1. buffer/global/ds/flat_atomic 4563 - wavefront - local 4564 - generic 4565 atomicrmw acq_rel - workgroup - global 1. s_waitcnt lgkmcnt(0) 1. s_waitcnt lgkmcnt(0) & 4566 vmcnt(0) & vscnt(0) 4567 4568 - If CU wavefront execution mode, omit vmcnt and 4569 vscnt. 4570 - If OpenCL, omit. - If OpenCL, omit 4571 s_waitcnt lgkmcnt(0). 4572 - Must happen after - Must happen after 4573 any preceding any preceding 4574 local/generic local/generic 4575 load/store/load load/store/load 4576 atomic/store atomic/store 4577 atomic/atomicrmw. atomic/atomicrmw. 4578 - Could be split into 4579 separate s_waitcnt 4580 vmcnt(0), s_waitcnt 4581 vscnt(0) and s_waitcnt 4582 lgkmcnt(0) to allow 4583 them to be 4584 independently moved 4585 according to the 4586 following rules. 4587 - s_waitcnt vmcnt(0) 4588 must happen after 4589 any preceding 4590 global/generic load/load 4591 atomic/ 4592 atomicrmw-with-return-value. 4593 - s_waitcnt vscnt(0) 4594 must happen after 4595 any preceding 4596 global/generic 4597 store/store 4598 atomic/ 4599 atomicrmw-no-return-value. 4600 - s_waitcnt lgkmcnt(0) 4601 must happen after 4602 any preceding 4603 local/generic load/store/load 4604 atomic/store atomic/atomicrmw. 4605 - Must happen before - Must happen before 4606 the following the following 4607 atomicrmw. atomicrmw. 4608 - Ensures that all - Ensures that all 4609 memory operations memory operations 4610 to local have have 4611 completed before completed before 4612 performing the performing the 4613 atomicrmw that is atomicrmw that is 4614 being released. being released. 4615 4616 2. buffer/global/flat_atomic 2. buffer/global_atomic 4617 3. s_waitcnt vm/vscnt(0) 4618 4619 - If CU wavefront execution mode, omit vm/vscnt. 4620 - Use vmcnt if atomic with 4621 return and vscnt if atomic 4622 with no-return. 4623 waitcnt lgkmcnt(0). 4624 - Must happen before 4625 the following 4626 buffer_gl0_inv. 4627 - Ensures any 4628 following global 4629 data read is no 4630 older than the 4631 atomicrmw value 4632 being acquired. 4633 4634 4. buffer_gl0_inv 4635 4636 - If CU wavefront execution mode, omit. 4637 - Ensures that 4638 following 4639 loads will not see 4640 stale data. 4641 4642 atomicrmw acq_rel - workgroup - local 1. waitcnt vmcnt(0) & vscnt(0) 4643 4644 - If CU wavefront execution mode, omit. 4645 - If OpenCL, omit. 4646 - Could be split into 4647 separate s_waitcnt 4648 vmcnt(0) and s_waitcnt 4649 vscnt(0) to allow 4650 them to be 4651 independently moved 4652 according to the 4653 following rules. 4654 - s_waitcnt vmcnt(0) 4655 must happen after 4656 any preceding 4657 global/generic load/load 4658 atomic/ 4659 atomicrmw-with-return-value. 4660 - s_waitcnt vscnt(0) 4661 must happen after 4662 any preceding 4663 global/generic 4664 store/store atomic/ 4665 atomicrmw-no-return-value. 4666 - Must happen before 4667 the following 4668 store. 4669 - Ensures that all 4670 global memory 4671 operations have 4672 completed before 4673 performing the 4674 store that is being 4675 released. 4676 4677 1. ds_atomic 2. ds_atomic 4678 2. s_waitcnt lgkmcnt(0) 3. s_waitcnt lgkmcnt(0) 4679 4680 - If OpenCL, omit. - If OpenCL, omit. 4681 - Must happen before - Must happen before 4682 any following the following 4683 global/generic buffer_gl0_inv. 4684 load/load 4685 atomic/store/store 4686 atomic/atomicrmw. 4687 - Ensures any - Ensures any 4688 following global following global 4689 data read is no data read is no 4690 older than the load older than the load 4691 atomic value being atomic value being 4692 acquired. acquired. 4693 4694 4. buffer_gl0_inv 4695 4696 - If CU wavefront execution mode, omit. 4697 - If OpenCL omit. 4698 - Ensures that 4699 following 4700 loads will not see 4701 stale data. 4702 4703 atomicrmw acq_rel - workgroup - generic 1. s_waitcnt lgkmcnt(0) 1. s_waitcnt lgkmcnt(0) & 4704 vmcnt(0) & vscnt(0) 4705 4706 - If CU wavefront execution mode, omit vmcnt and 4707 vscnt. 4708 - If OpenCL, omit. - If OpenCL, omit 4709 waitcnt lgkmcnt(0). 4710 - Must happen after 4711 any preceding 4712 local/generic 4713 load/store/load 4714 atomic/store 4715 atomic/atomicrmw. 4716 - Could be split into 4717 separate s_waitcnt 4718 vmcnt(0), s_waitcnt 4719 vscnt(0) and s_waitcnt 4720 lgkmcnt(0) to allow 4721 them to be 4722 independently moved 4723 according to the 4724 following rules. 4725 - s_waitcnt vmcnt(0) 4726 must happen after 4727 any preceding 4728 global/generic load/load 4729 atomic/ 4730 atomicrmw-with-return-value. 4731 - s_waitcnt vscnt(0) 4732 must happen after 4733 any preceding 4734 global/generic 4735 store/store 4736 atomic/ 4737 atomicrmw-no-return-value. 4738 - s_waitcnt lgkmcnt(0) 4739 must happen after 4740 any preceding 4741 local/generic load/store/load 4742 atomic/store atomic/atomicrmw. 4743 - Must happen before - Must happen before 4744 the following the following 4745 atomicrmw. atomicrmw. 4746 - Ensures that all - Ensures that all 4747 memory operations memory operations 4748 to local have have 4749 completed before completed before 4750 performing the performing the 4751 atomicrmw that is atomicrmw that is 4752 being released. being released. 4753 4754 2. flat_atomic 2. flat_atomic 4755 3. s_waitcnt lgkmcnt(0) 3. s_waitcnt lgkmcnt(0) & 4756 vm/vscnt(0) 4757 4758 - If CU wavefront execution mode, omit vm/vscnt. 4759 - If OpenCL, omit. - If OpenCL, omit 4760 waitcnt lgkmcnt(0). 4761 - Must happen before - Must happen before 4762 any following the following 4763 global/generic buffer_gl0_inv. 4764 load/load 4765 atomic/store/store 4766 atomic/atomicrmw. 4767 - Ensures any - Ensures any 4768 following global following global 4769 data read is no data read is no 4770 older than the load older than the load 4771 atomic value being atomic value being 4772 acquired. acquired. 4773 4774 3. buffer_gl0_inv 4775 4776 - If CU wavefront execution mode, omit. 4777 - Ensures that 4778 following 4779 loads will not see 4780 stale data. 4781 4782 atomicrmw acq_rel - agent - global 1. s_waitcnt lgkmcnt(0) & 1. s_waitcnt lgkmcnt(0) & 4783 - system vmcnt(0) vmcnt(0) & vscnt(0) 4784 4785 - If OpenCL, omit - If OpenCL, omit 4786 lgkmcnt(0). lgkmcnt(0). 4787 - Could be split into - Could be split into 4788 separate s_waitcnt separate s_waitcnt 4789 vmcnt(0) and vmcnt(0), s_waitcnt 4790 s_waitcnt vscnt(0) and s_waitcnt 4791 lgkmcnt(0) to allow lgkmcnt(0) to allow 4792 them to be them to be 4793 independently moved independently moved 4794 according to the according to the 4795 following rules. following rules. 4796 - s_waitcnt vmcnt(0) - s_waitcnt vmcnt(0) 4797 must happen after must happen after 4798 any preceding any preceding 4799 global/generic global/generic 4800 load/store/load load/load atomic/ 4801 atomic/store atomicrmw-with-return-value. 4802 atomic/atomicrmw. 4803 - s_waitcnt vscnt(0) 4804 must happen after 4805 any preceding 4806 global/generic 4807 store/store atomic/ 4808 atomicrmw-no-return-value. 4809 - s_waitcnt lgkmcnt(0) - s_waitcnt lgkmcnt(0) 4810 must happen after must happen after 4811 any preceding any preceding 4812 local/generic local/generic 4813 load/store/load load/store/load 4814 atomic/store atomic/store 4815 atomic/atomicrmw. atomic/atomicrmw. 4816 - Must happen before - Must happen before 4817 the following the following 4818 atomicrmw. atomicrmw. 4819 - Ensures that all - Ensures that all 4820 memory operations memory operations 4821 to global have to global have 4822 completed before completed before 4823 performing the performing the 4824 atomicrmw that is atomicrmw that is 4825 being released. being released. 4826 4827 2. buffer/global/flat_atomic 2. buffer/global_atomic 4828 3. s_waitcnt vmcnt(0) 3. s_waitcnt vm/vscnt(0) 4829 4830 - Use vmcnt if atomic with 4831 return and vscnt if atomic 4832 with no-return. 4833 waitcnt lgkmcnt(0). 4834 - Must happen before - Must happen before 4835 following following 4836 buffer_wbinvl1_vol. buffer_gl*_inv. 4837 - Ensures the - Ensures the 4838 atomicrmw has atomicrmw has 4839 completed before completed before 4840 invalidating the invalidating the 4841 cache. caches. 4842 4843 4. buffer_wbinvl1_vol 4. buffer_gl0_inv; 4844 buffer_gl1_inv 4845 4846 - Must happen before - Must happen before 4847 any following any following 4848 global/generic global/generic 4849 load/load load/load 4850 atomic/atomicrmw. atomic/atomicrmw. 4851 - Ensures that - Ensures that 4852 following loads following loads 4853 will not see stale will not see stale 4854 global data. global data. 4855 4856 atomicrmw acq_rel - agent - generic 1. s_waitcnt lgkmcnt(0) & 1. s_waitcnt lgkmcnt(0) & 4857 - system vmcnt(0) vmcnt(0) & vscnt(0) 4858 4859 - If OpenCL, omit - If OpenCL, omit 4860 lgkmcnt(0). lgkmcnt(0). 4861 - Could be split into - Could be split into 4862 separate s_waitcnt separate s_waitcnt 4863 vmcnt(0) and vmcnt(0), s_waitcnt 4864 s_waitcnt vscnt(0) and s_waitcnt 4865 lgkmcnt(0) to allow lgkmcnt(0) to allow 4866 them to be them to be 4867 independently moved independently moved 4868 according to the according to the 4869 following rules. following rules. 4870 - s_waitcnt vmcnt(0) - s_waitcnt vmcnt(0) 4871 must happen after must happen after 4872 any preceding any preceding 4873 global/generic global/generic 4874 load/store/load load/load atomic 4875 atomic/store atomicrmw-with-return-value. 4876 atomic/atomicrmw. 4877 - s_waitcnt vscnt(0) 4878 must happen after 4879 any preceding 4880 global/generic 4881 store/store atomic/ 4882 atomicrmw-no-return-value. 4883 - s_waitcnt lgkmcnt(0) - s_waitcnt lgkmcnt(0) 4884 must happen after must happen after 4885 any preceding any preceding 4886 local/generic local/generic 4887 load/store/load load/store/load 4888 atomic/store atomic/store 4889 atomic/atomicrmw. atomic/atomicrmw. 4890 - Must happen before - Must happen before 4891 the following the following 4892 atomicrmw. atomicrmw. 4893 - Ensures that all - Ensures that all 4894 memory operations memory operations 4895 to global have have 4896 completed before completed before 4897 performing the performing the 4898 atomicrmw that is atomicrmw that is 4899 being released. being released. 4900 4901 2. flat_atomic 2. flat_atomic 4902 3. s_waitcnt vmcnt(0) & 3. s_waitcnt vm/vscnt(0) & 4903 lgkmcnt(0) lgkmcnt(0) 4904 4905 - If OpenCL, omit - If OpenCL, omit 4906 lgkmcnt(0). lgkmcnt(0). 4907 - Use vmcnt if atomic with 4908 return and vscnt if atomic 4909 with no-return. 4910 - Must happen before - Must happen before 4911 following following 4912 buffer_wbinvl1_vol. buffer_gl*_inv. 4913 - Ensures the - Ensures the 4914 atomicrmw has atomicrmw has 4915 completed before completed before 4916 invalidating the invalidating the 4917 cache. caches. 4918 4919 4. buffer_wbinvl1_vol 4. buffer_gl0_inv; 4920 buffer_gl1_inv 4921 4922 - Must happen before - Must happen before 4923 any following any following 4924 global/generic global/generic 4925 load/load load/load 4926 atomic/atomicrmw. atomic/atomicrmw. 4927 - Ensures that - Ensures that 4928 following loads following loads 4929 will not see stale will not see stale 4930 global data. global data. 4931 4932 fence acq_rel - singlethread *none* *none* *none* 4933 - wavefront 4934 fence acq_rel - workgroup *none* 1. s_waitcnt lgkmcnt(0) 1. s_waitcnt lgkmcnt(0) & 4935 vmcnt(0) & vscnt(0) 4936 4937 - If CU wavefront execution mode, omit vmcnt and 4938 vscnt. 4939 - If OpenCL and - If OpenCL and 4940 address space is address space is 4941 not generic, omit. not generic, omit 4942 lgkmcnt(0). 4943 - If OpenCL and 4944 address space is 4945 local, omit 4946 vmcnt(0) and vscnt(0). 4947 - However, - However, 4948 since LLVM since LLVM 4949 currently has no currently has no 4950 address space on address space on 4951 the fence need to the fence need to 4952 conservatively conservatively 4953 always generate always generate 4954 (see comment for (see comment for 4955 previous fence). previous fence). 4956 - Must happen after 4957 any preceding 4958 local/generic 4959 load/load 4960 atomic/store/store 4961 atomic/atomicrmw. 4962 - Could be split into 4963 separate s_waitcnt 4964 vmcnt(0), s_waitcnt 4965 vscnt(0) and s_waitcnt 4966 lgkmcnt(0) to allow 4967 them to be 4968 independently moved 4969 according to the 4970 following rules. 4971 - s_waitcnt vmcnt(0) 4972 must happen after 4973 any preceding 4974 global/generic 4975 load/load 4976 atomic/ 4977 atomicrmw-with-return-value. 4978 - s_waitcnt vscnt(0) 4979 must happen after 4980 any preceding 4981 global/generic 4982 store/store atomic/ 4983 atomicrmw-no-return-value. 4984 - s_waitcnt lgkmcnt(0) 4985 must happen after 4986 any preceding 4987 local/generic 4988 load/store/load 4989 atomic/store atomic/ 4990 atomicrmw. 4991 - Must happen before - Must happen before 4992 any following any following 4993 global/generic global/generic 4994 load/load load/load 4995 atomic/store/store atomic/store/store 4996 atomic/atomicrmw. atomic/atomicrmw. 4997 - Ensures that all - Ensures that all 4998 memory operations memory operations 4999 to local have have 5000 completed before completed before 5001 performing any performing any 5002 following global following global 5003 memory operations. memory operations. 5004 - Ensures that the - Ensures that the 5005 preceding preceding 5006 local/generic load local/generic load 5007 atomic/atomicrmw atomic/atomicrmw 5008 with an equal or with an equal or 5009 wider sync scope wider sync scope 5010 and memory ordering and memory ordering 5011 stronger than stronger than 5012 unordered (this is unordered (this is 5013 termed the termed the 5014 acquire-fence-paired-atomic acquire-fence-paired-atomic 5015 ) has completed ) has completed 5016 before following before following 5017 global memory global memory 5018 operations. This operations. This 5019 satisfies the satisfies the 5020 requirements of requirements of 5021 acquire. acquire. 5022 - Ensures that all - Ensures that all 5023 previous memory previous memory 5024 operations have operations have 5025 completed before a completed before a 5026 following following 5027 local/generic store local/generic store 5028 atomic/atomicrmw atomic/atomicrmw 5029 with an equal or with an equal or 5030 wider sync scope wider sync scope 5031 and memory ordering and memory ordering 5032 stronger than stronger than 5033 unordered (this is unordered (this is 5034 termed the termed the 5035 release-fence-paired-atomic release-fence-paired-atomic 5036 ). This satisfies the ). This satisfies the 5037 requirements of requirements of 5038 release. release. 5039 - Must happen before 5040 the following 5041 buffer_gl0_inv. 5042 - Ensures that the 5043 acquire-fence-paired 5044 atomic has completed 5045 before invalidating 5046 the 5047 cache. Therefore 5048 any following 5049 locations read must 5050 be no older than 5051 the value read by 5052 the 5053 acquire-fence-paired-atomic. 5054 5055 3. buffer_gl0_inv 5056 5057 - If CU wavefront execution mode, omit. 5058 - Ensures that 5059 following 5060 loads will not see 5061 stale data. 5062 5063 fence acq_rel - agent *none* 1. s_waitcnt lgkmcnt(0) & 1. s_waitcnt lgkmcnt(0) & 5064 - system vmcnt(0) vmcnt(0) & vscnt(0) 5065 5066 - If OpenCL and - If OpenCL and 5067 address space is address space is 5068 not generic, omit not generic, omit 5069 lgkmcnt(0). lgkmcnt(0). 5070 - If OpenCL and 5071 address space is 5072 local, omit 5073 vmcnt(0) and vscnt(0). 5074 - However, since LLVM - However, since LLVM 5075 currently has no currently has no 5076 address space on address space on 5077 the fence need to the fence need to 5078 conservatively conservatively 5079 always generate always generate 5080 (see comment for (see comment for 5081 previous fence). previous fence). 5082 - Could be split into - Could be split into 5083 separate s_waitcnt separate s_waitcnt 5084 vmcnt(0) and vmcnt(0), s_waitcnt 5085 s_waitcnt vscnt(0) and s_waitcnt 5086 lgkmcnt(0) to allow lgkmcnt(0) to allow 5087 them to be them to be 5088 independently moved independently moved 5089 according to the according to the 5090 following rules. following rules. 5091 - s_waitcnt vmcnt(0) - s_waitcnt vmcnt(0) 5092 must happen after must happen after 5093 any preceding any preceding 5094 global/generic global/generic 5095 load/store/load load/load 5096 atomic/store atomic/ 5097 atomic/atomicrmw. atomicrmw-with-return-value. 5098 - s_waitcnt vscnt(0) 5099 must happen after 5100 any preceding 5101 global/generic 5102 store/store atomic/ 5103 atomicrmw-no-return-value. 5104 - s_waitcnt lgkmcnt(0) - s_waitcnt lgkmcnt(0) 5105 must happen after must happen after 5106 any preceding any preceding 5107 local/generic local/generic 5108 load/store/load load/store/load 5109 atomic/store atomic/store 5110 atomic/atomicrmw. atomic/atomicrmw. 5111 - Must happen before - Must happen before 5112 the following the following 5113 buffer_wbinvl1_vol. buffer_gl*_inv. 5114 - Ensures that the - Ensures that the 5115 preceding preceding 5116 global/local/generic global/local/generic 5117 load load 5118 atomic/atomicrmw atomic/atomicrmw 5119 with an equal or with an equal or 5120 wider sync scope wider sync scope 5121 and memory ordering and memory ordering 5122 stronger than stronger than 5123 unordered (this is unordered (this is 5124 termed the termed the 5125 acquire-fence-paired-atomic acquire-fence-paired-atomic 5126 ) has completed ) has completed 5127 before invalidating before invalidating 5128 the cache. This the caches. This 5129 satisfies the satisfies the 5130 requirements of requirements of 5131 acquire. acquire. 5132 - Ensures that all - Ensures that all 5133 previous memory previous memory 5134 operations have operations have 5135 completed before a completed before a 5136 following following 5137 global/local/generic global/local/generic 5138 store store 5139 atomic/atomicrmw atomic/atomicrmw 5140 with an equal or with an equal or 5141 wider sync scope wider sync scope 5142 and memory ordering and memory ordering 5143 stronger than stronger than 5144 unordered (this is unordered (this is 5145 termed the termed the 5146 release-fence-paired-atomic release-fence-paired-atomic 5147 ). This satisfies the ). This satisfies the 5148 requirements of requirements of 5149 release. release. 5150 5151 2. buffer_wbinvl1_vol 2. buffer_gl0_inv; 5152 buffer_gl1_inv 5153 5154 - Must happen before - Must happen before 5155 any following any following 5156 global/generic global/generic 5157 load/load load/load 5158 atomic/store/store atomic/store/store 5159 atomic/atomicrmw. atomic/atomicrmw. 5160 - Ensures that - Ensures that 5161 following loads following loads 5162 will not see stale will not see stale 5163 global data. This global data. This 5164 satisfies the satisfies the 5165 requirements of requirements of 5166 acquire. acquire. 5167 5168 **Sequential Consistent Atomic** 5169 ---------------------------------------------------------------------------------------------------------------------- 5170 load atomic seq_cst - singlethread - global *Same as corresponding *Same as corresponding 5171 - wavefront - local load atomic acquire, load atomic acquire, 5172 - generic except must generated except must generated 5173 all instructions even all instructions even 5174 for OpenCL.* for OpenCL.* 5175 load atomic seq_cst - workgroup - global 1. s_waitcnt lgkmcnt(0) 1. s_waitcnt lgkmcnt(0) & 5176 - generic vmcnt(0) & vscnt(0) 5177 5178 - If CU wavefront execution mode, omit vmcnt and 5179 vscnt. 5180 - Could be split into 5181 separate s_waitcnt 5182 vmcnt(0), s_waitcnt 5183 vscnt(0) and s_waitcnt 5184 lgkmcnt(0) to allow 5185 them to be 5186 independently moved 5187 according to the 5188 following rules. 5189 - Must - waitcnt lgkmcnt(0) must 5190 happen after happen after 5191 preceding preceding 5192 global/generic load local load 5193 atomic/store atomic/store 5194 atomic/atomicrmw atomic/atomicrmw 5195 with memory with memory 5196 ordering of seq_cst ordering of seq_cst 5197 and with equal or and with equal or 5198 wider sync scope. wider sync scope. 5199 (Note that seq_cst (Note that seq_cst 5200 fences have their fences have their 5201 own s_waitcnt own s_waitcnt 5202 lgkmcnt(0) and so do lgkmcnt(0) and so do 5203 not need to be not need to be 5204 considered.) considered.) 5205 - waitcnt vmcnt(0) 5206 Must happen after 5207 preceding 5208 global/generic load 5209 atomic/ 5210 atomicrmw-with-return-value 5211 with memory 5212 ordering of seq_cst 5213 and with equal or 5214 wider sync scope. 5215 (Note that seq_cst 5216 fences have their 5217 own s_waitcnt 5218 vmcnt(0) and so do 5219 not need to be 5220 considered.) 5221 - waitcnt vscnt(0) 5222 Must happen after 5223 preceding 5224 global/generic store 5225 atomic/ 5226 atomicrmw-no-return-value 5227 with memory 5228 ordering of seq_cst 5229 and with equal or 5230 wider sync scope. 5231 (Note that seq_cst 5232 fences have their 5233 own s_waitcnt 5234 vscnt(0) and so do 5235 not need to be 5236 considered.) 5237 - Ensures any - Ensures any 5238 preceding preceding 5239 sequential sequential 5240 consistent local consistent global/local 5241 memory instructions memory instructions 5242 have completed have completed 5243 before executing before executing 5244 this sequentially this sequentially 5245 consistent consistent 5246 instruction. This instruction. This 5247 prevents reordering prevents reordering 5248 a seq_cst store a seq_cst store 5249 followed by a followed by a 5250 seq_cst load. (Note seq_cst load. (Note 5251 that seq_cst is that seq_cst is 5252 stronger than stronger than 5253 acquire/release as acquire/release as 5254 the reordering of the reordering of 5255 load acquire load acquire 5256 followed by a store followed by a store 5257 release is release is 5258 prevented by the prevented by the 5259 waitcnt of waitcnt of 5260 the release, but the release, but 5261 there is nothing there is nothing 5262 preventing a store preventing a store 5263 release followed by release followed by 5264 load acquire from load acquire from 5265 competing out of competing out of 5266 order.) order.) 5267 5268 2. *Following 2. *Following 5269 instructions same as instructions same as 5270 corresponding load corresponding load 5271 atomic acquire, atomic acquire, 5272 except must generated except must generated 5273 all instructions even all instructions even 5274 for OpenCL.* for OpenCL.* 5275 load atomic seq_cst - workgroup - local *Same as corresponding 5276 load atomic acquire, 5277 except must generated 5278 all instructions even 5279 for OpenCL.* 5280 5281 1. s_waitcnt vmcnt(0) & vscnt(0) 5282 5283 - If CU wavefront execution mode, omit. 5284 - Could be split into 5285 separate s_waitcnt 5286 vmcnt(0) and s_waitcnt 5287 vscnt(0) to allow 5288 them to be 5289 independently moved 5290 according to the 5291 following rules. 5292 - waitcnt vmcnt(0) 5293 Must happen after 5294 preceding 5295 global/generic load 5296 atomic/ 5297 atomicrmw-with-return-value 5298 with memory 5299 ordering of seq_cst 5300 and with equal or 5301 wider sync scope. 5302 (Note that seq_cst 5303 fences have their 5304 own s_waitcnt 5305 vmcnt(0) and so do 5306 not need to be 5307 considered.) 5308 - waitcnt vscnt(0) 5309 Must happen after 5310 preceding 5311 global/generic store 5312 atomic/ 5313 atomicrmw-no-return-value 5314 with memory 5315 ordering of seq_cst 5316 and with equal or 5317 wider sync scope. 5318 (Note that seq_cst 5319 fences have their 5320 own s_waitcnt 5321 vscnt(0) and so do 5322 not need to be 5323 considered.) 5324 - Ensures any 5325 preceding 5326 sequential 5327 consistent global 5328 memory instructions 5329 have completed 5330 before executing 5331 this sequentially 5332 consistent 5333 instruction. This 5334 prevents reordering 5335 a seq_cst store 5336 followed by a 5337 seq_cst load. (Note 5338 that seq_cst is 5339 stronger than 5340 acquire/release as 5341 the reordering of 5342 load acquire 5343 followed by a store 5344 release is 5345 prevented by the 5346 waitcnt of 5347 the release, but 5348 there is nothing 5349 preventing a store 5350 release followed by 5351 load acquire from 5352 competing out of 5353 order.) 5354 5355 2. *Following 5356 instructions same as 5357 corresponding load 5358 atomic acquire, 5359 except must generated 5360 all instructions even 5361 for OpenCL.* 5362 5363 load atomic seq_cst - agent - global 1. s_waitcnt lgkmcnt(0) & 1. s_waitcnt lgkmcnt(0) & 5364 - system - generic vmcnt(0) vmcnt(0) & vscnt(0) 5365 5366 - Could be split into - Could be split into 5367 separate s_waitcnt separate s_waitcnt 5368 vmcnt(0) vmcnt(0), s_waitcnt 5369 and s_waitcnt vscnt(0) and s_waitcnt 5370 lgkmcnt(0) to allow lgkmcnt(0) to allow 5371 them to be them to be 5372 independently moved independently moved 5373 according to the according to the 5374 following rules. following rules. 5375 - waitcnt lgkmcnt(0) - waitcnt lgkmcnt(0) 5376 must happen after must happen after 5377 preceding preceding 5378 global/generic load local load 5379 atomic/store atomic/store 5380 atomic/atomicrmw atomic/atomicrmw 5381 with memory with memory 5382 ordering of seq_cst ordering of seq_cst 5383 and with equal or and with equal or 5384 wider sync scope. wider sync scope. 5385 (Note that seq_cst (Note that seq_cst 5386 fences have their fences have their 5387 own s_waitcnt own s_waitcnt 5388 lgkmcnt(0) and so do lgkmcnt(0) and so do 5389 not need to be not need to be 5390 considered.) considered.) 5391 - waitcnt vmcnt(0) - waitcnt vmcnt(0) 5392 must happen after must happen after 5393 preceding preceding 5394 global/generic load global/generic load 5395 atomic/store atomic/ 5396 atomic/atomicrmw atomicrmw-with-return-value 5397 with memory with memory 5398 ordering of seq_cst ordering of seq_cst 5399 and with equal or and with equal or 5400 wider sync scope. wider sync scope. 5401 (Note that seq_cst (Note that seq_cst 5402 fences have their fences have their 5403 own s_waitcnt own s_waitcnt 5404 vmcnt(0) and so do vmcnt(0) and so do 5405 not need to be not need to be 5406 considered.) considered.) 5407 - waitcnt vscnt(0) 5408 Must happen after 5409 preceding 5410 global/generic store 5411 atomic/ 5412 atomicrmw-no-return-value 5413 with memory 5414 ordering of seq_cst 5415 and with equal or 5416 wider sync scope. 5417 (Note that seq_cst 5418 fences have their 5419 own s_waitcnt 5420 vscnt(0) and so do 5421 not need to be 5422 considered.) 5423 - Ensures any - Ensures any 5424 preceding preceding 5425 sequential sequential 5426 consistent global consistent global 5427 memory instructions memory instructions 5428 have completed have completed 5429 before executing before executing 5430 this sequentially this sequentially 5431 consistent consistent 5432 instruction. This instruction. This 5433 prevents reordering prevents reordering 5434 a seq_cst store a seq_cst store 5435 followed by a followed by a 5436 seq_cst load. (Note seq_cst load. (Note 5437 that seq_cst is that seq_cst is 5438 stronger than stronger than 5439 acquire/release as acquire/release as 5440 the reordering of the reordering of 5441 load acquire load acquire 5442 followed by a store followed by a store 5443 release is release is 5444 prevented by the prevented by the 5445 waitcnt of waitcnt of 5446 the release, but the release, but 5447 there is nothing there is nothing 5448 preventing a store preventing a store 5449 release followed by release followed by 5450 load acquire from load acquire from 5451 competing out of competing out of 5452 order.) order.) 5453 5454 2. *Following 2. *Following 5455 instructions same as instructions same as 5456 corresponding load corresponding load 5457 atomic acquire, atomic acquire, 5458 except must generated except must generated 5459 all instructions even all instructions even 5460 for OpenCL.* for OpenCL.* 5461 store atomic seq_cst - singlethread - global *Same as corresponding *Same as corresponding 5462 - wavefront - local store atomic release, store atomic release, 5463 - workgroup - generic except must generated except must generated 5464 all instructions even all instructions even 5465 for OpenCL.* for OpenCL.* 5466 store atomic seq_cst - agent - global *Same as corresponding *Same as corresponding 5467 - system - generic store atomic release, store atomic release, 5468 except must generated except must generated 5469 all instructions even all instructions even 5470 for OpenCL.* for OpenCL.* 5471 atomicrmw seq_cst - singlethread - global *Same as corresponding *Same as corresponding 5472 - wavefront - local atomicrmw acq_rel, atomicrmw acq_rel, 5473 - workgroup - generic except must generated except must generated 5474 all instructions even all instructions even 5475 for OpenCL.* for OpenCL.* 5476 atomicrmw seq_cst - agent - global *Same as corresponding *Same as corresponding 5477 - system - generic atomicrmw acq_rel, atomicrmw acq_rel, 5478 except must generated except must generated 5479 all instructions even all instructions even 5480 for OpenCL.* for OpenCL.* 5481 fence seq_cst - singlethread *none* *Same as corresponding *Same as corresponding 5482 - wavefront fence acq_rel, fence acq_rel, 5483 - workgroup except must generated except must generated 5484 - agent all instructions even all instructions even 5485 - system for OpenCL.* for OpenCL.* 5486 ============ ============ ============== ========== =============================== ================================== 5487 5488The memory order also adds the single thread optimization constrains defined in 5489table 5490:ref:`amdgpu-amdhsa-memory-model-single-thread-optimization-constraints-gfx6-gfx10-table`. 5491 5492 .. table:: AMDHSA Memory Model Single Thread Optimization Constraints GFX6-GFX10 5493 :name: amdgpu-amdhsa-memory-model-single-thread-optimization-constraints-gfx6-gfx10-table 5494 5495 ============ ============================================================== 5496 LLVM Memory Optimization Constraints 5497 Ordering 5498 ============ ============================================================== 5499 unordered *none* 5500 monotonic *none* 5501 acquire - If a load atomic/atomicrmw then no following load/load 5502 atomic/store/ store atomic/atomicrmw/fence instruction can 5503 be moved before the acquire. 5504 - If a fence then same as load atomic, plus no preceding 5505 associated fence-paired-atomic can be moved after the fence. 5506 release - If a store atomic/atomicrmw then no preceding load/load 5507 atomic/store/ store atomic/atomicrmw/fence instruction can 5508 be moved after the release. 5509 - If a fence then same as store atomic, plus no following 5510 associated fence-paired-atomic can be moved before the 5511 fence. 5512 acq_rel Same constraints as both acquire and release. 5513 seq_cst - If a load atomic then same constraints as acquire, plus no 5514 preceding sequentially consistent load atomic/store 5515 atomic/atomicrmw/fence instruction can be moved after the 5516 seq_cst. 5517 - If a store atomic then the same constraints as release, plus 5518 no following sequentially consistent load atomic/store 5519 atomic/atomicrmw/fence instruction can be moved before the 5520 seq_cst. 5521 - If an atomicrmw/fence then same constraints as acq_rel. 5522 ============ ============================================================== 5523 5524Trap Handler ABI 5525~~~~~~~~~~~~~~~~ 5526 5527For code objects generated by AMDGPU backend for HSA [HSA]_ compatible runtimes 5528(such as ROCm [AMD-ROCm]_), the runtime installs a trap handler that supports 5529the ``s_trap`` instruction with the following usage: 5530 5531 .. table:: AMDGPU Trap Handler for AMDHSA OS 5532 :name: amdgpu-trap-handler-for-amdhsa-os-table 5533 5534 =================== =============== =============== ======================= 5535 Usage Code Sequence Trap Handler Description 5536 Inputs 5537 =================== =============== =============== ======================= 5538 reserved ``s_trap 0x00`` Reserved by hardware. 5539 ``debugtrap(arg)`` ``s_trap 0x01`` ``SGPR0-1``: Reserved for HSA 5540 ``queue_ptr`` ``debugtrap`` 5541 ``VGPR0``: intrinsic (not 5542 ``arg`` implemented). 5543 ``llvm.trap`` ``s_trap 0x02`` ``SGPR0-1``: Causes dispatch to be 5544 ``queue_ptr`` terminated and its 5545 associated queue put 5546 into the error state. 5547 ``llvm.debugtrap`` ``s_trap 0x03`` - If debugger not 5548 installed then 5549 behaves as a 5550 no-operation. The 5551 trap handler is 5552 entered and 5553 immediately returns 5554 to continue 5555 execution of the 5556 wavefront. 5557 - If the debugger is 5558 installed, causes 5559 the debug trap to be 5560 reported by the 5561 debugger and the 5562 wavefront is put in 5563 the halt state until 5564 resumed by the 5565 debugger. 5566 reserved ``s_trap 0x04`` Reserved. 5567 reserved ``s_trap 0x05`` Reserved. 5568 reserved ``s_trap 0x06`` Reserved. 5569 debugger breakpoint ``s_trap 0x07`` Reserved for debugger 5570 breakpoints. 5571 reserved ``s_trap 0x08`` Reserved. 5572 reserved ``s_trap 0xfe`` Reserved. 5573 reserved ``s_trap 0xff`` Reserved. 5574 =================== =============== =============== ======================= 5575 5576AMDPAL 5577------ 5578 5579This section provides code conventions used when the target triple OS is 5580``amdpal`` (see :ref:`amdgpu-target-triples`) for passing runtime parameters 5581from the application/runtime to each invocation of a hardware shader. These 5582parameters include both generic, application-controlled parameters called 5583*user data* as well as system-generated parameters that are a product of the 5584draw or dispatch execution. 5585 5586User Data 5587~~~~~~~~~ 5588 5589Each hardware stage has a set of 32-bit *user data registers* which can be 5590written from a command buffer and then loaded into SGPRs when waves are launched 5591via a subsequent dispatch or draw operation. This is the way most arguments are 5592passed from the application/runtime to a hardware shader. 5593 5594Compute User Data 5595~~~~~~~~~~~~~~~~~ 5596 5597Compute shader user data mappings are simpler than graphics shaders, and have a 5598fixed mapping. 5599 5600Note that there are always 10 available *user data entries* in registers - 5601entries beyond that limit must be fetched from memory (via the spill table 5602pointer) by the shader. 5603 5604 .. table:: PAL Compute Shader User Data Registers 5605 :name: pal-compute-user-data-registers 5606 5607 ============= ================================ 5608 User Register Description 5609 ============= ================================ 5610 0 Global Internal Table (32-bit pointer) 5611 1 Per-Shader Internal Table (32-bit pointer) 5612 2 - 11 Application-Controlled User Data (10 32-bit values) 5613 12 Spill Table (32-bit pointer) 5614 13 - 14 Thread Group Count (64-bit pointer) 5615 15 GDS Range 5616 ============= ================================ 5617 5618Graphics User Data 5619~~~~~~~~~~~~~~~~~~ 5620 5621Graphics pipelines support a much more flexible user data mapping: 5622 5623 .. table:: PAL Graphics Shader User Data Registers 5624 :name: pal-graphics-user-data-registers 5625 5626 ============= ================================ 5627 User Register Description 5628 ============= ================================ 5629 0 Global Internal Table (32-bit pointer) 5630 + Per-Shader Internal Table (32-bit pointer) 5631 + 1-15 Application Controlled User Data 5632 (1-15 Contiguous 32-bit Values in Registers) 5633 + Spill Table (32-bit pointer) 5634 + Draw Index (First Stage Only) 5635 + Vertex Offset (First Stage Only) 5636 + Instance Offset (First Stage Only) 5637 ============= ================================ 5638 5639 The placement of the global internal table remains fixed in the first *user 5640 data SGPR register*. Otherwise all parameters are optional, and can be mapped 5641 to any desired *user data SGPR register*, with the following restrictions: 5642 5643 * Draw Index, Vertex Offset, and Instance Offset can only be used by the first 5644 active hardware stage in a graphics pipeline (i.e. where the API vertex 5645 shader runs). 5646 5647 * Application-controlled user data must be mapped into a contiguous range of 5648 user data registers. 5649 5650 * The application-controlled user data range supports compaction remapping, so 5651 only *entries* that are actually consumed by the shader must be assigned to 5652 corresponding *registers*. Note that in order to support an efficient runtime 5653 implementation, the remapping must pack *registers* in the same order as 5654 *entries*, with unused *entries* removed. 5655 5656.. _pal_global_internal_table: 5657 5658Global Internal Table 5659~~~~~~~~~~~~~~~~~~~~~ 5660 5661The global internal table is a table of *shader resource descriptors* (SRDs) 5662that define how certain engine-wide, runtime-managed resources should be 5663accessed from a shader. The majority of these resources have HW-defined formats, 5664and it is up to the compiler to write/read data as required by the target 5665hardware. 5666 5667The following table illustrates the required format: 5668 5669 .. table:: PAL Global Internal Table 5670 :name: pal-git-table 5671 5672 ============= ================================ 5673 Offset Description 5674 ============= ================================ 5675 0-3 Graphics Scratch SRD 5676 4-7 Compute Scratch SRD 5677 8-11 ES/GS Ring Output SRD 5678 12-15 ES/GS Ring Input SRD 5679 16-19 GS/VS Ring Output #0 5680 20-23 GS/VS Ring Output #1 5681 24-27 GS/VS Ring Output #2 5682 28-31 GS/VS Ring Output #3 5683 32-35 GS/VS Ring Input SRD 5684 36-39 Tessellation Factor Buffer SRD 5685 40-43 Off-Chip LDS Buffer SRD 5686 44-47 Off-Chip Param Cache Buffer SRD 5687 48-51 Sample Position Buffer SRD 5688 52 vaRange::ShadowDescriptorTable High Bits 5689 ============= ================================ 5690 5691 The pointer to the global internal table passed to the shader as user data 5692 is a 32-bit pointer. The top 32 bits should be assumed to be the same as 5693 the top 32 bits of the pipeline, so the shader may use the program 5694 counter's top 32 bits. 5695 5696Unspecified OS 5697-------------- 5698 5699This section provides code conventions used when the target triple OS is 5700empty (see :ref:`amdgpu-target-triples`). 5701 5702Trap Handler ABI 5703~~~~~~~~~~~~~~~~ 5704 5705For code objects generated by AMDGPU backend for non-amdhsa OS, the runtime does 5706not install a trap handler. The ``llvm.trap`` and ``llvm.debugtrap`` 5707instructions are handled as follows: 5708 5709 .. table:: AMDGPU Trap Handler for Non-AMDHSA OS 5710 :name: amdgpu-trap-handler-for-non-amdhsa-os-table 5711 5712 =============== =============== =========================================== 5713 Usage Code Sequence Description 5714 =============== =============== =========================================== 5715 llvm.trap s_endpgm Causes wavefront to be terminated. 5716 llvm.debugtrap *none* Compiler warning given that there is no 5717 trap handler installed. 5718 =============== =============== =========================================== 5719 5720Source Languages 5721================ 5722 5723.. _amdgpu-opencl: 5724 5725OpenCL 5726------ 5727 5728When the language is OpenCL the following differences occur: 5729 57301. The OpenCL memory model is used (see :ref:`amdgpu-amdhsa-memory-model`). 57312. The AMDGPU backend appends additional arguments to the kernel's explicit 5732 arguments for the AMDHSA OS (see 5733 :ref:`opencl-kernel-implicit-arguments-appended-for-amdhsa-os-table`). 57343. Additional metadata is generated 5735 (see :ref:`amdgpu-amdhsa-code-object-metadata`). 5736 5737 .. table:: OpenCL kernel implicit arguments appended for AMDHSA OS 5738 :name: opencl-kernel-implicit-arguments-appended-for-amdhsa-os-table 5739 5740 ======== ==== ========= =========================================== 5741 Position Byte Byte Description 5742 Size Alignment 5743 ======== ==== ========= =========================================== 5744 1 8 8 OpenCL Global Offset X 5745 2 8 8 OpenCL Global Offset Y 5746 3 8 8 OpenCL Global Offset Z 5747 4 8 8 OpenCL address of printf buffer 5748 5 8 8 OpenCL address of virtual queue used by 5749 enqueue_kernel. 5750 6 8 8 OpenCL address of AqlWrap struct used by 5751 enqueue_kernel. 5752 7 8 8 Pointer argument used for Multi-gird 5753 synchronization. 5754 ======== ==== ========= =========================================== 5755 5756.. _amdgpu-hcc: 5757 5758HCC 5759--- 5760 5761When the language is HCC the following differences occur: 5762 57631. The HSA memory model is used (see :ref:`amdgpu-amdhsa-memory-model`). 5764 5765.. _amdgpu-assembler: 5766 5767Assembler 5768--------- 5769 5770AMDGPU backend has LLVM-MC based assembler which is currently in development. 5771It supports AMDGCN GFX6-GFX10. 5772 5773This section describes general syntax for instructions and operands. 5774 5775Instructions 5776~~~~~~~~~~~~ 5777 5778.. toctree:: 5779 :hidden: 5780 5781 AMDGPU/AMDGPUAsmGFX7 5782 AMDGPU/AMDGPUAsmGFX8 5783 AMDGPU/AMDGPUAsmGFX9 5784 AMDGPU/AMDGPUAsmGFX900 5785 AMDGPU/AMDGPUAsmGFX904 5786 AMDGPU/AMDGPUAsmGFX906 5787 AMDGPU/AMDGPUAsmGFX908 5788 AMDGPU/AMDGPUAsmGFX10 5789 AMDGPUModifierSyntax 5790 AMDGPUOperandSyntax 5791 AMDGPUInstructionSyntax 5792 AMDGPUInstructionNotation 5793 5794An instruction has the following :doc:`syntax<AMDGPUInstructionSyntax>`: 5795 5796 | ``<``\ *opcode*\ ``> <``\ *operand0*\ ``>, <``\ *operand1*\ ``>,... 5797 <``\ *modifier0*\ ``> <``\ *modifier1*\ ``>...`` 5798 5799:doc:`Operands<AMDGPUOperandSyntax>` are comma-separated while 5800:doc:`modifiers<AMDGPUModifierSyntax>` are space-separated. 5801 5802The order of operands and modifiers is fixed. 5803Most modifiers are optional and may be omitted. 5804 5805Links to detailed instruction syntax description may be found in the following 5806table. Note that features under development are not included 5807in this description. 5808 5809 ==================================== ====================================== 5810 Core ISA ISA Extensions 5811 ==================================== ====================================== 5812 :doc:`GFX7<AMDGPU/AMDGPUAsmGFX7>` \- 5813 :doc:`GFX8<AMDGPU/AMDGPUAsmGFX8>` \- 5814 :doc:`GFX9<AMDGPU/AMDGPUAsmGFX9>` :doc:`gfx900<AMDGPU/AMDGPUAsmGFX900>` 5815 5816 :doc:`gfx902<AMDGPU/AMDGPUAsmGFX900>` 5817 5818 :doc:`gfx904<AMDGPU/AMDGPUAsmGFX904>` 5819 5820 :doc:`gfx906<AMDGPU/AMDGPUAsmGFX906>` 5821 5822 :doc:`gfx908<AMDGPU/AMDGPUAsmGFX908>` 5823 5824 :doc:`gfx909<AMDGPU/AMDGPUAsmGFX900>` 5825 5826 :doc:`GFX10<AMDGPU/AMDGPUAsmGFX10>` gfx1011 5827 5828 gfx1012 5829 ==================================== ====================================== 5830 5831For more information about instructions, their semantics and supported 5832combinations of operands, refer to one of instruction set architecture manuals 5833[AMD-GCN-GFX6]_, [AMD-GCN-GFX7]_, [AMD-GCN-GFX8]_, [AMD-GCN-GFX9]_ and 5834[AMD-GCN-GFX10]_. 5835 5836Operands 5837~~~~~~~~ 5838 5839Detailed description of operands may be found :doc:`here<AMDGPUOperandSyntax>`. 5840 5841Modifiers 5842~~~~~~~~~ 5843 5844Detailed description of modifiers may be found 5845:doc:`here<AMDGPUModifierSyntax>`. 5846 5847Instruction Examples 5848~~~~~~~~~~~~~~~~~~~~ 5849 5850DS 5851++ 5852 5853.. code-block:: nasm 5854 5855 ds_add_u32 v2, v4 offset:16 5856 ds_write_src2_b64 v2 offset0:4 offset1:8 5857 ds_cmpst_f32 v2, v4, v6 5858 ds_min_rtn_f64 v[8:9], v2, v[4:5] 5859 5860For full list of supported instructions, refer to "LDS/GDS instructions" in ISA 5861Manual. 5862 5863FLAT 5864++++ 5865 5866.. code-block:: nasm 5867 5868 flat_load_dword v1, v[3:4] 5869 flat_store_dwordx3 v[3:4], v[5:7] 5870 flat_atomic_swap v1, v[3:4], v5 glc 5871 flat_atomic_cmpswap v1, v[3:4], v[5:6] glc slc 5872 flat_atomic_fmax_x2 v[1:2], v[3:4], v[5:6] glc 5873 5874For full list of supported instructions, refer to "FLAT instructions" in ISA 5875Manual. 5876 5877MUBUF 5878+++++ 5879 5880.. code-block:: nasm 5881 5882 buffer_load_dword v1, off, s[4:7], s1 5883 buffer_store_dwordx4 v[1:4], v2, ttmp[4:7], s1 offen offset:4 glc tfe 5884 buffer_store_format_xy v[1:2], off, s[4:7], s1 5885 buffer_wbinvl1 5886 buffer_atomic_inc v1, v2, s[8:11], s4 idxen offset:4 slc 5887 5888For full list of supported instructions, refer to "MUBUF Instructions" in ISA 5889Manual. 5890 5891SMRD/SMEM 5892+++++++++ 5893 5894.. code-block:: nasm 5895 5896 s_load_dword s1, s[2:3], 0xfc 5897 s_load_dwordx8 s[8:15], s[2:3], s4 5898 s_load_dwordx16 s[88:103], s[2:3], s4 5899 s_dcache_inv_vol 5900 s_memtime s[4:5] 5901 5902For full list of supported instructions, refer to "Scalar Memory Operations" in 5903ISA Manual. 5904 5905SOP1 5906++++ 5907 5908.. code-block:: nasm 5909 5910 s_mov_b32 s1, s2 5911 s_mov_b64 s[0:1], 0x80000000 5912 s_cmov_b32 s1, 200 5913 s_wqm_b64 s[2:3], s[4:5] 5914 s_bcnt0_i32_b64 s1, s[2:3] 5915 s_swappc_b64 s[2:3], s[4:5] 5916 s_cbranch_join s[4:5] 5917 5918For full list of supported instructions, refer to "SOP1 Instructions" in ISA 5919Manual. 5920 5921SOP2 5922++++ 5923 5924.. code-block:: nasm 5925 5926 s_add_u32 s1, s2, s3 5927 s_and_b64 s[2:3], s[4:5], s[6:7] 5928 s_cselect_b32 s1, s2, s3 5929 s_andn2_b32 s2, s4, s6 5930 s_lshr_b64 s[2:3], s[4:5], s6 5931 s_ashr_i32 s2, s4, s6 5932 s_bfm_b64 s[2:3], s4, s6 5933 s_bfe_i64 s[2:3], s[4:5], s6 5934 s_cbranch_g_fork s[4:5], s[6:7] 5935 5936For full list of supported instructions, refer to "SOP2 Instructions" in ISA 5937Manual. 5938 5939SOPC 5940++++ 5941 5942.. code-block:: nasm 5943 5944 s_cmp_eq_i32 s1, s2 5945 s_bitcmp1_b32 s1, s2 5946 s_bitcmp0_b64 s[2:3], s4 5947 s_setvskip s3, s5 5948 5949For full list of supported instructions, refer to "SOPC Instructions" in ISA 5950Manual. 5951 5952SOPP 5953++++ 5954 5955.. code-block:: nasm 5956 5957 s_barrier 5958 s_nop 2 5959 s_endpgm 5960 s_waitcnt 0 ; Wait for all counters to be 0 5961 s_waitcnt vmcnt(0) & expcnt(0) & lgkmcnt(0) ; Equivalent to above 5962 s_waitcnt vmcnt(1) ; Wait for vmcnt counter to be 1. 5963 s_sethalt 9 5964 s_sleep 10 5965 s_sendmsg 0x1 5966 s_sendmsg sendmsg(MSG_INTERRUPT) 5967 s_trap 1 5968 5969For full list of supported instructions, refer to "SOPP Instructions" in ISA 5970Manual. 5971 5972Unless otherwise mentioned, little verification is performed on the operands 5973of SOPP Instructions, so it is up to the programmer to be familiar with the 5974range or acceptable values. 5975 5976VALU 5977++++ 5978 5979For vector ALU instruction opcodes (VOP1, VOP2, VOP3, VOPC, VOP_DPP, VOP_SDWA), 5980the assembler will automatically use optimal encoding based on its operands. To 5981force specific encoding, one can add a suffix to the opcode of the instruction: 5982 5983* _e32 for 32-bit VOP1/VOP2/VOPC 5984* _e64 for 64-bit VOP3 5985* _dpp for VOP_DPP 5986* _sdwa for VOP_SDWA 5987 5988VOP1/VOP2/VOP3/VOPC examples: 5989 5990.. code-block:: nasm 5991 5992 v_mov_b32 v1, v2 5993 v_mov_b32_e32 v1, v2 5994 v_nop 5995 v_cvt_f64_i32_e32 v[1:2], v2 5996 v_floor_f32_e32 v1, v2 5997 v_bfrev_b32_e32 v1, v2 5998 v_add_f32_e32 v1, v2, v3 5999 v_mul_i32_i24_e64 v1, v2, 3 6000 v_mul_i32_i24_e32 v1, -3, v3 6001 v_mul_i32_i24_e32 v1, -100, v3 6002 v_addc_u32 v1, s[0:1], v2, v3, s[2:3] 6003 v_max_f16_e32 v1, v2, v3 6004 6005VOP_DPP examples: 6006 6007.. code-block:: nasm 6008 6009 v_mov_b32 v0, v0 quad_perm:[0,2,1,1] 6010 v_sin_f32 v0, v0 row_shl:1 row_mask:0xa bank_mask:0x1 bound_ctrl:0 6011 v_mov_b32 v0, v0 wave_shl:1 6012 v_mov_b32 v0, v0 row_mirror 6013 v_mov_b32 v0, v0 row_bcast:31 6014 v_mov_b32 v0, v0 quad_perm:[1,3,0,1] row_mask:0xa bank_mask:0x1 bound_ctrl:0 6015 v_add_f32 v0, v0, |v0| row_shl:1 row_mask:0xa bank_mask:0x1 bound_ctrl:0 6016 v_max_f16 v1, v2, v3 row_shl:1 row_mask:0xa bank_mask:0x1 bound_ctrl:0 6017 6018VOP_SDWA examples: 6019 6020.. code-block:: nasm 6021 6022 v_mov_b32 v1, v2 dst_sel:BYTE_0 dst_unused:UNUSED_PRESERVE src0_sel:DWORD 6023 v_min_u32 v200, v200, v1 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_1 src1_sel:DWORD 6024 v_sin_f32 v0, v0 dst_unused:UNUSED_PAD src0_sel:WORD_1 6025 v_fract_f32 v0, |v0| dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_1 6026 v_cmpx_le_u32 vcc, v1, v2 src0_sel:BYTE_2 src1_sel:WORD_0 6027 6028For full list of supported instructions, refer to "Vector ALU instructions". 6029 6030.. TODO:: 6031 6032 Remove once we switch to code object v3 by default. 6033 6034.. _amdgpu-amdhsa-assembler-predefined-symbols-v2: 6035 6036Code Object V2 Predefined Symbols (-mattr=-code-object-v3) 6037~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 6038 6039.. warning:: Code Object V2 is not the default code object version emitted by 6040 this version of LLVM. For a description of the predefined symbols available 6041 with the default configuration (Code Object V3) see 6042 :ref:`amdgpu-amdhsa-assembler-predefined-symbols-v3`. 6043 6044The AMDGPU assembler defines and updates some symbols automatically. These 6045symbols do not affect code generation. 6046 6047.option.machine_version_major 6048+++++++++++++++++++++++++++++ 6049 6050Set to the GFX major generation number of the target being assembled for. For 6051example, when assembling for a "GFX9" target this will be set to the integer 6052value "9". The possible GFX major generation numbers are presented in 6053:ref:`amdgpu-processors`. 6054 6055.option.machine_version_minor 6056+++++++++++++++++++++++++++++ 6057 6058Set to the GFX minor generation number of the target being assembled for. For 6059example, when assembling for a "GFX810" target this will be set to the integer 6060value "1". The possible GFX minor generation numbers are presented in 6061:ref:`amdgpu-processors`. 6062 6063.option.machine_version_stepping 6064++++++++++++++++++++++++++++++++ 6065 6066Set to the GFX stepping generation number of the target being assembled for. 6067For example, when assembling for a "GFX704" target this will be set to the 6068integer value "4". The possible GFX stepping generation numbers are presented 6069in :ref:`amdgpu-processors`. 6070 6071.kernel.vgpr_count 6072++++++++++++++++++ 6073 6074Set to zero each time a 6075:ref:`amdgpu-amdhsa-assembler-directive-amdgpu_hsa_kernel` directive is 6076encountered. At each instruction, if the current value of this symbol is less 6077than or equal to the maximum VPGR number explicitly referenced within that 6078instruction then the symbol value is updated to equal that VGPR number plus 6079one. 6080 6081.kernel.sgpr_count 6082++++++++++++++++++ 6083 6084Set to zero each time a 6085:ref:`amdgpu-amdhsa-assembler-directive-amdgpu_hsa_kernel` directive is 6086encountered. At each instruction, if the current value of this symbol is less 6087than or equal to the maximum VPGR number explicitly referenced within that 6088instruction then the symbol value is updated to equal that SGPR number plus 6089one. 6090 6091.. _amdgpu-amdhsa-assembler-directives-v2: 6092 6093Code Object V2 Directives (-mattr=-code-object-v3) 6094~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 6095 6096.. warning:: Code Object V2 is not the default code object version emitted by 6097 this version of LLVM. For a description of the directives supported with 6098 the default configuration (Code Object V3) see 6099 :ref:`amdgpu-amdhsa-assembler-directives-v3`. 6100 6101AMDGPU ABI defines auxiliary data in output code object. In assembly source, 6102one can specify them with assembler directives. 6103 6104.hsa_code_object_version major, minor 6105+++++++++++++++++++++++++++++++++++++ 6106 6107*major* and *minor* are integers that specify the version of the HSA code 6108object that will be generated by the assembler. 6109 6110.hsa_code_object_isa [major, minor, stepping, vendor, arch] 6111+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ 6112 6113 6114*major*, *minor*, and *stepping* are all integers that describe the instruction 6115set architecture (ISA) version of the assembly program. 6116 6117*vendor* and *arch* are quoted strings. *vendor* should always be equal to 6118"AMD" and *arch* should always be equal to "AMDGPU". 6119 6120By default, the assembler will derive the ISA version, *vendor*, and *arch* 6121from the value of the -mcpu option that is passed to the assembler. 6122 6123.. _amdgpu-amdhsa-assembler-directive-amdgpu_hsa_kernel: 6124 6125.amdgpu_hsa_kernel (name) 6126+++++++++++++++++++++++++ 6127 6128This directives specifies that the symbol with given name is a kernel entry 6129point (label) and the object should contain corresponding symbol of type 6130STT_AMDGPU_HSA_KERNEL. 6131 6132.amd_kernel_code_t 6133++++++++++++++++++ 6134 6135This directive marks the beginning of a list of key / value pairs that are used 6136to specify the amd_kernel_code_t object that will be emitted by the assembler. 6137The list must be terminated by the *.end_amd_kernel_code_t* directive. For any 6138amd_kernel_code_t values that are unspecified a default value will be used. The 6139default value for all keys is 0, with the following exceptions: 6140 6141- *amd_code_version_major* defaults to 1. 6142- *amd_kernel_code_version_minor* defaults to 2. 6143- *amd_machine_kind* defaults to 1. 6144- *amd_machine_version_major*, *machine_version_minor*, and 6145 *amd_machine_version_stepping* are derived from the value of the -mcpu option 6146 that is passed to the assembler. 6147- *kernel_code_entry_byte_offset* defaults to 256. 6148- *wavefront_size* defaults 6 for all targets before GFX10. For GFX10 onwards 6149 defaults to 6 if target feature ``wavefrontsize64`` is enabled, otherwise 5. 6150 Note that wavefront size is specified as a power of two, so a value of **n** 6151 means a size of 2^ **n**. 6152- *call_convention* defaults to -1. 6153- *kernarg_segment_alignment*, *group_segment_alignment*, and 6154 *private_segment_alignment* default to 4. Note that alignments are specified 6155 as a power of 2, so a value of **n** means an alignment of 2^ **n**. 6156- *enable_wgp_mode* defaults to 1 if target feature ``cumode`` is disabled for 6157 GFX10 onwards. 6158- *enable_mem_ordered* defaults to 1 for GFX10 onwards. 6159 6160The *.amd_kernel_code_t* directive must be placed immediately after the 6161function label and before any instructions. 6162 6163For a full list of amd_kernel_code_t keys, refer to AMDGPU ABI document, 6164comments in lib/Target/AMDGPU/AmdKernelCodeT.h and test/CodeGen/AMDGPU/hsa.s. 6165 6166.. _amdgpu-amdhsa-assembler-example-v2: 6167 6168Code Object V2 Example Source Code (-mattr=-code-object-v3) 6169~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 6170 6171.. warning:: Code Object V2 is not the default code object version emitted by 6172 this version of LLVM. For a description of the directives supported with 6173 the default configuration (Code Object V3) see 6174 :ref:`amdgpu-amdhsa-assembler-example-v3`. 6175 6176Here is an example of a minimal assembly source file, defining one HSA kernel: 6177 6178.. code:: 6179 :number-lines: 6180 6181 .hsa_code_object_version 1,0 6182 .hsa_code_object_isa 6183 6184 .hsatext 6185 .globl hello_world 6186 .p2align 8 6187 .amdgpu_hsa_kernel hello_world 6188 6189 hello_world: 6190 6191 .amd_kernel_code_t 6192 enable_sgpr_kernarg_segment_ptr = 1 6193 is_ptr64 = 1 6194 compute_pgm_rsrc1_vgprs = 0 6195 compute_pgm_rsrc1_sgprs = 0 6196 compute_pgm_rsrc2_user_sgpr = 2 6197 compute_pgm_rsrc1_wgp_mode = 0 6198 compute_pgm_rsrc1_mem_ordered = 0 6199 compute_pgm_rsrc1_fwd_progress = 1 6200 .end_amd_kernel_code_t 6201 6202 s_load_dwordx2 s[0:1], s[0:1] 0x0 6203 v_mov_b32 v0, 3.14159 6204 s_waitcnt lgkmcnt(0) 6205 v_mov_b32 v1, s0 6206 v_mov_b32 v2, s1 6207 flat_store_dword v[1:2], v0 6208 s_endpgm 6209 .Lfunc_end0: 6210 .size hello_world, .Lfunc_end0-hello_world 6211 6212.. _amdgpu-amdhsa-assembler-predefined-symbols-v3: 6213 6214Code Object V3 Predefined Symbols (-mattr=+code-object-v3) 6215~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 6216 6217The AMDGPU assembler defines and updates some symbols automatically. These 6218symbols do not affect code generation. 6219 6220.amdgcn.gfx_generation_number 6221+++++++++++++++++++++++++++++ 6222 6223Set to the GFX major generation number of the target being assembled for. For 6224example, when assembling for a "GFX9" target this will be set to the integer 6225value "9". The possible GFX major generation numbers are presented in 6226:ref:`amdgpu-processors`. 6227 6228.amdgcn.gfx_generation_minor 6229++++++++++++++++++++++++++++ 6230 6231Set to the GFX minor generation number of the target being assembled for. For 6232example, when assembling for a "GFX810" target this will be set to the integer 6233value "1". The possible GFX minor generation numbers are presented in 6234:ref:`amdgpu-processors`. 6235 6236.amdgcn.gfx_generation_stepping 6237+++++++++++++++++++++++++++++++ 6238 6239Set to the GFX stepping generation number of the target being assembled for. 6240For example, when assembling for a "GFX704" target this will be set to the 6241integer value "4". The possible GFX stepping generation numbers are presented 6242in :ref:`amdgpu-processors`. 6243 6244.. _amdgpu-amdhsa-assembler-symbol-next_free_vgpr: 6245 6246.amdgcn.next_free_vgpr 6247++++++++++++++++++++++ 6248 6249Set to zero before assembly begins. At each instruction, if the current value 6250of this symbol is less than or equal to the maximum VGPR number explicitly 6251referenced within that instruction then the symbol value is updated to equal 6252that VGPR number plus one. 6253 6254May be used to set the `.amdhsa_next_free_vpgr` directive in 6255:ref:`amdhsa-kernel-directives-table`. 6256 6257May be set at any time, e.g. manually set to zero at the start of each kernel. 6258 6259.. _amdgpu-amdhsa-assembler-symbol-next_free_sgpr: 6260 6261.amdgcn.next_free_sgpr 6262++++++++++++++++++++++ 6263 6264Set to zero before assembly begins. At each instruction, if the current value 6265of this symbol is less than or equal the maximum SGPR number explicitly 6266referenced within that instruction then the symbol value is updated to equal 6267that SGPR number plus one. 6268 6269May be used to set the `.amdhsa_next_free_spgr` directive in 6270:ref:`amdhsa-kernel-directives-table`. 6271 6272May be set at any time, e.g. manually set to zero at the start of each kernel. 6273 6274.. _amdgpu-amdhsa-assembler-directives-v3: 6275 6276Code Object V3 Directives (-mattr=+code-object-v3) 6277~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 6278 6279Directives which begin with ``.amdgcn`` are valid for all ``amdgcn`` 6280architecture processors, and are not OS-specific. Directives which begin with 6281``.amdhsa`` are specific to ``amdgcn`` architecture processors when the 6282``amdhsa`` OS is specified. See :ref:`amdgpu-target-triples` and 6283:ref:`amdgpu-processors`. 6284 6285.amdgcn_target <target> 6286+++++++++++++++++++++++ 6287 6288Optional directive which declares the target supported by the containing 6289assembler source file. Valid values are described in 6290:ref:`amdgpu-amdhsa-code-object-target-identification`. Used by the assembler 6291to validate command-line options such as ``-triple``, ``-mcpu``, and those 6292which specify target features. 6293 6294.amdhsa_kernel <name> 6295+++++++++++++++++++++ 6296 6297Creates a correctly aligned AMDHSA kernel descriptor and a symbol, 6298``<name>.kd``, in the current location of the current section. Only valid when 6299the OS is ``amdhsa``. ``<name>`` must be a symbol that labels the first 6300instruction to execute, and does not need to be previously defined. 6301 6302Marks the beginning of a list of directives used to generate the bytes of a 6303kernel descriptor, as described in :ref:`amdgpu-amdhsa-kernel-descriptor`. 6304Directives which may appear in this list are described in 6305:ref:`amdhsa-kernel-directives-table`. Directives may appear in any order, must 6306be valid for the target being assembled for, and cannot be repeated. Directives 6307support the range of values specified by the field they reference in 6308:ref:`amdgpu-amdhsa-kernel-descriptor`. If a directive is not specified, it is 6309assumed to have its default value, unless it is marked as "Required", in which 6310case it is an error to omit the directive. This list of directives is 6311terminated by an ``.end_amdhsa_kernel`` directive. 6312 6313 .. table:: AMDHSA Kernel Assembler Directives 6314 :name: amdhsa-kernel-directives-table 6315 6316 ======================================================== =================== ============ =================== 6317 Directive Default Supported On Description 6318 ======================================================== =================== ============ =================== 6319 ``.amdhsa_group_segment_fixed_size`` 0 GFX6-GFX10 Controls GROUP_SEGMENT_FIXED_SIZE in 6320 :ref:`amdgpu-amdhsa-kernel-descriptor-gfx6-gfx10-table`. 6321 ``.amdhsa_private_segment_fixed_size`` 0 GFX6-GFX10 Controls PRIVATE_SEGMENT_FIXED_SIZE in 6322 :ref:`amdgpu-amdhsa-kernel-descriptor-gfx6-gfx10-table`. 6323 ``.amdhsa_user_sgpr_private_segment_buffer`` 0 GFX6-GFX10 Controls ENABLE_SGPR_PRIVATE_SEGMENT_BUFFER in 6324 :ref:`amdgpu-amdhsa-kernel-descriptor-gfx6-gfx10-table`. 6325 ``.amdhsa_user_sgpr_dispatch_ptr`` 0 GFX6-GFX10 Controls ENABLE_SGPR_DISPATCH_PTR in 6326 :ref:`amdgpu-amdhsa-kernel-descriptor-gfx6-gfx10-table`. 6327 ``.amdhsa_user_sgpr_queue_ptr`` 0 GFX6-GFX10 Controls ENABLE_SGPR_QUEUE_PTR in 6328 :ref:`amdgpu-amdhsa-kernel-descriptor-gfx6-gfx10-table`. 6329 ``.amdhsa_user_sgpr_kernarg_segment_ptr`` 0 GFX6-GFX10 Controls ENABLE_SGPR_KERNARG_SEGMENT_PTR in 6330 :ref:`amdgpu-amdhsa-kernel-descriptor-gfx6-gfx10-table`. 6331 ``.amdhsa_user_sgpr_dispatch_id`` 0 GFX6-GFX10 Controls ENABLE_SGPR_DISPATCH_ID in 6332 :ref:`amdgpu-amdhsa-kernel-descriptor-gfx6-gfx10-table`. 6333 ``.amdhsa_user_sgpr_flat_scratch_init`` 0 GFX6-GFX10 Controls ENABLE_SGPR_FLAT_SCRATCH_INIT in 6334 :ref:`amdgpu-amdhsa-kernel-descriptor-gfx6-gfx10-table`. 6335 ``.amdhsa_user_sgpr_private_segment_size`` 0 GFX6-GFX10 Controls ENABLE_SGPR_PRIVATE_SEGMENT_SIZE in 6336 :ref:`amdgpu-amdhsa-kernel-descriptor-gfx6-gfx10-table`. 6337 ``.amdhsa_wavefront_size32`` Target GFX10 Controls ENABLE_WAVEFRONT_SIZE32 in 6338 Feature :ref:`amdgpu-amdhsa-kernel-descriptor-gfx6-gfx10-table`. 6339 Specific 6340 (-wavefrontsize64) 6341 ``.amdhsa_system_sgpr_private_segment_wavefront_offset`` 0 GFX6-GFX10 Controls ENABLE_SGPR_PRIVATE_SEGMENT_WAVEFRONT_OFFSET in 6342 :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx10-table`. 6343 ``.amdhsa_system_sgpr_workgroup_id_x`` 1 GFX6-GFX10 Controls ENABLE_SGPR_WORKGROUP_ID_X in 6344 :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx10-table`. 6345 ``.amdhsa_system_sgpr_workgroup_id_y`` 0 GFX6-GFX10 Controls ENABLE_SGPR_WORKGROUP_ID_Y in 6346 :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx10-table`. 6347 ``.amdhsa_system_sgpr_workgroup_id_z`` 0 GFX6-GFX10 Controls ENABLE_SGPR_WORKGROUP_ID_Z in 6348 :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx10-table`. 6349 ``.amdhsa_system_sgpr_workgroup_info`` 0 GFX6-GFX10 Controls ENABLE_SGPR_WORKGROUP_INFO in 6350 :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx10-table`. 6351 ``.amdhsa_system_vgpr_workitem_id`` 0 GFX6-GFX10 Controls ENABLE_VGPR_WORKITEM_ID in 6352 :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx10-table`. 6353 Possible values are defined in 6354 :ref:`amdgpu-amdhsa-system-vgpr-work-item-id-enumeration-values-table`. 6355 ``.amdhsa_next_free_vgpr`` Required GFX6-GFX10 Maximum VGPR number explicitly referenced, plus one. 6356 Used to calculate GRANULATED_WORKITEM_VGPR_COUNT in 6357 :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx10-table`. 6358 ``.amdhsa_next_free_sgpr`` Required GFX6-GFX10 Maximum SGPR number explicitly referenced, plus one. 6359 Used to calculate GRANULATED_WAVEFRONT_SGPR_COUNT in 6360 :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx10-table`. 6361 ``.amdhsa_reserve_vcc`` 1 GFX6-GFX10 Whether the kernel may use the special VCC SGPR. 6362 Used to calculate GRANULATED_WAVEFRONT_SGPR_COUNT in 6363 :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx10-table`. 6364 ``.amdhsa_reserve_flat_scratch`` 1 GFX7-GFX10 Whether the kernel may use flat instructions to access 6365 scratch memory. Used to calculate 6366 GRANULATED_WAVEFRONT_SGPR_COUNT in 6367 :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx10-table`. 6368 ``.amdhsa_reserve_xnack_mask`` Target GFX8-GFX10 Whether the kernel may trigger XNACK replay. 6369 Feature Used to calculate GRANULATED_WAVEFRONT_SGPR_COUNT in 6370 Specific :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx10-table`. 6371 (+xnack) 6372 ``.amdhsa_float_round_mode_32`` 0 GFX6-GFX10 Controls FLOAT_ROUND_MODE_32 in 6373 :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx10-table`. 6374 Possible values are defined in 6375 :ref:`amdgpu-amdhsa-floating-point-rounding-mode-enumeration-values-table`. 6376 ``.amdhsa_float_round_mode_16_64`` 0 GFX6-GFX10 Controls FLOAT_ROUND_MODE_16_64 in 6377 :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx10-table`. 6378 Possible values are defined in 6379 :ref:`amdgpu-amdhsa-floating-point-rounding-mode-enumeration-values-table`. 6380 ``.amdhsa_float_denorm_mode_32`` 0 GFX6-GFX10 Controls FLOAT_DENORM_MODE_32 in 6381 :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx10-table`. 6382 Possible values are defined in 6383 :ref:`amdgpu-amdhsa-floating-point-denorm-mode-enumeration-values-table`. 6384 ``.amdhsa_float_denorm_mode_16_64`` 3 GFX6-GFX10 Controls FLOAT_DENORM_MODE_16_64 in 6385 :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx10-table`. 6386 Possible values are defined in 6387 :ref:`amdgpu-amdhsa-floating-point-denorm-mode-enumeration-values-table`. 6388 ``.amdhsa_dx10_clamp`` 1 GFX6-GFX10 Controls ENABLE_DX10_CLAMP in 6389 :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx10-table`. 6390 ``.amdhsa_ieee_mode`` 1 GFX6-GFX10 Controls ENABLE_IEEE_MODE in 6391 :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx10-table`. 6392 ``.amdhsa_fp16_overflow`` 0 GFX9-GFX10 Controls FP16_OVFL in 6393 :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx10-table`. 6394 ``.amdhsa_workgroup_processor_mode`` Target GFX10 Controls ENABLE_WGP_MODE in 6395 Feature :ref:`amdgpu-amdhsa-kernel-descriptor-gfx6-gfx10-table`. 6396 Specific 6397 (-cumode) 6398 ``.amdhsa_memory_ordered`` 1 GFX10 Controls MEM_ORDERED in 6399 :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx10-table`. 6400 ``.amdhsa_forward_progress`` 0 GFX10 Controls FWD_PROGRESS in 6401 :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx10-table`. 6402 ``.amdhsa_exception_fp_ieee_invalid_op`` 0 GFX6-GFX10 Controls ENABLE_EXCEPTION_IEEE_754_FP_INVALID_OPERATION in 6403 :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx10-table`. 6404 ``.amdhsa_exception_fp_denorm_src`` 0 GFX6-GFX10 Controls ENABLE_EXCEPTION_FP_DENORMAL_SOURCE in 6405 :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx10-table`. 6406 ``.amdhsa_exception_fp_ieee_div_zero`` 0 GFX6-GFX10 Controls ENABLE_EXCEPTION_IEEE_754_FP_DIVISION_BY_ZERO in 6407 :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx10-table`. 6408 ``.amdhsa_exception_fp_ieee_overflow`` 0 GFX6-GFX10 Controls ENABLE_EXCEPTION_IEEE_754_FP_OVERFLOW in 6409 :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx10-table`. 6410 ``.amdhsa_exception_fp_ieee_underflow`` 0 GFX6-GFX10 Controls ENABLE_EXCEPTION_IEEE_754_FP_UNDERFLOW in 6411 :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx10-table`. 6412 ``.amdhsa_exception_fp_ieee_inexact`` 0 GFX6-GFX10 Controls ENABLE_EXCEPTION_IEEE_754_FP_INEXACT in 6413 :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx10-table`. 6414 ``.amdhsa_exception_int_div_zero`` 0 GFX6-GFX10 Controls ENABLE_EXCEPTION_INT_DIVIDE_BY_ZERO in 6415 :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx10-table`. 6416 ======================================================== =================== ============ =================== 6417 6418.amdgpu_metadata 6419++++++++++++++++ 6420 6421Optional directive which declares the contents of the ``NT_AMDGPU_METADATA`` 6422note record (see :ref:`amdgpu-elf-note-records-table-v3`). 6423 6424The contents must be in the [YAML]_ markup format, with the same structure and 6425semantics described in :ref:`amdgpu-amdhsa-code-object-metadata-v3`. 6426 6427This directive is terminated by an ``.end_amdgpu_metadata`` directive. 6428 6429.. _amdgpu-amdhsa-assembler-example-v3: 6430 6431Code Object V3 Example Source Code (-mattr=+code-object-v3) 6432~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 6433 6434Here is an example of a minimal assembly source file, defining one HSA kernel: 6435 6436.. code:: 6437 :number-lines: 6438 6439 .amdgcn_target "amdgcn-amd-amdhsa--gfx900+xnack" // optional 6440 6441 .text 6442 .globl hello_world 6443 .p2align 8 6444 .type hello_world,@function 6445 hello_world: 6446 s_load_dwordx2 s[0:1], s[0:1] 0x0 6447 v_mov_b32 v0, 3.14159 6448 s_waitcnt lgkmcnt(0) 6449 v_mov_b32 v1, s0 6450 v_mov_b32 v2, s1 6451 flat_store_dword v[1:2], v0 6452 s_endpgm 6453 .Lfunc_end0: 6454 .size hello_world, .Lfunc_end0-hello_world 6455 6456 .rodata 6457 .p2align 6 6458 .amdhsa_kernel hello_world 6459 .amdhsa_user_sgpr_kernarg_segment_ptr 1 6460 .amdhsa_next_free_vgpr .amdgcn.next_free_vgpr 6461 .amdhsa_next_free_sgpr .amdgcn.next_free_sgpr 6462 .end_amdhsa_kernel 6463 6464 .amdgpu_metadata 6465 --- 6466 amdhsa.version: 6467 - 1 6468 - 0 6469 amdhsa.kernels: 6470 - .name: hello_world 6471 .symbol: hello_world.kd 6472 .kernarg_segment_size: 48 6473 .group_segment_fixed_size: 0 6474 .private_segment_fixed_size: 0 6475 .kernarg_segment_align: 4 6476 .wavefront_size: 64 6477 .sgpr_count: 2 6478 .vgpr_count: 3 6479 .max_flat_workgroup_size: 256 6480 ... 6481 .end_amdgpu_metadata 6482 6483If an assembly source file contains multiple kernels and/or functions, the 6484:ref:`amdgpu-amdhsa-assembler-symbol-next_free_vgpr` and 6485:ref:`amdgpu-amdhsa-assembler-symbol-next_free_sgpr` symbols may be reset using 6486the ``.set <symbol>, <expression>`` directive. For example, in the case of two 6487kernels, where ``function1`` is only called from ``kernel1`` it is sufficient 6488to group the function with the kernel that calls it and reset the symbols 6489between the two connected components: 6490 6491.. code:: 6492 :number-lines: 6493 6494 .amdgcn_target "amdgcn-amd-amdhsa--gfx900+xnack" // optional 6495 6496 // gpr tracking symbols are implicitly set to zero 6497 6498 .text 6499 .globl kern0 6500 .p2align 8 6501 .type kern0,@function 6502 kern0: 6503 // ... 6504 s_endpgm 6505 .Lkern0_end: 6506 .size kern0, .Lkern0_end-kern0 6507 6508 .rodata 6509 .p2align 6 6510 .amdhsa_kernel kern0 6511 // ... 6512 .amdhsa_next_free_vgpr .amdgcn.next_free_vgpr 6513 .amdhsa_next_free_sgpr .amdgcn.next_free_sgpr 6514 .end_amdhsa_kernel 6515 6516 // reset symbols to begin tracking usage in func1 and kern1 6517 .set .amdgcn.next_free_vgpr, 0 6518 .set .amdgcn.next_free_sgpr, 0 6519 6520 .text 6521 .hidden func1 6522 .global func1 6523 .p2align 2 6524 .type func1,@function 6525 func1: 6526 // ... 6527 s_setpc_b64 s[30:31] 6528 .Lfunc1_end: 6529 .size func1, .Lfunc1_end-func1 6530 6531 .globl kern1 6532 .p2align 8 6533 .type kern1,@function 6534 kern1: 6535 // ... 6536 s_getpc_b64 s[4:5] 6537 s_add_u32 s4, s4, func1@rel32@lo+4 6538 s_addc_u32 s5, s5, func1@rel32@lo+4 6539 s_swappc_b64 s[30:31], s[4:5] 6540 // ... 6541 s_endpgm 6542 .Lkern1_end: 6543 .size kern1, .Lkern1_end-kern1 6544 6545 .rodata 6546 .p2align 6 6547 .amdhsa_kernel kern1 6548 // ... 6549 .amdhsa_next_free_vgpr .amdgcn.next_free_vgpr 6550 .amdhsa_next_free_sgpr .amdgcn.next_free_sgpr 6551 .end_amdhsa_kernel 6552 6553These symbols cannot identify connected components in order to automatically 6554track the usage for each kernel. However, in some cases careful organization of 6555the kernels and functions in the source file means there is minimal additional 6556effort required to accurately calculate GPR usage. 6557 6558Additional Documentation 6559======================== 6560 6561.. [AMD-RADEON-HD-2000-3000] `AMD R6xx shader ISA <http://developer.amd.com/wordpress/media/2012/10/R600_Instruction_Set_Architecture.pdf>`__ 6562.. [AMD-RADEON-HD-4000] `AMD R7xx shader ISA <http://developer.amd.com/wordpress/media/2012/10/R700-Family_Instruction_Set_Architecture.pdf>`__ 6563.. [AMD-RADEON-HD-5000] `AMD Evergreen shader ISA <http://developer.amd.com/wordpress/media/2012/10/AMD_Evergreen-Family_Instruction_Set_Architecture.pdf>`__ 6564.. [AMD-RADEON-HD-6000] `AMD Cayman/Trinity shader ISA <http://developer.amd.com/wordpress/media/2012/10/AMD_HD_6900_Series_Instruction_Set_Architecture.pdf>`__ 6565.. [AMD-GCN-GFX6] `AMD Southern Islands Series ISA <http://developer.amd.com/wordpress/media/2012/12/AMD_Southern_Islands_Instruction_Set_Architecture.pdf>`__ 6566.. [AMD-GCN-GFX7] `AMD Sea Islands Series ISA <http://developer.amd.com/wordpress/media/2013/07/AMD_Sea_Islands_Instruction_Set_Architecture.pdf>`_ 6567.. [AMD-GCN-GFX8] `AMD GCN3 Instruction Set Architecture <http://amd-dev.wpengine.netdna-cdn.com/wordpress/media/2013/12/AMD_GCN3_Instruction_Set_Architecture_rev1.1.pdf>`__ 6568.. [AMD-GCN-GFX9] `AMD "Vega" Instruction Set Architecture <http://developer.amd.com/wordpress/media/2013/12/Vega_Shader_ISA_28July2017.pdf>`__ 6569.. [AMD-GCN-GFX10] `AMD "RDNA 1.0" Instruction Set Architecture <https://gpuopen.com/wp-content/uploads/2019/08/RDNA_Shader_ISA_5August2019.pdf>`__ 6570.. [AMD-ROCm] `ROCm: Open Platform for Development, Discovery and Education Around GPU Computing <http://gpuopen.com/compute-product/rocm/>`__ 6571.. [AMD-ROCm-github] `ROCm github <http://github.com/RadeonOpenCompute>`__ 6572.. [HSA] `Heterogeneous System Architecture (HSA) Foundation <http://www.hsafoundation.com/>`__ 6573.. [ELF] `Executable and Linkable Format (ELF) <http://www.sco.com/developers/gabi/>`__ 6574.. [DWARF] `DWARF Debugging Information Format <http://dwarfstd.org/>`__ 6575.. [YAML] `YAML Ain't Markup Language (YAML™) Version 1.2 <http://www.yaml.org/spec/1.2/spec.html>`__ 6576.. [MsgPack] `Message Pack <http://www.msgpack.org/>`__ 6577.. [OpenCL] `The OpenCL Specification Version 2.0 <http://www.khronos.org/registry/cl/specs/opencl-2.0.pdf>`__ 6578.. [HRF] `Heterogeneous-race-free Memory Models <http://benedictgaster.org/wp-content/uploads/2014/01/asplos269-FINAL.pdf>`__ 6579.. [CLANG-ATTR] `Attributes in Clang <http://clang.llvm.org/docs/AttributeReference.html>`__ 6580