1============================= 2User Guide for AMDGPU Backend 3============================= 4 5.. contents:: 6 :local: 7 8Introduction 9============ 10 11The AMDGPU backend provides ISA code generation for AMD GPUs, starting with the 12R600 family up until the current GCN families. It lives in the 13``lib/Target/AMDGPU`` directory. 14 15LLVM 16==== 17 18.. _amdgpu-target-triples: 19 20Target Triples 21-------------- 22 23Use the ``clang -target <Architecture>-<Vendor>-<OS>-<Environment>`` option to 24specify the target triple: 25 26 .. table:: AMDGPU Architectures 27 :name: amdgpu-architecture-table 28 29 ============ ============================================================== 30 Architecture Description 31 ============ ============================================================== 32 ``r600`` AMD GPUs HD2XXX-HD6XXX for graphics and compute shaders. 33 ``amdgcn`` AMD GPUs GCN GFX6 onwards for graphics and compute shaders. 34 ============ ============================================================== 35 36 .. table:: AMDGPU Vendors 37 :name: amdgpu-vendor-table 38 39 ============ ============================================================== 40 Vendor Description 41 ============ ============================================================== 42 ``amd`` Can be used for all AMD GPU usage. 43 ``mesa3d`` Can be used if the OS is ``mesa3d``. 44 ============ ============================================================== 45 46 .. table:: AMDGPU Operating Systems 47 :name: amdgpu-os-table 48 49 ============== ============================================================ 50 OS Description 51 ============== ============================================================ 52 *<empty>* Defaults to the *unknown* OS. 53 ``amdhsa`` Compute kernels executed on HSA [HSA]_ compatible runtimes 54 such as AMD's ROCm [AMD-ROCm]_. 55 ``amdpal`` Graphic shaders and compute kernels executed on AMD PAL 56 runtime. 57 ``mesa3d`` Graphic shaders and compute kernels executed on Mesa 3D 58 runtime. 59 ============== ============================================================ 60 61 .. table:: AMDGPU Environments 62 :name: amdgpu-environment-table 63 64 ============ ============================================================== 65 Environment Description 66 ============ ============================================================== 67 *<empty>* Default. 68 ============ ============================================================== 69 70.. _amdgpu-processors: 71 72Processors 73---------- 74 75Use the ``clang -mcpu <Processor>`` option to specify the AMD GPU processor. The 76names from both the *Processor* and *Alternative Processor* can be used. 77 78 .. table:: AMDGPU Processors 79 :name: amdgpu-processor-table 80 81 =========== =============== ============ ===== ================= ======= ====================== 82 Processor Alternative Target dGPU/ Target ROCm Example 83 Processor Triple APU Features Support Products 84 Architecture Supported 85 [Default] 86 =========== =============== ============ ===== ================= ======= ====================== 87 **Radeon HD 2000/3000 Series (R600)** [AMD-RADEON-HD-2000-3000]_ 88 ----------------------------------------------------------------------------------------------- 89 ``r600`` ``r600`` dGPU 90 ``r630`` ``r600`` dGPU 91 ``rs880`` ``r600`` dGPU 92 ``rv670`` ``r600`` dGPU 93 **Radeon HD 4000 Series (R700)** [AMD-RADEON-HD-4000]_ 94 ----------------------------------------------------------------------------------------------- 95 ``rv710`` ``r600`` dGPU 96 ``rv730`` ``r600`` dGPU 97 ``rv770`` ``r600`` dGPU 98 **Radeon HD 5000 Series (Evergreen)** [AMD-RADEON-HD-5000]_ 99 ----------------------------------------------------------------------------------------------- 100 ``cedar`` ``r600`` dGPU 101 ``cypress`` ``r600`` dGPU 102 ``juniper`` ``r600`` dGPU 103 ``redwood`` ``r600`` dGPU 104 ``sumo`` ``r600`` dGPU 105 **Radeon HD 6000 Series (Northern Islands)** [AMD-RADEON-HD-6000]_ 106 ----------------------------------------------------------------------------------------------- 107 ``barts`` ``r600`` dGPU 108 ``caicos`` ``r600`` dGPU 109 ``cayman`` ``r600`` dGPU 110 ``turks`` ``r600`` dGPU 111 **GCN GFX6 (Southern Islands (SI))** [AMD-GCN-GFX6]_ 112 ----------------------------------------------------------------------------------------------- 113 ``gfx600`` - ``tahiti`` ``amdgcn`` dGPU 114 ``gfx601`` - ``hainan`` ``amdgcn`` dGPU 115 - ``oland`` 116 - ``pitcairn`` 117 - ``verde`` 118 **GCN GFX7 (Sea Islands (CI))** [AMD-GCN-GFX7]_ 119 ----------------------------------------------------------------------------------------------- 120 ``gfx700`` - ``kaveri`` ``amdgcn`` APU - A6-7000 121 - A6 Pro-7050B 122 - A8-7100 123 - A8 Pro-7150B 124 - A10-7300 125 - A10 Pro-7350B 126 - FX-7500 127 - A8-7200P 128 - A10-7400P 129 - FX-7600P 130 ``gfx701`` - ``hawaii`` ``amdgcn`` dGPU ROCm - FirePro W8100 131 - FirePro W9100 132 - FirePro S9150 133 - FirePro S9170 134 ``gfx702`` ``amdgcn`` dGPU ROCm - Radeon R9 290 135 - Radeon R9 290x 136 - Radeon R390 137 - Radeon R390x 138 ``gfx703`` - ``kabini`` ``amdgcn`` APU - E1-2100 139 - ``mullins`` - E1-2200 140 - E1-2500 141 - E2-3000 142 - E2-3800 143 - A4-5000 144 - A4-5100 145 - A6-5200 146 - A4 Pro-3340B 147 ``gfx704`` - ``bonaire`` ``amdgcn`` dGPU - Radeon HD 7790 148 - Radeon HD 8770 149 - R7 260 150 - R7 260X 151 **GCN GFX8 (Volcanic Islands (VI))** [AMD-GCN-GFX8]_ 152 ----------------------------------------------------------------------------------------------- 153 ``gfx801`` - ``carrizo`` ``amdgcn`` APU - xnack - A6-8500P 154 [on] - Pro A6-8500B 155 - A8-8600P 156 - Pro A8-8600B 157 - FX-8800P 158 - Pro A12-8800B 159 \ ``amdgcn`` APU - xnack ROCm - A10-8700P 160 [on] - Pro A10-8700B 161 - A10-8780P 162 \ ``amdgcn`` APU - xnack - A10-9600P 163 [on] - A10-9630P 164 - A12-9700P 165 - A12-9730P 166 - FX-9800P 167 - FX-9830P 168 \ ``amdgcn`` APU - xnack - E2-9010 169 [on] - A6-9210 170 - A9-9410 171 ``gfx802`` - ``iceland`` ``amdgcn`` dGPU - xnack ROCm - FirePro S7150 172 - ``tonga`` [off] - FirePro S7100 173 - FirePro W7100 174 - Radeon R285 175 - Radeon R9 380 176 - Radeon R9 385 177 - Mobile FirePro 178 M7170 179 ``gfx803`` - ``fiji`` ``amdgcn`` dGPU - xnack ROCm - Radeon R9 Nano 180 [off] - Radeon R9 Fury 181 - Radeon R9 FuryX 182 - Radeon Pro Duo 183 - FirePro S9300x2 184 - Radeon Instinct MI8 185 \ - ``polaris10`` ``amdgcn`` dGPU - xnack ROCm - Radeon RX 470 186 [off] - Radeon RX 480 187 - Radeon Instinct MI6 188 \ - ``polaris11`` ``amdgcn`` dGPU - xnack ROCm - Radeon RX 460 189 [off] 190 ``gfx810`` - ``stoney`` ``amdgcn`` APU - xnack 191 [on] 192 **GCN GFX9** [AMD-GCN-GFX9]_ 193 ----------------------------------------------------------------------------------------------- 194 ``gfx900`` ``amdgcn`` dGPU - xnack ROCm - Radeon Vega 195 [off] Frontier Edition 196 - Radeon RX Vega 56 197 - Radeon RX Vega 64 198 - Radeon RX Vega 64 199 Liquid 200 - Radeon Instinct MI25 201 ``gfx902`` ``amdgcn`` APU - xnack - Ryzen 3 2200G 202 [on] - Ryzen 5 2400G 203 ``gfx904`` ``amdgcn`` dGPU - xnack *TBA* 204 [off] 205 .. TODO 206 Add product 207 names. 208 ``gfx906`` ``amdgcn`` dGPU - xnack - Radeon Instinct MI50 209 [off] - Radeon Instinct MI60 210 ``gfx908`` ``amdgcn`` dGPU - xnack *TBA* 211 [off] 212 sram-ecc 213 [on] 214 ``gfx909`` ``amdgcn`` APU - xnack *TBA* (Raven Ridge 2) 215 [on] 216 .. TODO 217 Add product 218 names. 219 **GCN GFX10** [AMD-GCN-GFX10]_ 220 ----------------------------------------------------------------------------------------------- 221 ``gfx1010`` ``amdgcn`` dGPU - xnack *TBA* 222 [off] 223 - wavefrontsize64 224 [off] 225 - cumode 226 [off] 227 .. TODO 228 Add product 229 names. 230 ``gfx1011`` ``amdgcn`` dGPU - xnack *TBA* 231 [off] 232 - wavefrontsize64 233 [off] 234 - cumode 235 [off] 236 .. TODO 237 Add product 238 names. 239 ``gfx1012`` ``amdgcn`` dGPU - xnack *TBA* 240 [off] 241 - wavefrontsize64 242 [off] 243 - cumode 244 [off] 245 .. TODO 246 Add product 247 names. 248 =========== =============== ============ ===== ================= ======= ====================== 249 250.. _amdgpu-target-features: 251 252Target Features 253--------------- 254 255Target features control how code is generated to support certain 256processor specific features. Not all target features are supported by 257all processors. The runtime must ensure that the features supported by 258the device used to execute the code match the features enabled when 259generating the code. A mismatch of features may result in incorrect 260execution, or a reduction in performance. 261 262The target features supported by each processor, and the default value 263used if not specified explicitly, is listed in 264:ref:`amdgpu-processor-table`. 265 266Use the ``clang -m[no-]<TargetFeature>`` option to specify the AMD GPU 267target features. 268 269For example: 270 271``-mxnack`` 272 Enable the ``xnack`` feature. 273``-mno-xnack`` 274 Disable the ``xnack`` feature. 275 276 .. table:: AMDGPU Target Features 277 :name: amdgpu-target-feature-table 278 279 ====================== ================================================== 280 Target Feature Description 281 ====================== ================================================== 282 -m[no-]xnack Enable/disable generating code that has 283 memory clauses that are compatible with 284 having XNACK replay enabled. 285 286 This is used for demand paging and page 287 migration. If XNACK replay is enabled in 288 the device, then if a page fault occurs 289 the code may execute incorrectly if the 290 ``xnack`` feature is not enabled. Executing 291 code that has the feature enabled on a 292 device that does not have XNACK replay 293 enabled will execute correctly, but may 294 be less performant than code with the 295 feature disabled. 296 297 -m[no-]sram-ecc Enable/disable generating code that assumes SRAM 298 ECC is enabled/disabled. 299 300 -m[no-]wavefrontsize64 Control the default wavefront size used when 301 generating code for kernels. When disabled 302 native wavefront size 32 is used, when enabled 303 wavefront size 64 is used. 304 305 -m[no-]cumode Control the default wavefront execution mode used 306 when generating code for kernels. When disabled 307 native WGP wavefront execution mode is used, 308 when enabled CU wavefront execution mode is used 309 (see :ref:`amdgpu-amdhsa-memory-model`). 310 ====================== ================================================== 311 312.. _amdgpu-address-spaces: 313 314Address Spaces 315-------------- 316 317The AMDGPU backend uses the following address space mappings. 318 319The memory space names used in the table, aside from the region memory space, is 320from the OpenCL standard. 321 322LLVM Address Space number is used throughout LLVM (for example, in LLVM IR). 323 324 .. table:: Address Space Mapping 325 :name: amdgpu-address-space-mapping-table 326 327 ================== ================================= 328 LLVM Address Space Memory Space 329 ================== ================================= 330 0 Generic (Flat) 331 1 Global 332 2 Region (GDS) 333 3 Local (group/LDS) 334 4 Constant 335 5 Private (Scratch) 336 6 Constant 32-bit 337 7 Buffer Fat Pointer (experimental) 338 ================== ================================= 339 340The buffer fat pointer is an experimental address space that is currently 341unsupported in the backend. It exposes a non-integral pointer that is in future 342intended to support the modelling of 128-bit buffer descriptors + a 32-bit 343offset into the buffer descriptor (in total encapsulating a 160-bit 'pointer'), 344allowing us to use normal LLVM load/store/atomic operations to model the buffer 345descriptors used heavily in graphics workloads targeting the backend. 346 347.. _amdgpu-memory-scopes: 348 349Memory Scopes 350------------- 351 352This section provides LLVM memory synchronization scopes supported by the AMDGPU 353backend memory model when the target triple OS is ``amdhsa`` (see 354:ref:`amdgpu-amdhsa-memory-model` and :ref:`amdgpu-target-triples`). 355 356The memory model supported is based on the HSA memory model [HSA]_ which is 357based in turn on HRF-indirect with scope inclusion [HRF]_. The happens-before 358relation is transitive over the synchonizes-with relation independent of scope, 359and synchonizes-with allows the memory scope instances to be inclusive (see 360table :ref:`amdgpu-amdhsa-llvm-sync-scopes-table`). 361 362This is different to the OpenCL [OpenCL]_ memory model which does not have scope 363inclusion and requires the memory scopes to exactly match. However, this 364is conservatively correct for OpenCL. 365 366 .. table:: AMDHSA LLVM Sync Scopes 367 :name: amdgpu-amdhsa-llvm-sync-scopes-table 368 369 ======================= =================================================== 370 LLVM Sync Scope Description 371 ======================= =================================================== 372 *none* The default: ``system``. 373 374 Synchronizes with, and participates in modification 375 and seq_cst total orderings with, other operations 376 (except image operations) for all address spaces 377 (except private, or generic that accesses private) 378 provided the other operation's sync scope is: 379 380 - ``system``. 381 - ``agent`` and executed by a thread on the same 382 agent. 383 - ``workgroup`` and executed by a thread in the 384 same workgroup. 385 - ``wavefront`` and executed by a thread in the 386 same wavefront. 387 388 ``agent`` Synchronizes with, and participates in modification 389 and seq_cst total orderings with, other operations 390 (except image operations) for all address spaces 391 (except private, or generic that accesses private) 392 provided the other operation's sync scope is: 393 394 - ``system`` or ``agent`` and executed by a thread 395 on the same agent. 396 - ``workgroup`` and executed by a thread in the 397 same workgroup. 398 - ``wavefront`` and executed by a thread in the 399 same wavefront. 400 401 ``workgroup`` Synchronizes with, and participates in modification 402 and seq_cst total orderings with, other operations 403 (except image operations) for all address spaces 404 (except private, or generic that accesses private) 405 provided the other operation's sync scope is: 406 407 - ``system``, ``agent`` or ``workgroup`` and 408 executed by a thread in the same workgroup. 409 - ``wavefront`` and executed by a thread in the 410 same wavefront. 411 412 ``wavefront`` Synchronizes with, and participates in modification 413 and seq_cst total orderings with, other operations 414 (except image operations) for all address spaces 415 (except private, or generic that accesses private) 416 provided the other operation's sync scope is: 417 418 - ``system``, ``agent``, ``workgroup`` or 419 ``wavefront`` and executed by a thread in the 420 same wavefront. 421 422 ``singlethread`` Only synchronizes with, and participates in 423 modification and seq_cst total orderings with, 424 other operations (except image operations) running 425 in the same thread for all address spaces (for 426 example, in signal handlers). 427 428 ``one-as`` Same as ``system`` but only synchronizes with other 429 operations within the same address space. 430 431 ``agent-one-as`` Same as ``agent`` but only synchronizes with other 432 operations within the same address space. 433 434 ``workgroup-one-as`` Same as ``workgroup`` but only synchronizes with 435 other operations within the same address space. 436 437 ``wavefront-one-as`` Same as ``wavefront`` but only synchronizes with 438 other operations within the same address space. 439 440 ``singlethread-one-as`` Same as ``singlethread`` but only synchronizes with 441 other operations within the same address space. 442 ======================= =================================================== 443 444AMDGPU Intrinsics 445----------------- 446 447The AMDGPU backend implements the following LLVM IR intrinsics. 448 449*This section is WIP.* 450 451.. TODO 452 List AMDGPU intrinsics 453 454AMDGPU Attributes 455----------------- 456 457The AMDGPU backend supports the following LLVM IR attributes. 458 459 .. table:: AMDGPU LLVM IR Attributes 460 :name: amdgpu-llvm-ir-attributes-table 461 462 ======================================= ========================================================== 463 LLVM Attribute Description 464 ======================================= ========================================================== 465 "amdgpu-flat-work-group-size"="min,max" Specify the minimum and maximum flat work group sizes that 466 will be specified when the kernel is dispatched. Generated 467 by the ``amdgpu_flat_work_group_size`` CLANG attribute [CLANG-ATTR]_. 468 "amdgpu-implicitarg-num-bytes"="n" Number of kernel argument bytes to add to the kernel 469 argument block size for the implicit arguments. This 470 varies by OS and language (for OpenCL see 471 :ref:`opencl-kernel-implicit-arguments-appended-for-amdhsa-os-table`). 472 "amdgpu-num-sgpr"="n" Specifies the number of SGPRs to use. Generated by 473 the ``amdgpu_num_sgpr`` CLANG attribute [CLANG-ATTR]_. 474 "amdgpu-num-vgpr"="n" Specifies the number of VGPRs to use. Generated by the 475 ``amdgpu_num_vgpr`` CLANG attribute [CLANG-ATTR]_. 476 "amdgpu-waves-per-eu"="m,n" Specify the minimum and maximum number of waves per 477 execution unit. Generated by the ``amdgpu_waves_per_eu`` 478 CLANG attribute [CLANG-ATTR]_. 479 "amdgpu-ieee" true/false. Specify whether the function expects the IEEE field of the 480 mode register to be set on entry. Overrides the default for 481 the calling convention. 482 "amdgpu-dx10-clamp" true/false. Specify whether the function expects the DX10_CLAMP field of 483 the mode register to be set on entry. Overrides the default 484 for the calling convention. 485 ======================================= ========================================================== 486 487Code Object 488=========== 489 490The AMDGPU backend generates a standard ELF [ELF]_ relocatable code object that 491can be linked by ``lld`` to produce a standard ELF shared code object which can 492be loaded and executed on an AMDGPU target. 493 494Header 495------ 496 497The AMDGPU backend uses the following ELF header: 498 499 .. table:: AMDGPU ELF Header 500 :name: amdgpu-elf-header-table 501 502 ========================== =============================== 503 Field Value 504 ========================== =============================== 505 ``e_ident[EI_CLASS]`` ``ELFCLASS64`` 506 ``e_ident[EI_DATA]`` ``ELFDATA2LSB`` 507 ``e_ident[EI_OSABI]`` - ``ELFOSABI_NONE`` 508 - ``ELFOSABI_AMDGPU_HSA`` 509 - ``ELFOSABI_AMDGPU_PAL`` 510 - ``ELFOSABI_AMDGPU_MESA3D`` 511 ``e_ident[EI_ABIVERSION]`` - ``ELFABIVERSION_AMDGPU_HSA`` 512 - ``ELFABIVERSION_AMDGPU_PAL`` 513 - ``ELFABIVERSION_AMDGPU_MESA3D`` 514 ``e_type`` - ``ET_REL`` 515 - ``ET_DYN`` 516 ``e_machine`` ``EM_AMDGPU`` 517 ``e_entry`` 0 518 ``e_flags`` See :ref:`amdgpu-elf-header-e_flags-table` 519 ========================== =============================== 520 521.. 522 523 .. table:: AMDGPU ELF Header Enumeration Values 524 :name: amdgpu-elf-header-enumeration-values-table 525 526 =============================== ===== 527 Name Value 528 =============================== ===== 529 ``EM_AMDGPU`` 224 530 ``ELFOSABI_NONE`` 0 531 ``ELFOSABI_AMDGPU_HSA`` 64 532 ``ELFOSABI_AMDGPU_PAL`` 65 533 ``ELFOSABI_AMDGPU_MESA3D`` 66 534 ``ELFABIVERSION_AMDGPU_HSA`` 1 535 ``ELFABIVERSION_AMDGPU_PAL`` 0 536 ``ELFABIVERSION_AMDGPU_MESA3D`` 0 537 =============================== ===== 538 539``e_ident[EI_CLASS]`` 540 The ELF class is: 541 542 * ``ELFCLASS32`` for ``r600`` architecture. 543 544 * ``ELFCLASS64`` for ``amdgcn`` architecture which only supports 64 545 bit applications. 546 547``e_ident[EI_DATA]`` 548 All AMDGPU targets use ``ELFDATA2LSB`` for little-endian byte ordering. 549 550``e_ident[EI_OSABI]`` 551 One of the following AMD GPU architecture specific OS ABIs 552 (see :ref:`amdgpu-os-table`): 553 554 * ``ELFOSABI_NONE`` for *unknown* OS. 555 556 * ``ELFOSABI_AMDGPU_HSA`` for ``amdhsa`` OS. 557 558 * ``ELFOSABI_AMDGPU_PAL`` for ``amdpal`` OS. 559 560 * ``ELFOSABI_AMDGPU_MESA3D`` for ``mesa3D`` OS. 561 562``e_ident[EI_ABIVERSION]`` 563 The ABI version of the AMD GPU architecture specific OS ABI to which the code 564 object conforms: 565 566 * ``ELFABIVERSION_AMDGPU_HSA`` is used to specify the version of AMD HSA 567 runtime ABI. 568 569 * ``ELFABIVERSION_AMDGPU_PAL`` is used to specify the version of AMD PAL 570 runtime ABI. 571 572 * ``ELFABIVERSION_AMDGPU_MESA3D`` is used to specify the version of AMD MESA 573 3D runtime ABI. 574 575``e_type`` 576 Can be one of the following values: 577 578 579 ``ET_REL`` 580 The type produced by the AMD GPU backend compiler as it is relocatable code 581 object. 582 583 ``ET_DYN`` 584 The type produced by the linker as it is a shared code object. 585 586 The AMD HSA runtime loader requires a ``ET_DYN`` code object. 587 588``e_machine`` 589 The value ``EM_AMDGPU`` is used for the machine for all processors supported 590 by the ``r600`` and ``amdgcn`` architectures (see 591 :ref:`amdgpu-processor-table`). The specific processor is specified in the 592 ``EF_AMDGPU_MACH`` bit field of the ``e_flags`` (see 593 :ref:`amdgpu-elf-header-e_flags-table`). 594 595``e_entry`` 596 The entry point is 0 as the entry points for individual kernels must be 597 selected in order to invoke them through AQL packets. 598 599``e_flags`` 600 The AMDGPU backend uses the following ELF header flags: 601 602 .. table:: AMDGPU ELF Header ``e_flags`` 603 :name: amdgpu-elf-header-e_flags-table 604 605 ================================= ========== ============================= 606 Name Value Description 607 ================================= ========== ============================= 608 **AMDGPU Processor Flag** See :ref:`amdgpu-processor-table`. 609 -------------------------------------------- ----------------------------- 610 ``EF_AMDGPU_MACH`` 0x000000ff AMDGPU processor selection 611 mask for 612 ``EF_AMDGPU_MACH_xxx`` values 613 defined in 614 :ref:`amdgpu-ef-amdgpu-mach-table`. 615 ``EF_AMDGPU_XNACK`` 0x00000100 Indicates if the ``xnack`` 616 target feature is 617 enabled for all code 618 contained in the code object. 619 If the processor 620 does not support the 621 ``xnack`` target 622 feature then must 623 be 0. 624 See 625 :ref:`amdgpu-target-features`. 626 ``EF_AMDGPU_SRAM_ECC`` 0x00000200 Indicates if the ``sram-ecc`` 627 target feature is 628 enabled for all code 629 contained in the code object. 630 If the processor 631 does not support the 632 ``sram-ecc`` target 633 feature then must 634 be 0. 635 See 636 :ref:`amdgpu-target-features`. 637 ================================= ========== ============================= 638 639 .. table:: AMDGPU ``EF_AMDGPU_MACH`` Values 640 :name: amdgpu-ef-amdgpu-mach-table 641 642 ================================= ========== ============================= 643 Name Value Description (see 644 :ref:`amdgpu-processor-table`) 645 ================================= ========== ============================= 646 ``EF_AMDGPU_MACH_NONE`` 0x000 *not specified* 647 ``EF_AMDGPU_MACH_R600_R600`` 0x001 ``r600`` 648 ``EF_AMDGPU_MACH_R600_R630`` 0x002 ``r630`` 649 ``EF_AMDGPU_MACH_R600_RS880`` 0x003 ``rs880`` 650 ``EF_AMDGPU_MACH_R600_RV670`` 0x004 ``rv670`` 651 ``EF_AMDGPU_MACH_R600_RV710`` 0x005 ``rv710`` 652 ``EF_AMDGPU_MACH_R600_RV730`` 0x006 ``rv730`` 653 ``EF_AMDGPU_MACH_R600_RV770`` 0x007 ``rv770`` 654 ``EF_AMDGPU_MACH_R600_CEDAR`` 0x008 ``cedar`` 655 ``EF_AMDGPU_MACH_R600_CYPRESS`` 0x009 ``cypress`` 656 ``EF_AMDGPU_MACH_R600_JUNIPER`` 0x00a ``juniper`` 657 ``EF_AMDGPU_MACH_R600_REDWOOD`` 0x00b ``redwood`` 658 ``EF_AMDGPU_MACH_R600_SUMO`` 0x00c ``sumo`` 659 ``EF_AMDGPU_MACH_R600_BARTS`` 0x00d ``barts`` 660 ``EF_AMDGPU_MACH_R600_CAICOS`` 0x00e ``caicos`` 661 ``EF_AMDGPU_MACH_R600_CAYMAN`` 0x00f ``cayman`` 662 ``EF_AMDGPU_MACH_R600_TURKS`` 0x010 ``turks`` 663 *reserved* 0x011 - Reserved for ``r600`` 664 0x01f architecture processors. 665 ``EF_AMDGPU_MACH_AMDGCN_GFX600`` 0x020 ``gfx600`` 666 ``EF_AMDGPU_MACH_AMDGCN_GFX601`` 0x021 ``gfx601`` 667 ``EF_AMDGPU_MACH_AMDGCN_GFX700`` 0x022 ``gfx700`` 668 ``EF_AMDGPU_MACH_AMDGCN_GFX701`` 0x023 ``gfx701`` 669 ``EF_AMDGPU_MACH_AMDGCN_GFX702`` 0x024 ``gfx702`` 670 ``EF_AMDGPU_MACH_AMDGCN_GFX703`` 0x025 ``gfx703`` 671 ``EF_AMDGPU_MACH_AMDGCN_GFX704`` 0x026 ``gfx704`` 672 *reserved* 0x027 Reserved. 673 ``EF_AMDGPU_MACH_AMDGCN_GFX801`` 0x028 ``gfx801`` 674 ``EF_AMDGPU_MACH_AMDGCN_GFX802`` 0x029 ``gfx802`` 675 ``EF_AMDGPU_MACH_AMDGCN_GFX803`` 0x02a ``gfx803`` 676 ``EF_AMDGPU_MACH_AMDGCN_GFX810`` 0x02b ``gfx810`` 677 ``EF_AMDGPU_MACH_AMDGCN_GFX900`` 0x02c ``gfx900`` 678 ``EF_AMDGPU_MACH_AMDGCN_GFX902`` 0x02d ``gfx902`` 679 ``EF_AMDGPU_MACH_AMDGCN_GFX904`` 0x02e ``gfx904`` 680 ``EF_AMDGPU_MACH_AMDGCN_GFX906`` 0x02f ``gfx906`` 681 ``EF_AMDGPU_MACH_AMDGCN_GFX908`` 0x030 ``gfx908`` 682 ``EF_AMDGPU_MACH_AMDGCN_GFX909`` 0x031 ``gfx909`` 683 *reserved* 0x032 Reserved. 684 ``EF_AMDGPU_MACH_AMDGCN_GFX1010`` 0x033 ``gfx1010`` 685 ``EF_AMDGPU_MACH_AMDGCN_GFX1011`` 0x034 ``gfx1011`` 686 ``EF_AMDGPU_MACH_AMDGCN_GFX1012`` 0x035 ``gfx1012`` 687 ================================= ========== ============================= 688 689Sections 690-------- 691 692An AMDGPU target ELF code object has the standard ELF sections which include: 693 694 .. table:: AMDGPU ELF Sections 695 :name: amdgpu-elf-sections-table 696 697 ================== ================ ================================= 698 Name Type Attributes 699 ================== ================ ================================= 700 ``.bss`` ``SHT_NOBITS`` ``SHF_ALLOC`` + ``SHF_WRITE`` 701 ``.data`` ``SHT_PROGBITS`` ``SHF_ALLOC`` + ``SHF_WRITE`` 702 ``.debug_``\ *\** ``SHT_PROGBITS`` *none* 703 ``.dynamic`` ``SHT_DYNAMIC`` ``SHF_ALLOC`` 704 ``.dynstr`` ``SHT_PROGBITS`` ``SHF_ALLOC`` 705 ``.dynsym`` ``SHT_PROGBITS`` ``SHF_ALLOC`` 706 ``.got`` ``SHT_PROGBITS`` ``SHF_ALLOC`` + ``SHF_WRITE`` 707 ``.hash`` ``SHT_HASH`` ``SHF_ALLOC`` 708 ``.note`` ``SHT_NOTE`` *none* 709 ``.rela``\ *name* ``SHT_RELA`` *none* 710 ``.rela.dyn`` ``SHT_RELA`` *none* 711 ``.rodata`` ``SHT_PROGBITS`` ``SHF_ALLOC`` 712 ``.shstrtab`` ``SHT_STRTAB`` *none* 713 ``.strtab`` ``SHT_STRTAB`` *none* 714 ``.symtab`` ``SHT_SYMTAB`` *none* 715 ``.text`` ``SHT_PROGBITS`` ``SHF_ALLOC`` + ``SHF_EXECINSTR`` 716 ================== ================ ================================= 717 718These sections have their standard meanings (see [ELF]_) and are only generated 719if needed. 720 721``.debug``\ *\** 722 The standard DWARF sections. See :ref:`amdgpu-dwarf` for information on the 723 DWARF produced by the AMDGPU backend. 724 725``.dynamic``, ``.dynstr``, ``.dynsym``, ``.hash`` 726 The standard sections used by a dynamic loader. 727 728``.note`` 729 See :ref:`amdgpu-note-records` for the note records supported by the AMDGPU 730 backend. 731 732``.rela``\ *name*, ``.rela.dyn`` 733 For relocatable code objects, *name* is the name of the section that the 734 relocation records apply. For example, ``.rela.text`` is the section name for 735 relocation records associated with the ``.text`` section. 736 737 For linked shared code objects, ``.rela.dyn`` contains all the relocation 738 records from each of the relocatable code object's ``.rela``\ *name* sections. 739 740 See :ref:`amdgpu-relocation-records` for the relocation records supported by 741 the AMDGPU backend. 742 743``.text`` 744 The executable machine code for the kernels and functions they call. Generated 745 as position independent code. See :ref:`amdgpu-code-conventions` for 746 information on conventions used in the isa generation. 747 748.. _amdgpu-note-records: 749 750Note Records 751------------ 752 753The AMDGPU backend code object contains ELF note records in the ``.note`` 754section. The set of generated notes and their semantics depend on the code 755object version; see :ref:`amdgpu-note-records-v2` and 756:ref:`amdgpu-note-records-v3`. 757 758As required by ``ELFCLASS32`` and ``ELFCLASS64``, minimal zero byte padding 759must be generated after the ``name`` field to ensure the ``desc`` field is 4 760byte aligned. In addition, minimal zero byte padding must be generated to 761ensure the ``desc`` field size is a multiple of 4 bytes. The ``sh_addralign`` 762field of the ``.note`` section must be at least 4 to indicate at least 8 byte 763alignment. 764 765.. _amdgpu-note-records-v2: 766 767Code Object V2 Note Records (-mattr=-code-object-v3) 768~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 769 770.. warning:: Code Object V2 is not the default code object version emitted by 771 this version of LLVM. For a description of the notes generated with the 772 default configuration (Code Object V3) see :ref:`amdgpu-note-records-v3`. 773 774The AMDGPU backend code object uses the following ELF note record in the 775``.note`` section when compiling for Code Object V2 (-mattr=-code-object-v3). 776 777Additional note records may be present, but any which are not documented here 778are deprecated and should not be used. 779 780 .. table:: AMDGPU Code Object V2 ELF Note Records 781 :name: amdgpu-elf-note-records-table-v2 782 783 ===== ============================== ====================================== 784 Name Type Description 785 ===== ============================== ====================================== 786 "AMD" ``NT_AMD_AMDGPU_HSA_METADATA`` <metadata null terminated string> 787 ===== ============================== ====================================== 788 789.. 790 791 .. table:: AMDGPU Code Object V2 ELF Note Record Enumeration Values 792 :name: amdgpu-elf-note-record-enumeration-values-table-v2 793 794 ============================== ===== 795 Name Value 796 ============================== ===== 797 *reserved* 0-9 798 ``NT_AMD_AMDGPU_HSA_METADATA`` 10 799 *reserved* 11 800 ============================== ===== 801 802``NT_AMD_AMDGPU_HSA_METADATA`` 803 Specifies extensible metadata associated with the code objects executed on HSA 804 [HSA]_ compatible runtimes such as AMD's ROCm [AMD-ROCm]_. It is required when 805 the target triple OS is ``amdhsa`` (see :ref:`amdgpu-target-triples`). See 806 :ref:`amdgpu-amdhsa-code-object-metadata-v2` for the syntax of the code 807 object metadata string. 808 809.. _amdgpu-note-records-v3: 810 811Code Object V3 Note Records (-mattr=+code-object-v3) 812~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 813 814The AMDGPU backend code object uses the following ELF note record in the 815``.note`` section when compiling for Code Object V3 (-mattr=+code-object-v3). 816 817Additional note records may be present, but any which are not documented here 818are deprecated and should not be used. 819 820 .. table:: AMDGPU Code Object V3 ELF Note Records 821 :name: amdgpu-elf-note-records-table-v3 822 823 ======== ============================== ====================================== 824 Name Type Description 825 ======== ============================== ====================================== 826 "AMDGPU" ``NT_AMDGPU_METADATA`` Metadata in Message Pack [MsgPack]_ 827 binary format. 828 ======== ============================== ====================================== 829 830.. 831 832 .. table:: AMDGPU Code Object V3 ELF Note Record Enumeration Values 833 :name: amdgpu-elf-note-record-enumeration-values-table-v3 834 835 ============================== ===== 836 Name Value 837 ============================== ===== 838 *reserved* 0-31 839 ``NT_AMDGPU_METADATA`` 32 840 ============================== ===== 841 842``NT_AMDGPU_METADATA`` 843 Specifies extensible metadata associated with an AMDGPU code 844 object. It is encoded as a map in the Message Pack [MsgPack]_ binary 845 data format. See :ref:`amdgpu-amdhsa-code-object-metadata-v3` for the 846 map keys defined for the ``amdhsa`` OS. 847 848.. _amdgpu-symbols: 849 850Symbols 851------- 852 853Symbols include the following: 854 855 .. table:: AMDGPU ELF Symbols 856 :name: amdgpu-elf-symbols-table 857 858 ===================== ================== ================ ================== 859 Name Type Section Description 860 ===================== ================== ================ ================== 861 *link-name* ``STT_OBJECT`` - ``.data`` Global variable 862 - ``.rodata`` 863 - ``.bss`` 864 *link-name*\ ``.kd`` ``STT_OBJECT`` - ``.rodata`` Kernel descriptor 865 *link-name* ``STT_FUNC`` - ``.text`` Kernel entry point 866 *link-name* ``STT_OBJECT`` - SHN_AMDGPU_LDS Global variable in LDS 867 ===================== ================== ================ ================== 868 869Global variable 870 Global variables both used and defined by the compilation unit. 871 872 If the symbol is defined in the compilation unit then it is allocated in the 873 appropriate section according to if it has initialized data or is readonly. 874 875 If the symbol is external then its section is ``STN_UNDEF`` and the loader 876 will resolve relocations using the definition provided by another code object 877 or explicitly defined by the runtime. 878 879 If the symbol resides in local/group memory (LDS) then its section is the 880 special processor-specific section name ``SHN_AMDGPU_LDS``, and the 881 ``st_value`` field describes alignment requirements as it does for common 882 symbols. 883 884 .. TODO 885 Add description of linked shared object symbols. Seems undefined symbols 886 are marked as STT_NOTYPE. 887 888Kernel descriptor 889 Every HSA kernel has an associated kernel descriptor. It is the address of the 890 kernel descriptor that is used in the AQL dispatch packet used to invoke the 891 kernel, not the kernel entry point. The layout of the HSA kernel descriptor is 892 defined in :ref:`amdgpu-amdhsa-kernel-descriptor`. 893 894Kernel entry point 895 Every HSA kernel also has a symbol for its machine code entry point. 896 897.. _amdgpu-relocation-records: 898 899Relocation Records 900------------------ 901 902AMDGPU backend generates ``Elf64_Rela`` relocation records. Supported 903relocatable fields are: 904 905``word32`` 906 This specifies a 32-bit field occupying 4 bytes with arbitrary byte 907 alignment. These values use the same byte order as other word values in the 908 AMD GPU architecture. 909 910``word64`` 911 This specifies a 64-bit field occupying 8 bytes with arbitrary byte 912 alignment. These values use the same byte order as other word values in the 913 AMD GPU architecture. 914 915Following notations are used for specifying relocation calculations: 916 917**A** 918 Represents the addend used to compute the value of the relocatable field. 919 920**G** 921 Represents the offset into the global offset table at which the relocation 922 entry's symbol will reside during execution. 923 924**GOT** 925 Represents the address of the global offset table. 926 927**P** 928 Represents the place (section offset for ``et_rel`` or address for ``et_dyn``) 929 of the storage unit being relocated (computed using ``r_offset``). 930 931**S** 932 Represents the value of the symbol whose index resides in the relocation 933 entry. Relocations not using this must specify a symbol index of ``STN_UNDEF``. 934 935**B** 936 Represents the base address of a loaded executable or shared object which is 937 the difference between the ELF address and the actual load address. Relocations 938 using this are only valid in executable or shared objects. 939 940The following relocation types are supported: 941 942 .. table:: AMDGPU ELF Relocation Records 943 :name: amdgpu-elf-relocation-records-table 944 945 ========================== ======= ===== ========== ============================== 946 Relocation Type Kind Value Field Calculation 947 ========================== ======= ===== ========== ============================== 948 ``R_AMDGPU_NONE`` 0 *none* *none* 949 ``R_AMDGPU_ABS32_LO`` Static, 1 ``word32`` (S + A) & 0xFFFFFFFF 950 Dynamic 951 ``R_AMDGPU_ABS32_HI`` Static, 2 ``word32`` (S + A) >> 32 952 Dynamic 953 ``R_AMDGPU_ABS64`` Static, 3 ``word64`` S + A 954 Dynamic 955 ``R_AMDGPU_REL32`` Static 4 ``word32`` S + A - P 956 ``R_AMDGPU_REL64`` Static 5 ``word64`` S + A - P 957 ``R_AMDGPU_ABS32`` Static, 6 ``word32`` S + A 958 Dynamic 959 ``R_AMDGPU_GOTPCREL`` Static 7 ``word32`` G + GOT + A - P 960 ``R_AMDGPU_GOTPCREL32_LO`` Static 8 ``word32`` (G + GOT + A - P) & 0xFFFFFFFF 961 ``R_AMDGPU_GOTPCREL32_HI`` Static 9 ``word32`` (G + GOT + A - P) >> 32 962 ``R_AMDGPU_REL32_LO`` Static 10 ``word32`` (S + A - P) & 0xFFFFFFFF 963 ``R_AMDGPU_REL32_HI`` Static 11 ``word32`` (S + A - P) >> 32 964 *reserved* 12 965 ``R_AMDGPU_RELATIVE64`` Dynamic 13 ``word64`` B + A 966 ========================== ======= ===== ========== ============================== 967 968``R_AMDGPU_ABS32_LO`` and ``R_AMDGPU_ABS32_HI`` are only supported by 969the ``mesa3d`` OS, which does not support ``R_AMDGPU_ABS64``. 970 971There is no current OS loader support for 32 bit programs and so 972``R_AMDGPU_ABS32`` is not used. 973 974.. _amdgpu-dwarf: 975 976DWARF 977----- 978 979Standard DWARF [DWARF]_ Version 5 sections can be generated. These contain 980information that maps the code object executable code and data to the source 981language constructs. It can be used by tools such as debuggers and profilers. 982 983Address Space Mapping 984~~~~~~~~~~~~~~~~~~~~~ 985 986The following address space mapping is used: 987 988 .. table:: AMDGPU DWARF Address Space Mapping 989 :name: amdgpu-dwarf-address-space-mapping-table 990 991 =================== ================= 992 DWARF Address Space Memory Space 993 =================== ================= 994 1 Private (Scratch) 995 2 Local (group/LDS) 996 *omitted* Global 997 *omitted* Constant 998 *omitted* Generic (Flat) 999 *not supported* Region (GDS) 1000 =================== ================= 1001 1002See :ref:`amdgpu-address-spaces` for information on the memory space terminology 1003used in the table. 1004 1005An ``address_class`` attribute is generated on pointer type DIEs to specify the 1006DWARF address space of the value of the pointer when it is in the *private* or 1007*local* address space. Otherwise the attribute is omitted. 1008 1009An ``XDEREF`` operation is generated in location list expressions for variables 1010that are allocated in the *private* and *local* address space. Otherwise no 1011``XDREF`` is omitted. 1012 1013Register Mapping 1014~~~~~~~~~~~~~~~~ 1015 1016*This section is WIP.* 1017 1018.. TODO 1019 Define DWARF register enumeration. 1020 1021 If want to present a wavefront state then should expose vector registers as 1022 64 wide (rather than per work-item view that LLVM uses). Either as separate 1023 registers, or a 64x4 byte single register. In either case use a new LANE op 1024 (akin to XDREF) to select the current lane usage in a location 1025 expression. This would also allow scalar register spilling to vector register 1026 lanes to be expressed (currently no debug information is being generated for 1027 spilling). If choose a wide single register approach then use LANE in 1028 conjunction with PIECE operation to select the dword part of the register for 1029 the current lane. If the separate register approach then use LANE to select 1030 the register. 1031 1032Source Text 1033~~~~~~~~~~~ 1034 1035Source text for online-compiled programs (e.g. those compiled by the OpenCL 1036runtime) may be embedded into the DWARF v5 line table using the ``clang 1037-gembed-source`` option, described in table :ref:`amdgpu-debug-options`. 1038 1039For example: 1040 1041``-gembed-source`` 1042 Enable the embedded source DWARF v5 extension. 1043``-gno-embed-source`` 1044 Disable the embedded source DWARF v5 extension. 1045 1046 .. table:: AMDGPU Debug Options 1047 :name: amdgpu-debug-options 1048 1049 ==================== ================================================== 1050 Debug Flag Description 1051 ==================== ================================================== 1052 -g[no-]embed-source Enable/disable embedding source text in DWARF 1053 debug sections. Useful for environments where 1054 source cannot be written to disk, such as 1055 when performing online compilation. 1056 ==================== ================================================== 1057 1058This option enables one extended content types in the DWARF v5 Line Number 1059Program Header, which is used to encode embedded source. 1060 1061 .. table:: AMDGPU DWARF Line Number Program Header Extended Content Types 1062 :name: amdgpu-dwarf-extended-content-types 1063 1064 ============================ ====================== 1065 Content Type Form 1066 ============================ ====================== 1067 ``DW_LNCT_LLVM_source`` ``DW_FORM_line_strp`` 1068 ============================ ====================== 1069 1070The source field will contain the UTF-8 encoded, null-terminated source text 1071with ``'\n'`` line endings. When the source field is present, consumers can use 1072the embedded source instead of attempting to discover the source on disk. When 1073the source field is absent, consumers can access the file to get the source 1074text. 1075 1076The above content type appears in the ``file_name_entry_format`` field of the 1077line table prologue, and its corresponding value appear in the ``file_names`` 1078field. The current encoding of the content type is documented in table 1079:ref:`amdgpu-dwarf-extended-content-types-encoding` 1080 1081 .. table:: AMDGPU DWARF Line Number Program Header Extended Content Types Encoding 1082 :name: amdgpu-dwarf-extended-content-types-encoding 1083 1084 ============================ ==================== 1085 Content Type Value 1086 ============================ ==================== 1087 ``DW_LNCT_LLVM_source`` 0x2001 1088 ============================ ==================== 1089 1090.. _amdgpu-code-conventions: 1091 1092Code Conventions 1093================ 1094 1095This section provides code conventions used for each supported target triple OS 1096(see :ref:`amdgpu-target-triples`). 1097 1098AMDHSA 1099------ 1100 1101This section provides code conventions used when the target triple OS is 1102``amdhsa`` (see :ref:`amdgpu-target-triples`). 1103 1104.. _amdgpu-amdhsa-code-object-target-identification: 1105 1106Code Object Target Identification 1107~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 1108 1109The AMDHSA OS uses the following syntax to specify the code object 1110target as a single string: 1111 1112 ``<Architecture>-<Vendor>-<OS>-<Environment>-<Processor><Target Features>`` 1113 1114Where: 1115 1116 - ``<Architecture>``, ``<Vendor>``, ``<OS>`` and ``<Environment>`` 1117 are the same as the *Target Triple* (see 1118 :ref:`amdgpu-target-triples`). 1119 1120 - ``<Processor>`` is the same as the *Processor* (see 1121 :ref:`amdgpu-processors`). 1122 1123 - ``<Target Features>`` is a list of the enabled *Target Features* 1124 (see :ref:`amdgpu-target-features`), each prefixed by a plus, that 1125 apply to *Processor*. The list must be in the same order as listed 1126 in the table :ref:`amdgpu-target-feature-table`. Note that *Target 1127 Features* must be included in the list if they are enabled even if 1128 that is the default for *Processor*. 1129 1130For example: 1131 1132 ``"amdgcn-amd-amdhsa--gfx902+xnack"`` 1133 1134.. _amdgpu-amdhsa-code-object-metadata: 1135 1136Code Object Metadata 1137~~~~~~~~~~~~~~~~~~~~ 1138 1139The code object metadata specifies extensible metadata associated with the code 1140objects executed on HSA [HSA]_ compatible runtimes such as AMD's ROCm 1141[AMD-ROCm]_. The encoding and semantics of this metadata depends on the code 1142object version; see :ref:`amdgpu-amdhsa-code-object-metadata-v2` and 1143:ref:`amdgpu-amdhsa-code-object-metadata-v3`. 1144 1145Code object metadata is specified in a note record (see 1146:ref:`amdgpu-note-records`) and is required when the target triple OS is 1147``amdhsa`` (see :ref:`amdgpu-target-triples`). It must contain the minimum 1148information necessary to support the ROCM kernel queries. For example, the 1149segment sizes needed in a dispatch packet. In addition, a high level language 1150runtime may require other information to be included. For example, the AMD 1151OpenCL runtime records kernel argument information. 1152 1153.. _amdgpu-amdhsa-code-object-metadata-v2: 1154 1155Code Object V2 Metadata (-mattr=-code-object-v3) 1156++++++++++++++++++++++++++++++++++++++++++++++++ 1157 1158.. warning:: Code Object V2 is not the default code object version emitted by 1159 this version of LLVM. For a description of the metadata generated with the 1160 default configuration (Code Object V3) see 1161 :ref:`amdgpu-amdhsa-code-object-metadata-v3`. 1162 1163Code object V2 metadata is specified by the ``NT_AMD_AMDGPU_METADATA`` note 1164record (see :ref:`amdgpu-note-records-v2`). 1165 1166The metadata is specified as a YAML formatted string (see [YAML]_ and 1167:doc:`YamlIO`). 1168 1169.. TODO 1170 Is the string null terminated? It probably should not if YAML allows it to 1171 contain null characters, otherwise it should be. 1172 1173The metadata is represented as a single YAML document comprised of the mapping 1174defined in table :ref:`amdgpu-amdhsa-code-object-metadata-map-table-v2` and 1175referenced tables. 1176 1177For boolean values, the string values of ``false`` and ``true`` are used for 1178false and true respectively. 1179 1180Additional information can be added to the mappings. To avoid conflicts, any 1181non-AMD key names should be prefixed by "*vendor-name*.". 1182 1183 .. table:: AMDHSA Code Object V2 Metadata Map 1184 :name: amdgpu-amdhsa-code-object-metadata-map-table-v2 1185 1186 ========== ============== ========= ======================================= 1187 String Key Value Type Required? Description 1188 ========== ============== ========= ======================================= 1189 "Version" sequence of Required - The first integer is the major 1190 2 integers version. Currently 1. 1191 - The second integer is the minor 1192 version. Currently 0. 1193 "Printf" sequence of Each string is encoded information 1194 strings about a printf function call. The 1195 encoded information is organized as 1196 fields separated by colon (':'): 1197 1198 ``ID:N:S[0]:S[1]:...:S[N-1]:FormatString`` 1199 1200 where: 1201 1202 ``ID`` 1203 A 32 bit integer as a unique id for 1204 each printf function call 1205 1206 ``N`` 1207 A 32 bit integer equal to the number 1208 of arguments of printf function call 1209 minus 1 1210 1211 ``S[i]`` (where i = 0, 1, ... , N-1) 1212 32 bit integers for the size in bytes 1213 of the i-th FormatString argument of 1214 the printf function call 1215 1216 FormatString 1217 The format string passed to the 1218 printf function call. 1219 "Kernels" sequence of Required Sequence of the mappings for each 1220 mapping kernel in the code object. See 1221 :ref:`amdgpu-amdhsa-code-object-kernel-metadata-map-table-v2` 1222 for the definition of the mapping. 1223 ========== ============== ========= ======================================= 1224 1225.. 1226 1227 .. table:: AMDHSA Code Object V2 Kernel Metadata Map 1228 :name: amdgpu-amdhsa-code-object-kernel-metadata-map-table-v2 1229 1230 ================= ============== ========= ================================ 1231 String Key Value Type Required? Description 1232 ================= ============== ========= ================================ 1233 "Name" string Required Source name of the kernel. 1234 "SymbolName" string Required Name of the kernel 1235 descriptor ELF symbol. 1236 "Language" string Source language of the kernel. 1237 Values include: 1238 1239 - "OpenCL C" 1240 - "OpenCL C++" 1241 - "HCC" 1242 - "OpenMP" 1243 1244 "LanguageVersion" sequence of - The first integer is the major 1245 2 integers version. 1246 - The second integer is the 1247 minor version. 1248 "Attrs" mapping Mapping of kernel attributes. 1249 See 1250 :ref:`amdgpu-amdhsa-code-object-kernel-attribute-metadata-map-table-v2` 1251 for the mapping definition. 1252 "Args" sequence of Sequence of mappings of the 1253 mapping kernel arguments. See 1254 :ref:`amdgpu-amdhsa-code-object-kernel-argument-metadata-map-table-v2` 1255 for the definition of the mapping. 1256 "CodeProps" mapping Mapping of properties related to 1257 the kernel code. See 1258 :ref:`amdgpu-amdhsa-code-object-kernel-code-properties-metadata-map-table-v2` 1259 for the mapping definition. 1260 ================= ============== ========= ================================ 1261 1262.. 1263 1264 .. table:: AMDHSA Code Object V2 Kernel Attribute Metadata Map 1265 :name: amdgpu-amdhsa-code-object-kernel-attribute-metadata-map-table-v2 1266 1267 =================== ============== ========= ============================== 1268 String Key Value Type Required? Description 1269 =================== ============== ========= ============================== 1270 "ReqdWorkGroupSize" sequence of If not 0, 0, 0 then all values 1271 3 integers must be >=1 and the dispatch 1272 work-group size X, Y, Z must 1273 correspond to the specified 1274 values. Defaults to 0, 0, 0. 1275 1276 Corresponds to the OpenCL 1277 ``reqd_work_group_size`` 1278 attribute. 1279 "WorkGroupSizeHint" sequence of The dispatch work-group size 1280 3 integers X, Y, Z is likely to be the 1281 specified values. 1282 1283 Corresponds to the OpenCL 1284 ``work_group_size_hint`` 1285 attribute. 1286 "VecTypeHint" string The name of a scalar or vector 1287 type. 1288 1289 Corresponds to the OpenCL 1290 ``vec_type_hint`` attribute. 1291 1292 "RuntimeHandle" string The external symbol name 1293 associated with a kernel. 1294 OpenCL runtime allocates a 1295 global buffer for the symbol 1296 and saves the kernel's address 1297 to it, which is used for 1298 device side enqueueing. Only 1299 available for device side 1300 enqueued kernels. 1301 =================== ============== ========= ============================== 1302 1303.. 1304 1305 .. table:: AMDHSA Code Object V2 Kernel Argument Metadata Map 1306 :name: amdgpu-amdhsa-code-object-kernel-argument-metadata-map-table-v2 1307 1308 ================= ============== ========= ================================ 1309 String Key Value Type Required? Description 1310 ================= ============== ========= ================================ 1311 "Name" string Kernel argument name. 1312 "TypeName" string Kernel argument type name. 1313 "Size" integer Required Kernel argument size in bytes. 1314 "Align" integer Required Kernel argument alignment in 1315 bytes. Must be a power of two. 1316 "ValueKind" string Required Kernel argument kind that 1317 specifies how to set up the 1318 corresponding argument. 1319 Values include: 1320 1321 "ByValue" 1322 The argument is copied 1323 directly into the kernarg. 1324 1325 "GlobalBuffer" 1326 A global address space pointer 1327 to the buffer data is passed 1328 in the kernarg. 1329 1330 "DynamicSharedPointer" 1331 A group address space pointer 1332 to dynamically allocated LDS 1333 is passed in the kernarg. 1334 1335 "Sampler" 1336 A global address space 1337 pointer to a S# is passed in 1338 the kernarg. 1339 1340 "Image" 1341 A global address space 1342 pointer to a T# is passed in 1343 the kernarg. 1344 1345 "Pipe" 1346 A global address space pointer 1347 to an OpenCL pipe is passed in 1348 the kernarg. 1349 1350 "Queue" 1351 A global address space pointer 1352 to an OpenCL device enqueue 1353 queue is passed in the 1354 kernarg. 1355 1356 "HiddenGlobalOffsetX" 1357 The OpenCL grid dispatch 1358 global offset for the X 1359 dimension is passed in the 1360 kernarg. 1361 1362 "HiddenGlobalOffsetY" 1363 The OpenCL grid dispatch 1364 global offset for the Y 1365 dimension is passed in the 1366 kernarg. 1367 1368 "HiddenGlobalOffsetZ" 1369 The OpenCL grid dispatch 1370 global offset for the Z 1371 dimension is passed in the 1372 kernarg. 1373 1374 "HiddenNone" 1375 An argument that is not used 1376 by the kernel. Space needs to 1377 be left for it, but it does 1378 not need to be set up. 1379 1380 "HiddenPrintfBuffer" 1381 A global address space pointer 1382 to the runtime printf buffer 1383 is passed in kernarg. 1384 1385 "HiddenDefaultQueue" 1386 A global address space pointer 1387 to the OpenCL device enqueue 1388 queue that should be used by 1389 the kernel by default is 1390 passed in the kernarg. 1391 1392 "HiddenCompletionAction" 1393 A global address space pointer 1394 to help link enqueued kernels into 1395 the ancestor tree for determining 1396 when the parent kernel has finished. 1397 1398 "HiddenMultiGridSyncArg" 1399 A global address space pointer for 1400 multi-grid synchronization is 1401 passed in the kernarg. 1402 1403 "ValueType" string Required Kernel argument value type. Only 1404 present if "ValueKind" is 1405 "ByValue". For vector data 1406 types, the value is for the 1407 element type. Values include: 1408 1409 - "Struct" 1410 - "I8" 1411 - "U8" 1412 - "I16" 1413 - "U16" 1414 - "F16" 1415 - "I32" 1416 - "U32" 1417 - "F32" 1418 - "I64" 1419 - "U64" 1420 - "F64" 1421 1422 .. TODO 1423 How can it be determined if a 1424 vector type, and what size 1425 vector? 1426 "PointeeAlign" integer Alignment in bytes of pointee 1427 type for pointer type kernel 1428 argument. Must be a power 1429 of 2. Only present if 1430 "ValueKind" is 1431 "DynamicSharedPointer". 1432 "AddrSpaceQual" string Kernel argument address space 1433 qualifier. Only present if 1434 "ValueKind" is "GlobalBuffer" or 1435 "DynamicSharedPointer". Values 1436 are: 1437 1438 - "Private" 1439 - "Global" 1440 - "Constant" 1441 - "Local" 1442 - "Generic" 1443 - "Region" 1444 1445 .. TODO 1446 Is GlobalBuffer only Global 1447 or Constant? Is 1448 DynamicSharedPointer always 1449 Local? Can HCC allow Generic? 1450 How can Private or Region 1451 ever happen? 1452 "AccQual" string Kernel argument access 1453 qualifier. Only present if 1454 "ValueKind" is "Image" or 1455 "Pipe". Values 1456 are: 1457 1458 - "ReadOnly" 1459 - "WriteOnly" 1460 - "ReadWrite" 1461 1462 .. TODO 1463 Does this apply to 1464 GlobalBuffer? 1465 "ActualAccQual" string The actual memory accesses 1466 performed by the kernel on the 1467 kernel argument. Only present if 1468 "ValueKind" is "GlobalBuffer", 1469 "Image", or "Pipe". This may be 1470 more restrictive than indicated 1471 by "AccQual" to reflect what the 1472 kernel actual does. If not 1473 present then the runtime must 1474 assume what is implied by 1475 "AccQual" and "IsConst". Values 1476 are: 1477 1478 - "ReadOnly" 1479 - "WriteOnly" 1480 - "ReadWrite" 1481 1482 "IsConst" boolean Indicates if the kernel argument 1483 is const qualified. Only present 1484 if "ValueKind" is 1485 "GlobalBuffer". 1486 1487 "IsRestrict" boolean Indicates if the kernel argument 1488 is restrict qualified. Only 1489 present if "ValueKind" is 1490 "GlobalBuffer". 1491 1492 "IsVolatile" boolean Indicates if the kernel argument 1493 is volatile qualified. Only 1494 present if "ValueKind" is 1495 "GlobalBuffer". 1496 1497 "IsPipe" boolean Indicates if the kernel argument 1498 is pipe qualified. Only present 1499 if "ValueKind" is "Pipe". 1500 1501 .. TODO 1502 Can GlobalBuffer be pipe 1503 qualified? 1504 ================= ============== ========= ================================ 1505 1506.. 1507 1508 .. table:: AMDHSA Code Object V2 Kernel Code Properties Metadata Map 1509 :name: amdgpu-amdhsa-code-object-kernel-code-properties-metadata-map-table-v2 1510 1511 ============================ ============== ========= ===================== 1512 String Key Value Type Required? Description 1513 ============================ ============== ========= ===================== 1514 "KernargSegmentSize" integer Required The size in bytes of 1515 the kernarg segment 1516 that holds the values 1517 of the arguments to 1518 the kernel. 1519 "GroupSegmentFixedSize" integer Required The amount of group 1520 segment memory 1521 required by a 1522 work-group in 1523 bytes. This does not 1524 include any 1525 dynamically allocated 1526 group segment memory 1527 that may be added 1528 when the kernel is 1529 dispatched. 1530 "PrivateSegmentFixedSize" integer Required The amount of fixed 1531 private address space 1532 memory required for a 1533 work-item in 1534 bytes. If the kernel 1535 uses a dynamic call 1536 stack then additional 1537 space must be added 1538 to this value for the 1539 call stack. 1540 "KernargSegmentAlign" integer Required The maximum byte 1541 alignment of 1542 arguments in the 1543 kernarg segment. Must 1544 be a power of 2. 1545 "WavefrontSize" integer Required Wavefront size. Must 1546 be a power of 2. 1547 "NumSGPRs" integer Required Number of scalar 1548 registers used by a 1549 wavefront for 1550 GFX6-GFX10. This 1551 includes the special 1552 SGPRs for VCC, Flat 1553 Scratch (GFX7-GFX10) 1554 and XNACK (for 1555 GFX8-GFX10). It does 1556 not include the 16 1557 SGPR added if a trap 1558 handler is 1559 enabled. It is not 1560 rounded up to the 1561 allocation 1562 granularity. 1563 "NumVGPRs" integer Required Number of vector 1564 registers used by 1565 each work-item for 1566 GFX6-GFX10 1567 "MaxFlatWorkGroupSize" integer Required Maximum flat 1568 work-group size 1569 supported by the 1570 kernel in work-items. 1571 Must be >=1 and 1572 consistent with 1573 ReqdWorkGroupSize if 1574 not 0, 0, 0. 1575 "NumSpilledSGPRs" integer Number of stores from 1576 a scalar register to 1577 a register allocator 1578 created spill 1579 location. 1580 "NumSpilledVGPRs" integer Number of stores from 1581 a vector register to 1582 a register allocator 1583 created spill 1584 location. 1585 ============================ ============== ========= ===================== 1586 1587.. _amdgpu-amdhsa-code-object-metadata-v3: 1588 1589Code Object V3 Metadata (-mattr=+code-object-v3) 1590++++++++++++++++++++++++++++++++++++++++++++++++ 1591 1592Code object V3 metadata is specified by the ``NT_AMDGPU_METADATA`` note record 1593(see :ref:`amdgpu-note-records-v3`). 1594 1595The metadata is represented as Message Pack formatted binary data (see 1596[MsgPack]_). The top level is a Message Pack map that includes the 1597keys defined in table 1598:ref:`amdgpu-amdhsa-code-object-metadata-map-table-v3` and referenced 1599tables. 1600 1601Additional information can be added to the maps. To avoid conflicts, 1602any key names should be prefixed by "*vendor-name*." where 1603``vendor-name`` can be the the name of the vendor and specific vendor 1604tool that generates the information. The prefix is abbreviated to 1605simply "." when it appears within a map that has been added by the 1606same *vendor-name*. 1607 1608 .. table:: AMDHSA Code Object V3 Metadata Map 1609 :name: amdgpu-amdhsa-code-object-metadata-map-table-v3 1610 1611 ================= ============== ========= ======================================= 1612 String Key Value Type Required? Description 1613 ================= ============== ========= ======================================= 1614 "amdhsa.version" sequence of Required - The first integer is the major 1615 2 integers version. Currently 1. 1616 - The second integer is the minor 1617 version. Currently 0. 1618 "amdhsa.printf" sequence of Each string is encoded information 1619 strings about a printf function call. The 1620 encoded information is organized as 1621 fields separated by colon (':'): 1622 1623 ``ID:N:S[0]:S[1]:...:S[N-1]:FormatString`` 1624 1625 where: 1626 1627 ``ID`` 1628 A 32 bit integer as a unique id for 1629 each printf function call 1630 1631 ``N`` 1632 A 32 bit integer equal to the number 1633 of arguments of printf function call 1634 minus 1 1635 1636 ``S[i]`` (where i = 0, 1, ... , N-1) 1637 32 bit integers for the size in bytes 1638 of the i-th FormatString argument of 1639 the printf function call 1640 1641 FormatString 1642 The format string passed to the 1643 printf function call. 1644 "amdhsa.kernels" sequence of Required Sequence of the maps for each 1645 map kernel in the code object. See 1646 :ref:`amdgpu-amdhsa-code-object-kernel-metadata-map-table-v3` 1647 for the definition of the keys included 1648 in that map. 1649 ================= ============== ========= ======================================= 1650 1651.. 1652 1653 .. table:: AMDHSA Code Object V3 Kernel Metadata Map 1654 :name: amdgpu-amdhsa-code-object-kernel-metadata-map-table-v3 1655 1656 =================================== ============== ========= ================================ 1657 String Key Value Type Required? Description 1658 =================================== ============== ========= ================================ 1659 ".name" string Required Source name of the kernel. 1660 ".symbol" string Required Name of the kernel 1661 descriptor ELF symbol. 1662 ".language" string Source language of the kernel. 1663 Values include: 1664 1665 - "OpenCL C" 1666 - "OpenCL C++" 1667 - "HCC" 1668 - "HIP" 1669 - "OpenMP" 1670 - "Assembler" 1671 1672 ".language_version" sequence of - The first integer is the major 1673 2 integers version. 1674 - The second integer is the 1675 minor version. 1676 ".args" sequence of Sequence of maps of the 1677 map kernel arguments. See 1678 :ref:`amdgpu-amdhsa-code-object-kernel-argument-metadata-map-table-v3` 1679 for the definition of the keys 1680 included in that map. 1681 ".reqd_workgroup_size" sequence of If not 0, 0, 0 then all values 1682 3 integers must be >=1 and the dispatch 1683 work-group size X, Y, Z must 1684 correspond to the specified 1685 values. Defaults to 0, 0, 0. 1686 1687 Corresponds to the OpenCL 1688 ``reqd_work_group_size`` 1689 attribute. 1690 ".workgroup_size_hint" sequence of The dispatch work-group size 1691 3 integers X, Y, Z is likely to be the 1692 specified values. 1693 1694 Corresponds to the OpenCL 1695 ``work_group_size_hint`` 1696 attribute. 1697 ".vec_type_hint" string The name of a scalar or vector 1698 type. 1699 1700 Corresponds to the OpenCL 1701 ``vec_type_hint`` attribute. 1702 1703 ".device_enqueue_symbol" string The external symbol name 1704 associated with a kernel. 1705 OpenCL runtime allocates a 1706 global buffer for the symbol 1707 and saves the kernel's address 1708 to it, which is used for 1709 device side enqueueing. Only 1710 available for device side 1711 enqueued kernels. 1712 ".kernarg_segment_size" integer Required The size in bytes of 1713 the kernarg segment 1714 that holds the values 1715 of the arguments to 1716 the kernel. 1717 ".group_segment_fixed_size" integer Required The amount of group 1718 segment memory 1719 required by a 1720 work-group in 1721 bytes. This does not 1722 include any 1723 dynamically allocated 1724 group segment memory 1725 that may be added 1726 when the kernel is 1727 dispatched. 1728 ".private_segment_fixed_size" integer Required The amount of fixed 1729 private address space 1730 memory required for a 1731 work-item in 1732 bytes. If the kernel 1733 uses a dynamic call 1734 stack then additional 1735 space must be added 1736 to this value for the 1737 call stack. 1738 ".kernarg_segment_align" integer Required The maximum byte 1739 alignment of 1740 arguments in the 1741 kernarg segment. Must 1742 be a power of 2. 1743 ".wavefront_size" integer Required Wavefront size. Must 1744 be a power of 2. 1745 ".sgpr_count" integer Required Number of scalar 1746 registers required by a 1747 wavefront for 1748 GFX6-GFX9. A register 1749 is required if it is 1750 used explicitly, or 1751 if a higher numbered 1752 register is used 1753 explicitly. This 1754 includes the special 1755 SGPRs for VCC, Flat 1756 Scratch (GFX7-GFX9) 1757 and XNACK (for 1758 GFX8-GFX9). It does 1759 not include the 16 1760 SGPR added if a trap 1761 handler is 1762 enabled. It is not 1763 rounded up to the 1764 allocation 1765 granularity. 1766 ".vgpr_count" integer Required Number of vector 1767 registers required by 1768 each work-item for 1769 GFX6-GFX9. A register 1770 is required if it is 1771 used explicitly, or 1772 if a higher numbered 1773 register is used 1774 explicitly. 1775 ".max_flat_workgroup_size" integer Required Maximum flat 1776 work-group size 1777 supported by the 1778 kernel in work-items. 1779 Must be >=1 and 1780 consistent with 1781 ReqdWorkGroupSize if 1782 not 0, 0, 0. 1783 ".sgpr_spill_count" integer Number of stores from 1784 a scalar register to 1785 a register allocator 1786 created spill 1787 location. 1788 ".vgpr_spill_count" integer Number of stores from 1789 a vector register to 1790 a register allocator 1791 created spill 1792 location. 1793 =================================== ============== ========= ================================ 1794 1795.. 1796 1797 .. table:: AMDHSA Code Object V3 Kernel Argument Metadata Map 1798 :name: amdgpu-amdhsa-code-object-kernel-argument-metadata-map-table-v3 1799 1800 ====================== ============== ========= ================================ 1801 String Key Value Type Required? Description 1802 ====================== ============== ========= ================================ 1803 ".name" string Kernel argument name. 1804 ".type_name" string Kernel argument type name. 1805 ".size" integer Required Kernel argument size in bytes. 1806 ".offset" integer Required Kernel argument offset in 1807 bytes. The offset must be a 1808 multiple of the alignment 1809 required by the argument. 1810 ".value_kind" string Required Kernel argument kind that 1811 specifies how to set up the 1812 corresponding argument. 1813 Values include: 1814 1815 "by_value" 1816 The argument is copied 1817 directly into the kernarg. 1818 1819 "global_buffer" 1820 A global address space pointer 1821 to the buffer data is passed 1822 in the kernarg. 1823 1824 "dynamic_shared_pointer" 1825 A group address space pointer 1826 to dynamically allocated LDS 1827 is passed in the kernarg. 1828 1829 "sampler" 1830 A global address space 1831 pointer to a S# is passed in 1832 the kernarg. 1833 1834 "image" 1835 A global address space 1836 pointer to a T# is passed in 1837 the kernarg. 1838 1839 "pipe" 1840 A global address space pointer 1841 to an OpenCL pipe is passed in 1842 the kernarg. 1843 1844 "queue" 1845 A global address space pointer 1846 to an OpenCL device enqueue 1847 queue is passed in the 1848 kernarg. 1849 1850 "hidden_global_offset_x" 1851 The OpenCL grid dispatch 1852 global offset for the X 1853 dimension is passed in the 1854 kernarg. 1855 1856 "hidden_global_offset_y" 1857 The OpenCL grid dispatch 1858 global offset for the Y 1859 dimension is passed in the 1860 kernarg. 1861 1862 "hidden_global_offset_z" 1863 The OpenCL grid dispatch 1864 global offset for the Z 1865 dimension is passed in the 1866 kernarg. 1867 1868 "hidden_none" 1869 An argument that is not used 1870 by the kernel. Space needs to 1871 be left for it, but it does 1872 not need to be set up. 1873 1874 "hidden_printf_buffer" 1875 A global address space pointer 1876 to the runtime printf buffer 1877 is passed in kernarg. 1878 1879 "hidden_default_queue" 1880 A global address space pointer 1881 to the OpenCL device enqueue 1882 queue that should be used by 1883 the kernel by default is 1884 passed in the kernarg. 1885 1886 "hidden_completion_action" 1887 A global address space pointer 1888 to help link enqueued kernels into 1889 the ancestor tree for determining 1890 when the parent kernel has finished. 1891 1892 "hidden_multigrid_sync_arg" 1893 A global address space pointer for 1894 multi-grid synchronization is 1895 passed in the kernarg. 1896 1897 ".value_type" string Required Kernel argument value type. Only 1898 present if ".value_kind" is 1899 "by_value". For vector data 1900 types, the value is for the 1901 element type. Values include: 1902 1903 - "struct" 1904 - "i8" 1905 - "u8" 1906 - "i16" 1907 - "u16" 1908 - "f16" 1909 - "i32" 1910 - "u32" 1911 - "f32" 1912 - "i64" 1913 - "u64" 1914 - "f64" 1915 1916 .. TODO 1917 How can it be determined if a 1918 vector type, and what size 1919 vector? 1920 ".pointee_align" integer Alignment in bytes of pointee 1921 type for pointer type kernel 1922 argument. Must be a power 1923 of 2. Only present if 1924 ".value_kind" is 1925 "dynamic_shared_pointer". 1926 ".address_space" string Kernel argument address space 1927 qualifier. Only present if 1928 ".value_kind" is "global_buffer" or 1929 "dynamic_shared_pointer". Values 1930 are: 1931 1932 - "private" 1933 - "global" 1934 - "constant" 1935 - "local" 1936 - "generic" 1937 - "region" 1938 1939 .. TODO 1940 Is "global_buffer" only "global" 1941 or "constant"? Is 1942 "dynamic_shared_pointer" always 1943 "local"? Can HCC allow "generic"? 1944 How can "private" or "region" 1945 ever happen? 1946 ".access" string Kernel argument access 1947 qualifier. Only present if 1948 ".value_kind" is "image" or 1949 "pipe". Values 1950 are: 1951 1952 - "read_only" 1953 - "write_only" 1954 - "read_write" 1955 1956 .. TODO 1957 Does this apply to 1958 "global_buffer"? 1959 ".actual_access" string The actual memory accesses 1960 performed by the kernel on the 1961 kernel argument. Only present if 1962 ".value_kind" is "global_buffer", 1963 "image", or "pipe". This may be 1964 more restrictive than indicated 1965 by ".access" to reflect what the 1966 kernel actual does. If not 1967 present then the runtime must 1968 assume what is implied by 1969 ".access" and ".is_const" . Values 1970 are: 1971 1972 - "read_only" 1973 - "write_only" 1974 - "read_write" 1975 1976 ".is_const" boolean Indicates if the kernel argument 1977 is const qualified. Only present 1978 if ".value_kind" is 1979 "global_buffer". 1980 1981 ".is_restrict" boolean Indicates if the kernel argument 1982 is restrict qualified. Only 1983 present if ".value_kind" is 1984 "global_buffer". 1985 1986 ".is_volatile" boolean Indicates if the kernel argument 1987 is volatile qualified. Only 1988 present if ".value_kind" is 1989 "global_buffer". 1990 1991 ".is_pipe" boolean Indicates if the kernel argument 1992 is pipe qualified. Only present 1993 if ".value_kind" is "pipe". 1994 1995 .. TODO 1996 Can "global_buffer" be pipe 1997 qualified? 1998 ====================== ============== ========= ================================ 1999 2000.. 2001 2002Kernel Dispatch 2003~~~~~~~~~~~~~~~ 2004 2005The HSA architected queuing language (AQL) defines a user space memory interface 2006that can be used to control the dispatch of kernels, in an agent independent 2007way. An agent can have zero or more AQL queues created for it using the ROCm 2008runtime, in which AQL packets (all of which are 64 bytes) can be placed. See the 2009*HSA Platform System Architecture Specification* [HSA]_ for the AQL queue 2010mechanics and packet layouts. 2011 2012The packet processor of a kernel agent is responsible for detecting and 2013dispatching HSA kernels from the AQL queues associated with it. For AMD GPUs the 2014packet processor is implemented by the hardware command processor (CP), 2015asynchronous dispatch controller (ADC) and shader processor input controller 2016(SPI). 2017 2018The ROCm runtime can be used to allocate an AQL queue object. It uses the kernel 2019mode driver to initialize and register the AQL queue with CP. 2020 2021To dispatch a kernel the following actions are performed. This can occur in the 2022CPU host program, or from an HSA kernel executing on a GPU. 2023 20241. A pointer to an AQL queue for the kernel agent on which the kernel is to be 2025 executed is obtained. 20262. A pointer to the kernel descriptor (see 2027 :ref:`amdgpu-amdhsa-kernel-descriptor`) of the kernel to execute is 2028 obtained. It must be for a kernel that is contained in a code object that that 2029 was loaded by the ROCm runtime on the kernel agent with which the AQL queue is 2030 associated. 20313. Space is allocated for the kernel arguments using the ROCm runtime allocator 2032 for a memory region with the kernarg property for the kernel agent that will 2033 execute the kernel. It must be at least 16 byte aligned. 20344. Kernel argument values are assigned to the kernel argument memory 2035 allocation. The layout is defined in the *HSA Programmer's Language Reference* 2036 [HSA]_. For AMDGPU the kernel execution directly accesses the kernel argument 2037 memory in the same way constant memory is accessed. (Note that the HSA 2038 specification allows an implementation to copy the kernel argument contents to 2039 another location that is accessed by the kernel.) 20405. An AQL kernel dispatch packet is created on the AQL queue. The ROCm runtime 2041 api uses 64 bit atomic operations to reserve space in the AQL queue for the 2042 packet. The packet must be set up, and the final write must use an atomic 2043 store release to set the packet kind to ensure the packet contents are 2044 visible to the kernel agent. AQL defines a doorbell signal mechanism to 2045 notify the kernel agent that the AQL queue has been updated. These rules, and 2046 the layout of the AQL queue and kernel dispatch packet is defined in the *HSA 2047 System Architecture Specification* [HSA]_. 20486. A kernel dispatch packet includes information about the actual dispatch, 2049 such as grid and work-group size, together with information from the code 2050 object about the kernel, such as segment sizes. The ROCm runtime queries on 2051 the kernel symbol can be used to obtain the code object values which are 2052 recorded in the :ref:`amdgpu-amdhsa-code-object-metadata`. 20537. CP executes micro-code and is responsible for detecting and setting up the 2054 GPU to execute the wavefronts of a kernel dispatch. 20558. CP ensures that when the a wavefront starts executing the kernel machine 2056 code, the scalar general purpose registers (SGPR) and vector general purpose 2057 registers (VGPR) are set up as required by the machine code. The required 2058 setup is defined in the :ref:`amdgpu-amdhsa-kernel-descriptor`. The initial 2059 register state is defined in 2060 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`. 20619. The prolog of the kernel machine code (see 2062 :ref:`amdgpu-amdhsa-kernel-prolog`) sets up the machine state as necessary 2063 before continuing executing the machine code that corresponds to the kernel. 206410. When the kernel dispatch has completed execution, CP signals the completion 2065 signal specified in the kernel dispatch packet if not 0. 2066 2067.. _amdgpu-amdhsa-memory-spaces: 2068 2069Memory Spaces 2070~~~~~~~~~~~~~ 2071 2072The memory space properties are: 2073 2074 .. table:: AMDHSA Memory Spaces 2075 :name: amdgpu-amdhsa-memory-spaces-table 2076 2077 ================= =========== ======== ======= ================== 2078 Memory Space Name HSA Segment Hardware Address NULL Value 2079 Name Name Size 2080 ================= =========== ======== ======= ================== 2081 Private private scratch 32 0x00000000 2082 Local group LDS 32 0xFFFFFFFF 2083 Global global global 64 0x0000000000000000 2084 Constant constant *same as 64 0x0000000000000000 2085 global* 2086 Generic flat flat 64 0x0000000000000000 2087 Region N/A GDS 32 *not implemented 2088 for AMDHSA* 2089 ================= =========== ======== ======= ================== 2090 2091The global and constant memory spaces both use global virtual addresses, which 2092are the same virtual address space used by the CPU. However, some virtual 2093addresses may only be accessible to the CPU, some only accessible by the GPU, 2094and some by both. 2095 2096Using the constant memory space indicates that the data will not change during 2097the execution of the kernel. This allows scalar read instructions to be 2098used. The vector and scalar L1 caches are invalidated of volatile data before 2099each kernel dispatch execution to allow constant memory to change values between 2100kernel dispatches. 2101 2102The local memory space uses the hardware Local Data Store (LDS) which is 2103automatically allocated when the hardware creates work-groups of wavefronts, and 2104freed when all the wavefronts of a work-group have terminated. The data store 2105(DS) instructions can be used to access it. 2106 2107The private memory space uses the hardware scratch memory support. If the kernel 2108uses scratch, then the hardware allocates memory that is accessed using 2109wavefront lane dword (4 byte) interleaving. The mapping used from private 2110address to physical address is: 2111 2112 ``wavefront-scratch-base + 2113 (private-address * wavefront-size * 4) + 2114 (wavefront-lane-id * 4)`` 2115 2116There are different ways that the wavefront scratch base address is determined 2117by a wavefront (see :ref:`amdgpu-amdhsa-initial-kernel-execution-state`). This 2118memory can be accessed in an interleaved manner using buffer instruction with 2119the scratch buffer descriptor and per wavefront scratch offset, by the scratch 2120instructions, or by flat instructions. If each lane of a wavefront accesses the 2121same private address, the interleaving results in adjacent dwords being accessed 2122and hence requires fewer cache lines to be fetched. Multi-dword access is not 2123supported except by flat and scratch instructions in GFX9-GFX10. 2124 2125The generic address space uses the hardware flat address support available in 2126GFX7-GFX10. This uses two fixed ranges of virtual addresses (the private and 2127local appertures), that are outside the range of addressible global memory, to 2128map from a flat address to a private or local address. 2129 2130FLAT instructions can take a flat address and access global, private (scratch) 2131and group (LDS) memory depending in if the address is within one of the 2132apperture ranges. Flat access to scratch requires hardware aperture setup and 2133setup in the kernel prologue (see :ref:`amdgpu-amdhsa-flat-scratch`). Flat 2134access to LDS requires hardware aperture setup and M0 (GFX7-GFX8) register setup 2135(see :ref:`amdgpu-amdhsa-m0`). 2136 2137To convert between a segment address and a flat address the base address of the 2138appertures address can be used. For GFX7-GFX8 these are available in the 2139:ref:`amdgpu-amdhsa-hsa-aql-queue` the address of which can be obtained with 2140Queue Ptr SGPR (see :ref:`amdgpu-amdhsa-initial-kernel-execution-state`). For 2141GFX9-GFX10 the appature base addresses are directly available as inline constant 2142registers ``SRC_SHARED_BASE/LIMIT`` and ``SRC_PRIVATE_BASE/LIMIT``. In 64 bit 2143address mode the apperture sizes are 2^32 bytes and the base is aligned to 2^32 2144which makes it easier to convert from flat to segment or segment to flat. 2145 2146Image and Samplers 2147~~~~~~~~~~~~~~~~~~ 2148 2149Image and sample handles created by the ROCm runtime are 64 bit addresses of a 2150hardware 32 byte V# and 48 byte S# object respectively. In order to support the 2151HSA ``query_sampler`` operations two extra dwords are used to store the HSA BRIG 2152enumeration values for the queries that are not trivially deducible from the S# 2153representation. 2154 2155HSA Signals 2156~~~~~~~~~~~ 2157 2158HSA signal handles created by the ROCm runtime are 64 bit addresses of a 2159structure allocated in memory accessible from both the CPU and GPU. The 2160structure is defined by the ROCm runtime and subject to change between releases 2161(see [AMD-ROCm-github]_). 2162 2163.. _amdgpu-amdhsa-hsa-aql-queue: 2164 2165HSA AQL Queue 2166~~~~~~~~~~~~~ 2167 2168The HSA AQL queue structure is defined by the ROCm runtime and subject to change 2169between releases (see [AMD-ROCm-github]_). For some processors it contains 2170fields needed to implement certain language features such as the flat address 2171aperture bases. It also contains fields used by CP such as managing the 2172allocation of scratch memory. 2173 2174.. _amdgpu-amdhsa-kernel-descriptor: 2175 2176Kernel Descriptor 2177~~~~~~~~~~~~~~~~~ 2178 2179A kernel descriptor consists of the information needed by CP to initiate the 2180execution of a kernel, including the entry point address of the machine code 2181that implements the kernel. 2182 2183Kernel Descriptor for GFX6-GFX10 2184++++++++++++++++++++++++++++++++ 2185 2186CP microcode requires the Kernel descriptor to be allocated on 64 byte 2187alignment. 2188 2189 .. table:: Kernel Descriptor for GFX6-GFX10 2190 :name: amdgpu-amdhsa-kernel-descriptor-gfx6-gfx10-table 2191 2192 ======= ======= =============================== ============================ 2193 Bits Size Field Name Description 2194 ======= ======= =============================== ============================ 2195 31:0 4 bytes GROUP_SEGMENT_FIXED_SIZE The amount of fixed local 2196 address space memory 2197 required for a work-group 2198 in bytes. This does not 2199 include any dynamically 2200 allocated local address 2201 space memory that may be 2202 added when the kernel is 2203 dispatched. 2204 63:32 4 bytes PRIVATE_SEGMENT_FIXED_SIZE The amount of fixed 2205 private address space 2206 memory required for a 2207 work-item in bytes. If 2208 is_dynamic_callstack is 1 2209 then additional space must 2210 be added to this value for 2211 the call stack. 2212 127:64 8 bytes Reserved, must be 0. 2213 191:128 8 bytes KERNEL_CODE_ENTRY_BYTE_OFFSET Byte offset (possibly 2214 negative) from base 2215 address of kernel 2216 descriptor to kernel's 2217 entry point instruction 2218 which must be 256 byte 2219 aligned. 2220 351:272 20 Reserved, must be 0. 2221 bytes 2222 383:352 4 bytes COMPUTE_PGM_RSRC3 GFX6-9 2223 Reserved, must be 0. 2224 GFX10 2225 Compute Shader (CS) 2226 program settings used by 2227 CP to set up 2228 ``COMPUTE_PGM_RSRC3`` 2229 configuration 2230 register. See 2231 :ref:`amdgpu-amdhsa-compute_pgm_rsrc3-gfx10-table`. 2232 415:384 4 bytes COMPUTE_PGM_RSRC1 Compute Shader (CS) 2233 program settings used by 2234 CP to set up 2235 ``COMPUTE_PGM_RSRC1`` 2236 configuration 2237 register. See 2238 :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx10-table`. 2239 447:416 4 bytes COMPUTE_PGM_RSRC2 Compute Shader (CS) 2240 program settings used by 2241 CP to set up 2242 ``COMPUTE_PGM_RSRC2`` 2243 configuration 2244 register. See 2245 :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx10-table`. 2246 448 1 bit ENABLE_SGPR_PRIVATE_SEGMENT Enable the setup of the 2247 _BUFFER SGPR user data registers 2248 (see 2249 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`). 2250 2251 The total number of SGPR 2252 user data registers 2253 requested must not exceed 2254 16 and match value in 2255 ``compute_pgm_rsrc2.user_sgpr.user_sgpr_count``. 2256 Any requests beyond 16 2257 will be ignored. 2258 449 1 bit ENABLE_SGPR_DISPATCH_PTR *see above* 2259 450 1 bit ENABLE_SGPR_QUEUE_PTR *see above* 2260 451 1 bit ENABLE_SGPR_KERNARG_SEGMENT_PTR *see above* 2261 452 1 bit ENABLE_SGPR_DISPATCH_ID *see above* 2262 453 1 bit ENABLE_SGPR_FLAT_SCRATCH_INIT *see above* 2263 454 1 bit ENABLE_SGPR_PRIVATE_SEGMENT *see above* 2264 _SIZE 2265 457:455 3 bits Reserved, must be 0. 2266 458 1 bit ENABLE_WAVEFRONT_SIZE32 GFX6-9 2267 Reserved, must be 0. 2268 GFX10 2269 - If 0 execute in 2270 wavefront size 64 mode. 2271 - If 1 execute in 2272 native wavefront size 2273 32 mode. 2274 463:459 5 bits Reserved, must be 0. 2275 511:464 6 bytes Reserved, must be 0. 2276 512 **Total size 64 bytes.** 2277 ======= ==================================================================== 2278 2279.. 2280 2281 .. table:: compute_pgm_rsrc1 for GFX6-GFX10 2282 :name: amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx10-table 2283 2284 ======= ======= =============================== =========================================================================== 2285 Bits Size Field Name Description 2286 ======= ======= =============================== =========================================================================== 2287 5:0 6 bits GRANULATED_WORKITEM_VGPR_COUNT Number of vector register 2288 blocks used by each work-item; 2289 granularity is device 2290 specific: 2291 2292 GFX6-GFX9 2293 - vgprs_used 0..256 2294 - max(0, ceil(vgprs_used / 4) - 1) 2295 GFX10 (wavefront size 64) 2296 - max_vgpr 1..256 2297 - max(0, ceil(vgprs_used / 4) - 1) 2298 GFX10 (wavefront size 32) 2299 - max_vgpr 1..256 2300 - max(0, ceil(vgprs_used / 8) - 1) 2301 2302 Where vgprs_used is defined 2303 as the highest VGPR number 2304 explicitly referenced plus 2305 one. 2306 2307 Used by CP to set up 2308 ``COMPUTE_PGM_RSRC1.VGPRS``. 2309 2310 The 2311 :ref:`amdgpu-assembler` 2312 calculates this 2313 automatically for the 2314 selected processor from 2315 values provided to the 2316 `.amdhsa_kernel` directive 2317 by the 2318 `.amdhsa_next_free_vgpr` 2319 nested directive (see 2320 :ref:`amdhsa-kernel-directives-table`). 2321 9:6 4 bits GRANULATED_WAVEFRONT_SGPR_COUNT Number of scalar register 2322 blocks used by a wavefront; 2323 granularity is device 2324 specific: 2325 2326 GFX6-GFX8 2327 - sgprs_used 0..112 2328 - max(0, ceil(sgprs_used / 8) - 1) 2329 GFX9 2330 - sgprs_used 0..112 2331 - 2 * max(0, ceil(sgprs_used / 16) - 1) 2332 GFX10 2333 Reserved, must be 0. 2334 (128 SGPRs always 2335 allocated.) 2336 2337 Where sgprs_used is 2338 defined as the highest 2339 SGPR number explicitly 2340 referenced plus one, plus 2341 a target-specific number 2342 of additional special 2343 SGPRs for VCC, 2344 FLAT_SCRATCH (GFX7+) and 2345 XNACK_MASK (GFX8+), and 2346 any additional 2347 target-specific 2348 limitations. It does not 2349 include the 16 SGPRs added 2350 if a trap handler is 2351 enabled. 2352 2353 The target-specific 2354 limitations and special 2355 SGPR layout are defined in 2356 the hardware 2357 documentation, which can 2358 be found in the 2359 :ref:`amdgpu-processors` 2360 table. 2361 2362 Used by CP to set up 2363 ``COMPUTE_PGM_RSRC1.SGPRS``. 2364 2365 The 2366 :ref:`amdgpu-assembler` 2367 calculates this 2368 automatically for the 2369 selected processor from 2370 values provided to the 2371 `.amdhsa_kernel` directive 2372 by the 2373 `.amdhsa_next_free_sgpr` 2374 and `.amdhsa_reserve_*` 2375 nested directives (see 2376 :ref:`amdhsa-kernel-directives-table`). 2377 11:10 2 bits PRIORITY Must be 0. 2378 2379 Start executing wavefront 2380 at the specified priority. 2381 2382 CP is responsible for 2383 filling in 2384 ``COMPUTE_PGM_RSRC1.PRIORITY``. 2385 13:12 2 bits FLOAT_ROUND_MODE_32 Wavefront starts execution 2386 with specified rounding 2387 mode for single (32 2388 bit) floating point 2389 precision floating point 2390 operations. 2391 2392 Floating point rounding 2393 mode values are defined in 2394 :ref:`amdgpu-amdhsa-floating-point-rounding-mode-enumeration-values-table`. 2395 2396 Used by CP to set up 2397 ``COMPUTE_PGM_RSRC1.FLOAT_MODE``. 2398 15:14 2 bits FLOAT_ROUND_MODE_16_64 Wavefront starts execution 2399 with specified rounding 2400 denorm mode for half/double (16 2401 and 64 bit) floating point 2402 precision floating point 2403 operations. 2404 2405 Floating point rounding 2406 mode values are defined in 2407 :ref:`amdgpu-amdhsa-floating-point-rounding-mode-enumeration-values-table`. 2408 2409 Used by CP to set up 2410 ``COMPUTE_PGM_RSRC1.FLOAT_MODE``. 2411 17:16 2 bits FLOAT_DENORM_MODE_32 Wavefront starts execution 2412 with specified denorm mode 2413 for single (32 2414 bit) floating point 2415 precision floating point 2416 operations. 2417 2418 Floating point denorm mode 2419 values are defined in 2420 :ref:`amdgpu-amdhsa-floating-point-denorm-mode-enumeration-values-table`. 2421 2422 Used by CP to set up 2423 ``COMPUTE_PGM_RSRC1.FLOAT_MODE``. 2424 19:18 2 bits FLOAT_DENORM_MODE_16_64 Wavefront starts execution 2425 with specified denorm mode 2426 for half/double (16 2427 and 64 bit) floating point 2428 precision floating point 2429 operations. 2430 2431 Floating point denorm mode 2432 values are defined in 2433 :ref:`amdgpu-amdhsa-floating-point-denorm-mode-enumeration-values-table`. 2434 2435 Used by CP to set up 2436 ``COMPUTE_PGM_RSRC1.FLOAT_MODE``. 2437 20 1 bit PRIV Must be 0. 2438 2439 Start executing wavefront 2440 in privilege trap handler 2441 mode. 2442 2443 CP is responsible for 2444 filling in 2445 ``COMPUTE_PGM_RSRC1.PRIV``. 2446 21 1 bit ENABLE_DX10_CLAMP Wavefront starts execution 2447 with DX10 clamp mode 2448 enabled. Used by the vector 2449 ALU to force DX10 style 2450 treatment of NaN's (when 2451 set, clamp NaN to zero, 2452 otherwise pass NaN 2453 through). 2454 2455 Used by CP to set up 2456 ``COMPUTE_PGM_RSRC1.DX10_CLAMP``. 2457 22 1 bit DEBUG_MODE Must be 0. 2458 2459 Start executing wavefront 2460 in single step mode. 2461 2462 CP is responsible for 2463 filling in 2464 ``COMPUTE_PGM_RSRC1.DEBUG_MODE``. 2465 23 1 bit ENABLE_IEEE_MODE Wavefront starts execution 2466 with IEEE mode 2467 enabled. Floating point 2468 opcodes that support 2469 exception flag gathering 2470 will quiet and propagate 2471 signaling-NaN inputs per 2472 IEEE 754-2008. Min_dx10 and 2473 max_dx10 become IEEE 2474 754-2008 compliant due to 2475 signaling-NaN propagation 2476 and quieting. 2477 2478 Used by CP to set up 2479 ``COMPUTE_PGM_RSRC1.IEEE_MODE``. 2480 24 1 bit BULKY Must be 0. 2481 2482 Only one work-group allowed 2483 to execute on a compute 2484 unit. 2485 2486 CP is responsible for 2487 filling in 2488 ``COMPUTE_PGM_RSRC1.BULKY``. 2489 25 1 bit CDBG_USER Must be 0. 2490 2491 Flag that can be used to 2492 control debugging code. 2493 2494 CP is responsible for 2495 filling in 2496 ``COMPUTE_PGM_RSRC1.CDBG_USER``. 2497 26 1 bit FP16_OVFL GFX6-GFX8 2498 Reserved, must be 0. 2499 GFX9-GFX10 2500 Wavefront starts execution 2501 with specified fp16 overflow 2502 mode. 2503 2504 - If 0, fp16 overflow generates 2505 +/-INF values. 2506 - If 1, fp16 overflow that is the 2507 result of an +/-INF input value 2508 or divide by 0 produces a +/-INF, 2509 otherwise clamps computed 2510 overflow to +/-MAX_FP16 as 2511 appropriate. 2512 2513 Used by CP to set up 2514 ``COMPUTE_PGM_RSRC1.FP16_OVFL``. 2515 28:27 2 bits Reserved, must be 0. 2516 29 1 bit WGP_MODE GFX6-GFX9 2517 Reserved, must be 0. 2518 GFX10 2519 - If 0 execute work-groups in 2520 CU wavefront execution mode. 2521 - If 1 execute work-groups on 2522 in WGP wavefront execution mode. 2523 2524 See :ref:`amdgpu-amdhsa-memory-model`. 2525 2526 Used by CP to set up 2527 ``COMPUTE_PGM_RSRC1.WGP_MODE``. 2528 30 1 bit MEM_ORDERED GFX6-9 2529 Reserved, must be 0. 2530 GFX10 2531 Controls the behavior of the 2532 waitcnt's vmcnt and vscnt 2533 counters. 2534 2535 - If 0 vmcnt reports completion 2536 of load and atomic with return 2537 out of order with sample 2538 instructions, and the vscnt 2539 reports the completion of 2540 store and atomic without 2541 return in order. 2542 - If 1 vmcnt reports completion 2543 of load, atomic with return 2544 and sample instructions in 2545 order, and the vscnt reports 2546 the completion of store and 2547 atomic without return in order. 2548 2549 Used by CP to set up 2550 ``COMPUTE_PGM_RSRC1.MEM_ORDERED``. 2551 31 1 bit FWD_PROGRESS GFX6-9 2552 Reserved, must be 0. 2553 GFX10 2554 - If 0 execute SIMD wavefronts 2555 using oldest first policy. 2556 - If 1 execute SIMD wavefronts to 2557 ensure wavefronts will make some 2558 forward progress. 2559 2560 Used by CP to set up 2561 ``COMPUTE_PGM_RSRC1.FWD_PROGRESS``. 2562 32 **Total size 4 bytes** 2563 ======= =================================================================================================================== 2564 2565.. 2566 2567 .. table:: compute_pgm_rsrc2 for GFX6-GFX10 2568 :name: amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx10-table 2569 2570 ======= ======= =============================== =========================================================================== 2571 Bits Size Field Name Description 2572 ======= ======= =============================== =========================================================================== 2573 0 1 bit ENABLE_SGPR_PRIVATE_SEGMENT Enable the setup of the 2574 _WAVEFRONT_OFFSET SGPR wavefront scratch offset 2575 system register (see 2576 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`). 2577 2578 Used by CP to set up 2579 ``COMPUTE_PGM_RSRC2.SCRATCH_EN``. 2580 5:1 5 bits USER_SGPR_COUNT The total number of SGPR 2581 user data registers 2582 requested. This number must 2583 match the number of user 2584 data registers enabled. 2585 2586 Used by CP to set up 2587 ``COMPUTE_PGM_RSRC2.USER_SGPR``. 2588 6 1 bit ENABLE_TRAP_HANDLER Must be 0. 2589 2590 This bit represents 2591 ``COMPUTE_PGM_RSRC2.TRAP_PRESENT``, 2592 which is set by the CP if 2593 the runtime has installed a 2594 trap handler. 2595 7 1 bit ENABLE_SGPR_WORKGROUP_ID_X Enable the setup of the 2596 system SGPR register for 2597 the work-group id in the X 2598 dimension (see 2599 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`). 2600 2601 Used by CP to set up 2602 ``COMPUTE_PGM_RSRC2.TGID_X_EN``. 2603 8 1 bit ENABLE_SGPR_WORKGROUP_ID_Y Enable the setup of the 2604 system SGPR register for 2605 the work-group id in the Y 2606 dimension (see 2607 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`). 2608 2609 Used by CP to set up 2610 ``COMPUTE_PGM_RSRC2.TGID_Y_EN``. 2611 9 1 bit ENABLE_SGPR_WORKGROUP_ID_Z Enable the setup of the 2612 system SGPR register for 2613 the work-group id in the Z 2614 dimension (see 2615 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`). 2616 2617 Used by CP to set up 2618 ``COMPUTE_PGM_RSRC2.TGID_Z_EN``. 2619 10 1 bit ENABLE_SGPR_WORKGROUP_INFO Enable the setup of the 2620 system SGPR register for 2621 work-group information (see 2622 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`). 2623 2624 Used by CP to set up 2625 ``COMPUTE_PGM_RSRC2.TGID_SIZE_EN``. 2626 12:11 2 bits ENABLE_VGPR_WORKITEM_ID Enable the setup of the 2627 VGPR system registers used 2628 for the work-item ID. 2629 :ref:`amdgpu-amdhsa-system-vgpr-work-item-id-enumeration-values-table` 2630 defines the values. 2631 2632 Used by CP to set up 2633 ``COMPUTE_PGM_RSRC2.TIDIG_CMP_CNT``. 2634 13 1 bit ENABLE_EXCEPTION_ADDRESS_WATCH Must be 0. 2635 2636 Wavefront starts execution 2637 with address watch 2638 exceptions enabled which 2639 are generated when L1 has 2640 witnessed a thread access 2641 an *address of 2642 interest*. 2643 2644 CP is responsible for 2645 filling in the address 2646 watch bit in 2647 ``COMPUTE_PGM_RSRC2.EXCP_EN_MSB`` 2648 according to what the 2649 runtime requests. 2650 14 1 bit ENABLE_EXCEPTION_MEMORY Must be 0. 2651 2652 Wavefront starts execution 2653 with memory violation 2654 exceptions exceptions 2655 enabled which are generated 2656 when a memory violation has 2657 occurred for this wavefront from 2658 L1 or LDS 2659 (write-to-read-only-memory, 2660 mis-aligned atomic, LDS 2661 address out of range, 2662 illegal address, etc.). 2663 2664 CP sets the memory 2665 violation bit in 2666 ``COMPUTE_PGM_RSRC2.EXCP_EN_MSB`` 2667 according to what the 2668 runtime requests. 2669 23:15 9 bits GRANULATED_LDS_SIZE Must be 0. 2670 2671 CP uses the rounded value 2672 from the dispatch packet, 2673 not this value, as the 2674 dispatch may contain 2675 dynamically allocated group 2676 segment memory. CP writes 2677 directly to 2678 ``COMPUTE_PGM_RSRC2.LDS_SIZE``. 2679 2680 Amount of group segment 2681 (LDS) to allocate for each 2682 work-group. Granularity is 2683 device specific: 2684 2685 GFX6: 2686 roundup(lds-size / (64 * 4)) 2687 GFX7-GFX10: 2688 roundup(lds-size / (128 * 4)) 2689 2690 24 1 bit ENABLE_EXCEPTION_IEEE_754_FP Wavefront starts execution 2691 _INVALID_OPERATION with specified exceptions 2692 enabled. 2693 2694 Used by CP to set up 2695 ``COMPUTE_PGM_RSRC2.EXCP_EN`` 2696 (set from bits 0..6). 2697 2698 IEEE 754 FP Invalid 2699 Operation 2700 25 1 bit ENABLE_EXCEPTION_FP_DENORMAL FP Denormal one or more 2701 _SOURCE input operands is a 2702 denormal number 2703 26 1 bit ENABLE_EXCEPTION_IEEE_754_FP IEEE 754 FP Division by 2704 _DIVISION_BY_ZERO Zero 2705 27 1 bit ENABLE_EXCEPTION_IEEE_754_FP IEEE 754 FP FP Overflow 2706 _OVERFLOW 2707 28 1 bit ENABLE_EXCEPTION_IEEE_754_FP IEEE 754 FP Underflow 2708 _UNDERFLOW 2709 29 1 bit ENABLE_EXCEPTION_IEEE_754_FP IEEE 754 FP Inexact 2710 _INEXACT 2711 30 1 bit ENABLE_EXCEPTION_INT_DIVIDE_BY Integer Division by Zero 2712 _ZERO (rcp_iflag_f32 instruction 2713 only) 2714 31 1 bit Reserved, must be 0. 2715 32 **Total size 4 bytes.** 2716 ======= =================================================================================================================== 2717 2718.. 2719 2720 .. table:: compute_pgm_rsrc3 for GFX10 2721 :name: amdgpu-amdhsa-compute_pgm_rsrc3-gfx10-table 2722 2723 ======= ======= =============================== =========================================================================== 2724 Bits Size Field Name Description 2725 ======= ======= =============================== =========================================================================== 2726 3:0 4 bits SHARED_VGPR_COUNT Number of shared VGPRs for wavefront size 64. Granularity 8. Value 0-120. 2727 compute_pgm_rsrc1.vgprs + shared_vgpr_cnt cannot exceed 64. 2728 31:4 28 Reserved, must be 0. 2729 bits 2730 32 **Total size 4 bytes.** 2731 ======= =================================================================================================================== 2732 2733.. 2734 2735 .. table:: Floating Point Rounding Mode Enumeration Values 2736 :name: amdgpu-amdhsa-floating-point-rounding-mode-enumeration-values-table 2737 2738 ====================================== ===== ============================== 2739 Enumeration Name Value Description 2740 ====================================== ===== ============================== 2741 FLOAT_ROUND_MODE_NEAR_EVEN 0 Round Ties To Even 2742 FLOAT_ROUND_MODE_PLUS_INFINITY 1 Round Toward +infinity 2743 FLOAT_ROUND_MODE_MINUS_INFINITY 2 Round Toward -infinity 2744 FLOAT_ROUND_MODE_ZERO 3 Round Toward 0 2745 ====================================== ===== ============================== 2746 2747.. 2748 2749 .. table:: Floating Point Denorm Mode Enumeration Values 2750 :name: amdgpu-amdhsa-floating-point-denorm-mode-enumeration-values-table 2751 2752 ====================================== ===== ============================== 2753 Enumeration Name Value Description 2754 ====================================== ===== ============================== 2755 FLOAT_DENORM_MODE_FLUSH_SRC_DST 0 Flush Source and Destination 2756 Denorms 2757 FLOAT_DENORM_MODE_FLUSH_DST 1 Flush Output Denorms 2758 FLOAT_DENORM_MODE_FLUSH_SRC 2 Flush Source Denorms 2759 FLOAT_DENORM_MODE_FLUSH_NONE 3 No Flush 2760 ====================================== ===== ============================== 2761 2762.. 2763 2764 .. table:: System VGPR Work-Item ID Enumeration Values 2765 :name: amdgpu-amdhsa-system-vgpr-work-item-id-enumeration-values-table 2766 2767 ======================================== ===== ============================ 2768 Enumeration Name Value Description 2769 ======================================== ===== ============================ 2770 SYSTEM_VGPR_WORKITEM_ID_X 0 Set work-item X dimension 2771 ID. 2772 SYSTEM_VGPR_WORKITEM_ID_X_Y 1 Set work-item X and Y 2773 dimensions ID. 2774 SYSTEM_VGPR_WORKITEM_ID_X_Y_Z 2 Set work-item X, Y and Z 2775 dimensions ID. 2776 SYSTEM_VGPR_WORKITEM_ID_UNDEFINED 3 Undefined. 2777 ======================================== ===== ============================ 2778 2779.. _amdgpu-amdhsa-initial-kernel-execution-state: 2780 2781Initial Kernel Execution State 2782~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 2783 2784This section defines the register state that will be set up by the packet 2785processor prior to the start of execution of every wavefront. This is limited by 2786the constraints of the hardware controllers of CP/ADC/SPI. 2787 2788The order of the SGPR registers is defined, but the compiler can specify which 2789ones are actually setup in the kernel descriptor using the ``enable_sgpr_*`` bit 2790fields (see :ref:`amdgpu-amdhsa-kernel-descriptor`). The register numbers used 2791for enabled registers are dense starting at SGPR0: the first enabled register is 2792SGPR0, the next enabled register is SGPR1 etc.; disabled registers do not have 2793an SGPR number. 2794 2795The initial SGPRs comprise up to 16 User SRGPs that are set by CP and apply to 2796all wavefronts of the grid. It is possible to specify more than 16 User SGPRs using 2797the ``enable_sgpr_*`` bit fields, in which case only the first 16 are actually 2798initialized. These are then immediately followed by the System SGPRs that are 2799set up by ADC/SPI and can have different values for each wavefront of the grid 2800dispatch. 2801 2802SGPR register initial state is defined in 2803:ref:`amdgpu-amdhsa-sgpr-register-set-up-order-table`. 2804 2805 .. table:: SGPR Register Set Up Order 2806 :name: amdgpu-amdhsa-sgpr-register-set-up-order-table 2807 2808 ========== ========================== ====== ============================== 2809 SGPR Order Name Number Description 2810 (kernel descriptor enable of 2811 field) SGPRs 2812 ========== ========================== ====== ============================== 2813 First Private Segment Buffer 4 V# that can be used, together 2814 (enable_sgpr_private with Scratch Wavefront Offset 2815 _segment_buffer) as an offset, to access the 2816 private memory space using a 2817 segment address. 2818 2819 CP uses the value provided by 2820 the runtime. 2821 then Dispatch Ptr 2 64 bit address of AQL dispatch 2822 (enable_sgpr_dispatch_ptr) packet for kernel dispatch 2823 actually executing. 2824 then Queue Ptr 2 64 bit address of amd_queue_t 2825 (enable_sgpr_queue_ptr) object for AQL queue on which 2826 the dispatch packet was 2827 queued. 2828 then Kernarg Segment Ptr 2 64 bit address of Kernarg 2829 (enable_sgpr_kernarg segment. This is directly 2830 _segment_ptr) copied from the 2831 kernarg_address in the kernel 2832 dispatch packet. 2833 2834 Having CP load it once avoids 2835 loading it at the beginning of 2836 every wavefront. 2837 then Dispatch Id 2 64 bit Dispatch ID of the 2838 (enable_sgpr_dispatch_id) dispatch packet being 2839 executed. 2840 then Flat Scratch Init 2 This is 2 SGPRs: 2841 (enable_sgpr_flat_scratch 2842 _init) GFX6 2843 Not supported. 2844 GFX7-GFX8 2845 The first SGPR is a 32 bit 2846 byte offset from 2847 ``SH_HIDDEN_PRIVATE_BASE_VIMID`` 2848 to per SPI base of memory 2849 for scratch for the queue 2850 executing the kernel 2851 dispatch. CP obtains this 2852 from the runtime. (The 2853 Scratch Segment Buffer base 2854 address is 2855 ``SH_HIDDEN_PRIVATE_BASE_VIMID`` 2856 plus this offset.) The value 2857 of Scratch Wavefront Offset must 2858 be added to this offset by 2859 the kernel machine code, 2860 right shifted by 8, and 2861 moved to the FLAT_SCRATCH_HI 2862 SGPR register. 2863 FLAT_SCRATCH_HI corresponds 2864 to SGPRn-4 on GFX7, and 2865 SGPRn-6 on GFX8 (where SGPRn 2866 is the highest numbered SGPR 2867 allocated to the wavefront). 2868 FLAT_SCRATCH_HI is 2869 multiplied by 256 (as it is 2870 in units of 256 bytes) and 2871 added to 2872 ``SH_HIDDEN_PRIVATE_BASE_VIMID`` 2873 to calculate the per wavefront 2874 FLAT SCRATCH BASE in flat 2875 memory instructions that 2876 access the scratch 2877 apperture. 2878 2879 The second SGPR is 32 bit 2880 byte size of a single 2881 work-item's scratch memory 2882 usage. CP obtains this from 2883 the runtime, and it is 2884 always a multiple of DWORD. 2885 CP checks that the value in 2886 the kernel dispatch packet 2887 Private Segment Byte Size is 2888 not larger, and requests the 2889 runtime to increase the 2890 queue's scratch size if 2891 necessary. The kernel code 2892 must move it to 2893 FLAT_SCRATCH_LO which is 2894 SGPRn-3 on GFX7 and SGPRn-5 2895 on GFX8. FLAT_SCRATCH_LO is 2896 used as the FLAT SCRATCH 2897 SIZE in flat memory 2898 instructions. Having CP load 2899 it once avoids loading it at 2900 the beginning of every 2901 wavefront. 2902 GFX9-GFX10 2903 This is the 2904 64 bit base address of the 2905 per SPI scratch backing 2906 memory managed by SPI for 2907 the queue executing the 2908 kernel dispatch. CP obtains 2909 this from the runtime (and 2910 divides it if there are 2911 multiple Shader Arrays each 2912 with its own SPI). The value 2913 of Scratch Wavefront Offset must 2914 be added by the kernel 2915 machine code and the result 2916 moved to the FLAT_SCRATCH 2917 SGPR which is SGPRn-6 and 2918 SGPRn-5. It is used as the 2919 FLAT SCRATCH BASE in flat 2920 memory instructions. 2921 then Private Segment Size 1 The 32 bit byte size of a 2922 (enable_sgpr_private single 2923 work-item's 2924 scratch_segment_size) memory 2925 allocation. This is the 2926 value from the kernel 2927 dispatch packet Private 2928 Segment Byte Size rounded up 2929 by CP to a multiple of 2930 DWORD. 2931 2932 Having CP load it once avoids 2933 loading it at the beginning of 2934 every wavefront. 2935 2936 This is not used for 2937 GFX7-GFX8 since it is the same 2938 value as the second SGPR of 2939 Flat Scratch Init. However, it 2940 may be needed for GFX9-GFX10 which 2941 changes the meaning of the 2942 Flat Scratch Init value. 2943 then Grid Work-Group Count X 1 32 bit count of the number of 2944 (enable_sgpr_grid work-groups in the X dimension 2945 _workgroup_count_X) for the grid being 2946 executed. Computed from the 2947 fields in the kernel dispatch 2948 packet as ((grid_size.x + 2949 workgroup_size.x - 1) / 2950 workgroup_size.x). 2951 then Grid Work-Group Count Y 1 32 bit count of the number of 2952 (enable_sgpr_grid work-groups in the Y dimension 2953 _workgroup_count_Y && for the grid being 2954 less than 16 previous executed. Computed from the 2955 SGPRs) fields in the kernel dispatch 2956 packet as ((grid_size.y + 2957 workgroup_size.y - 1) / 2958 workgroupSize.y). 2959 2960 Only initialized if <16 2961 previous SGPRs initialized. 2962 then Grid Work-Group Count Z 1 32 bit count of the number of 2963 (enable_sgpr_grid work-groups in the Z dimension 2964 _workgroup_count_Z && for the grid being 2965 less than 16 previous executed. Computed from the 2966 SGPRs) fields in the kernel dispatch 2967 packet as ((grid_size.z + 2968 workgroup_size.z - 1) / 2969 workgroupSize.z). 2970 2971 Only initialized if <16 2972 previous SGPRs initialized. 2973 then Work-Group Id X 1 32 bit work-group id in X 2974 (enable_sgpr_workgroup_id dimension of grid for 2975 _X) wavefront. 2976 then Work-Group Id Y 1 32 bit work-group id in Y 2977 (enable_sgpr_workgroup_id dimension of grid for 2978 _Y) wavefront. 2979 then Work-Group Id Z 1 32 bit work-group id in Z 2980 (enable_sgpr_workgroup_id dimension of grid for 2981 _Z) wavefront. 2982 then Work-Group Info 1 {first_wavefront, 14'b0000, 2983 (enable_sgpr_workgroup ordered_append_term[10:0], 2984 _info) threadgroup_size_in_wavefronts[5:0]} 2985 then Scratch Wavefront Offset 1 32 bit byte offset from base 2986 (enable_sgpr_private of scratch base of queue 2987 _segment_wavefront_offset) executing the kernel 2988 dispatch. Must be used as an 2989 offset with Private 2990 segment address when using 2991 Scratch Segment Buffer. It 2992 must be used to set up FLAT 2993 SCRATCH for flat addressing 2994 (see 2995 :ref:`amdgpu-amdhsa-flat-scratch`). 2996 ========== ========================== ====== ============================== 2997 2998The order of the VGPR registers is defined, but the compiler can specify which 2999ones are actually setup in the kernel descriptor using the ``enable_vgpr*`` bit 3000fields (see :ref:`amdgpu-amdhsa-kernel-descriptor`). The register numbers used 3001for enabled registers are dense starting at VGPR0: the first enabled register is 3002VGPR0, the next enabled register is VGPR1 etc.; disabled registers do not have a 3003VGPR number. 3004 3005VGPR register initial state is defined in 3006:ref:`amdgpu-amdhsa-vgpr-register-set-up-order-table`. 3007 3008 .. table:: VGPR Register Set Up Order 3009 :name: amdgpu-amdhsa-vgpr-register-set-up-order-table 3010 3011 ========== ========================== ====== ============================== 3012 VGPR Order Name Number Description 3013 (kernel descriptor enable of 3014 field) VGPRs 3015 ========== ========================== ====== ============================== 3016 First Work-Item Id X 1 32 bit work item id in X 3017 (Always initialized) dimension of work-group for 3018 wavefront lane. 3019 then Work-Item Id Y 1 32 bit work item id in Y 3020 (enable_vgpr_workitem_id dimension of work-group for 3021 > 0) wavefront lane. 3022 then Work-Item Id Z 1 32 bit work item id in Z 3023 (enable_vgpr_workitem_id dimension of work-group for 3024 > 1) wavefront lane. 3025 ========== ========================== ====== ============================== 3026 3027The setting of registers is done by GPU CP/ADC/SPI hardware as follows: 3028 30291. SGPRs before the Work-Group Ids are set by CP using the 16 User Data 3030 registers. 30312. Work-group Id registers X, Y, Z are set by ADC which supports any 3032 combination including none. 30333. Scratch Wavefront Offset is set by SPI in a per wavefront basis which is why 3034 its value cannot included with the flat scratch init value which is per queue. 30354. The VGPRs are set by SPI which only supports specifying either (X), (X, Y) 3036 or (X, Y, Z). 3037 3038Flat Scratch register pair are adjacent SGRRs so they can be moved as a 64 bit 3039value to the hardware required SGPRn-3 and SGPRn-4 respectively. 3040 3041The global segment can be accessed either using buffer instructions (GFX6 which 3042has V# 64 bit address support), flat instructions (GFX7-GFX10), or global 3043instructions (GFX9-GFX10). 3044 3045If buffer operations are used then the compiler can generate a V# with the 3046following properties: 3047 3048* base address of 0 3049* no swizzle 3050* ATC: 1 if IOMMU present (such as APU) 3051* ptr64: 1 3052* MTYPE set to support memory coherence that matches the runtime (such as CC for 3053 APU and NC for dGPU). 3054 3055.. _amdgpu-amdhsa-kernel-prolog: 3056 3057Kernel Prolog 3058~~~~~~~~~~~~~ 3059 3060.. _amdgpu-amdhsa-m0: 3061 3062M0 3063++ 3064 3065GFX6-GFX8 3066 The M0 register must be initialized with a value at least the total LDS size 3067 if the kernel may access LDS via DS or flat operations. Total LDS size is 3068 available in dispatch packet. For M0, it is also possible to use maximum 3069 possible value of LDS for given target (0x7FFF for GFX6 and 0xFFFF for 3070 GFX7-GFX8). 3071GFX9-GFX10 3072 The M0 register is not used for range checking LDS accesses and so does not 3073 need to be initialized in the prolog. 3074 3075.. _amdgpu-amdhsa-flat-scratch: 3076 3077Flat Scratch 3078++++++++++++ 3079 3080If the kernel may use flat operations to access scratch memory, the prolog code 3081must set up FLAT_SCRATCH register pair (FLAT_SCRATCH_LO/FLAT_SCRATCH_HI which 3082are in SGPRn-4/SGPRn-3). Initialization uses Flat Scratch Init and Scratch Wavefront 3083Offset SGPR registers (see :ref:`amdgpu-amdhsa-initial-kernel-execution-state`): 3084 3085GFX6 3086 Flat scratch is not supported. 3087 3088GFX7-GFX8 3089 1. The low word of Flat Scratch Init is 32 bit byte offset from 3090 ``SH_HIDDEN_PRIVATE_BASE_VIMID`` to the base of scratch backing memory 3091 being managed by SPI for the queue executing the kernel dispatch. This is 3092 the same value used in the Scratch Segment Buffer V# base address. The 3093 prolog must add the value of Scratch Wavefront Offset to get the wavefront's byte 3094 scratch backing memory offset from ``SH_HIDDEN_PRIVATE_BASE_VIMID``. Since 3095 FLAT_SCRATCH_LO is in units of 256 bytes, the offset must be right shifted 3096 by 8 before moving into FLAT_SCRATCH_LO. 3097 2. The second word of Flat Scratch Init is 32 bit byte size of a single 3098 work-items scratch memory usage. This is directly loaded from the kernel 3099 dispatch packet Private Segment Byte Size and rounded up to a multiple of 3100 DWORD. Having CP load it once avoids loading it at the beginning of every 3101 wavefront. The prolog must move it to FLAT_SCRATCH_LO for use as FLAT SCRATCH 3102 SIZE. 3103 3104GFX9-GFX10 3105 The Flat Scratch Init is the 64 bit address of the base of scratch backing 3106 memory being managed by SPI for the queue executing the kernel dispatch. The 3107 prolog must add the value of Scratch Wavefront Offset and moved to the FLAT_SCRATCH 3108 pair for use as the flat scratch base in flat memory instructions. 3109 3110.. _amdgpu-amdhsa-memory-model: 3111 3112Memory Model 3113~~~~~~~~~~~~ 3114 3115This section describes the mapping of LLVM memory model onto AMDGPU machine code 3116(see :ref:`memmodel`). *The implementation is WIP.* 3117 3118.. TODO 3119 Update when implementation complete. 3120 3121The AMDGPU backend supports the memory synchronization scopes specified in 3122:ref:`amdgpu-memory-scopes`. 3123 3124The code sequences used to implement the memory model are defined in table 3125:ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx6-gfx10-table`. 3126 3127The sequences specify the order of instructions that a single thread must 3128execute. The ``s_waitcnt`` and ``buffer_wbinvl1_vol`` are defined with respect 3129to other memory instructions executed by the same thread. This allows them to be 3130moved earlier or later which can allow them to be combined with other instances 3131of the same instruction, or hoisted/sunk out of loops to improve 3132performance. Only the instructions related to the memory model are given; 3133additional ``s_waitcnt`` instructions are required to ensure registers are 3134defined before being used. These may be able to be combined with the memory 3135model ``s_waitcnt`` instructions as described above. 3136 3137The AMDGPU backend supports the following memory models: 3138 3139 HSA Memory Model [HSA]_ 3140 The HSA memory model uses a single happens-before relation for all address 3141 spaces (see :ref:`amdgpu-address-spaces`). 3142 OpenCL Memory Model [OpenCL]_ 3143 The OpenCL memory model which has separate happens-before relations for the 3144 global and local address spaces. Only a fence specifying both global and 3145 local address space, and seq_cst instructions join the relationships. Since 3146 the LLVM ``memfence`` instruction does not allow an address space to be 3147 specified the OpenCL fence has to convervatively assume both local and 3148 global address space was specified. However, optimizations can often be 3149 done to eliminate the additional ``s_waitcnt`` instructions when there are 3150 no intervening memory instructions which access the corresponding address 3151 space. The code sequences in the table indicate what can be omitted for the 3152 OpenCL memory. The target triple environment is used to determine if the 3153 source language is OpenCL (see :ref:`amdgpu-opencl`). 3154 3155``ds/flat_load/store/atomic`` instructions to local memory are termed LDS 3156operations. 3157 3158``buffer/global/flat_load/store/atomic`` instructions to global memory are 3159termed vector memory operations. 3160 3161For GFX6-GFX9: 3162 3163* Each agent has multiple shader arrays (SA). 3164* Each SA has multiple compute units (CU). 3165* Each CU has multiple SIMDs that execute wavefronts. 3166* The wavefronts for a single work-group are executed in the same CU but may be 3167 executed by different SIMDs. 3168* Each CU has a single LDS memory shared by the wavefronts of the work-groups 3169 executing on it. 3170* All LDS operations of a CU are performed as wavefront wide operations in a 3171 global order and involve no caching. Completion is reported to a wavefront in 3172 execution order. 3173* The LDS memory has multiple request queues shared by the SIMDs of a 3174 CU. Therefore, the LDS operations performed by different wavefronts of a work-group 3175 can be reordered relative to each other, which can result in reordering the 3176 visibility of vector memory operations with respect to LDS operations of other 3177 wavefronts in the same work-group. A ``s_waitcnt lgkmcnt(0)`` is required to 3178 ensure synchronization between LDS operations and vector memory operations 3179 between wavefronts of a work-group, but not between operations performed by the 3180 same wavefront. 3181* The vector memory operations are performed as wavefront wide operations and 3182 completion is reported to a wavefront in execution order. The exception is 3183 that for GFX7-GFX9 ``flat_load/store/atomic`` instructions can report out of 3184 vector memory order if they access LDS memory, and out of LDS operation order 3185 if they access global memory. 3186* The vector memory operations access a single vector L1 cache shared by all 3187 SIMDs a CU. Therefore, no special action is required for coherence between the 3188 lanes of a single wavefront, or for coherence between wavefronts in the same 3189 work-group. A ``buffer_wbinvl1_vol`` is required for coherence between wavefronts 3190 executing in different work-groups as they may be executing on different CUs. 3191* The scalar memory operations access a scalar L1 cache shared by all wavefronts 3192 on a group of CUs. The scalar and vector L1 caches are not coherent. However, 3193 scalar operations are used in a restricted way so do not impact the memory 3194 model. See :ref:`amdgpu-amdhsa-memory-spaces`. 3195* The vector and scalar memory operations use an L2 cache shared by all CUs on 3196 the same agent. 3197* The L2 cache has independent channels to service disjoint ranges of virtual 3198 addresses. 3199* Each CU has a separate request queue per channel. Therefore, the vector and 3200 scalar memory operations performed by wavefronts executing in different work-groups 3201 (which may be executing on different CUs) of an agent can be reordered 3202 relative to each other. A ``s_waitcnt vmcnt(0)`` is required to ensure 3203 synchronization between vector memory operations of different CUs. It ensures a 3204 previous vector memory operation has completed before executing a subsequent 3205 vector memory or LDS operation and so can be used to meet the requirements of 3206 acquire and release. 3207* The L2 cache can be kept coherent with other agents on some targets, or ranges 3208 of virtual addresses can be set up to bypass it to ensure system coherence. 3209 3210For GFX10: 3211 3212* Each agent has multiple shader arrays (SA). 3213* Each SA has multiple work-group processors (WGP). 3214* Each WGP has multiple compute units (CU). 3215* Each CU has multiple SIMDs that execute wavefronts. 3216* The wavefronts for a single work-group are executed in the same 3217 WGP. In CU wavefront execution mode the wavefronts may be executed by 3218 different SIMDs in the same CU. In WGP wavefront execution mode the 3219 wavefronts may be executed by different SIMDs in different CUs in the same 3220 WGP. 3221* Each WGP has a single LDS memory shared by the wavefronts of the work-groups 3222 executing on it. 3223* All LDS operations of a WGP are performed as wavefront wide operations in a 3224 global order and involve no caching. Completion is reported to a wavefront in 3225 execution order. 3226* The LDS memory has multiple request queues shared by the SIMDs of a 3227 WGP. Therefore, the LDS operations performed by different wavefronts of a work-group 3228 can be reordered relative to each other, which can result in reordering the 3229 visibility of vector memory operations with respect to LDS operations of other 3230 wavefronts in the same work-group. A ``s_waitcnt lgkmcnt(0)`` is required to 3231 ensure synchronization between LDS operations and vector memory operations 3232 between wavefronts of a work-group, but not between operations performed by the 3233 same wavefront. 3234* The vector memory operations are performed as wavefront wide operations. 3235 Completion of load/store/sample operations are reported to a wavefront in 3236 execution order of other load/store/sample operations performed by that 3237 wavefront. 3238* The vector memory operations access a vector L0 cache. There is a single L0 3239 cache per CU. Each SIMD of a CU accesses the same L0 cache. 3240 Therefore, no special action is required for coherence between the lanes of a 3241 single wavefront. However, a ``BUFFER_GL0_INV`` is required for coherence 3242 between wavefronts executing in the same work-group as they may be executing on 3243 SIMDs of different CUs that access different L0s. A ``BUFFER_GL0_INV`` is also 3244 required for coherence between wavefronts executing in different work-groups as 3245 they may be executing on different WGPs. 3246* The scalar memory operations access a scalar L0 cache shared by all wavefronts 3247 on a WGP. The scalar and vector L0 caches are not coherent. However, scalar 3248 operations are used in a restricted way so do not impact the memory model. See 3249 :ref:`amdgpu-amdhsa-memory-spaces`. 3250* The vector and scalar memory L0 caches use an L1 cache shared by all WGPs on 3251 the same SA. Therefore, no special action is required for coherence between 3252 the wavefronts of a single work-group. However, a ``BUFFER_GL1_INV`` is 3253 required for coherence between wavefronts executing in different work-groups as 3254 they may be executing on different SAs that access different L1s. 3255* The L1 caches have independent quadrants to service disjoint ranges of virtual 3256 addresses. 3257* Each L0 cache has a separate request queue per L1 quadrant. Therefore, the 3258 vector and scalar memory operations performed by different wavefronts, whether 3259 executing in the same or different work-groups (which may be executing on 3260 different CUs accessing different L0s), can be reordered relative to each 3261 other. A ``s_waitcnt vmcnt(0) & vscnt(0)`` is required to ensure synchronization 3262 between vector memory operations of different wavefronts. It ensures a previous 3263 vector memory operation has completed before executing a subsequent vector 3264 memory or LDS operation and so can be used to meet the requirements of acquire, 3265 release and sequential consistency. 3266* The L1 caches use an L2 cache shared by all SAs on the same agent. 3267* The L2 cache has independent channels to service disjoint ranges of virtual 3268 addresses. 3269* Each L1 quadrant of a single SA accesses a different L2 channel. Each L1 3270 quadrant has a separate request queue per L2 channel. Therefore, the vector 3271 and scalar memory operations performed by wavefronts executing in different 3272 work-groups (which may be executing on different SAs) of an agent can be 3273 reordered relative to each other. A ``s_waitcnt vmcnt(0) & vscnt(0)`` is 3274 required to ensure synchronization between vector memory operations of 3275 different SAs. It ensures a previous vector memory operation has completed 3276 before executing a subsequent vector memory and so can be used to meet the 3277 requirements of acquire, release and sequential consistency. 3278* The L2 cache can be kept coherent with other agents on some targets, or ranges 3279 of virtual addresses can be set up to bypass it to ensure system coherence. 3280 3281Private address space uses ``buffer_load/store`` using the scratch V# (GFX6-GFX8), 3282or ``scratch_load/store`` (GFX9-GFX10). Since only a single thread is accessing the 3283memory, atomic memory orderings are not meaningful and all accesses are treated 3284as non-atomic. 3285 3286Constant address space uses ``buffer/global_load`` instructions (or equivalent 3287scalar memory instructions). Since the constant address space contents do not 3288change during the execution of a kernel dispatch it is not legal to perform 3289stores, and atomic memory orderings are not meaningful and all access are 3290treated as non-atomic. 3291 3292A memory synchronization scope wider than work-group is not meaningful for the 3293group (LDS) address space and is treated as work-group. 3294 3295The memory model does not support the region address space which is treated as 3296non-atomic. 3297 3298Acquire memory ordering is not meaningful on store atomic instructions and is 3299treated as non-atomic. 3300 3301Release memory ordering is not meaningful on load atomic instructions and is 3302treated a non-atomic. 3303 3304Acquire-release memory ordering is not meaningful on load or store atomic 3305instructions and is treated as acquire and release respectively. 3306 3307AMDGPU backend only uses scalar memory operations to access memory that is 3308proven to not change during the execution of the kernel dispatch. This includes 3309constant address space and global address space for program scope const 3310variables. Therefore the kernel machine code does not have to maintain the 3311scalar L1 cache to ensure it is coherent with the vector L1 cache. The scalar 3312and vector L1 caches are invalidated between kernel dispatches by CP since 3313constant address space data may change between kernel dispatch executions. See 3314:ref:`amdgpu-amdhsa-memory-spaces`. 3315 3316The one execption is if scalar writes are used to spill SGPR registers. In this 3317case the AMDGPU backend ensures the memory location used to spill is never 3318accessed by vector memory operations at the same time. If scalar writes are used 3319then a ``s_dcache_wb`` is inserted before the ``s_endpgm`` and before a function 3320return since the locations may be used for vector memory instructions by a 3321future wavefront that uses the same scratch area, or a function call that creates a 3322frame at the same address, respectively. There is no need for a ``s_dcache_inv`` 3323as all scalar writes are write-before-read in the same thread. 3324 3325For GFX6-GFX9, scratch backing memory (which is used for the private address space) 3326is accessed with MTYPE NC_NV (non-coherenent non-volatile). Since the private 3327address space is only accessed by a single thread, and is always 3328write-before-read, there is never a need to invalidate these entries from the L1 3329cache. Hence all cache invalidates are done as ``*_vol`` to only invalidate the 3330volatile cache lines. 3331 3332For GFX10, scratch backing memory (which is used for the private address space) 3333is accessed with MTYPE NC (non-coherenent). Since the private address space is 3334only accessed by a single thread, and is always write-before-read, there is 3335never a need to invalidate these entries from the L0 or L1 caches. 3336 3337For GFX10, wavefronts are executed in native mode with in-order reporting of loads 3338and sample instructions. In this mode vmcnt reports completion of load, atomic 3339with return and sample instructions in order, and the vscnt reports the 3340completion of store and atomic without return in order. See ``MEM_ORDERED`` field 3341in :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx10-table`. 3342 3343In GFX10, wavefronts can be executed in WGP or CU wavefront execution mode: 3344 3345* In WGP wavefront execution mode the wavefronts of a work-group are executed 3346 on the SIMDs of both CUs of the WGP. Therefore, explicit management of the per 3347 CU L0 caches is required for work-group synchronization. Also accesses to L1 at 3348 work-group scope need to be expicitly ordered as the accesses from different 3349 CUs are not ordered. 3350* In CU wavefront execution mode the wavefronts of a work-group are executed on 3351 the SIMDs of a single CU of the WGP. Therefore, all global memory access by 3352 the work-group access the same L0 which in turn ensures L1 accesses are 3353 ordered and so do not require explicit management of the caches for 3354 work-group synchronization. 3355 3356See ``WGP_MODE`` field in :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx10-table` 3357and :ref:`amdgpu-target-features`. 3358 3359On dGPU the kernarg backing memory is accessed as UC (uncached) to avoid needing 3360to invalidate the L2 cache. For GFX6-GFX9, this also causes it to be treated as 3361non-volatile and so is not invalidated by ``*_vol``. On APU it is accessed as CC 3362(cache coherent) and so the L2 cache will be coherent with the CPU and other 3363agents. 3364 3365 .. table:: AMDHSA Memory Model Code Sequences GFX6-GFX10 3366 :name: amdgpu-amdhsa-memory-model-code-sequences-gfx6-gfx10-table 3367 3368 ============ ============ ============== ========== =============================== ================================== 3369 LLVM Instr LLVM Memory LLVM Memory AMDGPU AMDGPU Machine Code AMDGPU Machine Code 3370 Ordering Sync Scope Address GFX6-9 GFX10 3371 Space 3372 ============ ============ ============== ========== =============================== ================================== 3373 **Non-Atomic** 3374 ---------------------------------------------------------------------------------------------------------------------- 3375 load *none* *none* - global - !volatile & !nontemporal - !volatile & !nontemporal 3376 - generic 3377 - private 1. buffer/global/flat_load 1. buffer/global/flat_load 3378 - constant 3379 - volatile & !nontemporal - volatile & !nontemporal 3380 3381 1. buffer/global/flat_load 1. buffer/global/flat_load 3382 glc=1 glc=1 dlc=1 3383 3384 - nontemporal - nontemporal 3385 3386 1. buffer/global/flat_load 1. buffer/global/flat_load 3387 glc=1 slc=1 slc=1 3388 3389 load *none* *none* - local 1. ds_load 1. ds_load 3390 store *none* *none* - global - !nontemporal - !nontemporal 3391 - generic 3392 - private 1. buffer/global/flat_store 1. buffer/global/flat_store 3393 - constant 3394 - nontemporal - nontemporal 3395 3396 1. buffer/global/flat_stote 1. buffer/global/flat_store 3397 glc=1 slc=1 slc=1 3398 3399 store *none* *none* - local 1. ds_store 1. ds_store 3400 **Unordered Atomic** 3401 ---------------------------------------------------------------------------------------------------------------------- 3402 load atomic unordered *any* *any* *Same as non-atomic*. *Same as non-atomic*. 3403 store atomic unordered *any* *any* *Same as non-atomic*. *Same as non-atomic*. 3404 atomicrmw unordered *any* *any* *Same as monotonic *Same as monotonic 3405 atomic*. atomic*. 3406 **Monotonic Atomic** 3407 ---------------------------------------------------------------------------------------------------------------------- 3408 load atomic monotonic - singlethread - global 1. buffer/global/flat_load 1. buffer/global/flat_load 3409 - wavefront - generic 3410 load atomic monotonic - workgroup - global 1. buffer/global/flat_load 1. buffer/global/flat_load 3411 - generic glc=1 3412 3413 - If CU wavefront execution mode, omit glc=1. 3414 3415 load atomic monotonic - singlethread - local 1. ds_load 1. ds_load 3416 - wavefront 3417 - workgroup 3418 load atomic monotonic - agent - global 1. buffer/global/flat_load 1. buffer/global/flat_load 3419 - system - generic glc=1 glc=1 dlc=1 3420 store atomic monotonic - singlethread - global 1. buffer/global/flat_store 1. buffer/global/flat_store 3421 - wavefront - generic 3422 - workgroup 3423 - agent 3424 - system 3425 store atomic monotonic - singlethread - local 1. ds_store 1. ds_store 3426 - wavefront 3427 - workgroup 3428 atomicrmw monotonic - singlethread - global 1. buffer/global/flat_atomic 1. buffer/global/flat_atomic 3429 - wavefront - generic 3430 - workgroup 3431 - agent 3432 - system 3433 atomicrmw monotonic - singlethread - local 1. ds_atomic 1. ds_atomic 3434 - wavefront 3435 - workgroup 3436 **Acquire Atomic** 3437 ---------------------------------------------------------------------------------------------------------------------- 3438 load atomic acquire - singlethread - global 1. buffer/global/ds/flat_load 1. buffer/global/ds/flat_load 3439 - wavefront - local 3440 - generic 3441 load atomic acquire - workgroup - global 1. buffer/global/flat_load 1. buffer/global_load glc=1 3442 3443 - If CU wavefront execution mode, omit glc=1. 3444 3445 2. s_waitcnt vmcnt(0) 3446 3447 - If CU wavefront execution mode, omit. 3448 - Must happen before 3449 the following buffer_gl0_inv 3450 and before any following 3451 global/generic 3452 load/load 3453 atomic/stote/store 3454 atomic/atomicrmw. 3455 3456 3. buffer_gl0_inv 3457 3458 - If CU wavefront execution mode, omit. 3459 - Ensures that 3460 following 3461 loads will not see 3462 stale data. 3463 3464 load atomic acquire - workgroup - local 1. ds_load 1. ds_load 3465 2. s_waitcnt lgkmcnt(0) 2. s_waitcnt lgkmcnt(0) 3466 3467 - If OpenCL, omit. - If OpenCL, omit. 3468 - Must happen before - Must happen before 3469 any following the following buffer_gl0_inv 3470 global/generic and before any following 3471 load/load global/generic load/load 3472 atomic/store/store atomic/store/store 3473 atomic/atomicrmw. atomic/atomicrmw. 3474 - Ensures any - Ensures any 3475 following global following global 3476 data read is no data read is no 3477 older than the load older than the load 3478 atomic value being atomic value being 3479 acquired. acquired. 3480 3481 3. buffer_gl0_inv 3482 3483 - If CU wavefront execution mode, omit. 3484 - If OpenCL, omit. 3485 - Ensures that 3486 following 3487 loads will not see 3488 stale data. 3489 3490 load atomic acquire - workgroup - generic 1. flat_load 1. flat_load glc=1 3491 3492 - If CU wavefront execution mode, omit glc=1. 3493 3494 2. s_waitcnt lgkmcnt(0) 2. s_waitcnt lgkmcnt(0) & 3495 vmcnt(0) 3496 3497 - If CU wavefront execution mode, omit vmcnt. 3498 - If OpenCL, omit. - If OpenCL, omit 3499 lgkmcnt(0). 3500 - Must happen before - Must happen before 3501 any following the following 3502 global/generic buffer_gl0_inv and any 3503 load/load following global/generic 3504 atomic/store/store load/load 3505 atomic/atomicrmw. atomic/store/store 3506 atomic/atomicrmw. 3507 - Ensures any - Ensures any 3508 following global following global 3509 data read is no data read is no 3510 older than the load older than the load 3511 atomic value being atomic value being 3512 acquired. acquired. 3513 3514 3. buffer_gl0_inv 3515 3516 - If CU wavefront execution mode, omit. 3517 - Ensures that 3518 following 3519 loads will not see 3520 stale data. 3521 3522 load atomic acquire - agent - global 1. buffer/global/flat_load 1. buffer/global_load 3523 - system glc=1 glc=1 dlc=1 3524 2. s_waitcnt vmcnt(0) 2. s_waitcnt vmcnt(0) 3525 3526 - Must happen before - Must happen before 3527 following following 3528 buffer_wbinvl1_vol. buffer_gl*_inv. 3529 - Ensures the load - Ensures the load 3530 has completed has completed 3531 before invalidating before invalidating 3532 the cache. the caches. 3533 3534 3. buffer_wbinvl1_vol 3. buffer_gl0_inv; 3535 buffer_gl1_inv 3536 3537 - Must happen before - Must happen before 3538 any following any following 3539 global/generic global/generic 3540 load/load load/load 3541 atomic/atomicrmw. atomic/atomicrmw. 3542 - Ensures that - Ensures that 3543 following following 3544 loads will not see loads will not see 3545 stale global data. stale global data. 3546 3547 load atomic acquire - agent - generic 1. flat_load glc=1 1. flat_load glc=1 dlc=1 3548 - system 2. s_waitcnt vmcnt(0) & 2. s_waitcnt vmcnt(0) & 3549 lgkmcnt(0) lgkmcnt(0) 3550 3551 - If OpenCL omit - If OpenCL omit 3552 lgkmcnt(0). lgkmcnt(0). 3553 - Must happen before - Must happen before 3554 following following 3555 buffer_wbinvl1_vol. buffer_gl*_invl. 3556 - Ensures the flat_load - Ensures the flat_load 3557 has completed has completed 3558 before invalidating before invalidating 3559 the cache. the caches. 3560 3561 3. buffer_wbinvl1_vol 3. buffer_gl0_inv; 3562 buffer_gl1_inv 3563 3564 - Must happen before - Must happen before 3565 any following any following 3566 global/generic global/generic 3567 load/load load/load 3568 atomic/atomicrmw. atomic/atomicrmw. 3569 - Ensures that - Ensures that 3570 following loads following loads 3571 will not see stale will not see stale 3572 global data. global data. 3573 3574 atomicrmw acquire - singlethread - global 1. buffer/global/ds/flat_atomic 1. buffer/global/ds/flat_atomic 3575 - wavefront - local 3576 - generic 3577 atomicrmw acquire - workgroup - global 1. buffer/global/flat_atomic 1. buffer/global_atomic 3578 2. s_waitcnt vm/vscnt(0) 3579 3580 - If CU wavefront execution mode, omit. 3581 - Use vmcnt if atomic with 3582 return and vscnt if atomic 3583 with no-return. 3584 - Must happen before 3585 the following buffer_gl0_inv 3586 and before any following 3587 global/generic 3588 load/load 3589 atomic/stote/store 3590 atomic/atomicrmw. 3591 3592 3. buffer_gl0_inv 3593 3594 - If CU wavefront execution mode, omit. 3595 - Ensures that 3596 following 3597 loads will not see 3598 stale data. 3599 3600 atomicrmw acquire - workgroup - local 1. ds_atomic 1. ds_atomic 3601 2. waitcnt lgkmcnt(0) 2. waitcnt lgkmcnt(0) 3602 3603 - If OpenCL, omit. - If OpenCL, omit. 3604 - Must happen before - Must happen before 3605 any following the following 3606 global/generic buffer_gl0_inv. 3607 load/load 3608 atomic/store/store 3609 atomic/atomicrmw. 3610 - Ensures any - Ensures any 3611 following global following global 3612 data read is no data read is no 3613 older than the older than the 3614 atomicrmw value atomicrmw value 3615 being acquired. being acquired. 3616 3617 3. buffer_gl0_inv 3618 3619 - If OpenCL omit. 3620 - Ensures that 3621 following 3622 loads will not see 3623 stale data. 3624 3625 atomicrmw acquire - workgroup - generic 1. flat_atomic 1. flat_atomic 3626 2. waitcnt lgkmcnt(0) 2. waitcnt lgkmcnt(0) & 3627 vm/vscnt(0) 3628 3629 - If CU wavefront execution mode, omit vm/vscnt. 3630 - If OpenCL, omit. - If OpenCL, omit 3631 waitcnt lgkmcnt(0).. 3632 - Use vmcnt if atomic with 3633 return and vscnt if atomic 3634 with no-return. 3635 waitcnt lgkmcnt(0). 3636 - Must happen before - Must happen before 3637 any following the following 3638 global/generic buffer_gl0_inv. 3639 load/load 3640 atomic/store/store 3641 atomic/atomicrmw. 3642 - Ensures any - Ensures any 3643 following global following global 3644 data read is no data read is no 3645 older than the older than the 3646 atomicrmw value atomicrmw value 3647 being acquired. being acquired. 3648 3649 3. buffer_gl0_inv 3650 3651 - If CU wavefront execution mode, omit. 3652 - Ensures that 3653 following 3654 loads will not see 3655 stale data. 3656 3657 atomicrmw acquire - agent - global 1. buffer/global/flat_atomic 1. buffer/global_atomic 3658 - system 2. s_waitcnt vmcnt(0) 2. s_waitcnt vm/vscnt(0) 3659 3660 - Use vmcnt if atomic with 3661 return and vscnt if atomic 3662 with no-return. 3663 waitcnt lgkmcnt(0). 3664 - Must happen before - Must happen before 3665 following following 3666 buffer_wbinvl1_vol. buffer_gl*_inv. 3667 - Ensures the - Ensures the 3668 atomicrmw has atomicrmw has 3669 completed before completed before 3670 invalidating the invalidating the 3671 cache. caches. 3672 3673 3. buffer_wbinvl1_vol 3. buffer_gl0_inv; 3674 buffer_gl1_inv 3675 3676 - Must happen before - Must happen before 3677 any following any following 3678 global/generic global/generic 3679 load/load load/load 3680 atomic/atomicrmw. atomic/atomicrmw. 3681 - Ensures that - Ensures that 3682 following loads following loads 3683 will not see stale will not see stale 3684 global data. global data. 3685 3686 atomicrmw acquire - agent - generic 1. flat_atomic 1. flat_atomic 3687 - system 2. s_waitcnt vmcnt(0) & 2. s_waitcnt vm/vscnt(0) & 3688 lgkmcnt(0) lgkmcnt(0) 3689 3690 - If OpenCL, omit - If OpenCL, omit 3691 lgkmcnt(0). lgkmcnt(0). 3692 - Use vmcnt if atomic with 3693 return and vscnt if atomic 3694 with no-return. 3695 - Must happen before - Must happen before 3696 following following 3697 buffer_wbinvl1_vol. buffer_gl*_inv. 3698 - Ensures the - Ensures the 3699 atomicrmw has atomicrmw has 3700 completed before completed before 3701 invalidating the invalidating the 3702 cache. caches. 3703 3704 3. buffer_wbinvl1_vol 3. buffer_gl0_inv; 3705 buffer_gl1_inv 3706 3707 - Must happen before - Must happen before 3708 any following any following 3709 global/generic global/generic 3710 load/load load/load 3711 atomic/atomicrmw. atomic/atomicrmw. 3712 - Ensures that - Ensures that 3713 following loads following loads 3714 will not see stale will not see stale 3715 global data. global data. 3716 3717 fence acquire - singlethread *none* *none* *none* 3718 - wavefront 3719 fence acquire - workgroup *none* 1. s_waitcnt lgkmcnt(0) 1. s_waitcnt lgkmcnt(0) & 3720 vmcnt(0) & vscnt(0) 3721 3722 - If CU wavefront execution mode, omit vmcnt and 3723 vscnt. 3724 - If OpenCL and - If OpenCL and 3725 address space is address space is 3726 not generic, omit. not generic, omit 3727 lgkmcnt(0). 3728 - If OpenCL and 3729 address space is 3730 local, omit 3731 vmcnt(0) and vscnt(0). 3732 - However, since LLVM - However, since LLVM 3733 currently has no currently has no 3734 address space on address space on 3735 the fence need to the fence need to 3736 conservatively conservatively 3737 always generate. If always generate. If 3738 fence had an fence had an 3739 address space then address space then 3740 set to address set to address 3741 space of OpenCL space of OpenCL 3742 fence flag, or to fence flag, or to 3743 generic if both generic if both 3744 local and global local and global 3745 flags are flags are 3746 specified. specified. 3747 - Must happen after 3748 any preceding 3749 local/generic load 3750 atomic/atomicrmw 3751 with an equal or 3752 wider sync scope 3753 and memory ordering 3754 stronger than 3755 unordered (this is 3756 termed the 3757 fence-paired-atomic). 3758 - Must happen before 3759 any following 3760 global/generic 3761 load/load 3762 atomic/store/store 3763 atomic/atomicrmw. 3764 - Ensures any 3765 following global 3766 data read is no 3767 older than the 3768 value read by the 3769 fence-paired-atomic. 3770 - Could be split into 3771 separate s_waitcnt 3772 vmcnt(0), s_waitcnt 3773 vscnt(0) and s_waitcnt 3774 lgkmcnt(0) to allow 3775 them to be 3776 independently moved 3777 according to the 3778 following rules. 3779 - s_waitcnt vmcnt(0) 3780 must happen after 3781 any preceding 3782 global/generic load 3783 atomic/ 3784 atomicrmw-with-return-value 3785 with an equal or 3786 wider sync scope 3787 and memory ordering 3788 stronger than 3789 unordered (this is 3790 termed the 3791 fence-paired-atomic). 3792 - s_waitcnt vscnt(0) 3793 must happen after 3794 any preceding 3795 global/generic 3796 atomicrmw-no-return-value 3797 with an equal or 3798 wider sync scope 3799 and memory ordering 3800 stronger than 3801 unordered (this is 3802 termed the 3803 fence-paired-atomic). 3804 - s_waitcnt lgkmcnt(0) 3805 must happen after 3806 any preceding 3807 local/generic load 3808 atomic/atomicrmw 3809 with an equal or 3810 wider sync scope 3811 and memory ordering 3812 stronger than 3813 unordered (this is 3814 termed the 3815 fence-paired-atomic). 3816 - Must happen before 3817 the following 3818 buffer_gl0_inv. 3819 - Ensures that the 3820 fence-paired atomic 3821 has completed 3822 before invalidating 3823 the 3824 cache. Therefore 3825 any following 3826 locations read must 3827 be no older than 3828 the value read by 3829 the 3830 fence-paired-atomic. 3831 3832 3. buffer_gl0_inv 3833 3834 - If CU wavefront execution mode, omit. 3835 - Ensures that 3836 following 3837 loads will not see 3838 stale data. 3839 3840 fence acquire - agent *none* 1. s_waitcnt lgkmcnt(0) & 1. s_waitcnt lgkmcnt(0) & 3841 - system vmcnt(0) vmcnt(0) & vscnt(0) 3842 3843 - If OpenCL and - If OpenCL and 3844 address space is address space is 3845 not generic, omit not generic, omit 3846 lgkmcnt(0). lgkmcnt(0). 3847 - If OpenCL and 3848 address space is 3849 local, omit 3850 vmcnt(0) and vscnt(0). 3851 - However, since LLVM - However, since LLVM 3852 currently has no currently has no 3853 address space on address space on 3854 the fence need to the fence need to 3855 conservatively conservatively 3856 always generate always generate 3857 (see comment for (see comment for 3858 previous fence). previous fence). 3859 - Could be split into 3860 separate s_waitcnt 3861 vmcnt(0) and 3862 s_waitcnt 3863 lgkmcnt(0) to allow 3864 them to be 3865 independently moved 3866 according to the 3867 following rules. 3868 - s_waitcnt vmcnt(0) 3869 must happen after 3870 any preceding 3871 global/generic load 3872 atomic/atomicrmw 3873 with an equal or 3874 wider sync scope 3875 and memory ordering 3876 stronger than 3877 unordered (this is 3878 termed the 3879 fence-paired-atomic). 3880 - s_waitcnt lgkmcnt(0) 3881 must happen after 3882 any preceding 3883 local/generic load 3884 atomic/atomicrmw 3885 with an equal or 3886 wider sync scope 3887 and memory ordering 3888 stronger than 3889 unordered (this is 3890 termed the 3891 fence-paired-atomic). 3892 - Must happen before 3893 the following 3894 buffer_wbinvl1_vol. 3895 - Ensures that the 3896 fence-paired atomic 3897 has completed 3898 before invalidating 3899 the 3900 cache. Therefore 3901 any following 3902 locations read must 3903 be no older than 3904 the value read by 3905 the 3906 fence-paired-atomic. 3907 - Could be split into 3908 separate s_waitcnt 3909 vmcnt(0), s_waitcnt 3910 vscnt(0) and s_waitcnt 3911 lgkmcnt(0) to allow 3912 them to be 3913 independently moved 3914 according to the 3915 following rules. 3916 - s_waitcnt vmcnt(0) 3917 must happen after 3918 any preceding 3919 global/generic load 3920 atomic/ 3921 atomicrmw-with-return-value 3922 with an equal or 3923 wider sync scope 3924 and memory ordering 3925 stronger than 3926 unordered (this is 3927 termed the 3928 fence-paired-atomic). 3929 - s_waitcnt vscnt(0) 3930 must happen after 3931 any preceding 3932 global/generic 3933 atomicrmw-no-return-value 3934 with an equal or 3935 wider sync scope 3936 and memory ordering 3937 stronger than 3938 unordered (this is 3939 termed the 3940 fence-paired-atomic). 3941 - s_waitcnt lgkmcnt(0) 3942 must happen after 3943 any preceding 3944 local/generic load 3945 atomic/atomicrmw 3946 with an equal or 3947 wider sync scope 3948 and memory ordering 3949 stronger than 3950 unordered (this is 3951 termed the 3952 fence-paired-atomic). 3953 - Must happen before 3954 the following 3955 buffer_gl*_inv. 3956 - Ensures that the 3957 fence-paired atomic 3958 has completed 3959 before invalidating 3960 the 3961 caches. Therefore 3962 any following 3963 locations read must 3964 be no older than 3965 the value read by 3966 the 3967 fence-paired-atomic. 3968 3969 2. buffer_wbinvl1_vol 2. buffer_gl0_inv; 3970 buffer_gl1_inv 3971 3972 - Must happen before any - Must happen before any 3973 following global/generic following global/generic 3974 load/load load/load 3975 atomic/store/store atomic/store/store 3976 atomic/atomicrmw. atomic/atomicrmw. 3977 - Ensures that - Ensures that 3978 following loads following loads 3979 will not see stale will not see stale 3980 global data. global data. 3981 3982 **Release Atomic** 3983 ---------------------------------------------------------------------------------------------------------------------- 3984 store atomic release - singlethread - global 1. buffer/global/ds/flat_store 1. buffer/global/ds/flat_store 3985 - wavefront - local 3986 - generic 3987 store atomic release - workgroup - global 1. s_waitcnt lgkmcnt(0) 1. s_waitcnt lgkmcnt(0) & 3988 vmcnt(0) & vscnt(0) 3989 3990 - If CU wavefront execution mode, omit vmcnt and 3991 vscnt. 3992 - If OpenCL, omit. - If OpenCL, omit 3993 lgkmcnt(0). 3994 - Must happen after 3995 any preceding 3996 local/generic 3997 load/store/load 3998 atomic/store 3999 atomic/atomicrmw. 4000 - Could be split into 4001 separate s_waitcnt 4002 vmcnt(0), s_waitcnt 4003 vscnt(0) and s_waitcnt 4004 lgkmcnt(0) to allow 4005 them to be 4006 independently moved 4007 according to the 4008 following rules. 4009 - s_waitcnt vmcnt(0) 4010 must happen after 4011 any preceding 4012 global/generic load/load 4013 atomic/ 4014 atomicrmw-with-return-value. 4015 - s_waitcnt vscnt(0) 4016 must happen after 4017 any preceding 4018 global/generic 4019 store/store 4020 atomic/ 4021 atomicrmw-no-return-value. 4022 - s_waitcnt lgkmcnt(0) 4023 must happen after 4024 any preceding 4025 local/generic 4026 load/store/load 4027 atomic/store 4028 atomic/atomicrmw. 4029 - Must happen before - Must happen before 4030 the following the following 4031 store. store. 4032 - Ensures that all - Ensures that all 4033 memory operations memory operations 4034 to local have have 4035 completed before completed before 4036 performing the performing the 4037 store that is being store that is being 4038 released. released. 4039 4040 2. buffer/global/flat_store 2. buffer/global_store 4041 store atomic release - workgroup - local 1. waitcnt vmcnt(0) & vscnt(0) 4042 4043 - If CU wavefront execution mode, omit. 4044 - If OpenCL, omit. 4045 - Could be split into 4046 separate s_waitcnt 4047 vmcnt(0) and s_waitcnt 4048 vscnt(0) to allow 4049 them to be 4050 independently moved 4051 according to the 4052 following rules. 4053 - s_waitcnt vmcnt(0) 4054 must happen after 4055 any preceding 4056 global/generic load/load 4057 atomic/ 4058 atomicrmw-with-return-value. 4059 - s_waitcnt vscnt(0) 4060 must happen after 4061 any preceding 4062 global/generic 4063 store/store atomic/ 4064 atomicrmw-no-return-value. 4065 - Must happen before 4066 the following 4067 store. 4068 - Ensures that all 4069 global memory 4070 operations have 4071 completed before 4072 performing the 4073 store that is being 4074 released. 4075 4076 1. ds_store 2. ds_store 4077 store atomic release - workgroup - generic 1. s_waitcnt lgkmcnt(0) 1. s_waitcnt lgkmcnt(0) & 4078 vmcnt(0) & vscnt(0) 4079 4080 - If CU wavefront execution mode, omit vmcnt and 4081 vscnt. 4082 - If OpenCL, omit. - If OpenCL, omit 4083 lgkmcnt(0). 4084 - Must happen after 4085 any preceding 4086 local/generic 4087 load/store/load 4088 atomic/store 4089 atomic/atomicrmw. 4090 - Could be split into 4091 separate s_waitcnt 4092 vmcnt(0), s_waitcnt 4093 vscnt(0) and s_waitcnt 4094 lgkmcnt(0) to allow 4095 them to be 4096 independently moved 4097 according to the 4098 following rules. 4099 - s_waitcnt vmcnt(0) 4100 must happen after 4101 any preceding 4102 global/generic load/load 4103 atomic/ 4104 atomicrmw-with-return-value. 4105 - s_waitcnt vscnt(0) 4106 must happen after 4107 any preceding 4108 global/generic 4109 store/store 4110 atomic/ 4111 atomicrmw-no-return-value. 4112 - s_waitcnt lgkmcnt(0) 4113 must happen after 4114 any preceding 4115 local/generic load/store/load 4116 atomic/store atomic/atomicrmw. 4117 - Must happen before - Must happen before 4118 the following the following 4119 store. store. 4120 - Ensures that all - Ensures that all 4121 memory operations memory operations 4122 to local have have 4123 completed before completed before 4124 performing the performing the 4125 store that is being store that is being 4126 released. released. 4127 4128 2. flat_store 2. flat_store 4129 store atomic release - agent - global 1. s_waitcnt lgkmcnt(0) & 1. s_waitcnt lgkmcnt(0) & 4130 - system - generic vmcnt(0) vmcnt(0) & vscnt(0) 4131 4132 - If OpenCL, omit - If OpenCL, omit 4133 lgkmcnt(0). lgkmcnt(0). 4134 - Could be split into - Could be split into 4135 separate s_waitcnt separate s_waitcnt 4136 vmcnt(0) and vmcnt(0), s_waitcnt vscnt(0) 4137 s_waitcnt and s_waitcnt 4138 lgkmcnt(0) to allow lgkmcnt(0) to allow 4139 them to be them to be 4140 independently moved independently moved 4141 according to the according to the 4142 following rules. following rules. 4143 - s_waitcnt vmcnt(0) - s_waitcnt vmcnt(0) 4144 must happen after must happen after 4145 any preceding any preceding 4146 global/generic global/generic 4147 load/store/load load/load 4148 atomic/store atomic/ 4149 atomic/atomicrmw. atomicrmw-with-return-value. 4150 - s_waitcnt vscnt(0) 4151 must happen after 4152 any preceding 4153 global/generic 4154 store/store atomic/ 4155 atomicrmw-no-return-value. 4156 - s_waitcnt lgkmcnt(0) - s_waitcnt lgkmcnt(0) 4157 must happen after must happen after 4158 any preceding any preceding 4159 local/generic local/generic 4160 load/store/load load/store/load 4161 atomic/store atomic/store 4162 atomic/atomicrmw. atomic/atomicrmw. 4163 - Must happen before - Must happen before 4164 the following the following 4165 store. store. 4166 - Ensures that all - Ensures that all 4167 memory operations memory operations 4168 to memory have to memory have 4169 completed before completed before 4170 performing the performing the 4171 store that is being store that is being 4172 released. released. 4173 4174 2. buffer/global/ds/flat_store 2. buffer/global/ds/flat_store 4175 atomicrmw release - singlethread - global 1. buffer/global/ds/flat_atomic 1. buffer/global/ds/flat_atomic 4176 - wavefront - local 4177 - generic 4178 atomicrmw release - workgroup - global 1. s_waitcnt lgkmcnt(0) 1. s_waitcnt lgkmcnt(0) & 4179 vmcnt(0) & vscnt(0) 4180 4181 - If CU wavefront execution mode, omit vmcnt and 4182 vscnt. 4183 - If OpenCL, omit. 4184 4185 - Must happen after 4186 any preceding 4187 local/generic 4188 load/store/load 4189 atomic/store 4190 atomic/atomicrmw. 4191 - Could be split into 4192 separate s_waitcnt 4193 vmcnt(0), s_waitcnt 4194 vscnt(0) and s_waitcnt 4195 lgkmcnt(0) to allow 4196 them to be 4197 independently moved 4198 according to the 4199 following rules. 4200 - s_waitcnt vmcnt(0) 4201 must happen after 4202 any preceding 4203 global/generic load/load 4204 atomic/ 4205 atomicrmw-with-return-value. 4206 - s_waitcnt vscnt(0) 4207 must happen after 4208 any preceding 4209 global/generic 4210 store/store 4211 atomic/ 4212 atomicrmw-no-return-value. 4213 - s_waitcnt lgkmcnt(0) 4214 must happen after 4215 any preceding 4216 local/generic 4217 load/store/load 4218 atomic/store 4219 atomic/atomicrmw. 4220 - Must happen before - Must happen before 4221 the following the following 4222 atomicrmw. atomicrmw. 4223 - Ensures that all - Ensures that all 4224 memory operations memory operations 4225 to local have have 4226 completed before completed before 4227 performing the performing the 4228 atomicrmw that is atomicrmw that is 4229 being released. being released. 4230 4231 2. buffer/global/flat_atomic 2. buffer/global_atomic 4232 atomicrmw release - workgroup - local 1. waitcnt vmcnt(0) & vscnt(0) 4233 4234 - If CU wavefront execution mode, omit. 4235 - If OpenCL, omit. 4236 - Could be split into 4237 separate s_waitcnt 4238 vmcnt(0) and s_waitcnt 4239 vscnt(0) to allow 4240 them to be 4241 independently moved 4242 according to the 4243 following rules. 4244 - s_waitcnt vmcnt(0) 4245 must happen after 4246 any preceding 4247 global/generic load/load 4248 atomic/ 4249 atomicrmw-with-return-value. 4250 - s_waitcnt vscnt(0) 4251 must happen after 4252 any preceding 4253 global/generic 4254 store/store atomic/ 4255 atomicrmw-no-return-value. 4256 - Must happen before 4257 the following 4258 store. 4259 - Ensures that all 4260 global memory 4261 operations have 4262 completed before 4263 performing the 4264 store that is being 4265 released. 4266 4267 1. ds_atomic 2. ds_atomic 4268 atomicrmw release - workgroup - generic 1. s_waitcnt lgkmcnt(0) 1. s_waitcnt lgkmcnt(0) & 4269 vmcnt(0) & vscnt(0) 4270 4271 - If CU wavefront execution mode, omit vmcnt and 4272 vscnt. 4273 - If OpenCL, omit. - If OpenCL, omit 4274 waitcnt lgkmcnt(0). 4275 - Must happen after 4276 any preceding 4277 local/generic 4278 load/store/load 4279 atomic/store 4280 atomic/atomicrmw. 4281 - Could be split into 4282 separate s_waitcnt 4283 vmcnt(0), s_waitcnt 4284 vscnt(0) and s_waitcnt 4285 lgkmcnt(0) to allow 4286 them to be 4287 independently moved 4288 according to the 4289 following rules. 4290 - s_waitcnt vmcnt(0) 4291 must happen after 4292 any preceding 4293 global/generic load/load 4294 atomic/ 4295 atomicrmw-with-return-value. 4296 - s_waitcnt vscnt(0) 4297 must happen after 4298 any preceding 4299 global/generic 4300 store/store 4301 atomic/ 4302 atomicrmw-no-return-value. 4303 - s_waitcnt lgkmcnt(0) 4304 must happen after 4305 any preceding 4306 local/generic load/store/load 4307 atomic/store atomic/atomicrmw. 4308 - Must happen before - Must happen before 4309 the following the following 4310 atomicrmw. atomicrmw. 4311 - Ensures that all - Ensures that all 4312 memory operations memory operations 4313 to local have have 4314 completed before completed before 4315 performing the performing the 4316 atomicrmw that is atomicrmw that is 4317 being released. being released. 4318 4319 2. flat_atomic 2. flat_atomic 4320 atomicrmw release - agent - global 1. s_waitcnt lgkmcnt(0) & 1. s_waitcnt lkkmcnt(0) & 4321 - system - generic vmcnt(0) vmcnt(0) & vscnt(0) 4322 4323 - If OpenCL, omit - If OpenCL, omit 4324 lgkmcnt(0). lgkmcnt(0). 4325 - Could be split into - Could be split into 4326 separate s_waitcnt separate s_waitcnt 4327 vmcnt(0) and vmcnt(0), s_waitcnt 4328 s_waitcnt vscnt(0) and s_waitcnt 4329 lgkmcnt(0) to allow lgkmcnt(0) to allow 4330 them to be them to be 4331 independently moved independently moved 4332 according to the according to the 4333 following rules. following rules. 4334 - s_waitcnt vmcnt(0) - s_waitcnt vmcnt(0) 4335 must happen after must happen after 4336 any preceding any preceding 4337 global/generic global/generic 4338 load/store/load load/load atomic/ 4339 atomic/store atomicrmw-with-return-value. 4340 atomic/atomicrmw. 4341 - s_waitcnt vscnt(0) 4342 must happen after 4343 any preceding 4344 global/generic 4345 store/store atomic/ 4346 atomicrmw-no-return-value. 4347 - s_waitcnt lgkmcnt(0) - s_waitcnt lgkmcnt(0) 4348 must happen after must happen after 4349 any preceding any preceding 4350 local/generic local/generic 4351 load/store/load load/store/load 4352 atomic/store atomic/store 4353 atomic/atomicrmw. atomic/atomicrmw. 4354 - Must happen before - Must happen before 4355 the following the following 4356 atomicrmw. atomicrmw. 4357 - Ensures that all - Ensures that all 4358 memory operations memory operations 4359 to global and local to global and local 4360 have completed have completed 4361 before performing before performing 4362 the atomicrmw that the atomicrmw that 4363 is being released. is being released. 4364 4365 2. buffer/global/ds/flat_atomic 2. buffer/global/ds/flat_atomic 4366 fence release - singlethread *none* *none* *none* 4367 - wavefront 4368 fence release - workgroup *none* 1. s_waitcnt lgkmcnt(0) 1. s_waitcnt lgkmcnt(0) & 4369 vmcnt(0) & vscnt(0) 4370 4371 - If CU wavefront execution mode, omit vmcnt and 4372 vscnt. 4373 - If OpenCL and - If OpenCL and 4374 address space is address space is 4375 not generic, omit. not generic, omit 4376 lgkmcnt(0). 4377 - If OpenCL and 4378 address space is 4379 local, omit 4380 vmcnt(0) and vscnt(0). 4381 - However, since LLVM - However, since LLVM 4382 currently has no currently has no 4383 address space on address space on 4384 the fence need to the fence need to 4385 conservatively conservatively 4386 always generate. If always generate. If 4387 fence had an fence had an 4388 address space then address space then 4389 set to address set to address 4390 space of OpenCL space of OpenCL 4391 fence flag, or to fence flag, or to 4392 generic if both generic if both 4393 local and global local and global 4394 flags are flags are 4395 specified. specified. 4396 - Must happen after 4397 any preceding 4398 local/generic 4399 load/load 4400 atomic/store/store 4401 atomic/atomicrmw. 4402 - Could be split into 4403 separate s_waitcnt 4404 vmcnt(0), s_waitcnt 4405 vscnt(0) and s_waitcnt 4406 lgkmcnt(0) to allow 4407 them to be 4408 independently moved 4409 according to the 4410 following rules. 4411 - s_waitcnt vmcnt(0) 4412 must happen after 4413 any preceding 4414 global/generic 4415 load/load 4416 atomic/ 4417 atomicrmw-with-return-value. 4418 - s_waitcnt vscnt(0) 4419 must happen after 4420 any preceding 4421 global/generic 4422 store/store atomic/ 4423 atomicrmw-no-return-value. 4424 - s_waitcnt lgkmcnt(0) 4425 must happen after 4426 any preceding 4427 local/generic 4428 load/store/load 4429 atomic/store atomic/ 4430 atomicrmw. 4431 - Must happen before - Must happen before 4432 any following store any following store 4433 atomic/atomicrmw atomic/atomicrmw 4434 with an equal or with an equal or 4435 wider sync scope wider sync scope 4436 and memory ordering and memory ordering 4437 stronger than stronger than 4438 unordered (this is unordered (this is 4439 termed the termed the 4440 fence-paired-atomic). fence-paired-atomic). 4441 - Ensures that all - Ensures that all 4442 memory operations memory operations 4443 to local have have 4444 completed before completed before 4445 performing the performing the 4446 following following 4447 fence-paired-atomic. fence-paired-atomic. 4448 4449 fence release - agent *none* 1. s_waitcnt lgkmcnt(0) & 1. s_waitcnt lgkmcnt(0) & 4450 - system vmcnt(0) vmcnt(0) & vscnt(0) 4451 4452 - If OpenCL and - If OpenCL and 4453 address space is address space is 4454 not generic, omit not generic, omit 4455 lgkmcnt(0). lgkmcnt(0). 4456 - If OpenCL and - If OpenCL and 4457 address space is address space is 4458 local, omit local, omit 4459 vmcnt(0). vmcnt(0) and vscnt(0). 4460 - However, since LLVM - However, since LLVM 4461 currently has no currently has no 4462 address space on address space on 4463 the fence need to the fence need to 4464 conservatively conservatively 4465 always generate. If always generate. If 4466 fence had an fence had an 4467 address space then address space then 4468 set to address set to address 4469 space of OpenCL space of OpenCL 4470 fence flag, or to fence flag, or to 4471 generic if both generic if both 4472 local and global local and global 4473 flags are flags are 4474 specified. specified. 4475 - Could be split into - Could be split into 4476 separate s_waitcnt separate s_waitcnt 4477 vmcnt(0) and vmcnt(0), s_waitcnt 4478 s_waitcnt vscnt(0) and s_waitcnt 4479 lgkmcnt(0) to allow lgkmcnt(0) to allow 4480 them to be them to be 4481 independently moved independently moved 4482 according to the according to the 4483 following rules. following rules. 4484 - s_waitcnt vmcnt(0) - s_waitcnt vmcnt(0) 4485 must happen after must happen after 4486 any preceding any preceding 4487 global/generic global/generic 4488 load/store/load load/load atomic/ 4489 atomic/store atomicrmw-with-return-value. 4490 atomic/atomicrmw. 4491 - s_waitcnt vscnt(0) 4492 must happen after 4493 any preceding 4494 global/generic 4495 store/store atomic/ 4496 atomicrmw-no-return-value. 4497 - s_waitcnt lgkmcnt(0) - s_waitcnt lgkmcnt(0) 4498 must happen after must happen after 4499 any preceding any preceding 4500 local/generic local/generic 4501 load/store/load load/store/load 4502 atomic/store atomic/store 4503 atomic/atomicrmw. atomic/atomicrmw. 4504 - Must happen before - Must happen before 4505 any following store any following store 4506 atomic/atomicrmw atomic/atomicrmw 4507 with an equal or with an equal or 4508 wider sync scope wider sync scope 4509 and memory ordering and memory ordering 4510 stronger than stronger than 4511 unordered (this is unordered (this is 4512 termed the termed the 4513 fence-paired-atomic). fence-paired-atomic). 4514 - Ensures that all - Ensures that all 4515 memory operations memory operations 4516 have have 4517 completed before completed before 4518 performing the performing the 4519 following following 4520 fence-paired-atomic. fence-paired-atomic. 4521 4522 **Acquire-Release Atomic** 4523 ---------------------------------------------------------------------------------------------------------------------- 4524 atomicrmw acq_rel - singlethread - global 1. buffer/global/ds/flat_atomic 1. buffer/global/ds/flat_atomic 4525 - wavefront - local 4526 - generic 4527 atomicrmw acq_rel - workgroup - global 1. s_waitcnt lgkmcnt(0) 1. s_waitcnt lgkmcnt(0) & 4528 vmcnt(0) & vscnt(0) 4529 4530 - If CU wavefront execution mode, omit vmcnt and 4531 vscnt. 4532 - If OpenCL, omit. - If OpenCL, omit 4533 s_waitcnt lgkmcnt(0). 4534 - Must happen after - Must happen after 4535 any preceding any preceding 4536 local/generic local/generic 4537 load/store/load load/store/load 4538 atomic/store atomic/store 4539 atomic/atomicrmw. atomic/atomicrmw. 4540 - Could be split into 4541 separate s_waitcnt 4542 vmcnt(0), s_waitcnt 4543 vscnt(0) and s_waitcnt 4544 lgkmcnt(0) to allow 4545 them to be 4546 independently moved 4547 according to the 4548 following rules. 4549 - s_waitcnt vmcnt(0) 4550 must happen after 4551 any preceding 4552 global/generic load/load 4553 atomic/ 4554 atomicrmw-with-return-value. 4555 - s_waitcnt vscnt(0) 4556 must happen after 4557 any preceding 4558 global/generic 4559 store/store 4560 atomic/ 4561 atomicrmw-no-return-value. 4562 - s_waitcnt lgkmcnt(0) 4563 must happen after 4564 any preceding 4565 local/generic load/store/load 4566 atomic/store atomic/atomicrmw. 4567 - Must happen before - Must happen before 4568 the following the following 4569 atomicrmw. atomicrmw. 4570 - Ensures that all - Ensures that all 4571 memory operations memory operations 4572 to local have have 4573 completed before completed before 4574 performing the performing the 4575 atomicrmw that is atomicrmw that is 4576 being released. being released. 4577 4578 2. buffer/global/flat_atomic 2. buffer/global_atomic 4579 3. s_waitcnt vm/vscnt(0) 4580 4581 - If CU wavefront execution mode, omit vm/vscnt. 4582 - Use vmcnt if atomic with 4583 return and vscnt if atomic 4584 with no-return. 4585 waitcnt lgkmcnt(0). 4586 - Must happen before 4587 the following 4588 buffer_gl0_inv. 4589 - Ensures any 4590 following global 4591 data read is no 4592 older than the 4593 atomicrmw value 4594 being acquired. 4595 4596 4. buffer_gl0_inv 4597 4598 - If CU wavefront execution mode, omit. 4599 - Ensures that 4600 following 4601 loads will not see 4602 stale data. 4603 4604 atomicrmw acq_rel - workgroup - local 1. waitcnt vmcnt(0) & vscnt(0) 4605 4606 - If CU wavefront execution mode, omit. 4607 - If OpenCL, omit. 4608 - Could be split into 4609 separate s_waitcnt 4610 vmcnt(0) and s_waitcnt 4611 vscnt(0) to allow 4612 them to be 4613 independently moved 4614 according to the 4615 following rules. 4616 - s_waitcnt vmcnt(0) 4617 must happen after 4618 any preceding 4619 global/generic load/load 4620 atomic/ 4621 atomicrmw-with-return-value. 4622 - s_waitcnt vscnt(0) 4623 must happen after 4624 any preceding 4625 global/generic 4626 store/store atomic/ 4627 atomicrmw-no-return-value. 4628 - Must happen before 4629 the following 4630 store. 4631 - Ensures that all 4632 global memory 4633 operations have 4634 completed before 4635 performing the 4636 store that is being 4637 released. 4638 4639 1. ds_atomic 2. ds_atomic 4640 2. s_waitcnt lgkmcnt(0) 3. s_waitcnt lgkmcnt(0) 4641 4642 - If OpenCL, omit. - If OpenCL, omit. 4643 - Must happen before - Must happen before 4644 any following the following 4645 global/generic buffer_gl0_inv. 4646 load/load 4647 atomic/store/store 4648 atomic/atomicrmw. 4649 - Ensures any - Ensures any 4650 following global following global 4651 data read is no data read is no 4652 older than the load older than the load 4653 atomic value being atomic value being 4654 acquired. acquired. 4655 4656 4. buffer_gl0_inv 4657 4658 - If CU wavefront execution mode, omit. 4659 - If OpenCL omit. 4660 - Ensures that 4661 following 4662 loads will not see 4663 stale data. 4664 4665 atomicrmw acq_rel - workgroup - generic 1. s_waitcnt lgkmcnt(0) 1. s_waitcnt lgkmcnt(0) & 4666 vmcnt(0) & vscnt(0) 4667 4668 - If CU wavefront execution mode, omit vmcnt and 4669 vscnt. 4670 - If OpenCL, omit. - If OpenCL, omit 4671 waitcnt lgkmcnt(0). 4672 - Must happen after 4673 any preceding 4674 local/generic 4675 load/store/load 4676 atomic/store 4677 atomic/atomicrmw. 4678 - Could be split into 4679 separate s_waitcnt 4680 vmcnt(0), s_waitcnt 4681 vscnt(0) and s_waitcnt 4682 lgkmcnt(0) to allow 4683 them to be 4684 independently moved 4685 according to the 4686 following rules. 4687 - s_waitcnt vmcnt(0) 4688 must happen after 4689 any preceding 4690 global/generic load/load 4691 atomic/ 4692 atomicrmw-with-return-value. 4693 - s_waitcnt vscnt(0) 4694 must happen after 4695 any preceding 4696 global/generic 4697 store/store 4698 atomic/ 4699 atomicrmw-no-return-value. 4700 - s_waitcnt lgkmcnt(0) 4701 must happen after 4702 any preceding 4703 local/generic load/store/load 4704 atomic/store atomic/atomicrmw. 4705 - Must happen before - Must happen before 4706 the following the following 4707 atomicrmw. atomicrmw. 4708 - Ensures that all - Ensures that all 4709 memory operations memory operations 4710 to local have have 4711 completed before completed before 4712 performing the performing the 4713 atomicrmw that is atomicrmw that is 4714 being released. being released. 4715 4716 2. flat_atomic 2. flat_atomic 4717 3. s_waitcnt lgkmcnt(0) 3. s_waitcnt lgkmcnt(0) & 4718 vm/vscnt(0) 4719 4720 - If CU wavefront execution mode, omit vm/vscnt. 4721 - If OpenCL, omit. - If OpenCL, omit 4722 waitcnt lgkmcnt(0). 4723 - Must happen before - Must happen before 4724 any following the following 4725 global/generic buffer_gl0_inv. 4726 load/load 4727 atomic/store/store 4728 atomic/atomicrmw. 4729 - Ensures any - Ensures any 4730 following global following global 4731 data read is no data read is no 4732 older than the load older than the load 4733 atomic value being atomic value being 4734 acquired. acquired. 4735 4736 3. buffer_gl0_inv 4737 4738 - If CU wavefront execution mode, omit. 4739 - Ensures that 4740 following 4741 loads will not see 4742 stale data. 4743 4744 atomicrmw acq_rel - agent - global 1. s_waitcnt lgkmcnt(0) & 1. s_waitcnt lgkmcnt(0) & 4745 - system vmcnt(0) vmcnt(0) & vscnt(0) 4746 4747 - If OpenCL, omit - If OpenCL, omit 4748 lgkmcnt(0). lgkmcnt(0). 4749 - Could be split into - Could be split into 4750 separate s_waitcnt separate s_waitcnt 4751 vmcnt(0) and vmcnt(0), s_waitcnt 4752 s_waitcnt vscnt(0) and s_waitcnt 4753 lgkmcnt(0) to allow lgkmcnt(0) to allow 4754 them to be them to be 4755 independently moved independently moved 4756 according to the according to the 4757 following rules. following rules. 4758 - s_waitcnt vmcnt(0) - s_waitcnt vmcnt(0) 4759 must happen after must happen after 4760 any preceding any preceding 4761 global/generic global/generic 4762 load/store/load load/load atomic/ 4763 atomic/store atomicrmw-with-return-value. 4764 atomic/atomicrmw. 4765 - s_waitcnt vscnt(0) 4766 must happen after 4767 any preceding 4768 global/generic 4769 store/store atomic/ 4770 atomicrmw-no-return-value. 4771 - s_waitcnt lgkmcnt(0) - s_waitcnt lgkmcnt(0) 4772 must happen after must happen after 4773 any preceding any preceding 4774 local/generic local/generic 4775 load/store/load load/store/load 4776 atomic/store atomic/store 4777 atomic/atomicrmw. atomic/atomicrmw. 4778 - Must happen before - Must happen before 4779 the following the following 4780 atomicrmw. atomicrmw. 4781 - Ensures that all - Ensures that all 4782 memory operations memory operations 4783 to global have to global have 4784 completed before completed before 4785 performing the performing the 4786 atomicrmw that is atomicrmw that is 4787 being released. being released. 4788 4789 2. buffer/global/flat_atomic 2. buffer/global_atomic 4790 3. s_waitcnt vmcnt(0) 3. s_waitcnt vm/vscnt(0) 4791 4792 - Use vmcnt if atomic with 4793 return and vscnt if atomic 4794 with no-return. 4795 waitcnt lgkmcnt(0). 4796 - Must happen before - Must happen before 4797 following following 4798 buffer_wbinvl1_vol. buffer_gl*_inv. 4799 - Ensures the - Ensures the 4800 atomicrmw has atomicrmw has 4801 completed before completed before 4802 invalidating the invalidating the 4803 cache. caches. 4804 4805 4. buffer_wbinvl1_vol 4. buffer_gl0_inv; 4806 buffer_gl1_inv 4807 4808 - Must happen before - Must happen before 4809 any following any following 4810 global/generic global/generic 4811 load/load load/load 4812 atomic/atomicrmw. atomic/atomicrmw. 4813 - Ensures that - Ensures that 4814 following loads following loads 4815 will not see stale will not see stale 4816 global data. global data. 4817 4818 atomicrmw acq_rel - agent - generic 1. s_waitcnt lgkmcnt(0) & 1. s_waitcnt lgkmcnt(0) & 4819 - system vmcnt(0) vmcnt(0) & vscnt(0) 4820 4821 - If OpenCL, omit - If OpenCL, omit 4822 lgkmcnt(0). lgkmcnt(0). 4823 - Could be split into - Could be split into 4824 separate s_waitcnt separate s_waitcnt 4825 vmcnt(0) and vmcnt(0), s_waitcnt 4826 s_waitcnt vscnt(0) and s_waitcnt 4827 lgkmcnt(0) to allow lgkmcnt(0) to allow 4828 them to be them to be 4829 independently moved independently moved 4830 according to the according to the 4831 following rules. following rules. 4832 - s_waitcnt vmcnt(0) - s_waitcnt vmcnt(0) 4833 must happen after must happen after 4834 any preceding any preceding 4835 global/generic global/generic 4836 load/store/load load/load atomic 4837 atomic/store atomicrmw-with-return-value. 4838 atomic/atomicrmw. 4839 - s_waitcnt vscnt(0) 4840 must happen after 4841 any preceding 4842 global/generic 4843 store/store atomic/ 4844 atomicrmw-no-return-value. 4845 - s_waitcnt lgkmcnt(0) - s_waitcnt lgkmcnt(0) 4846 must happen after must happen after 4847 any preceding any preceding 4848 local/generic local/generic 4849 load/store/load load/store/load 4850 atomic/store atomic/store 4851 atomic/atomicrmw. atomic/atomicrmw. 4852 - Must happen before - Must happen before 4853 the following the following 4854 atomicrmw. atomicrmw. 4855 - Ensures that all - Ensures that all 4856 memory operations memory operations 4857 to global have have 4858 completed before completed before 4859 performing the performing the 4860 atomicrmw that is atomicrmw that is 4861 being released. being released. 4862 4863 2. flat_atomic 2. flat_atomic 4864 3. s_waitcnt vmcnt(0) & 3. s_waitcnt vm/vscnt(0) & 4865 lgkmcnt(0) lgkmcnt(0) 4866 4867 - If OpenCL, omit - If OpenCL, omit 4868 lgkmcnt(0). lgkmcnt(0). 4869 - Use vmcnt if atomic with 4870 return and vscnt if atomic 4871 with no-return. 4872 - Must happen before - Must happen before 4873 following following 4874 buffer_wbinvl1_vol. buffer_gl*_inv. 4875 - Ensures the - Ensures the 4876 atomicrmw has atomicrmw has 4877 completed before completed before 4878 invalidating the invalidating the 4879 cache. caches. 4880 4881 4. buffer_wbinvl1_vol 4. buffer_gl0_inv; 4882 buffer_gl1_inv 4883 4884 - Must happen before - Must happen before 4885 any following any following 4886 global/generic global/generic 4887 load/load load/load 4888 atomic/atomicrmw. atomic/atomicrmw. 4889 - Ensures that - Ensures that 4890 following loads following loads 4891 will not see stale will not see stale 4892 global data. global data. 4893 4894 fence acq_rel - singlethread *none* *none* *none* 4895 - wavefront 4896 fence acq_rel - workgroup *none* 1. s_waitcnt lgkmcnt(0) 1. s_waitcnt lgkmcnt(0) & 4897 vmcnt(0) & vscnt(0) 4898 4899 - If CU wavefront execution mode, omit vmcnt and 4900 vscnt. 4901 - If OpenCL and - If OpenCL and 4902 address space is address space is 4903 not generic, omit. not generic, omit 4904 lgkmcnt(0). 4905 - If OpenCL and 4906 address space is 4907 local, omit 4908 vmcnt(0) and vscnt(0). 4909 - However, - However, 4910 since LLVM since LLVM 4911 currently has no currently has no 4912 address space on address space on 4913 the fence need to the fence need to 4914 conservatively conservatively 4915 always generate always generate 4916 (see comment for (see comment for 4917 previous fence). previous fence). 4918 - Must happen after 4919 any preceding 4920 local/generic 4921 load/load 4922 atomic/store/store 4923 atomic/atomicrmw. 4924 - Could be split into 4925 separate s_waitcnt 4926 vmcnt(0), s_waitcnt 4927 vscnt(0) and s_waitcnt 4928 lgkmcnt(0) to allow 4929 them to be 4930 independently moved 4931 according to the 4932 following rules. 4933 - s_waitcnt vmcnt(0) 4934 must happen after 4935 any preceding 4936 global/generic 4937 load/load 4938 atomic/ 4939 atomicrmw-with-return-value. 4940 - s_waitcnt vscnt(0) 4941 must happen after 4942 any preceding 4943 global/generic 4944 store/store atomic/ 4945 atomicrmw-no-return-value. 4946 - s_waitcnt lgkmcnt(0) 4947 must happen after 4948 any preceding 4949 local/generic 4950 load/store/load 4951 atomic/store atomic/ 4952 atomicrmw. 4953 - Must happen before - Must happen before 4954 any following any following 4955 global/generic global/generic 4956 load/load load/load 4957 atomic/store/store atomic/store/store 4958 atomic/atomicrmw. atomic/atomicrmw. 4959 - Ensures that all - Ensures that all 4960 memory operations memory operations 4961 to local have have 4962 completed before completed before 4963 performing any performing any 4964 following global following global 4965 memory operations. memory operations. 4966 - Ensures that the - Ensures that the 4967 preceding preceding 4968 local/generic load local/generic load 4969 atomic/atomicrmw atomic/atomicrmw 4970 with an equal or with an equal or 4971 wider sync scope wider sync scope 4972 and memory ordering and memory ordering 4973 stronger than stronger than 4974 unordered (this is unordered (this is 4975 termed the termed the 4976 acquire-fence-paired-atomic acquire-fence-paired-atomic 4977 ) has completed ) has completed 4978 before following before following 4979 global memory global memory 4980 operations. This operations. This 4981 satisfies the satisfies the 4982 requirements of requirements of 4983 acquire. acquire. 4984 - Ensures that all - Ensures that all 4985 previous memory previous memory 4986 operations have operations have 4987 completed before a completed before a 4988 following following 4989 local/generic store local/generic store 4990 atomic/atomicrmw atomic/atomicrmw 4991 with an equal or with an equal or 4992 wider sync scope wider sync scope 4993 and memory ordering and memory ordering 4994 stronger than stronger than 4995 unordered (this is unordered (this is 4996 termed the termed the 4997 release-fence-paired-atomic release-fence-paired-atomic 4998 ). This satisfies the ). This satisfies the 4999 requirements of requirements of 5000 release. release. 5001 - Must happen before 5002 the following 5003 buffer_gl0_inv. 5004 - Ensures that the 5005 acquire-fence-paired 5006 atomic has completed 5007 before invalidating 5008 the 5009 cache. Therefore 5010 any following 5011 locations read must 5012 be no older than 5013 the value read by 5014 the 5015 acquire-fence-paired-atomic. 5016 5017 3. buffer_gl0_inv 5018 5019 - If CU wavefront execution mode, omit. 5020 - Ensures that 5021 following 5022 loads will not see 5023 stale data. 5024 5025 fence acq_rel - agent *none* 1. s_waitcnt lgkmcnt(0) & 1. s_waitcnt lgkmcnt(0) & 5026 - system vmcnt(0) vmcnt(0) & vscnt(0) 5027 5028 - If OpenCL and - If OpenCL and 5029 address space is address space is 5030 not generic, omit not generic, omit 5031 lgkmcnt(0). lgkmcnt(0). 5032 - If OpenCL and 5033 address space is 5034 local, omit 5035 vmcnt(0) and vscnt(0). 5036 - However, since LLVM - However, since LLVM 5037 currently has no currently has no 5038 address space on address space on 5039 the fence need to the fence need to 5040 conservatively conservatively 5041 always generate always generate 5042 (see comment for (see comment for 5043 previous fence). previous fence). 5044 - Could be split into - Could be split into 5045 separate s_waitcnt separate s_waitcnt 5046 vmcnt(0) and vmcnt(0), s_waitcnt 5047 s_waitcnt vscnt(0) and s_waitcnt 5048 lgkmcnt(0) to allow lgkmcnt(0) to allow 5049 them to be them to be 5050 independently moved independently moved 5051 according to the according to the 5052 following rules. following rules. 5053 - s_waitcnt vmcnt(0) - s_waitcnt vmcnt(0) 5054 must happen after must happen after 5055 any preceding any preceding 5056 global/generic global/generic 5057 load/store/load load/load 5058 atomic/store atomic/ 5059 atomic/atomicrmw. atomicrmw-with-return-value. 5060 - s_waitcnt vscnt(0) 5061 must happen after 5062 any preceding 5063 global/generic 5064 store/store atomic/ 5065 atomicrmw-no-return-value. 5066 - s_waitcnt lgkmcnt(0) - s_waitcnt lgkmcnt(0) 5067 must happen after must happen after 5068 any preceding any preceding 5069 local/generic local/generic 5070 load/store/load load/store/load 5071 atomic/store atomic/store 5072 atomic/atomicrmw. atomic/atomicrmw. 5073 - Must happen before - Must happen before 5074 the following the following 5075 buffer_wbinvl1_vol. buffer_gl*_inv. 5076 - Ensures that the - Ensures that the 5077 preceding preceding 5078 global/local/generic global/local/generic 5079 load load 5080 atomic/atomicrmw atomic/atomicrmw 5081 with an equal or with an equal or 5082 wider sync scope wider sync scope 5083 and memory ordering and memory ordering 5084 stronger than stronger than 5085 unordered (this is unordered (this is 5086 termed the termed the 5087 acquire-fence-paired-atomic acquire-fence-paired-atomic 5088 ) has completed ) has completed 5089 before invalidating before invalidating 5090 the cache. This the caches. This 5091 satisfies the satisfies the 5092 requirements of requirements of 5093 acquire. acquire. 5094 - Ensures that all - Ensures that all 5095 previous memory previous memory 5096 operations have operations have 5097 completed before a completed before a 5098 following following 5099 global/local/generic global/local/generic 5100 store store 5101 atomic/atomicrmw atomic/atomicrmw 5102 with an equal or with an equal or 5103 wider sync scope wider sync scope 5104 and memory ordering and memory ordering 5105 stronger than stronger than 5106 unordered (this is unordered (this is 5107 termed the termed the 5108 release-fence-paired-atomic release-fence-paired-atomic 5109 ). This satisfies the ). This satisfies the 5110 requirements of requirements of 5111 release. release. 5112 5113 2. buffer_wbinvl1_vol 2. buffer_gl0_inv; 5114 buffer_gl1_inv 5115 5116 - Must happen before - Must happen before 5117 any following any following 5118 global/generic global/generic 5119 load/load load/load 5120 atomic/store/store atomic/store/store 5121 atomic/atomicrmw. atomic/atomicrmw. 5122 - Ensures that - Ensures that 5123 following loads following loads 5124 will not see stale will not see stale 5125 global data. This global data. This 5126 satisfies the satisfies the 5127 requirements of requirements of 5128 acquire. acquire. 5129 5130 **Sequential Consistent Atomic** 5131 ---------------------------------------------------------------------------------------------------------------------- 5132 load atomic seq_cst - singlethread - global *Same as corresponding *Same as corresponding 5133 - wavefront - local load atomic acquire, load atomic acquire, 5134 - generic except must generated except must generated 5135 all instructions even all instructions even 5136 for OpenCL.* for OpenCL.* 5137 load atomic seq_cst - workgroup - global 1. s_waitcnt lgkmcnt(0) 1. s_waitcnt lgkmcnt(0) & 5138 - generic vmcnt(0) & vscnt(0) 5139 5140 - If CU wavefront execution mode, omit vmcnt and 5141 vscnt. 5142 - Could be split into 5143 separate s_waitcnt 5144 vmcnt(0), s_waitcnt 5145 vscnt(0) and s_waitcnt 5146 lgkmcnt(0) to allow 5147 them to be 5148 independently moved 5149 according to the 5150 following rules. 5151 - Must - waitcnt lgkmcnt(0) must 5152 happen after happen after 5153 preceding preceding 5154 global/generic load local load 5155 atomic/store atomic/store 5156 atomic/atomicrmw atomic/atomicrmw 5157 with memory with memory 5158 ordering of seq_cst ordering of seq_cst 5159 and with equal or and with equal or 5160 wider sync scope. wider sync scope. 5161 (Note that seq_cst (Note that seq_cst 5162 fences have their fences have their 5163 own s_waitcnt own s_waitcnt 5164 lgkmcnt(0) and so do lgkmcnt(0) and so do 5165 not need to be not need to be 5166 considered.) considered.) 5167 - waitcnt vmcnt(0) 5168 Must happen after 5169 preceding 5170 global/generic load 5171 atomic/ 5172 atomicrmw-with-return-value 5173 with memory 5174 ordering of seq_cst 5175 and with equal or 5176 wider sync scope. 5177 (Note that seq_cst 5178 fences have their 5179 own s_waitcnt 5180 vmcnt(0) and so do 5181 not need to be 5182 considered.) 5183 - waitcnt vscnt(0) 5184 Must happen after 5185 preceding 5186 global/generic store 5187 atomic/ 5188 atomicrmw-no-return-value 5189 with memory 5190 ordering of seq_cst 5191 and with equal or 5192 wider sync scope. 5193 (Note that seq_cst 5194 fences have their 5195 own s_waitcnt 5196 vscnt(0) and so do 5197 not need to be 5198 considered.) 5199 - Ensures any - Ensures any 5200 preceding preceding 5201 sequential sequential 5202 consistent local consistent global/local 5203 memory instructions memory instructions 5204 have completed have completed 5205 before executing before executing 5206 this sequentially this sequentially 5207 consistent consistent 5208 instruction. This instruction. This 5209 prevents reordering prevents reordering 5210 a seq_cst store a seq_cst store 5211 followed by a followed by a 5212 seq_cst load. (Note seq_cst load. (Note 5213 that seq_cst is that seq_cst is 5214 stronger than stronger than 5215 acquire/release as acquire/release as 5216 the reordering of the reordering of 5217 load acquire load acquire 5218 followed by a store followed by a store 5219 release is release is 5220 prevented by the prevented by the 5221 waitcnt of waitcnt of 5222 the release, but the release, but 5223 there is nothing there is nothing 5224 preventing a store preventing a store 5225 release followed by release followed by 5226 load acquire from load acquire from 5227 competing out of competing out of 5228 order.) order.) 5229 5230 2. *Following 2. *Following 5231 instructions same as instructions same as 5232 corresponding load corresponding load 5233 atomic acquire, atomic acquire, 5234 except must generated except must generated 5235 all instructions even all instructions even 5236 for OpenCL.* for OpenCL.* 5237 load atomic seq_cst - workgroup - local *Same as corresponding 5238 load atomic acquire, 5239 except must generated 5240 all instructions even 5241 for OpenCL.* 5242 5243 1. s_waitcnt vmcnt(0) & vscnt(0) 5244 5245 - If CU wavefront execution mode, omit. 5246 - Could be split into 5247 separate s_waitcnt 5248 vmcnt(0) and s_waitcnt 5249 vscnt(0) to allow 5250 them to be 5251 independently moved 5252 according to the 5253 following rules. 5254 - waitcnt vmcnt(0) 5255 Must happen after 5256 preceding 5257 global/generic load 5258 atomic/ 5259 atomicrmw-with-return-value 5260 with memory 5261 ordering of seq_cst 5262 and with equal or 5263 wider sync scope. 5264 (Note that seq_cst 5265 fences have their 5266 own s_waitcnt 5267 vmcnt(0) and so do 5268 not need to be 5269 considered.) 5270 - waitcnt vscnt(0) 5271 Must happen after 5272 preceding 5273 global/generic store 5274 atomic/ 5275 atomicrmw-no-return-value 5276 with memory 5277 ordering of seq_cst 5278 and with equal or 5279 wider sync scope. 5280 (Note that seq_cst 5281 fences have their 5282 own s_waitcnt 5283 vscnt(0) and so do 5284 not need to be 5285 considered.) 5286 - Ensures any 5287 preceding 5288 sequential 5289 consistent global 5290 memory instructions 5291 have completed 5292 before executing 5293 this sequentially 5294 consistent 5295 instruction. This 5296 prevents reordering 5297 a seq_cst store 5298 followed by a 5299 seq_cst load. (Note 5300 that seq_cst is 5301 stronger than 5302 acquire/release as 5303 the reordering of 5304 load acquire 5305 followed by a store 5306 release is 5307 prevented by the 5308 waitcnt of 5309 the release, but 5310 there is nothing 5311 preventing a store 5312 release followed by 5313 load acquire from 5314 competing out of 5315 order.) 5316 5317 2. *Following 5318 instructions same as 5319 corresponding load 5320 atomic acquire, 5321 except must generated 5322 all instructions even 5323 for OpenCL.* 5324 5325 load atomic seq_cst - agent - global 1. s_waitcnt lgkmcnt(0) & 1. s_waitcnt lgkmcnt(0) & 5326 - system - generic vmcnt(0) vmcnt(0) & vscnt(0) 5327 5328 - Could be split into - Could be split into 5329 separate s_waitcnt separate s_waitcnt 5330 vmcnt(0) vmcnt(0), s_waitcnt 5331 and s_waitcnt vscnt(0) and s_waitcnt 5332 lgkmcnt(0) to allow lgkmcnt(0) to allow 5333 them to be them to be 5334 independently moved independently moved 5335 according to the according to the 5336 following rules. following rules. 5337 - waitcnt lgkmcnt(0) - waitcnt lgkmcnt(0) 5338 must happen after must happen after 5339 preceding preceding 5340 global/generic load local load 5341 atomic/store atomic/store 5342 atomic/atomicrmw atomic/atomicrmw 5343 with memory with memory 5344 ordering of seq_cst ordering of seq_cst 5345 and with equal or and with equal or 5346 wider sync scope. wider sync scope. 5347 (Note that seq_cst (Note that seq_cst 5348 fences have their fences have their 5349 own s_waitcnt own s_waitcnt 5350 lgkmcnt(0) and so do lgkmcnt(0) and so do 5351 not need to be not need to be 5352 considered.) considered.) 5353 - waitcnt vmcnt(0) - waitcnt vmcnt(0) 5354 must happen after must happen after 5355 preceding preceding 5356 global/generic load global/generic load 5357 atomic/store atomic/ 5358 atomic/atomicrmw atomicrmw-with-return-value 5359 with memory with memory 5360 ordering of seq_cst ordering of seq_cst 5361 and with equal or and with equal or 5362 wider sync scope. wider sync scope. 5363 (Note that seq_cst (Note that seq_cst 5364 fences have their fences have their 5365 own s_waitcnt own s_waitcnt 5366 vmcnt(0) and so do vmcnt(0) and so do 5367 not need to be not need to be 5368 considered.) considered.) 5369 - waitcnt vscnt(0) 5370 Must happen after 5371 preceding 5372 global/generic store 5373 atomic/ 5374 atomicrmw-no-return-value 5375 with memory 5376 ordering of seq_cst 5377 and with equal or 5378 wider sync scope. 5379 (Note that seq_cst 5380 fences have their 5381 own s_waitcnt 5382 vscnt(0) and so do 5383 not need to be 5384 considered.) 5385 - Ensures any - Ensures any 5386 preceding preceding 5387 sequential sequential 5388 consistent global consistent global 5389 memory instructions memory instructions 5390 have completed have completed 5391 before executing before executing 5392 this sequentially this sequentially 5393 consistent consistent 5394 instruction. This instruction. This 5395 prevents reordering prevents reordering 5396 a seq_cst store a seq_cst store 5397 followed by a followed by a 5398 seq_cst load. (Note seq_cst load. (Note 5399 that seq_cst is that seq_cst is 5400 stronger than stronger than 5401 acquire/release as acquire/release as 5402 the reordering of the reordering of 5403 load acquire load acquire 5404 followed by a store followed by a store 5405 release is release is 5406 prevented by the prevented by the 5407 waitcnt of waitcnt of 5408 the release, but the release, but 5409 there is nothing there is nothing 5410 preventing a store preventing a store 5411 release followed by release followed by 5412 load acquire from load acquire from 5413 competing out of competing out of 5414 order.) order.) 5415 5416 2. *Following 2. *Following 5417 instructions same as instructions same as 5418 corresponding load corresponding load 5419 atomic acquire, atomic acquire, 5420 except must generated except must generated 5421 all instructions even all instructions even 5422 for OpenCL.* for OpenCL.* 5423 store atomic seq_cst - singlethread - global *Same as corresponding *Same as corresponding 5424 - wavefront - local store atomic release, store atomic release, 5425 - workgroup - generic except must generated except must generated 5426 all instructions even all instructions even 5427 for OpenCL.* for OpenCL.* 5428 store atomic seq_cst - agent - global *Same as corresponding *Same as corresponding 5429 - system - generic store atomic release, store atomic release, 5430 except must generated except must generated 5431 all instructions even all instructions even 5432 for OpenCL.* for OpenCL.* 5433 atomicrmw seq_cst - singlethread - global *Same as corresponding *Same as corresponding 5434 - wavefront - local atomicrmw acq_rel, atomicrmw acq_rel, 5435 - workgroup - generic except must generated except must generated 5436 all instructions even all instructions even 5437 for OpenCL.* for OpenCL.* 5438 atomicrmw seq_cst - agent - global *Same as corresponding *Same as corresponding 5439 - system - generic atomicrmw acq_rel, atomicrmw acq_rel, 5440 except must generated except must generated 5441 all instructions even all instructions even 5442 for OpenCL.* for OpenCL.* 5443 fence seq_cst - singlethread *none* *Same as corresponding *Same as corresponding 5444 - wavefront fence acq_rel, fence acq_rel, 5445 - workgroup except must generated except must generated 5446 - agent all instructions even all instructions even 5447 - system for OpenCL.* for OpenCL.* 5448 ============ ============ ============== ========== =============================== ================================== 5449 5450The memory order also adds the single thread optimization constrains defined in 5451table 5452:ref:`amdgpu-amdhsa-memory-model-single-thread-optimization-constraints-gfx6-gfx10-table`. 5453 5454 .. table:: AMDHSA Memory Model Single Thread Optimization Constraints GFX6-GFX10 5455 :name: amdgpu-amdhsa-memory-model-single-thread-optimization-constraints-gfx6-gfx10-table 5456 5457 ============ ============================================================== 5458 LLVM Memory Optimization Constraints 5459 Ordering 5460 ============ ============================================================== 5461 unordered *none* 5462 monotonic *none* 5463 acquire - If a load atomic/atomicrmw then no following load/load 5464 atomic/store/ store atomic/atomicrmw/fence instruction can 5465 be moved before the acquire. 5466 - If a fence then same as load atomic, plus no preceding 5467 associated fence-paired-atomic can be moved after the fence. 5468 release - If a store atomic/atomicrmw then no preceding load/load 5469 atomic/store/ store atomic/atomicrmw/fence instruction can 5470 be moved after the release. 5471 - If a fence then same as store atomic, plus no following 5472 associated fence-paired-atomic can be moved before the 5473 fence. 5474 acq_rel Same constraints as both acquire and release. 5475 seq_cst - If a load atomic then same constraints as acquire, plus no 5476 preceding sequentially consistent load atomic/store 5477 atomic/atomicrmw/fence instruction can be moved after the 5478 seq_cst. 5479 - If a store atomic then the same constraints as release, plus 5480 no following sequentially consistent load atomic/store 5481 atomic/atomicrmw/fence instruction can be moved before the 5482 seq_cst. 5483 - If an atomicrmw/fence then same constraints as acq_rel. 5484 ============ ============================================================== 5485 5486Trap Handler ABI 5487~~~~~~~~~~~~~~~~ 5488 5489For code objects generated by AMDGPU backend for HSA [HSA]_ compatible runtimes 5490(such as ROCm [AMD-ROCm]_), the runtime installs a trap handler that supports 5491the ``s_trap`` instruction with the following usage: 5492 5493 .. table:: AMDGPU Trap Handler for AMDHSA OS 5494 :name: amdgpu-trap-handler-for-amdhsa-os-table 5495 5496 =================== =============== =============== ======================= 5497 Usage Code Sequence Trap Handler Description 5498 Inputs 5499 =================== =============== =============== ======================= 5500 reserved ``s_trap 0x00`` Reserved by hardware. 5501 ``debugtrap(arg)`` ``s_trap 0x01`` ``SGPR0-1``: Reserved for HSA 5502 ``queue_ptr`` ``debugtrap`` 5503 ``VGPR0``: intrinsic (not 5504 ``arg`` implemented). 5505 ``llvm.trap`` ``s_trap 0x02`` ``SGPR0-1``: Causes dispatch to be 5506 ``queue_ptr`` terminated and its 5507 associated queue put 5508 into the error state. 5509 ``llvm.debugtrap`` ``s_trap 0x03`` - If debugger not 5510 installed then 5511 behaves as a 5512 no-operation. The 5513 trap handler is 5514 entered and 5515 immediately returns 5516 to continue 5517 execution of the 5518 wavefront. 5519 - If the debugger is 5520 installed, causes 5521 the debug trap to be 5522 reported by the 5523 debugger and the 5524 wavefront is put in 5525 the halt state until 5526 resumed by the 5527 debugger. 5528 reserved ``s_trap 0x04`` Reserved. 5529 reserved ``s_trap 0x05`` Reserved. 5530 reserved ``s_trap 0x06`` Reserved. 5531 debugger breakpoint ``s_trap 0x07`` Reserved for debugger 5532 breakpoints. 5533 reserved ``s_trap 0x08`` Reserved. 5534 reserved ``s_trap 0xfe`` Reserved. 5535 reserved ``s_trap 0xff`` Reserved. 5536 =================== =============== =============== ======================= 5537 5538AMDPAL 5539------ 5540 5541This section provides code conventions used when the target triple OS is 5542``amdpal`` (see :ref:`amdgpu-target-triples`) for passing runtime parameters 5543from the application/runtime to each invocation of a hardware shader. These 5544parameters include both generic, application-controlled parameters called 5545*user data* as well as system-generated parameters that are a product of the 5546draw or dispatch execution. 5547 5548User Data 5549~~~~~~~~~ 5550 5551Each hardware stage has a set of 32-bit *user data registers* which can be 5552written from a command buffer and then loaded into SGPRs when waves are launched 5553via a subsequent dispatch or draw operation. This is the way most arguments are 5554passed from the application/runtime to a hardware shader. 5555 5556Compute User Data 5557~~~~~~~~~~~~~~~~~ 5558 5559Compute shader user data mappings are simpler than graphics shaders, and have a 5560fixed mapping. 5561 5562Note that there are always 10 available *user data entries* in registers - 5563entries beyond that limit must be fetched from memory (via the spill table 5564pointer) by the shader. 5565 5566 .. table:: PAL Compute Shader User Data Registers 5567 :name: pal-compute-user-data-registers 5568 5569 ============= ================================ 5570 User Register Description 5571 ============= ================================ 5572 0 Global Internal Table (32-bit pointer) 5573 1 Per-Shader Internal Table (32-bit pointer) 5574 2 - 11 Application-Controlled User Data (10 32-bit values) 5575 12 Spill Table (32-bit pointer) 5576 13 - 14 Thread Group Count (64-bit pointer) 5577 15 GDS Range 5578 ============= ================================ 5579 5580Graphics User Data 5581~~~~~~~~~~~~~~~~~~ 5582 5583Graphics pipelines support a much more flexible user data mapping: 5584 5585 .. table:: PAL Graphics Shader User Data Registers 5586 :name: pal-graphics-user-data-registers 5587 5588 ============= ================================ 5589 User Register Description 5590 ============= ================================ 5591 0 Global Internal Table (32-bit pointer) 5592 + Per-Shader Internal Table (32-bit pointer) 5593 + 1-15 Application Controlled User Data 5594 (1-15 Contiguous 32-bit Values in Registers) 5595 + Spill Table (32-bit pointer) 5596 + Draw Index (First Stage Only) 5597 + Vertex Offset (First Stage Only) 5598 + Instance Offset (First Stage Only) 5599 ============= ================================ 5600 5601 The placement of the global internal table remains fixed in the first *user 5602 data SGPR register*. Otherwise all parameters are optional, and can be mapped 5603 to any desired *user data SGPR register*, with the following regstrictions: 5604 5605 * Draw Index, Vertex Offset, and Instance Offset can only be used by the first 5606 activehardware stage in a graphics pipeline (i.e. where the API vertex 5607 shader runs). 5608 5609 * Application-controlled user data must be mapped into a contiguous range of 5610 user data registers. 5611 5612 * The application-controlled user data range supports compaction remapping, so 5613 only *entries* that are actually consumed by the shader must be assigned to 5614 corresponding *registers*. Note that in order to support an efficient runtime 5615 implementation, the remapping must pack *registers* in the same order as 5616 *entries*, with unused *entries* removed. 5617 5618.. _pal_global_internal_table: 5619 5620Global Internal Table 5621~~~~~~~~~~~~~~~~~~~~~ 5622 5623The global internal table is a table of *shader resource descriptors* (SRDs) that 5624define how certain engine-wide, runtime-managed resources should be accessed 5625from a shader. The majority of these resources have HW-defined formats, and it 5626is up to the compiler to write/read data as required by the target hardware. 5627 5628The following table illustrates the required format: 5629 5630 .. table:: PAL Global Internal Table 5631 :name: pal-git-table 5632 5633 ============= ================================ 5634 Offset Description 5635 ============= ================================ 5636 0-3 Graphics Scratch SRD 5637 4-7 Compute Scratch SRD 5638 8-11 ES/GS Ring Output SRD 5639 12-15 ES/GS Ring Input SRD 5640 16-19 GS/VS Ring Output #0 5641 20-23 GS/VS Ring Output #1 5642 24-27 GS/VS Ring Output #2 5643 28-31 GS/VS Ring Output #3 5644 32-35 GS/VS Ring Input SRD 5645 36-39 Tessellation Factor Buffer SRD 5646 40-43 Off-Chip LDS Buffer SRD 5647 44-47 Off-Chip Param Cache Buffer SRD 5648 48-51 Sample Position Buffer SRD 5649 52 vaRange::ShadowDescriptorTable High Bits 5650 ============= ================================ 5651 5652 The pointer to the global internal table passed to the shader as user data 5653 is a 32-bit pointer. The top 32 bits should be assumed to be the same as 5654 the top 32 bits of the pipeline, so the shader may use the program 5655 counter's top 32 bits. 5656 5657Unspecified OS 5658-------------- 5659 5660This section provides code conventions used when the target triple OS is 5661empty (see :ref:`amdgpu-target-triples`). 5662 5663Trap Handler ABI 5664~~~~~~~~~~~~~~~~ 5665 5666For code objects generated by AMDGPU backend for non-amdhsa OS, the runtime does 5667not install a trap handler. The ``llvm.trap`` and ``llvm.debugtrap`` 5668instructions are handled as follows: 5669 5670 .. table:: AMDGPU Trap Handler for Non-AMDHSA OS 5671 :name: amdgpu-trap-handler-for-non-amdhsa-os-table 5672 5673 =============== =============== =========================================== 5674 Usage Code Sequence Description 5675 =============== =============== =========================================== 5676 llvm.trap s_endpgm Causes wavefront to be terminated. 5677 llvm.debugtrap *none* Compiler warning given that there is no 5678 trap handler installed. 5679 =============== =============== =========================================== 5680 5681Source Languages 5682================ 5683 5684.. _amdgpu-opencl: 5685 5686OpenCL 5687------ 5688 5689When the language is OpenCL the following differences occur: 5690 56911. The OpenCL memory model is used (see :ref:`amdgpu-amdhsa-memory-model`). 56922. The AMDGPU backend appends additional arguments to the kernel's explicit 5693 arguments for the AMDHSA OS (see 5694 :ref:`opencl-kernel-implicit-arguments-appended-for-amdhsa-os-table`). 56953. Additional metadata is generated 5696 (see :ref:`amdgpu-amdhsa-code-object-metadata`). 5697 5698 .. table:: OpenCL kernel implicit arguments appended for AMDHSA OS 5699 :name: opencl-kernel-implicit-arguments-appended-for-amdhsa-os-table 5700 5701 ======== ==== ========= =========================================== 5702 Position Byte Byte Description 5703 Size Alignment 5704 ======== ==== ========= =========================================== 5705 1 8 8 OpenCL Global Offset X 5706 2 8 8 OpenCL Global Offset Y 5707 3 8 8 OpenCL Global Offset Z 5708 4 8 8 OpenCL address of printf buffer 5709 5 8 8 OpenCL address of virtual queue used by 5710 enqueue_kernel. 5711 6 8 8 OpenCL address of AqlWrap struct used by 5712 enqueue_kernel. 5713 7 8 8 Pointer argument used for Multi-gird 5714 synchronization. 5715 ======== ==== ========= =========================================== 5716 5717.. _amdgpu-hcc: 5718 5719HCC 5720--- 5721 5722When the language is HCC the following differences occur: 5723 57241. The HSA memory model is used (see :ref:`amdgpu-amdhsa-memory-model`). 5725 5726.. _amdgpu-assembler: 5727 5728Assembler 5729--------- 5730 5731AMDGPU backend has LLVM-MC based assembler which is currently in development. 5732It supports AMDGCN GFX6-GFX10. 5733 5734This section describes general syntax for instructions and operands. 5735 5736Instructions 5737~~~~~~~~~~~~ 5738 5739.. toctree:: 5740 :hidden: 5741 5742 AMDGPU/AMDGPUAsmGFX7 5743 AMDGPU/AMDGPUAsmGFX8 5744 AMDGPU/AMDGPUAsmGFX9 5745 AMDGPU/AMDGPUAsmGFX10 5746 AMDGPUModifierSyntax 5747 AMDGPUOperandSyntax 5748 AMDGPUInstructionSyntax 5749 AMDGPUInstructionNotation 5750 5751An instruction has the following :doc:`syntax<AMDGPUInstructionSyntax>`: 5752 5753 ``<``\ *opcode*\ ``> <``\ *operand0*\ ``>, <``\ *operand1*\ ``>,... <``\ *modifier0*\ ``> <``\ *modifier1*\ ``>...`` 5754 5755:doc:`Operands<AMDGPUOperandSyntax>` are normally comma-separated while 5756:doc:`modifiers<AMDGPUModifierSyntax>` are space-separated. 5757 5758The order of *operands* and *modifiers* is fixed. 5759Most *modifiers* are optional and may be omitted. 5760 5761See detailed instruction syntax description for :doc:`GFX7<AMDGPU/AMDGPUAsmGFX7>`, 5762:doc:`GFX8<AMDGPU/AMDGPUAsmGFX8>`, :doc:`GFX9<AMDGPU/AMDGPUAsmGFX9>` 5763and :doc:`GFX10<AMDGPU/AMDGPUAsmGFX10>`. 5764 5765Note that features under development are not included in this description. 5766 5767For more information about instructions, their semantics and supported combinations of 5768operands, refer to one of instruction set architecture manuals 5769[AMD-GCN-GFX6]_, [AMD-GCN-GFX7]_, [AMD-GCN-GFX8]_, [AMD-GCN-GFX9]_ and 5770[AMD-GCN-GFX10]_. 5771 5772Operands 5773~~~~~~~~ 5774 5775Detailed description of operands may be found :doc:`here<AMDGPUOperandSyntax>`. 5776 5777Modifiers 5778~~~~~~~~~ 5779 5780Detailed description of modifiers may be found :doc:`here<AMDGPUModifierSyntax>`. 5781 5782Instruction Examples 5783~~~~~~~~~~~~~~~~~~~~ 5784 5785DS 5786++ 5787 5788.. code-block:: nasm 5789 5790 ds_add_u32 v2, v4 offset:16 5791 ds_write_src2_b64 v2 offset0:4 offset1:8 5792 ds_cmpst_f32 v2, v4, v6 5793 ds_min_rtn_f64 v[8:9], v2, v[4:5] 5794 5795 5796For full list of supported instructions, refer to "LDS/GDS instructions" in ISA Manual. 5797 5798FLAT 5799++++ 5800 5801.. code-block:: nasm 5802 5803 flat_load_dword v1, v[3:4] 5804 flat_store_dwordx3 v[3:4], v[5:7] 5805 flat_atomic_swap v1, v[3:4], v5 glc 5806 flat_atomic_cmpswap v1, v[3:4], v[5:6] glc slc 5807 flat_atomic_fmax_x2 v[1:2], v[3:4], v[5:6] glc 5808 5809For full list of supported instructions, refer to "FLAT instructions" in ISA Manual. 5810 5811MUBUF 5812+++++ 5813 5814.. code-block:: nasm 5815 5816 buffer_load_dword v1, off, s[4:7], s1 5817 buffer_store_dwordx4 v[1:4], v2, ttmp[4:7], s1 offen offset:4 glc tfe 5818 buffer_store_format_xy v[1:2], off, s[4:7], s1 5819 buffer_wbinvl1 5820 buffer_atomic_inc v1, v2, s[8:11], s4 idxen offset:4 slc 5821 5822For full list of supported instructions, refer to "MUBUF Instructions" in ISA Manual. 5823 5824SMRD/SMEM 5825+++++++++ 5826 5827.. code-block:: nasm 5828 5829 s_load_dword s1, s[2:3], 0xfc 5830 s_load_dwordx8 s[8:15], s[2:3], s4 5831 s_load_dwordx16 s[88:103], s[2:3], s4 5832 s_dcache_inv_vol 5833 s_memtime s[4:5] 5834 5835For full list of supported instructions, refer to "Scalar Memory Operations" in ISA Manual. 5836 5837SOP1 5838++++ 5839 5840.. code-block:: nasm 5841 5842 s_mov_b32 s1, s2 5843 s_mov_b64 s[0:1], 0x80000000 5844 s_cmov_b32 s1, 200 5845 s_wqm_b64 s[2:3], s[4:5] 5846 s_bcnt0_i32_b64 s1, s[2:3] 5847 s_swappc_b64 s[2:3], s[4:5] 5848 s_cbranch_join s[4:5] 5849 5850For full list of supported instructions, refer to "SOP1 Instructions" in ISA Manual. 5851 5852SOP2 5853++++ 5854 5855.. code-block:: nasm 5856 5857 s_add_u32 s1, s2, s3 5858 s_and_b64 s[2:3], s[4:5], s[6:7] 5859 s_cselect_b32 s1, s2, s3 5860 s_andn2_b32 s2, s4, s6 5861 s_lshr_b64 s[2:3], s[4:5], s6 5862 s_ashr_i32 s2, s4, s6 5863 s_bfm_b64 s[2:3], s4, s6 5864 s_bfe_i64 s[2:3], s[4:5], s6 5865 s_cbranch_g_fork s[4:5], s[6:7] 5866 5867For full list of supported instructions, refer to "SOP2 Instructions" in ISA Manual. 5868 5869SOPC 5870++++ 5871 5872.. code-block:: nasm 5873 5874 s_cmp_eq_i32 s1, s2 5875 s_bitcmp1_b32 s1, s2 5876 s_bitcmp0_b64 s[2:3], s4 5877 s_setvskip s3, s5 5878 5879For full list of supported instructions, refer to "SOPC Instructions" in ISA Manual. 5880 5881SOPP 5882++++ 5883 5884.. code-block:: nasm 5885 5886 s_barrier 5887 s_nop 2 5888 s_endpgm 5889 s_waitcnt 0 ; Wait for all counters to be 0 5890 s_waitcnt vmcnt(0) & expcnt(0) & lgkmcnt(0) ; Equivalent to above 5891 s_waitcnt vmcnt(1) ; Wait for vmcnt counter to be 1. 5892 s_sethalt 9 5893 s_sleep 10 5894 s_sendmsg 0x1 5895 s_sendmsg sendmsg(MSG_INTERRUPT) 5896 s_trap 1 5897 5898For full list of supported instructions, refer to "SOPP Instructions" in ISA Manual. 5899 5900Unless otherwise mentioned, little verification is performed on the operands 5901of SOPP Instructions, so it is up to the programmer to be familiar with the 5902range or acceptable values. 5903 5904VALU 5905++++ 5906 5907For vector ALU instruction opcodes (VOP1, VOP2, VOP3, VOPC, VOP_DPP, VOP_SDWA), 5908the assembler will automatically use optimal encoding based on its operands. 5909To force specific encoding, one can add a suffix to the opcode of the instruction: 5910 5911* _e32 for 32-bit VOP1/VOP2/VOPC 5912* _e64 for 64-bit VOP3 5913* _dpp for VOP_DPP 5914* _sdwa for VOP_SDWA 5915 5916VOP1/VOP2/VOP3/VOPC examples: 5917 5918.. code-block:: nasm 5919 5920 v_mov_b32 v1, v2 5921 v_mov_b32_e32 v1, v2 5922 v_nop 5923 v_cvt_f64_i32_e32 v[1:2], v2 5924 v_floor_f32_e32 v1, v2 5925 v_bfrev_b32_e32 v1, v2 5926 v_add_f32_e32 v1, v2, v3 5927 v_mul_i32_i24_e64 v1, v2, 3 5928 v_mul_i32_i24_e32 v1, -3, v3 5929 v_mul_i32_i24_e32 v1, -100, v3 5930 v_addc_u32 v1, s[0:1], v2, v3, s[2:3] 5931 v_max_f16_e32 v1, v2, v3 5932 5933VOP_DPP examples: 5934 5935.. code-block:: nasm 5936 5937 v_mov_b32 v0, v0 quad_perm:[0,2,1,1] 5938 v_sin_f32 v0, v0 row_shl:1 row_mask:0xa bank_mask:0x1 bound_ctrl:0 5939 v_mov_b32 v0, v0 wave_shl:1 5940 v_mov_b32 v0, v0 row_mirror 5941 v_mov_b32 v0, v0 row_bcast:31 5942 v_mov_b32 v0, v0 quad_perm:[1,3,0,1] row_mask:0xa bank_mask:0x1 bound_ctrl:0 5943 v_add_f32 v0, v0, |v0| row_shl:1 row_mask:0xa bank_mask:0x1 bound_ctrl:0 5944 v_max_f16 v1, v2, v3 row_shl:1 row_mask:0xa bank_mask:0x1 bound_ctrl:0 5945 5946VOP_SDWA examples: 5947 5948.. code-block:: nasm 5949 5950 v_mov_b32 v1, v2 dst_sel:BYTE_0 dst_unused:UNUSED_PRESERVE src0_sel:DWORD 5951 v_min_u32 v200, v200, v1 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_1 src1_sel:DWORD 5952 v_sin_f32 v0, v0 dst_unused:UNUSED_PAD src0_sel:WORD_1 5953 v_fract_f32 v0, |v0| dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_1 5954 v_cmpx_le_u32 vcc, v1, v2 src0_sel:BYTE_2 src1_sel:WORD_0 5955 5956For full list of supported instructions, refer to "Vector ALU instructions". 5957 5958.. TODO 5959 Remove once we switch to code object v3 by default. 5960 5961.. _amdgpu-amdhsa-assembler-predefined-symbols-v2: 5962 5963Code Object V2 Predefined Symbols (-mattr=-code-object-v3) 5964~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 5965 5966.. warning:: Code Object V2 is not the default code object version emitted by 5967 this version of LLVM. For a description of the predefined symbols available 5968 with the default configuration (Code Object V3) see 5969 :ref:`amdgpu-amdhsa-assembler-predefined-symbols-v3`. 5970 5971The AMDGPU assembler defines and updates some symbols automatically. These 5972symbols do not affect code generation. 5973 5974.option.machine_version_major 5975+++++++++++++++++++++++++++++ 5976 5977Set to the GFX major generation number of the target being assembled for. For 5978example, when assembling for a "GFX9" target this will be set to the integer 5979value "9". The possible GFX major generation numbers are presented in 5980:ref:`amdgpu-processors`. 5981 5982.option.machine_version_minor 5983+++++++++++++++++++++++++++++ 5984 5985Set to the GFX minor generation number of the target being assembled for. For 5986example, when assembling for a "GFX810" target this will be set to the integer 5987value "1". The possible GFX minor generation numbers are presented in 5988:ref:`amdgpu-processors`. 5989 5990.option.machine_version_stepping 5991++++++++++++++++++++++++++++++++ 5992 5993Set to the GFX stepping generation number of the target being assembled for. 5994For example, when assembling for a "GFX704" target this will be set to the 5995integer value "4". The possible GFX stepping generation numbers are presented 5996in :ref:`amdgpu-processors`. 5997 5998.kernel.vgpr_count 5999++++++++++++++++++ 6000 6001Set to zero each time a 6002:ref:`amdgpu-amdhsa-assembler-directive-amdgpu_hsa_kernel` directive is 6003encountered. At each instruction, if the current value of this symbol is less 6004than or equal to the maximum VPGR number explicitly referenced within that 6005instruction then the symbol value is updated to equal that VGPR number plus 6006one. 6007 6008.kernel.sgpr_count 6009++++++++++++++++++ 6010 6011Set to zero each time a 6012:ref:`amdgpu-amdhsa-assembler-directive-amdgpu_hsa_kernel` directive is 6013encountered. At each instruction, if the current value of this symbol is less 6014than or equal to the maximum VPGR number explicitly referenced within that 6015instruction then the symbol value is updated to equal that SGPR number plus 6016one. 6017 6018.. _amdgpu-amdhsa-assembler-directives-v2: 6019 6020Code Object V2 Directives (-mattr=-code-object-v3) 6021~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 6022 6023.. warning:: Code Object V2 is not the default code object version emitted by 6024 this version of LLVM. For a description of the directives supported with 6025 the default configuration (Code Object V3) see 6026 :ref:`amdgpu-amdhsa-assembler-directives-v3`. 6027 6028AMDGPU ABI defines auxiliary data in output code object. In assembly source, 6029one can specify them with assembler directives. 6030 6031.hsa_code_object_version major, minor 6032+++++++++++++++++++++++++++++++++++++ 6033 6034*major* and *minor* are integers that specify the version of the HSA code 6035object that will be generated by the assembler. 6036 6037.hsa_code_object_isa [major, minor, stepping, vendor, arch] 6038+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ 6039 6040 6041*major*, *minor*, and *stepping* are all integers that describe the instruction 6042set architecture (ISA) version of the assembly program. 6043 6044*vendor* and *arch* are quoted strings. *vendor* should always be equal to 6045"AMD" and *arch* should always be equal to "AMDGPU". 6046 6047By default, the assembler will derive the ISA version, *vendor*, and *arch* 6048from the value of the -mcpu option that is passed to the assembler. 6049 6050.. _amdgpu-amdhsa-assembler-directive-amdgpu_hsa_kernel: 6051 6052.amdgpu_hsa_kernel (name) 6053+++++++++++++++++++++++++ 6054 6055This directives specifies that the symbol with given name is a kernel entry point 6056(label) and the object should contain corresponding symbol of type STT_AMDGPU_HSA_KERNEL. 6057 6058.amd_kernel_code_t 6059++++++++++++++++++ 6060 6061This directive marks the beginning of a list of key / value pairs that are used 6062to specify the amd_kernel_code_t object that will be emitted by the assembler. 6063The list must be terminated by the *.end_amd_kernel_code_t* directive. For 6064any amd_kernel_code_t values that are unspecified a default value will be 6065used. The default value for all keys is 0, with the following exceptions: 6066 6067- *amd_code_version_major* defaults to 1. 6068- *amd_kernel_code_version_minor* defaults to 2. 6069- *amd_machine_kind* defaults to 1. 6070- *amd_machine_version_major*, *machine_version_minor*, and 6071 *amd_machine_version_stepping* are derived from the value of the -mcpu option 6072 that is passed to the assembler. 6073- *kernel_code_entry_byte_offset* defaults to 256. 6074- *wavefront_size* defaults 6 for all targets before GFX10. For GFX10 onwards 6075 defaults to 6 if target feature ``wavefrontsize64`` is enabled, otherwise 5. 6076 Note that wavefront size is specified as a power of two, so a value of **n** 6077 means a size of 2^ **n**. 6078- *call_convention* defaults to -1. 6079- *kernarg_segment_alignment*, *group_segment_alignment*, and 6080 *private_segment_alignment* default to 4. Note that alignments are specified 6081 as a power of 2, so a value of **n** means an alignment of 2^ **n**. 6082- *enable_wgp_mode* defaults to 1 if target feature ``cumode`` is disabled for 6083 GFX10 onwards. 6084- *enable_mem_ordered* defaults to 1 for GFX10 onwards. 6085 6086The *.amd_kernel_code_t* directive must be placed immediately after the 6087function label and before any instructions. 6088 6089For a full list of amd_kernel_code_t keys, refer to AMDGPU ABI document, 6090comments in lib/Target/AMDGPU/AmdKernelCodeT.h and test/CodeGen/AMDGPU/hsa.s. 6091 6092.. _amdgpu-amdhsa-assembler-example-v2: 6093 6094Code Object V2 Example Source Code (-mattr=-code-object-v3) 6095~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 6096 6097.. warning:: Code Object V2 is not the default code object version emitted by 6098 this version of LLVM. For a description of the directives supported with 6099 the default configuration (Code Object V3) see 6100 :ref:`amdgpu-amdhsa-assembler-example-v3`. 6101 6102Here is an example of a minimal assembly source file, defining one HSA kernel: 6103 6104.. code-block:: none 6105 6106 .hsa_code_object_version 1,0 6107 .hsa_code_object_isa 6108 6109 .hsatext 6110 .globl hello_world 6111 .p2align 8 6112 .amdgpu_hsa_kernel hello_world 6113 6114 hello_world: 6115 6116 .amd_kernel_code_t 6117 enable_sgpr_kernarg_segment_ptr = 1 6118 is_ptr64 = 1 6119 compute_pgm_rsrc1_vgprs = 0 6120 compute_pgm_rsrc1_sgprs = 0 6121 compute_pgm_rsrc2_user_sgpr = 2 6122 compute_pgm_rsrc1_wgp_mode = 0 6123 compute_pgm_rsrc1_mem_ordered = 0 6124 compute_pgm_rsrc1_fwd_progress = 1 6125 .end_amd_kernel_code_t 6126 6127 s_load_dwordx2 s[0:1], s[0:1] 0x0 6128 v_mov_b32 v0, 3.14159 6129 s_waitcnt lgkmcnt(0) 6130 v_mov_b32 v1, s0 6131 v_mov_b32 v2, s1 6132 flat_store_dword v[1:2], v0 6133 s_endpgm 6134 .Lfunc_end0: 6135 .size hello_world, .Lfunc_end0-hello_world 6136 6137.. _amdgpu-amdhsa-assembler-predefined-symbols-v3: 6138 6139Code Object V3 Predefined Symbols (-mattr=+code-object-v3) 6140~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 6141 6142The AMDGPU assembler defines and updates some symbols automatically. These 6143symbols do not affect code generation. 6144 6145.amdgcn.gfx_generation_number 6146+++++++++++++++++++++++++++++ 6147 6148Set to the GFX major generation number of the target being assembled for. For 6149example, when assembling for a "GFX9" target this will be set to the integer 6150value "9". The possible GFX major generation numbers are presented in 6151:ref:`amdgpu-processors`. 6152 6153.amdgcn.gfx_generation_minor 6154++++++++++++++++++++++++++++ 6155 6156Set to the GFX minor generation number of the target being assembled for. For 6157example, when assembling for a "GFX810" target this will be set to the integer 6158value "1". The possible GFX minor generation numbers are presented in 6159:ref:`amdgpu-processors`. 6160 6161.amdgcn.gfx_generation_stepping 6162+++++++++++++++++++++++++++++++ 6163 6164Set to the GFX stepping generation number of the target being assembled for. 6165For example, when assembling for a "GFX704" target this will be set to the 6166integer value "4". The possible GFX stepping generation numbers are presented 6167in :ref:`amdgpu-processors`. 6168 6169.. _amdgpu-amdhsa-assembler-symbol-next_free_vgpr: 6170 6171.amdgcn.next_free_vgpr 6172++++++++++++++++++++++ 6173 6174Set to zero before assembly begins. At each instruction, if the current value 6175of this symbol is less than or equal to the maximum VGPR number explicitly 6176referenced within that instruction then the symbol value is updated to equal 6177that VGPR number plus one. 6178 6179May be used to set the `.amdhsa_next_free_vpgr` directive in 6180:ref:`amdhsa-kernel-directives-table`. 6181 6182May be set at any time, e.g. manually set to zero at the start of each kernel. 6183 6184.. _amdgpu-amdhsa-assembler-symbol-next_free_sgpr: 6185 6186.amdgcn.next_free_sgpr 6187++++++++++++++++++++++ 6188 6189Set to zero before assembly begins. At each instruction, if the current value 6190of this symbol is less than or equal the maximum SGPR number explicitly 6191referenced within that instruction then the symbol value is updated to equal 6192that SGPR number plus one. 6193 6194May be used to set the `.amdhsa_next_free_spgr` directive in 6195:ref:`amdhsa-kernel-directives-table`. 6196 6197May be set at any time, e.g. manually set to zero at the start of each kernel. 6198 6199.. _amdgpu-amdhsa-assembler-directives-v3: 6200 6201Code Object V3 Directives (-mattr=+code-object-v3) 6202~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 6203 6204Directives which begin with ``.amdgcn`` are valid for all ``amdgcn`` 6205architecture processors, and are not OS-specific. Directives which begin with 6206``.amdhsa`` are specific to ``amdgcn`` architecture processors when the 6207``amdhsa`` OS is specified. See :ref:`amdgpu-target-triples` and 6208:ref:`amdgpu-processors`. 6209 6210.amdgcn_target <target> 6211+++++++++++++++++++++++ 6212 6213Optional directive which declares the target supported by the containing 6214assembler source file. Valid values are described in 6215:ref:`amdgpu-amdhsa-code-object-target-identification`. Used by the assembler 6216to validate command-line options such as ``-triple``, ``-mcpu``, and those 6217which specify target features. 6218 6219.amdhsa_kernel <name> 6220+++++++++++++++++++++ 6221 6222Creates a correctly aligned AMDHSA kernel descriptor and a symbol, 6223``<name>.kd``, in the current location of the current section. Only valid when 6224the OS is ``amdhsa``. ``<name>`` must be a symbol that labels the first 6225instruction to execute, and does not need to be previously defined. 6226 6227Marks the beginning of a list of directives used to generate the bytes of a 6228kernel descriptor, as described in :ref:`amdgpu-amdhsa-kernel-descriptor`. 6229Directives which may appear in this list are described in 6230:ref:`amdhsa-kernel-directives-table`. Directives may appear in any order, must 6231be valid for the target being assembled for, and cannot be repeated. Directives 6232support the range of values specified by the field they reference in 6233:ref:`amdgpu-amdhsa-kernel-descriptor`. If a directive is not specified, it is 6234assumed to have its default value, unless it is marked as "Required", in which 6235case it is an error to omit the directive. This list of directives is 6236terminated by an ``.end_amdhsa_kernel`` directive. 6237 6238 .. table:: AMDHSA Kernel Assembler Directives 6239 :name: amdhsa-kernel-directives-table 6240 6241 ======================================================== =================== ============ =================== 6242 Directive Default Supported On Description 6243 ======================================================== =================== ============ =================== 6244 ``.amdhsa_group_segment_fixed_size`` 0 GFX6-GFX10 Controls GROUP_SEGMENT_FIXED_SIZE in 6245 :ref:`amdgpu-amdhsa-kernel-descriptor-gfx6-gfx10-table`. 6246 ``.amdhsa_private_segment_fixed_size`` 0 GFX6-GFX10 Controls PRIVATE_SEGMENT_FIXED_SIZE in 6247 :ref:`amdgpu-amdhsa-kernel-descriptor-gfx6-gfx10-table`. 6248 ``.amdhsa_user_sgpr_private_segment_buffer`` 0 GFX6-GFX10 Controls ENABLE_SGPR_PRIVATE_SEGMENT_BUFFER in 6249 :ref:`amdgpu-amdhsa-kernel-descriptor-gfx6-gfx10-table`. 6250 ``.amdhsa_user_sgpr_dispatch_ptr`` 0 GFX6-GFX10 Controls ENABLE_SGPR_DISPATCH_PTR in 6251 :ref:`amdgpu-amdhsa-kernel-descriptor-gfx6-gfx10-table`. 6252 ``.amdhsa_user_sgpr_queue_ptr`` 0 GFX6-GFX10 Controls ENABLE_SGPR_QUEUE_PTR in 6253 :ref:`amdgpu-amdhsa-kernel-descriptor-gfx6-gfx10-table`. 6254 ``.amdhsa_user_sgpr_kernarg_segment_ptr`` 0 GFX6-GFX10 Controls ENABLE_SGPR_KERNARG_SEGMENT_PTR in 6255 :ref:`amdgpu-amdhsa-kernel-descriptor-gfx6-gfx10-table`. 6256 ``.amdhsa_user_sgpr_dispatch_id`` 0 GFX6-GFX10 Controls ENABLE_SGPR_DISPATCH_ID in 6257 :ref:`amdgpu-amdhsa-kernel-descriptor-gfx6-gfx10-table`. 6258 ``.amdhsa_user_sgpr_flat_scratch_init`` 0 GFX6-GFX10 Controls ENABLE_SGPR_FLAT_SCRATCH_INIT in 6259 :ref:`amdgpu-amdhsa-kernel-descriptor-gfx6-gfx10-table`. 6260 ``.amdhsa_user_sgpr_private_segment_size`` 0 GFX6-GFX10 Controls ENABLE_SGPR_PRIVATE_SEGMENT_SIZE in 6261 :ref:`amdgpu-amdhsa-kernel-descriptor-gfx6-gfx10-table`. 6262 ``.amdhsa_wavefront_size32`` Target GFX10 Controls ENABLE_WAVEFRONT_SIZE32 in 6263 Feature :ref:`amdgpu-amdhsa-kernel-descriptor-gfx6-gfx10-table`. 6264 Specific 6265 (-wavefrontsize64) 6266 ``.amdhsa_system_sgpr_private_segment_wavefront_offset`` 0 GFX6-GFX10 Controls ENABLE_SGPR_PRIVATE_SEGMENT_WAVEFRONT_OFFSET in 6267 :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx10-table`. 6268 ``.amdhsa_system_sgpr_workgroup_id_x`` 1 GFX6-GFX10 Controls ENABLE_SGPR_WORKGROUP_ID_X in 6269 :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx10-table`. 6270 ``.amdhsa_system_sgpr_workgroup_id_y`` 0 GFX6-GFX10 Controls ENABLE_SGPR_WORKGROUP_ID_Y in 6271 :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx10-table`. 6272 ``.amdhsa_system_sgpr_workgroup_id_z`` 0 GFX6-GFX10 Controls ENABLE_SGPR_WORKGROUP_ID_Z in 6273 :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx10-table`. 6274 ``.amdhsa_system_sgpr_workgroup_info`` 0 GFX6-GFX10 Controls ENABLE_SGPR_WORKGROUP_INFO in 6275 :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx10-table`. 6276 ``.amdhsa_system_vgpr_workitem_id`` 0 GFX6-GFX10 Controls ENABLE_VGPR_WORKITEM_ID in 6277 :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx10-table`. 6278 Possible values are defined in 6279 :ref:`amdgpu-amdhsa-system-vgpr-work-item-id-enumeration-values-table`. 6280 ``.amdhsa_next_free_vgpr`` Required GFX6-GFX10 Maximum VGPR number explicitly referenced, plus one. 6281 Used to calculate GRANULATED_WORKITEM_VGPR_COUNT in 6282 :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx10-table`. 6283 ``.amdhsa_next_free_sgpr`` Required GFX6-GFX10 Maximum SGPR number explicitly referenced, plus one. 6284 Used to calculate GRANULATED_WAVEFRONT_SGPR_COUNT in 6285 :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx10-table`. 6286 ``.amdhsa_reserve_vcc`` 1 GFX6-GFX10 Whether the kernel may use the special VCC SGPR. 6287 Used to calculate GRANULATED_WAVEFRONT_SGPR_COUNT in 6288 :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx10-table`. 6289 ``.amdhsa_reserve_flat_scratch`` 1 GFX7-GFX10 Whether the kernel may use flat instructions to access 6290 scratch memory. Used to calculate 6291 GRANULATED_WAVEFRONT_SGPR_COUNT in 6292 :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx10-table`. 6293 ``.amdhsa_reserve_xnack_mask`` Target GFX8-GFX10 Whether the kernel may trigger XNACK replay. 6294 Feature Used to calculate GRANULATED_WAVEFRONT_SGPR_COUNT in 6295 Specific :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx10-table`. 6296 (+xnack) 6297 ``.amdhsa_float_round_mode_32`` 0 GFX6-GFX10 Controls FLOAT_ROUND_MODE_32 in 6298 :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx10-table`. 6299 Possible values are defined in 6300 :ref:`amdgpu-amdhsa-floating-point-rounding-mode-enumeration-values-table`. 6301 ``.amdhsa_float_round_mode_16_64`` 0 GFX6-GFX10 Controls FLOAT_ROUND_MODE_16_64 in 6302 :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx10-table`. 6303 Possible values are defined in 6304 :ref:`amdgpu-amdhsa-floating-point-rounding-mode-enumeration-values-table`. 6305 ``.amdhsa_float_denorm_mode_32`` 0 GFX6-GFX10 Controls FLOAT_DENORM_MODE_32 in 6306 :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx10-table`. 6307 Possible values are defined in 6308 :ref:`amdgpu-amdhsa-floating-point-denorm-mode-enumeration-values-table`. 6309 ``.amdhsa_float_denorm_mode_16_64`` 3 GFX6-GFX10 Controls FLOAT_DENORM_MODE_16_64 in 6310 :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx10-table`. 6311 Possible values are defined in 6312 :ref:`amdgpu-amdhsa-floating-point-denorm-mode-enumeration-values-table`. 6313 ``.amdhsa_dx10_clamp`` 1 GFX6-GFX10 Controls ENABLE_DX10_CLAMP in 6314 :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx10-table`. 6315 ``.amdhsa_ieee_mode`` 1 GFX6-GFX10 Controls ENABLE_IEEE_MODE in 6316 :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx10-table`. 6317 ``.amdhsa_fp16_overflow`` 0 GFX9-GFX10 Controls FP16_OVFL in 6318 :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx10-table`. 6319 ``.amdhsa_workgroup_processor_mode`` Target GFX10 Controls ENABLE_WGP_MODE in 6320 Feature :ref:`amdgpu-amdhsa-kernel-descriptor-gfx6-gfx10-table`. 6321 Specific 6322 (-cumode) 6323 ``.amdhsa_memory_ordered`` 1 GFX10 Controls MEM_ORDERED in 6324 :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx10-table`. 6325 ``.amdhsa_forward_progress`` 0 GFX10 Controls FWD_PROGRESS in 6326 :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx10-table`. 6327 ``.amdhsa_exception_fp_ieee_invalid_op`` 0 GFX6-GFX10 Controls ENABLE_EXCEPTION_IEEE_754_FP_INVALID_OPERATION in 6328 :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx10-table`. 6329 ``.amdhsa_exception_fp_denorm_src`` 0 GFX6-GFX10 Controls ENABLE_EXCEPTION_FP_DENORMAL_SOURCE in 6330 :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx10-table`. 6331 ``.amdhsa_exception_fp_ieee_div_zero`` 0 GFX6-GFX10 Controls ENABLE_EXCEPTION_IEEE_754_FP_DIVISION_BY_ZERO in 6332 :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx10-table`. 6333 ``.amdhsa_exception_fp_ieee_overflow`` 0 GFX6-GFX10 Controls ENABLE_EXCEPTION_IEEE_754_FP_OVERFLOW in 6334 :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx10-table`. 6335 ``.amdhsa_exception_fp_ieee_underflow`` 0 GFX6-GFX10 Controls ENABLE_EXCEPTION_IEEE_754_FP_UNDERFLOW in 6336 :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx10-table`. 6337 ``.amdhsa_exception_fp_ieee_inexact`` 0 GFX6-GFX10 Controls ENABLE_EXCEPTION_IEEE_754_FP_INEXACT in 6338 :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx10-table`. 6339 ``.amdhsa_exception_int_div_zero`` 0 GFX6-GFX10 Controls ENABLE_EXCEPTION_INT_DIVIDE_BY_ZERO in 6340 :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx10-table`. 6341 ======================================================== =================== ============ =================== 6342 6343.amdgpu_metadata 6344++++++++++++++++ 6345 6346Optional directive which declares the contents of the ``NT_AMDGPU_METADATA`` 6347note record (see :ref:`amdgpu-elf-note-records-table-v3`). 6348 6349The contents must be in the [YAML]_ markup format, with the same structure and 6350semantics described in :ref:`amdgpu-amdhsa-code-object-metadata-v3`. 6351 6352This directive is terminated by an ``.end_amdgpu_metadata`` directive. 6353 6354.. _amdgpu-amdhsa-assembler-example-v3: 6355 6356Code Object V3 Example Source Code (-mattr=+code-object-v3) 6357~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 6358 6359Here is an example of a minimal assembly source file, defining one HSA kernel: 6360 6361.. code-block:: none 6362 6363 .amdgcn_target "amdgcn-amd-amdhsa--gfx900+xnack" // optional 6364 6365 .text 6366 .globl hello_world 6367 .p2align 8 6368 .type hello_world,@function 6369 hello_world: 6370 s_load_dwordx2 s[0:1], s[0:1] 0x0 6371 v_mov_b32 v0, 3.14159 6372 s_waitcnt lgkmcnt(0) 6373 v_mov_b32 v1, s0 6374 v_mov_b32 v2, s1 6375 flat_store_dword v[1:2], v0 6376 s_endpgm 6377 .Lfunc_end0: 6378 .size hello_world, .Lfunc_end0-hello_world 6379 6380 .rodata 6381 .p2align 6 6382 .amdhsa_kernel hello_world 6383 .amdhsa_user_sgpr_kernarg_segment_ptr 1 6384 .amdhsa_next_free_vgpr .amdgcn.next_free_vgpr 6385 .amdhsa_next_free_sgpr .amdgcn.next_free_sgpr 6386 .end_amdhsa_kernel 6387 6388 .amdgpu_metadata 6389 --- 6390 amdhsa.version: 6391 - 1 6392 - 0 6393 amdhsa.kernels: 6394 - .name: hello_world 6395 .symbol: hello_world.kd 6396 .kernarg_segment_size: 48 6397 .group_segment_fixed_size: 0 6398 .private_segment_fixed_size: 0 6399 .kernarg_segment_align: 4 6400 .wavefront_size: 64 6401 .sgpr_count: 2 6402 .vgpr_count: 3 6403 .max_flat_workgroup_size: 256 6404 ... 6405 .end_amdgpu_metadata 6406 6407If an assembly source file contains multiple kernels and/or functions, the 6408:ref:`amdgpu-amdhsa-assembler-symbol-next_free_vgpr` and 6409:ref:`amdgpu-amdhsa-assembler-symbol-next_free_sgpr` symbols may be reset using 6410the ``.set <symbol>, <expression>`` directive. For example, in the case of two 6411kernels, where ``function1`` is only called from ``kernel1`` it is sufficient 6412to group the function with the kernel that calls it and reset the symbols 6413between the two connected components: 6414 6415.. code-block:: none 6416 6417 .amdgcn_target "amdgcn-amd-amdhsa--gfx900+xnack" // optional 6418 6419 // gpr tracking symbols are implicitly set to zero 6420 6421 .text 6422 .globl kern0 6423 .p2align 8 6424 .type kern0,@function 6425 kern0: 6426 // ... 6427 s_endpgm 6428 .Lkern0_end: 6429 .size kern0, .Lkern0_end-kern0 6430 6431 .rodata 6432 .p2align 6 6433 .amdhsa_kernel kern0 6434 // ... 6435 .amdhsa_next_free_vgpr .amdgcn.next_free_vgpr 6436 .amdhsa_next_free_sgpr .amdgcn.next_free_sgpr 6437 .end_amdhsa_kernel 6438 6439 // reset symbols to begin tracking usage in func1 and kern1 6440 .set .amdgcn.next_free_vgpr, 0 6441 .set .amdgcn.next_free_sgpr, 0 6442 6443 .text 6444 .hidden func1 6445 .global func1 6446 .p2align 2 6447 .type func1,@function 6448 func1: 6449 // ... 6450 s_setpc_b64 s[30:31] 6451 .Lfunc1_end: 6452 .size func1, .Lfunc1_end-func1 6453 6454 .globl kern1 6455 .p2align 8 6456 .type kern1,@function 6457 kern1: 6458 // ... 6459 s_getpc_b64 s[4:5] 6460 s_add_u32 s4, s4, func1@rel32@lo+4 6461 s_addc_u32 s5, s5, func1@rel32@lo+4 6462 s_swappc_b64 s[30:31], s[4:5] 6463 // ... 6464 s_endpgm 6465 .Lkern1_end: 6466 .size kern1, .Lkern1_end-kern1 6467 6468 .rodata 6469 .p2align 6 6470 .amdhsa_kernel kern1 6471 // ... 6472 .amdhsa_next_free_vgpr .amdgcn.next_free_vgpr 6473 .amdhsa_next_free_sgpr .amdgcn.next_free_sgpr 6474 .end_amdhsa_kernel 6475 6476These symbols cannot identify connected components in order to automatically 6477track the usage for each kernel. However, in some cases careful organization of 6478the kernels and functions in the source file means there is minimal additional 6479effort required to accurately calculate GPR usage. 6480 6481Additional Documentation 6482======================== 6483 6484.. [AMD-RADEON-HD-2000-3000] `AMD R6xx shader ISA <http://developer.amd.com/wordpress/media/2012/10/R600_Instruction_Set_Architecture.pdf>`__ 6485.. [AMD-RADEON-HD-4000] `AMD R7xx shader ISA <http://developer.amd.com/wordpress/media/2012/10/R700-Family_Instruction_Set_Architecture.pdf>`__ 6486.. [AMD-RADEON-HD-5000] `AMD Evergreen shader ISA <http://developer.amd.com/wordpress/media/2012/10/AMD_Evergreen-Family_Instruction_Set_Architecture.pdf>`__ 6487.. [AMD-RADEON-HD-6000] `AMD Cayman/Trinity shader ISA <http://developer.amd.com/wordpress/media/2012/10/AMD_HD_6900_Series_Instruction_Set_Architecture.pdf>`__ 6488.. [AMD-GCN-GFX6] `AMD Southern Islands Series ISA <http://developer.amd.com/wordpress/media/2012/12/AMD_Southern_Islands_Instruction_Set_Architecture.pdf>`__ 6489.. [AMD-GCN-GFX7] `AMD Sea Islands Series ISA <http://developer.amd.com/wordpress/media/2013/07/AMD_Sea_Islands_Instruction_Set_Architecture.pdf>`_ 6490.. [AMD-GCN-GFX8] `AMD GCN3 Instruction Set Architecture <http://amd-dev.wpengine.netdna-cdn.com/wordpress/media/2013/12/AMD_GCN3_Instruction_Set_Architecture_rev1.1.pdf>`__ 6491.. [AMD-GCN-GFX9] `AMD "Vega" Instruction Set Architecture <http://developer.amd.com/wordpress/media/2013/12/Vega_Shader_ISA_28July2017.pdf>`__ 6492.. [AMD-GCN-GFX10] AMD "Navi" Instruction Set Architecture *TBA* 6493.. TODO 6494 ttye Add link when made public. 6495.. [AMD-ROCm] `ROCm: Open Platform for Development, Discovery and Education Around GPU Computing <http://gpuopen.com/compute-product/rocm/>`__ 6496.. [AMD-ROCm-github] `ROCm github <http://github.com/RadeonOpenCompute>`__ 6497.. [HSA] `Heterogeneous System Architecture (HSA) Foundation <http://www.hsafoundation.com/>`__ 6498.. [ELF] `Executable and Linkable Format (ELF) <http://www.sco.com/developers/gabi/>`__ 6499.. [DWARF] `DWARF Debugging Information Format <http://dwarfstd.org/>`__ 6500.. [YAML] `YAML Ain't Markup Language (YAML™) Version 1.2 <http://www.yaml.org/spec/1.2/spec.html>`__ 6501.. [MsgPack] `Message Pack <http://www.msgpack.org/>`__ 6502.. [OpenCL] `The OpenCL Specification Version 2.0 <http://www.khronos.org/registry/cl/specs/opencl-2.0.pdf>`__ 6503.. [HRF] `Heterogeneous-race-free Memory Models <http://benedictgaster.org/wp-content/uploads/2014/01/asplos269-FINAL.pdf>`__ 6504.. [CLANG-ATTR] `Attributes in Clang <http://clang.llvm.org/docs/AttributeReference.html>`__ 6505