1.. _skiboot-6.3: 2 3skiboot-6.3 4=========== 5 6skiboot v6.3 was released on Friday May 3rd 2019. It is the first 7release of skiboot 6.3, which becomes the new stable release 8of skiboot following the 6.2 release, first released December 14th 2018. 9 10Skiboot 6.3 will mark the basis for op-build v2.3. 11 12skiboot v6.3 contains all bug fixes as of :ref:`skiboot-6.0.20`, 13and :ref:`skiboot-6.2.3` (the currently maintained 14stable releases). 15 16For how the skiboot stable releases work, see :ref:`stable-rules` for details. 17 18Over skiboot 6.2, we have the following changes: 19 20.. _skiboot-6.3-new-features: 21 22New Features 23------------ 24 25- hw/imc: Enable opal calls to init/start/stop IMC Trace mode 26 27 New OPAL APIs for In-Memory Collection Counter infrastructure(IMC), 28 including a new device type called OPAL_IMC_COUNTERS_TRACE. 29- xive: Add calls to save/restore the queues and VPs HW state 30 31 To be able to support migration of guests using the XIVE native 32 exploitation mode, (where the queue is effectively owned by the 33 guest), KVM needs to be able to save and restore the HW-modified 34 fields of the queue, such as the current queue producer pointer and 35 generation bit, and to retrieve the modified thread context registers 36 of the VP from the NVT structure : the VP interrupt pending bits. 37 38 However, there is no need to set back the NVT structure on P9. P10 39 should be the same. 40- witherspoon: Add nvlink2 interconnect information 41 42 GPUs on Redbud and Sequoia platforms are interconnected in groups of 43 2 or 3 GPUs. The problem with that is if the user decides to pass a single 44 GPU from a group to the userspace, we need to ensure that links between 45 GPUs do not get enabled. 46 47 A V100 GPU provides a way to disable selected links. In order to only 48 disable links to peer GPUs, we need a topology map. 49 50 This adds an "ibm,nvlink-peers" property to a GPU DT node with phandles 51 of peer GPUs and NVLink2 bridges. The index in the property is a GPU link 52 number. 53- platforms/romulus: Also support talos 54 55 The two are similar enough and I'd like to have a slot table for our 56 Talos. 57- OpenCAPI support! (see :ref:`skiboot-6.3-OpenCAPI` section) 58- opal/hmi: set a flag to inform OS that TOD/TB has failed. 59 60 Set a flag to indicate OS about TOD/TB failure as part of new 61 opal_handle_hmi2 handler. This flag then can be used by OS to make sure 62 functions depending on TB value (e.g. udelay()) are aware of TB not 63 ticking. 64- astbmc: Enable IPMI HIOMAP for AMI platforms 65 66 Required for Habanero, Palmetto and Romulus. 67- power-mgmt : occ : Add 'freq-domain-mask' DT property 68 69 Add a new device-tree property freq-domain-indicator to define group of 70 CPUs which would share same frequency. This property has been added under 71 power-mgmt node. It is a bitmask. 72 73 Bitwise AND is taken between this bitmask value and PIR of cpu. All the 74 CPUs lying in the same frequency domain will have same result for AND. 75 76 For example, For POWER9, 0xFFF0 indicates quad wide frequency domain. 77 Taking AND with the PIR of CPUs will yield us frequency domain which is 78 quad wise distribution as last 4 bits have been masked which represent the 79 cores. 80 81 Similarly, 0xFFF8 will represent core wide frequency domain for P8. 82 83 Also, Add a new device-tree property domain-runs-at which will denote the 84 strategy OCC is using to change the frequency of a frequency-domain. There 85 can be two strategy - FREQ_MOST_RECENTLY_SET and FREQ_MAX_IN_DOMAIN. 86 87 FREQ_MOST_RECENTLY_SET : the OCC sets the frequency of the quad to the most 88 recent frequency value requested by the CPUs in the quad. 89 90 FREQ_MAX_IN_DOMAIN : the OCC sets the frequency of the CPUs in 91 the Quad to the maximum of the latest frequency requested by each of 92 the component cores. 93- powercap: occ: Fix the powercapping range allowed for user 94 95 OCC provides two limits for minimum powercap. One being hard powercap 96 minimum which is guaranteed by OCC and the other one is a soft 97 powercap minimum which is lesser than hard-min and may or may not be 98 asserted due to various power-thermal reasons. So to allow the users 99 to access the entire powercap range, this patch exports soft powercap 100 minimum as the "powercap-min" DT property. And it also adds a new 101 DT property called "powercap-hard-min" to export the hard-min powercap 102 limit. 103- Add NVDIMM support 104 105 NVDIMMs are memory modules that use a battery backup system to allow the 106 contents RAM to be saved to non-volatile storage if system power goes 107 away unexpectedly. This allows them to be used a high-performance 108 storage device, suitable for serving as a cache for SSDs and the like. 109 110 Configuration of NVDIMMs is handled by hostboot and communicated to OPAL 111 via the HDAT. We need to parse out the NVDIMM memory ranges and create 112 memory regions with the "pmem-region" compatible label to make them 113 available to the host. 114- core/exceptions: implement support for MCE interrupts in powersave 115 116 The ISA specifies that MCE interrupts in power saving modes will enter 117 at 0x200 with powersave bits in SRR1 set. This is not currently 118 supported properly, the MCE will just happen like a normal interrupt, 119 but GPRs could be lost, which would lead to crashes (e.g., r1, r2, r13 120 etc). 121 122 So check the power save bits similarly to the sreset vector, and 123 handle this properly. 124- core/exceptions: allow recoverable sreset exceptions 125 126 This requires implementing the MSR[RI] bit. Then just allow all 127 non-fatal sreset exceptions to recover. 128- core/exceptions: implement an exception handler for non-powersave sresets 129 130 Detect non-powersave sresets and send them to the normal exception 131 handler which prints registers and stack. 132- Add PVR_TYPE_P9P 133 134 Enable a new PVR to get us running on another p9 variant. 135 136Since v6.3-rc2: 137 138- Expose PNOR Flash partitions to host MTD driver via devicetree 139 140 This makes it possible for the host to directly address each 141 partition without requiring each application to directly parse 142 the FFS headers. This has been in use for some time already to 143 allow BOOTKERNFW partition updates from the host. 144 145 All partitions except BOOTKERNFW are marked readonly. 146 147 The BOOTKERNFW partition is currently exclusively used by the TalosII platform 148 149- Write boot progress to LPC port 80h 150 151 This is an adaptation of what we currently do for op_display() on FSP 152 machines, inventing an encoding for what we can write into the single 153 byte at LPC port 80h. 154 155 Port 80h is often used on x86 systems to indicate boot progress/status 156 and dates back a decent amount of time. Since a byte isn't exactly very 157 expressive for everything that can go on (and wrong) during boot, it's 158 all about compromise. 159 160 Some systems (such as Zaius/Barreleye G2) have a physical dual 7 segment 161 display that display these codes. So far, this has only been driven by 162 hostboot (see hostboot commit 90ec2e65314c). 163 164- Write boot progress to LPC ports 81 and 82 165 166 There's a thought to write more extensive boot progress codes to LPC 167 ports 81 and 82 to supplement/replace any reliance on port 80. 168 169 We want to still emit port 80 for platforms like Zaius and Barreleye 170 that have the physical display. Ports 81 and 82 can be monitored by a 171 BMC though. 172 173- Add Talos II platform 174 175 Talos II has some hardware differences from Romulus, therefore 176 we cannot guarantee Talos II == Romulus in skiboot. Copy and 177 slightly modify the Romulus files for Talos II. 178 179Since v6.3-rc1: 180 181- cpufeatures: Add tm-suspend-hypervisor-assist and tm-suspend-xer-so-bug node 182 183 tm-suspend-hypervisor-assist for P9 >=DD2.2 184 And a tm-suspend-xer-so-bug node for P9 DD2.2 only. 185 186 I also treat P9P as P9 DD2.3 and add a unit test for the cpufeatures 187 infrastructure. 188 189 Fixes: https://github.com/open-power/skiboot/issues/233 190 191 192Deprecated/Removed Features 193--------------------------- 194 195- opal: Deprecate reading the PHB status 196 197 The OPAL_PCI_EEH_FREEZE_STATUS call takes a bunch of parameters, one of 198 them is @phb_status. It is defined as __be64* and always NULL in 199 the current Linux upstream but if anyone ever decides to read that status, 200 then the PHB3's handler will assume it is struct OpalIoPhb3ErrorData* 201 (which is a lot bigger than 8 bytes) and zero it causing the stack 202 corruption; p7ioc-phb has the same issue. 203 204 This removes @phb_status from all eeh_freeze_status() hooks and moves 205 the error message from PHB4 to the affected OPAL handlers. 206 207 As far as we can tell, nobody has ever used this and thus it's safe to remove. 208- Remove POWER9N DD1 support 209 210 This is not a shipping product and is no longer supported by Linux 211 or other firmware components. 212 213Since v6.3-rc3: 214 215- Disable fast-reset for POWER8 216 217 There is a bug with fast-reset when CPU cores are busy, which can be 218 reproduced by running `stress` and then trying `reboot -ff` (this is 219 what the op-test test cases FastRebootHostStress and 220 FastRebootHostStressTorture do). What happens is the cores lock up, 221 which isn't the best thing in the world when you want them to start 222 executing instructions again. 223 224 A workaround is to use instruction ramming, which while greatly 225 increasing the reliability of fast-reset on p8, doesn't make it perfect. 226 227 Instruction ramming is what pdbg was modified to do in order to have the 228 sreset functionality work reliably on p8. 229 pdbg patches: https://patchwork.ozlabs.org/project/pdbg/list/?series=96593&state=* 230 231 Fixes: https://github.com/open-power/skiboot/issues/185 232 233General 234------- 235 236- core/i2c: Various bits of refactoring 237- refactor backtrace generation infrastructure 238- astbmc: Handle failure to initialise raw flash 239 240 Initialising raw flash lead to a dead assignment to rc. Check the return 241 code and take the failure path as necessary. Both before and after the 242 fix we see output along the lines of the following when flash_init() 243 fails: :: 244 245 [ 53.283182881,7] IRQ: Registering 0800..0ff7 ops @0x300d4b98 (data 0x3052b9d8) 246 [ 53.283184335,7] IRQ: Registering 0ff8..0fff ops @0x300d4bc8 (data 0x3052b9d8) 247 [ 53.283185513,7] PHB#0000: Initializing PHB... 248 [ 53.288260827,4] FLASH: Can't load resource id:0. No system flash found 249 [ 53.288354442,4] FLASH: Can't load resource id:1. No system flash found 250 [ 53.342933439,3] CAPP: Error loading ucode lid. index=200ea 251 [ 53.462749486,2] NVRAM: Failed to load 252 [ 53.462819095,2] NVRAM: Failed to load 253 [ 53.462894236,2] NVRAM: Failed to load 254 [ 53.462967071,2] NVRAM: Failed to load 255 [ 53.463033077,2] NVRAM: Failed to load 256 [ 53.463144847,2] NVRAM: Failed to load 257 258 Eventually followed by: :: 259 260 [ 57.216942479,5] INIT: platform wait for kernel load failed 261 [ 57.217051132,5] INIT: Assuming kernel at 0x20000000 262 [ 57.217127508,3] INIT: ELF header not found. Assuming raw binary. 263 [ 57.217249886,2] NVRAM: Failed to load 264 [ 57.221294487,0] FATAL: Kernel is zeros, can't execute! 265 [ 57.221397429,0] Assert fail: core/init.c:615:0 266 [ 57.221471414,0] Aborting! 267 CPU 0028 Backtrace: 268 S: 0000000031d43c60 R: 000000003001b274 ._abort+0x4c 269 S: 0000000031d43ce0 R: 000000003001b2f0 .assert_fail+0x34 270 S: 0000000031d43d60 R: 0000000030014814 .load_and_boot_kernel+0xae4 271 S: 0000000031d43e30 R: 0000000030015164 .main_cpu_entry+0x680 272 S: 0000000031d43f00 R: 0000000030002718 boot_entry+0x1c0 273 --- OPAL boot --- 274 275 Analysis of the execution paths suggests we'll always "safely" end this 276 way due the setup sequence for the blocklevel callbacks in flash_init() 277 and error handling in blocklevel_get_info(), and there's no current risk 278 of executing from unexpected memory locations. As such the issue is 279 reduced to down to a fix for poor error hygene in the original change 280 and a resolution for a Coverity warning (famous last words etc). 281- core/flash: Retry requests as necessary in flash_load_resource() 282 283 We would like to successfully boot if we have a dependency on the BMC 284 for flash even if the BMC is not current ready to service flash 285 requests. On the assumption that it will become ready, retry for several 286 minutes to cover a BMC reboot cycle and *eventually* rather than 287 *immediately* crash out with: :: 288 289 [ 269.549748] reboot: Restarting system 290 [ 390.297462587,5] OPAL: Reboot request... 291 [ 390.297737995,5] RESET: Initiating fast reboot 1... 292 [ 391.074707590,5] Clearing unused memory: 293 [ 391.075198880,5] PCI: Clearing all devices... 294 [ 391.075201618,7] Clearing region 201ffe000000-201fff800000 295 [ 391.086235699,5] PCI: Resetting PHBs and training links... 296 [ 391.254089525,3] FFS: Error 17 reading flash header 297 [ 391.254159668,3] FLASH: Can't open ffs handle: 17 298 [ 392.307245135,5] PCI: Probing slots... 299 [ 392.363723191,5] PCI Summary: 300 ... 301 [ 393.423255262,5] OCC: All Chip Rdy after 0 ms 302 [ 393.453092828,5] INIT: Starting kernel at 0x20000000, fdt at 303 0x30800a88 390645 bytes 304 [ 393.453202605,0] FATAL: Kernel is zeros, can't execute! 305 [ 393.453247064,0] Assert fail: core/init.c:593:0 306 [ 393.453289682,0] Aborting! 307 CPU 0040 Backtrace: 308 S: 0000000031e03ca0 R: 000000003001af60 ._abort+0x4c 309 S: 0000000031e03d20 R: 000000003001afdc .assert_fail+0x34 310 S: 0000000031e03da0 R: 00000000300146d8 .load_and_boot_kernel+0xb30 311 S: 0000000031e03e70 R: 0000000030026cf0 .fast_reboot_entry+0x39c 312 S: 0000000031e03f00 R: 0000000030002a4c fast_reset_entry+0x2c 313 --- OPAL boot --- 314 315 The OPAL flash API hooks directly into the blocklevel layer, so there's 316 no delay for e.g. the host kernel, just for asynchronously loaded 317 resources during boot. 318- fast-reboot: occ: Call occ_pstates_init() on fast-reset on all machines 319 320 Commit 815417dcda2e ("init, occ: Initialise OCC earlier on BMC systems") 321 conditionally invoked occ_pstates_init() only on FSP based systems in 322 load_and_boot_kernel(). Due to this pstate table is re-parsed on FSP 323 system and skipped on BMC system during fast-reboot. So this patch fixes 324 this by invoking occ_pstates_init() on all boxes during fast-reboot. 325- opal/hmi: Don't retry TOD recovery if it is already in failed state. 326 327 On TOD failure, all cores/thread receives HMI and very first thread that 328 gets interrupt fixes the TOD where as others just resets the respective 329 HMER error bit and return. But when TOD is unrecoverable, all the threads 330 try to do TOD recovery one by one causing threads to spend more time inside 331 opal. Set a global flag when TOD is unrecoverable so that rest of the 332 threads go back to linux immediately avoiding lock ups in system 333 reboot/panic path. 334- hw/bt: Do not disable ipmi message retry during OPAL boot 335 336 Currently OPAL doesn't know whether BMC is functioning or not. If BMC is 337 down (like BMC reboot), then we keep on retry sending message to BMC. So 338 in some corner cases we may hit hard lockup issue in kernel. 339 340 Ideally we should avoid using synchronous path as much as possible. But 341 for now commit 01f977c3 added option to disable message retry in synchronous. 342 But this fix is not required during boot. Hence lets disable IPMI message 343 retry during OPAL boot. 344- hdata/memory: Fix warning message 345 346 Even though we added memory to device tree, we are getting below warning. :: 347 348 [ 57.136949696,3] Unable to use memory range 0 from MSAREA 0 349 [ 57.137049753,3] Unable to use memory range 0 from MSAREA 1 350 [ 57.137152335,3] Unable to use memory range 0 from MSAREA 2 351 [ 57.137251218,3] Unable to use memory range 0 from MSAREA 3 352- hw/bt: Add backend interface to disable ipmi message retry option 353 354 During boot OPAL makes IPMI_GET_BT_CAPS call to BMC to get BT interface 355 capabilities which includes IPMI message max resend count, message 356 timeout, etc,. Most of the time OPAL gets response from BMC within 357 specified timeout. In some corner cases (like mboxd daemon reset in BMC, 358 BMC reboot, etc) OPAL may not get response within timeout period. In 359 such scenarios, OPAL resends message until max resend count reaches. 360 361 OPAL uses synchronous IPMI message (ipmi_queue_msg_sync()) for few 362 operations like flash read, write, etc. Thread will wait in OPAL until 363 it gets response from BMC. In some corner cases like BMC reboot, thread 364 may wait in OPAL for long time (more than 20 seconds) and results in 365 kernel hardlockup. 366 367 This patch introduces new interface to disable message resend option. We 368 will disable message resend option for synchrous message. This will 369 greatly reduces kernel hardlock up issues. 370 371 This is short term fix. Long term solution is to convert all synchronous 372 messages to asynhrounous one. 373- ipmi/power: Fix system reboot issue 374 375 Kernel makes reboot/shudown OPAL call for reboot/shutdown. Once kernel 376 gets response from OPAL it runs opal_poll_events() until firmware 377 handles the request. 378 379 On BMC based system, OPAL makes IPMI call (IPMI_CHASSIS_CONTROL) to 380 initiate system reboot/shutdown. At present OPAL queues IPMI messages 381 and return SUCESS to Host. If BMC is not ready to accept command (like 382 BMC reboot), then these message will fail. We have to manually 383 reboot/shutdown the system using BMC interface. 384 385 This patch adds logic to validate message return value. If message failed, 386 then it will resend the message. At some stage BMC will be ready to accept 387 message and handles IPMI message. 388- firmware-versions: Add test case for parsing VERSION 389 390 Also make it possible to use with afl-lop/afl-fuzz just to help make 391 *sure* we're all good. 392 393 Additionally, if we hit a entry in VERSION that is larger than our 394 buffer size, we skip over it gracefully rather than overwriting the 395 stack. This is only a problem if VERSION isn't trusted, which as of 396 4b8cc05a94513816d43fb8bd6178896b430af08f it is verified as part of 397 Secure Boot. 398- core/fast-reboot: improve NMI handling during fast reset 399 400 Improve sreset and MCE handling in fast reboot. Switch the HILE bit 401 off before copying OPAL's exception vectors, so NMIs can be handled 402 properly. Also disable MSR[ME] while the vectors are being overwritten 403- core/cpu: HID update race 404 405 If the per-core HID register is updated concurrently by multiple 406 threads, updates can get lost. This has been observed during fast 407 reboot where the HILE bit does not get cleared on all cores, which 408 can cause machine check exception interrupts to crash. 409 410 Fix this by only updating HID on thread0. 411- SLW: Print verbose info on errors only 412 413 Change print level from debug to warning for reporting 414 bad EC_PPM_SPECIAL_WKUP_* scom values. To reduce cluttering 415 in the log print only on error. 416 417Since v6.3-rc2: 418 419- hw/xscom: add missing P9P chip name 420- asm/head: balance branches to avoid link stack predictor mispredicts 421 422 The Linux wrapper for OPAL call and return is arranged like this: :: 423 424 __opal_call: 425 mflr r0 426 std r0,PPC_STK_LROFF(r1) 427 LOAD_REG_ADDR(r11, opal_return) 428 mtlr r11 429 hrfid -> OPAL 430 431 opal_return: 432 ld r0,PPC_STK_LROFF(r1) 433 mtlr r0 434 blr 435 436 When skiboot returns to Linux, it branches to LR (i.e., opal_return) 437 with a blr. This unbalances the link stack predictor and will cause 438 mispredicts back up the return stack. 439- external/mambo: also invoke readline for the non-autorun case 440- asm/head.S: set POWER9 radix HID bit at entry 441 442 When running in virtual memory mode, the radix MMU hid bit should not 443 be changed, so set this in the initial boot SPR setup. 444 445 As a side effect, fast reboot also has HID0:RADIX bit set by the 446 shared spr init, so no need for an explicit call. 447- build: link with --orphan-handling=warn 448 449 The linker can warn when the linker script does not explicitly place 450 all sections. These orphan sections are placed according to 451 heuristics, which may not always be desirable. Enable this warning. 452- build: -fno-asynchronous-unwind-tables 453 454 skiboot does not use unwind tables, this option saves about 100kB, 455 mostly from .text. 456- opal/hmi: Initialize the hmi event with old value of TFMR. 457 458 Do this before we fix TFAC errors. Otherwise the event at host console 459 shows no thread error reported in TFMR register. 460 461 Without this patch the console event show TFMR with no thread error: 462 (DEC parity error TFMR[59] injection) :: 463 464 [ 53.737572] Severe Hypervisor Maintenance interrupt [Recovered] 465 [ 53.737596] Error detail: Timer facility experienced an error 466 [ 53.737611] HMER: 0840000000000000 467 [ 53.737621] TFMR: 3212000870e04000 468 469 After this patch it shows old TFMR value on host console: :: 470 471 [ 2302.267271] Severe Hypervisor Maintenance interrupt [Recovered] 472 [ 2302.267305] Error detail: Timer facility experienced an error 473 [ 2302.267320] HMER: 0840000000000000 474 [ 2302.267330] TFMR: 3212000870e14010 475 476 477IBM FSP based platforms 478----------------------- 479 480- platforms/firenze: Rework I2C controller fixups 481- platforms/zz: Re-enable LXVPD slot information parsing 482 483 From memory this was disabled in the distant past since we were waiting 484 for an updates to the LXPVD format. It looks like that never happened 485 so re-enable it for the ZZ platform so that we can get PCI slot location 486 codes on ZZ. 487 488HIOMAP 489------ 490- astbmc: Try IPMI HIOMAP for P8 491 492 The HIOMAP protocol was developed after the release of P8 in preparation 493 for P9. As a consequence P9 always uses it, but it has rarely been 494 enabled for P8. P8DTU has recently added IPMI HIOMAP support to its BMC 495 firmware, so enable its use in skiboot with P8 machines. Doing so 496 requires some rework to ensure fallback works correctly as in the past 497 the fallback was to mbox, which will only work for P9. 498- libflash/ipmi-hiomap: Enforce message size for empty response 499 500 The protocol defines the response to the associated messages as empty 501 except for the command ID and sequence fields. If the BMC is returning 502 extra data consider the message malformed. 503- libflash/ipmi-hiomap: Remove unused close handling 504 505 Issuing a HIOMAP_C_CLOSE is not required by the protocol specification, 506 rather a close can be implicit in a subsequent 507 CREATE_{READ,WRITE}_WINDOW request. The implicit close provides an 508 opportunity to reduce LPC traffic and the implementation takes up that 509 optimisation, so remove the case from the IPMI callback handler. 510- libflash/ipmi-hiomap: Overhaul event handling 511 512 Reworking the event handling was inspired by a bug report by Vasant 513 where the host would get wedged on multiple flash access attempts in the 514 face of a persistent error state on the BMC-side. The cause of this bug 515 was the early-exit based on ctx->update, which erronously assumed that 516 all events had been completely handled in prior calls to 517 ipmi_hiomap_handle_events(). This is not true if e.g. 518 HIOMAP_E_DAEMON_READY is clear in the prior calls. 519 520 Regardless, there were other correctness and efficiency problems with 521 the handling strategy: 522 523 * Ack-able event state was not restored in the face of errors in the 524 process of re-establishing protocol state 525 * It forced needless window restoration with respect to the context in 526 which ipmi_hiomap_handle_events() was called. 527 * Tests for HIOMAP_E_DAEMON_READY and HIOMAP_E_FLASH_LOST were redundant 528 with the overhauled error handling introduced in the previous patch 529 530 Fix all of the above issues and add comments to explain the event 531 handling flow. 532- libflash/ipmi-hiomap: Overhaul error handling 533 534 The aim is to improve the robustness with respect to absence of the 535 BMC-side daemon. The current error handling roughly mirrors what was 536 done for the mailbox implementation, but there's room for improvement. 537 538 Errors are split into two classes, those that affect the transport state 539 and those that affect the window validity. From here, we push the 540 transport state error checks right to the bottom of the stack, to ensure 541 the link is known to be in a good state before any message is sent. 542 Window validity tests remain as they were in the hiomap_window_move() 543 and ipmi_hiomap_read() functions. Validity tests are not necessary in 544 the write and erase paths as we will receive an error response from the 545 BMC when performing a dirty or flush on an invalid window. 546 547 Recovery also remains as it was, done on entry to the blocklevel 548 callbacks. If an error state is encountered in the middle of an 549 operation no attempt is made to recover it on the spot, instead the 550 error is returned up the stack and the caller can choose how it wishes 551 to respond. 552- libflash/ipmi-hiomap: Fix leak of msg in callback 553 554Since v6.3-rc1: 555 556- libflash/ipmi-hiomap: Fix blocks count issue 557 558 We convert data size to block count and pass block count to BMC. 559 If data size is not block aligned then we endup sending block count 560 less than actual data. BMC will write partial data to flash memory. 561 562 Sample log :: 563 564 [ 594.388458416,7] HIOMAP: Marked flash dirty at 0x42010 for 8 565 [ 594.398756487,7] HIOMAP: Flushed writes 566 [ 594.409596439,7] HIOMAP: Marked flash dirty at 0x42018 for 3970 567 [ 594.419897507,7] HIOMAP: Flushed writes 568 569 In this case HIOMAP sent data with block count=0 and hence BMC didn't 570 flush data to flash. 571 572 573 574POWER8 575------ 576- hw/phb3/naples: Disable D-states 577 578 Putting "Mellanox Technologies MT27700 Family [ConnectX-4] [15b3:1013]" 579 (more precisely, the second of 2 its PCI functions, no matter in what 580 order) into the D3 state causes EEH with the "PCT timeout" error. 581 This has been noticed on garrison machines only and firestones do not 582 seem to have this issue. 583 584 This disables D-states changing for devices on root buses on Naples by 585 installing a config space access filter (copied from PHB4). 586- cpufeatures: Always advertise POWER8NVL as DD2 587 588 Despite the major version of PVR being 1 (0x004c0100) for POWER8NVL, 589 these chips are functionally equalent to P8/P8E DD2 levels. 590 591 This advertises POWER8NVL as DD2. As the result, skiboot adds 592 ibm,powerpc-cpu-features/processor-control-facility for such CPUs and 593 the linux kernel can use hypervisor doorbell messages to wake secondary 594 threads; otherwise "KVM: CPU %d seems to be stuck" would appear because 595 of missing LPCR_PECEDH. 596 597p8dtu Platform 598^^^^^^^^^^^^^^ 599- p8dtu: Configure BMC graphics 600 601 We can no-longer read the values from the BMC in the way we have in the 602 past. Values were provided by Eric Chen of SMC. 603- p8dtu: Enable HIOMAP support 604 605Vesnin Platform 606^^^^^^^^^^^^^^^ 607- platforms/vesnin: Disable PCIe port bifurcation 608 609 PCIe ports connected to CPU1 and CPU3 now work as x16 instead of x8x8. 610 611- Fix hang in pnv_platform_error_reboot path due to TOD failure. 612 613 On TOD failure, with TB stuck, when linux heads down to 614 pnv_platform_error_reboot() path due to unrecoverable hmi event, the panic 615 cpu gets stuck in OPAL inside ipmi_queue_msg_sync(). At this time, rest 616 all other cpus are in smp_handle_nmi_ipi() waiting for panic cpu to proceed. 617 But with panic cpu stuck inside OPAL, linux never recovers/reboot. :: 618 619 p0 c1 t0 620 NIA : 0x000000003001dd3c <.time_wait+0x64> 621 CFAR : 0x000000003001dce4 <.time_wait+0xc> 622 MSR : 0x9000000002803002 623 LR : 0x000000003002ecf8 <.ipmi_queue_msg_sync+0xec> 624 625 STACK: SP NIA 626 0x0000000031c236e0 0x0000000031c23760 (big-endian) 627 0x0000000031c23760 0x000000003002ecf8 <.ipmi_queue_msg_sync+0xec> 628 0x0000000031c237f0 0x00000000300aa5f8 <.hiomap_queue_msg_sync+0x7c> 629 0x0000000031c23880 0x00000000300aaadc <.hiomap_window_move+0x150> 630 0x0000000031c23950 0x00000000300ab1d8 <.ipmi_hiomap_write+0xcc> 631 0x0000000031c23a90 0x00000000300a7b18 <.blocklevel_raw_write+0xbc> 632 0x0000000031c23b30 0x00000000300a7c34 <.blocklevel_write+0xfc> 633 0x0000000031c23bf0 0x0000000030030be0 <.flash_nvram_write+0xd4> 634 0x0000000031c23c90 0x000000003002c128 <.opal_write_nvram+0xd0> 635 0x0000000031c23d20 0x00000000300051e4 <opal_entry+0x134> 636 0xc000001fea6e7870 0xc0000000000a9060 <opal_nvram_write+0x80> 637 0xc000001fea6e78c0 0xc000000000030b84 <nvram_write_os_partition+0x94> 638 0xc000001fea6e7960 0xc0000000000310b0 <nvram_pstore_write+0xb0> 639 0xc000001fea6e7990 0xc0000000004792d4 <pstore_dump+0x1d4> 640 0xc000001fea6e7ad0 0xc00000000018a570 <kmsg_dump+0x140> 641 0xc000001fea6e7b40 0xc000000000028e5c <panic_flush_kmsg_end+0x2c> 642 0xc000001fea6e7b60 0xc0000000000a7168 <pnv_platform_error_reboot+0x68> 643 0xc000001fea6e7bd0 0xc0000000000ac9b8 <hmi_event_handler+0x1d8> 644 0xc000001fea6e7c80 0xc00000000012d6c8 <process_one_work+0x1b8> 645 0xc000001fea6e7d20 0xc00000000012da28 <worker_thread+0x88> 646 0xc000001fea6e7db0 0xc0000000001366f4 <kthread+0x164> 647 0xc000001fea6e7e20 0xc00000000000b65c <ret_from_kernel_thread+0x5c> 648 649 This is because, there is a while loop towards the end of 650 ipmi_queue_msg_sync() which keeps looping until "sync_msg" does not match 651 with "msg". It loops over time_wait_ms() until exit condition is met. In 652 normal scenario time_wait_ms() calls run pollers so that ipmi backend gets 653 a chance to check ipmi response and set sync_msg to NULL. :: 654 655 while (sync_msg == msg) 656 time_wait_ms(10); 657 658 But in the event when TB is in failed state time_wait_ms()->time_wait_poll() 659 returns immediately without calling pollers and hence we end up looping 660 forever. This patch fixes this hang by calling opal_run_pollers() in TB 661 failed state as well. 662 663 664.. _skiboot-6.3-power9: 665 666POWER9 667------ 668 669- Retry link training at PCIe GEN1 if presence detected but training repeatedly failed 670 671 Certain older PCIe 1.0 devices will not train unless the training process starts at GEN1 speeds. 672 As a last resort when a device will not train, fall back to GEN1 speed for the last training attempt. 673 674 This is verified to fix devices based on the Conexant CX23888 on the Talos II platform. 675- hw/phb4: Drop FRESET_DEASSERT_DELAY state 676 677 The delay between the ASSERT_DELAY and DEASSERT_DELAY states is set to 678 one timebase tick. This state seems to have been a hold over from PHB3 679 where it was used to add a 1s delay between de-asserting PERST and 680 polling the link for the CAPI FPGA. There's no requirement for that here 681 since the link polling on PHB4 is a bit smarter so we should be fine. 682- hw/phb4: Factor out PERST control 683 684 Some time ago Mikey added some code work around a bug we found where a 685 certain RAID card wouldn't come back again after a fast-reboot. The 686 workaround is setting the Link Disable bit before asserting PERST and 687 clear it after de-asserting PERST. 688 689 Currently we do this in the FRESET path, but not in the CRESET path. 690 This patch moves the PERST control into its own function to reduce 691 duplication and to the workaround is applied in all circumstances. 692- hw/phb4: Remove FRESET presence check 693 694 When we do an freset the first step is to check if a card is present in 695 the slot. However, this only occurs when we enter phb4_freset() with the 696 slot state set to SLOT_NORMAL. This occurs in: 697 698 a) The creset path, and 699 b) When the OS manually requests an FRESET via an OPAL call. 700 701 (a) is problematic because in the boot path the generic code will put the 702 slot into FRESET_START manually before calling into phb4_freset(). This 703 can result in a situation where a device is detected on boot, but not 704 after a CRESET. 705 706 I've noticed this occurring on systems where the PHB's slot presence 707 detect signal is not wired to an adapter. In this situation we can rely 708 on the in-band presence mechanism, but the presence check will make 709 us exit before that has a chance to work. 710 711 Additionally, if we enter from the CRESET path this early exit leaves 712 the slot's PERST signal being left asserted. This isn't currently an issue, 713 but if we want to support hotplug of devices into the root port it will 714 be. 715- hw/phb4: Skip FRESET PERST when coming from CRESET 716 717 PERST is asserted at the beginning of the CRESET process to prevent 718 the downstream device from interacting with the host while the PHB logic 719 is being reset and re-initialised. There is at least a 100ms wait during 720 the CRESET processing so it's not necessary to wait this time again 721 in the FRESET handler. 722 723 This patch extends the delay after re-setting the PHB logic to extend 724 to the 250ms PERST wait period that we typically use and sets the 725 skip_perst flag so that we don't wait this time again in the FRESET 726 handler. 727- hw/phb4: Look for the hub-id from in the PBCQ node 728 729 The hub-id is stored in the PBCQ node rather than the stack node so we 730 never add it to the PHB node. This breaks the lxvpd slot lookup code 731 since the hub-id is encoded in the VPD record that we need to find the 732 slot information. 733- hdata/iohub: Look for IOVPD on P9 734 735 P8 and P9 use the same IO VPD setup, so we need to load the IOHUB VPD on 736 P9 systems too. 737 738Since v6.3-rc2: 739 740- hw/phb4: Squash the IO bridge window 741 742 The PCI-PCI bridge spec says that bridges that implement an IO window 743 should hardcode the IO base and limit registers to zero. 744 Unfortunately, these registers only define the upper bits of the IO 745 window and the low bits are assumed to be 0 for the base and 1 for the 746 limit address. As a result, setting both to zero can be mis-interpreted 747 as a 4K IO window. 748 749 This patch fixes the problem the same way PHB3 does. It sets the IO base 750 and limit values to 0xf000 and 0x1000 respectively which most software 751 interprets as a disabled window. 752 753 lspci before patch: :: 754 755 0000:00:00.0 PCI bridge: IBM Device 04c1 (prog-if 00 [Normal decode]) 756 I/O behind bridge: 00000000-00000fff 757 758 lspci after patch: :: 759 760 0000:00:00.0 PCI bridge: IBM Device 04c1 (prog-if 00 [Normal decode]) 761 I/O behind bridge: None 762 763- hw/xscom: Enable sw xstop by default on p9 764 765 This was disabled at some point during bringup to make life easier for 766 the lab folks trying to debug NVLink issues. This hack really should 767 have never made it out into the wild though, so we now have the 768 following situation occuring in the field: 769 770 1) A bad happens 771 2) The host kernel recieves an unrecoverable HMI and calls into OPAL to 772 request a platform reboot. 773 3) OPAL rejects the reboot attempt and returns to the kernel with 774 OPAL_PARAMETER. 775 4) Kernel panics and attempts to kexec into a kdump kernel. 776 777 A side effect of the HMI seems to be CPUs becoming stuck which results 778 in the initialisation of the kdump kernel taking a extremely long time 779 (6+ hours). It's also been observed that after performing a dump the 780 kdump kernel then crashes itself because OPAL has ended up in a bad 781 state as a side effect of the HMI. 782 783 All up, it's not very good so re-enable the software checkstop by 784 default. If people still want to turn it off they can using the nvram 785 override. 786 787 788CAPI2 789^^^^^ 790- capp/phb4: Prevent HMI from getting triggered when disabling CAPP 791 792 While disabling CAPP an HMI gets triggered as soon as ETU is put in 793 reset mode. This is caused as before we can disabled CAPP, it detects 794 PHB link going down and triggers an HMI requesting Opal to perform 795 CAPP recovery. This has an un-intended side effect of spamming the 796 Opal logs with malfunction alert messages and may also confuse the 797 user. 798 799 To prevent this we mask the CAPP FIR error 'PHB Link Down' Bit(31) 800 when we are disabling CAPP just before we put ETU in reset in 801 phb4_creset(). Also now since bringing down the PHB link now wont 802 trigger an HMI and CAPP recovery, hence we manually set the 803 PHB4_CAPP_RECOVERY flag on the phb to force recovery during creset. 804 805- phb4/capp: Implement sequence to disable CAPP and enable fast-reset 806 807 We implement h/w sequence to disable CAPP in disable_capi_mode() and 808 with it also enable fast-reset for CAPI mode in phb4_set_capi_mode(). 809 810 Sequence to disable CAPP is executed in three phases. The first two 811 phase is implemented in disable_capi_mode() where we reset the CAPP 812 registers followed by PEC registers to their init values. The final 813 third final phase is to reset the PHB CAPI Compare/Mask Register and 814 is done in phb4_init_ioda3(). The reason to move the PHB reset to 815 phb4_init_ioda3() is because by the time Opal PCI reset state machine 816 reaches this function the PHB is already un-fenced and its 817 configuration registers accessible via mmio. 818- capp/phb4: Force CAPP to PCIe mode during kernel shutdown 819 820 This patch introduces a new opal syncer for PHB4 named 821 phb4_host_sync_reset(). We register this opal syncer when CAPP is 822 activated successfully in phb4_set_capi_mode() so that it will be 823 called at kernel shutdown during fast-reset. 824 825 During kernel shutdown the function will then repeatedly call 826 phb->ops->set_capi_mode() to switch switch CAPP to PCIe mode. In case 827 set_capi_mode() indicates its OPAL_BUSY, which indicates that CAPP is 828 still transitioning to new state; it calls slot->ops.run_sm() to 829 ensure that Opal slot reset state machine makes forward progress. 830 831 832Witherspoon Platform 833^^^^^^^^^^^^^^^^^^^^ 834- platforms/witherspoon: Make PCIe shared slot error message more informative 835 836 If we're missing chips for some reason, we print a warning when configuring 837 the PCIe shared slot. 838 839 The warning doesn't really make it clear what "shared slot" is, and if it's 840 printed, it'll come right after a bunch of messages about NPU setup, so 841 let's clarify the message to explicitly mention PCI. 842- witherspoon: Add nvlink2 interconnect information 843 844 See :ref:`skiboot-6.3-new-features` for details. 845 846Zaius Platform 847^^^^^^^^^^^^^^ 848 849- zaius: Add BMC description 850 851 Frederic reported that Zaius was failing with a NULL dereference when 852 trying to initialise IPMI HIOMAP. It turns out that the BMC wasn't 853 described at all, so add a description. 854 855p9dsu platform 856^^^^^^^^^^^^^^ 857- p9dsu: Fix p9dsu default variant 858 859 Add the default when no riser_id is returned from the ipmi query. 860 861 Allow a little more time for BMC reply and cleanup some label strings. 862 863 864PCIe 865---- 866 867See :ref:`skiboot-6.3-power9` for POWER9 specific PCIe changes. 868 869- core/pcie-slot: Don't bail early in the power on case 870 871 Exiting early in the power off case makes sense since we can't disable 872 slot power (or assert PERST) for suprise hotplug slots. However, we 873 should not exit early in the power-on case since it's possible slot 874 power may have been disabled (or just not enabled at boot time). 875- firenze-pci: Always init slot info from LXVPD 876 877 We can slot information from the LXVPD without having power control 878 information about that slot. This patch changes the init path so that 879 we always override the add_properties() call rather than only when we 880 have power control information about the slot. 881- fsp/lxvpd: Print more LXVPD slot information 882 883 Useful to know since it changes the behaviour of the slot core. 884- core/pcie-slot: Set power state from the PWRCTL flag 885 886 For some reason we look at the power control indicator and use that to 887 determine if the slot is "off" rather than the power control flag that 888 is used to power down the slot. 889 890 While we're here change the default behaviour so that the slot is 891 assumed to be powered on if there's no slot capability, or if there's 892 no power control available. 893- core/pci: Increase the max slot string size 894 895 The maximum string length for the slot label / device location code in 896 the PCI summary is currently 32 characters. This results in some IBM 897 location codes being truncated due to their length, e.g. :: 898 899 PHB#0001:02:11.0 [SWDN] SLOT=C11 x8 900 PHB#0001:13:00.0 [EP ] *snip* LOC_CODE=U78D3.ND1.WZS004A-P1-C 901 PHB#0001:13:00.1 [EP ] *snip* LOC_CODE=U78D3.ND1.WZS004A-P1-C 902 PHB#0001:13:00.2 [EP ] *snip* LOC_CODE=U78D3.ND1.WZS004A-P1-C 903 PHB#0001:13:00.3 [EP ] *snip* LOC_CODE=U78D3.ND1.WZS004A-P1-C 904 905 Which obscure the actual location of the card, and it looks bad. This 906 patch increases the maximum length of the label string to 80 characters 907 since that's the maximum length for a location code. 908 909 910Since v6.3-rc3: 911 912- pci: Try harder to add meaningful ibm,loc-code 913 914 We keep the existing logic of looking to the parent for the slot-label or 915 slot-location-code, but we add logic to (if all that fails) we look 916 directly for the slot-location-code (as this should give us the correct 917 loc code for things directly under the PHB), and otherwise we just look 918 for a loc-code. 919 920 The applicable bit of PAPR here is: 921 922 R1–12.1–1. Each instance of a hardware entity (FRU) has a platform 923 unique location code and any node in the OF 924 device tree that describes a part of a hardware entity must include the 925 “ibm,loc-code” property with a 926 value that represents the location code for that hardware entity. 927 928 which we weren't really fully obeying at any recent (ever?) point in 929 time. Now we should do okay, at least for PCI. 930 931Since v6.3-rc2: 932- core/pci: Use PHB io-base-location by default for PHB slots 933 934 On witherspoon only the GPU slots and the three pluggable PCI slots 935 (SLOT0, 1, 2) have platform defined slot names. For builtin devices such 936 as the SATA controller or the PLX switch that fans out to the GPU slots 937 we have no location codes which some people consider an issue. 938 939 This patch address the problem by making the ibm,slot-location-code for 940 the root port device default to the ibm,io-base-location-code which is 941 typically the location code for the system itself. 942 943 e.g. :: 944 945 pciex@600c3c0100000/ibm,loc-code 946 "UOPWR.0000000-Node0-Proc0" 947 948 pciex@600c3c0100000/pci@0/ibm,loc-code 949 "UOPWR.0000000-Node0-Proc0" 950 951 pciex@600c3c0100000/pci@0/usb-xhci@0/ibm,loc-code 952 "UOPWR.0000000-Node0" 953 954 The PHB node, and the root complex nodes have a loc code of the 955 processor they are attached to, while the usb-xhci device under the 956 root port has a location code of the system itself. 957 958- hw/phb4: Read ibm,loc-code from PBCQ node 959 960 On P9 the PBCQs are subdivided by stacks which implement the PCI Express 961 logic. When phb4 was forked from phb3 most of the properties that were 962 in the pbcq node moved into the stack node, but ibm,loc-code was not one 963 of them. This patch fixes the phb4 init sequence to read the base 964 location code from the PBCQ node (parent of the stack node) rather than 965 the stack node itself. 966 967 968.. _skiboot-6.3-OpenCAPI: 969 970OpenCAPI 971-------- 972- npu2/hw-procedures: Fix parallel zcal for opencapi 973 974 For opencapi, we currently do impedance calibration when initializing 975 the PHY for the device, which could run in parallel if we have 976 multiple opencapi devices. But if 2 devices are on the same 977 obus, the 2 calibration sequences could overlap, which likely yields 978 bad results and is useless anyway since it only needs to be done once 979 per obus. 980 981 This patch splits the opencapi PHY reset in 2 parts: 982 983 - a 'init' part called serially at boot. That's when zcal is done. If 984 we have 2 devices on the same socket, the zcal won't be redone, 985 since we're called serially and we'll see it has already be done for 986 the obus 987 - a 'reset' part called during fundamental reset as a prereq for link 988 training. It does the PHY setup for a set of lanes and the dccal. 989 990 The PHY team confirmed there's no dependency between zcal and the 991 other reset steps and it can be moved earlier. 992- npu2-hw-procedures: Fix zcal in mixed opencapi and nvlink mode 993 994 The zcal procedure needs to be run once per obus. We keep track of 995 which obus is already calibrated in an array indexed by the obus 996 number. However, the obus number is inferred from the brick index, 997 which works well for nvlink but not for opencapi. 998 999 Create an obus_index() function, which, from a device, returns the 1000 correct obus index, irrespective of the device type. 1001- npu2-opencapi: Fix adapter reset when using 2 adapters 1002 1003 If two opencapi adapters are on the same obus, we may try to train the 1004 two links in parallel at boot time, when all the PCI links are being 1005 trained. Both links use the same i2c controller to handle the reset 1006 signal, so some care is needed to make sure resetting one doesn't 1007 interfere with the reset of the other. We need to keep track of the 1008 current state of the i2c controller (and use locking). 1009 1010 This went mostly unnoticed as you need to have 2 opencapi cards on the 1011 same socket and links tended to train anyway because of the retries. 1012- npu2-opencapi: Extend delay after releasing reset on adapter 1013 1014 Give more time to the FPGA to process the reset signal. The previous 1015 delay, 5ms, is too short for newer adapters with bigger FPGAs. Extend 1016 it to 250ms. 1017 Ultimately, that delay will likely end up being added to the opencapi 1018 specification, but we are not there yet. 1019- npu2-opencapi: ODL should be in reset when enabled 1020 1021 We haven't hit any problem so far, but from the ODL designer, the ODL 1022 should be in reset when it is enabled. 1023 1024 The ODL remains in reset until we start a fundamental reset to 1025 initiate link training. We still assert and deassert the ODL reset 1026 signal as part of the normal procedure just before training the 1027 link. Asserting is therefore useless at boot, since the ODL is already 1028 in reset, but we keep it as it's only a scom write and it's needed 1029 when we reset/retrain from the OS. 1030- npu2-opencapi: Keep ODL and adapter in reset at the same time 1031 1032 Split the function to assert and deassert the reset signal on the ODL, 1033 so that we can keep the ODL in reset while we reset the adapter, 1034 therefore having a window where both sides are in reset. 1035 1036 It is actually not required with our current DLx at boot time, but I 1037 need to split the ODL reset function for the following patch and it 1038 will become useful/required later when we introduce resetting an 1039 opencapi link from the OS. 1040- npu2-opencapi: Setup perf counters to detect CRC errors 1041 1042 It's possible to set up performance counters for the PLL to detect 1043 various conditions for the links in nvlink or opencapi mode. Since 1044 those counters are currently unused, let's configure them when an obus 1045 is in opencapi mode to detect CRC errors on the link. Each link has 1046 two counters: 1047 - CRC error detected by the host 1048 - CRC error detected by the DLx (NAK received by the host) 1049 1050 We also dump the counters shortly after the link trains, but they can 1051 be read multiple times through cronus, pdbg or linux. The counters are 1052 configured to be reset after each read. 1053 1054Since v6.3-rc1: 1055 1056- opal/hmi: Never trust a cow! 1057 1058 With opencapi, it's fairly common to trigger HMIs during AFU 1059 development on the FPGA, by not replying in time to an NPU command, 1060 for example. So shift the blame reported by that cow to avoid crowding 1061 my mailbox. 1062- hw/npu2: Dump (more) npu2 registers on link error and HMIs 1063 1064 We were already logging some NPU registers during an HMI. This patch 1065 cleans up a bit how it is done and separates what is global from what 1066 is specific to nvlink or opencapi. 1067 1068 Since we can now receive an error interrupt when an opencapi link goes 1069 down unexpectedly, we also dump the NPU state but we limit it to the 1070 registers of the brick which hit the error. 1071 1072 The list of registers to dump was worked out with the hw team to 1073 allow for proper debugging. For each register, we print the name as 1074 found in the NPU workbook, the scom address and the register value. 1075- hw/npu2: Report errors to the OS if an OpenCAPI brick is fenced 1076 1077 Now that the NPU may report interrupts due to the link going down 1078 unexpectedly, report those errors to the OS when queried by the 1079 'next_error' PHB callback. 1080 1081 The hardware doesn't support recovery of the link when it goes down 1082 unexpectedly. So we report the PHB as dead, so that the OS can log the 1083 proper message, notify the drivers and take the devices down. 1084- hw/npu2: Fix OpenCAPI PE assignment 1085 1086 When we support mixing NVLink and OpenCAPI devices on the same NPU, we're 1087 going to have to share the same range of 16 PE numbers between NVLink and 1088 OpenCAPI PHBs. 1089 1090 For OpenCAPI devices, PE assignment is only significant for determining 1091 which System Interrupt Log register is used for a particular brick - unlike 1092 NVLink, it doesn't play any role in determining how links are fenced. 1093 1094 Split the PE range into a lower half which is used for NVLink, and an upper 1095 half that is used for OpenCAPI, with a fixed PE number assigned per brick. 1096 1097 As the PE assignment for OpenCAPI devices is fixed, set the PE once 1098 during device init and then ignore calls to the set_pe() operation. 1099 1100- opal-api: Reserve 2 OPAL API calls for future OpenCAPI LPC use 1101 1102 OpenCAPI Lowest Point of Coherency (LPC) memory is going to require 1103 some extra OPAL calls to set up NPU BARs. These calls will most likely be 1104 called OPAL_NPU_LPC_ALLOC and OPAL_NPU_LPC_RELEASE, we're not quite ready 1105 to upstream that code yet though. 1106 1107 1108 1109NVLINK2 1110------- 1111- npu2: Allow ATSD for LPAR other than 0 1112 1113 Each XTS MMIO ATSD# register is accompanied by another register - 1114 XTS MMIO ATSD0 LPARID# - which controls LPID filtering for ATSD 1115 transactions. 1116 1117 When a host system passes a GPU through to a guest, we need to enable 1118 some ATSD for an LPAR. At the moment the host assigns one ATSD to 1119 a NVLink bridge and this maps it to an LPAR when GPU is assigned to 1120 the LPAR. The link number is used for an ATSD index. 1121 1122 ATSD6&7 stay mapped to the host (LPAR=0) all the time which seems to be 1123 acceptable price for the simplicity. 1124- npu2: Add XTS_BDF_MAP wildcard refcount 1125 1126 Currently PID wildcard is programmed into the NPU once and never cleared 1127 up. This works for the bare metal as MSR does not change while the host 1128 OS is running. 1129 1130 However with the device virtualization, we need to keep track of wildcard 1131 entries use and clear them up before switching a GPU from a host to 1132 a guest or vice versa. 1133 1134 This adds refcount to a NPU2, one counter per wildcard entry. The index 1135 is a short lparid (4 bits long) which is allocated in opal_npu_map_lpar() 1136 and should be smaller than NPU2_XTS_BDF_MAP_SIZE (defined as 16). 1137 1138Since v6.3-rc2: 1139- npu2: Disable Probe-to-Invalid-Return-Modified-or-Owned snarfing by default 1140 1141 V100 GPUs are known to violate NVLink2 protocol in some cases (one is when 1142 memory was accessed by the CPU and they by GPU using so called block 1143 linear mapping) and issue double probes to NPU which can cope with this 1144 problem only if CONFIG_ENABLE_SNARF_CPM ("disable/enable Probe.I.MO 1145 snarfing a cp_m") is not set in the CQ_SM Misc Config register #0. 1146 If the bit is set (which is the case today), NPU issues the machine 1147 check stop. 1148 1149 The snarfing feature is designed to detect 2 probes in flight and combine 1150 them into one. 1151 1152 This adds a new "opal-npu2-snarf-cpm" nvram variable which controls 1153 CONFIG_ENABLE_SNARF_CPM for all NVLinks to prevent the machine check 1154 stop from happening. 1155 1156 This disables snarfing by default as otherwise a broken GPU driver can 1157 crash the entire box even when a GPU is passed through to a guest. 1158 This provides a dial to allow regression tests (might be useful for 1159 a bare metal). To enable snarfing, the user needs to run: :: 1160 1161 sudo nvram -p ibm,skiboot --update-config opal-npu2-snarf-cpm=enable 1162 1163 and reboot the host system. 1164 1165- hw/npu2: Show name of opencapi error interrupts 1166 1167 1168Debugging and simulation 1169------------------------ 1170 1171- external/mambo: Error out if kernel is too large 1172 1173 If you're trying to boot a gigantic kernel in mambo (which you can 1174 reproduce by building a kernel with CONFIG_MODULES=n) you'll get 1175 misleading errors like: :: 1176 1177 WARNING: 0: (0): [0:0]: Invalid/unsupported instr 0x00000000[INVALID] 1178 WARNING: 0: (0): PC(EA): 0x0000000030000010 PC(RA):0x0000000030000010 MSR: 0x9000000000000000 LR: 0x0000000000000000 1179 WARNING: 0: (0): numInstructions = 0 1180 WARNING: 1: (1): [0:0]: Invalid/unsupported instr 0x00000000[INVALID] 1181 WARNING: 1: (1): PC(EA): 0x0000000000000E40 PC(RA):0x0000000000000E40 MSR: 0x9000000000000000 LR: 0x0000000000000000 1182 WARNING: 1: (1): numInstructions = 1 1183 WARNING: 1: (1): Interrupt to 0x0000000000000E40 from 0x0000000000000E40 1184 INFO: 1: (2): ** Execution stopped: Continuous Interrupt, Instruction caused exception, ** 1185 1186 So add an error to skiboot.tcl to warn the user before this happens. 1187 Making PAYLOAD_ADDR further back is one way to do this but if there's a 1188 less gross way to generally work around this very niche problem, I can 1189 suggest that instead. 1190- external/mambo: Populate kernel-base-address in the DT 1191 1192 skiboot.tcl defines PAYLOAD_ADDR as 0x20000000, which is the default in 1193 skiboot. This is also the default in skiboot unless kernel-base-address 1194 is set in the device tree. 1195 1196 If you change PAYLOAD_ADDR to something else for mambo, skiboot won't 1197 see it because it doesn't set that DT property, so fix it so that it does. 1198- external/mambo: allow CPU targeting for most debug utils 1199 1200 Debug util functions target CPU 0:0:0 by default Some can be 1201 overidden explicitly per invocation, and others can't at all. 1202 Even for those that can be overidden, it is a pain to type 1203 them out when you're debugging a particular thread. 1204 1205 Provide a new 'target' function that allows the default CPU 1206 target to be changed. Wire that up that default to all other utils. 1207 Provide a new 'S' step command which only steps the target CPU. 1208- qemu: bt device isn't always hanging off / 1209 1210 Just use the normal for_each_compatible instead. 1211 1212 Otherwise in the qemu model as executed by op-test, 1213 we wouldn't go down the astbmc_init() path, thus not having flash. 1214- devicetree: Add p9-simics.dts 1215 1216 Add a p9-based devicetree that's suitable for use with Simics. 1217- devicetree: Move power9-phb4.dts 1218 1219 Clean up the formatting of power9-phb4.dts and move it to 1220 external/devicetree/p9.dts. This sets us up to include it as the basis 1221 for other trees. 1222- devicetree: Add nx node to power9-phb4.dts 1223 1224 A (non-qemu) p9 without an nx node will assert in p9_darn_init(): :: 1225 1226 dt_for_each_compatible(dt_root, nx, "ibm,power9-nx") 1227 break; 1228 if (!nx) { 1229 if (!dt_node_is_compatible(dt_root, "qemu,powernv")) 1230 assert(nx); 1231 return; 1232 } 1233 1234 Since NX is this essential, add it to the device tree. 1235- devicetree: Fix typo in power9-phb4.dts 1236 1237 Change "impi" to "ipmi". 1238- devicetree: Fix syntax error in power9-phb4.dts 1239 1240 Remove the extra space causing this: :: 1241 1242 Error: power9-phb4.dts:156.15-16 syntax error 1243 FATAL ERROR: Unable to parse input tree 1244- core/init: enable machine check on secondaries 1245 1246 Secondary CPUs currently run with MSR[ME]=0 during boot, whih means 1247 if they take a machine check, the system will checkstop. 1248 1249 Enable ME where possible and allow them to print registers. 1250 1251Utilities 1252--------- 1253- pflash: Don't try update RO ToC 1254 1255 In the future it's likely the ToC will be marked as read-only. Don't 1256 error out by assuming its writable. 1257- pflash: Support encoding/decoding ECC'd partitions 1258 1259 With the new --ecc option, pflash can add/remove ECC when 1260 reading/writing flash partitions protected by ECC. 1261 1262 This is *not* flawless with current PNORs out in the wild though, as 1263 they do not typically fill the whole partition with valid ECC data, so 1264 you have to know how big the valid ECC'd data is and specify the size 1265 manually. Note that for some partitions this is pratically impossible 1266 without knowing the details of the content of the partition. 1267 1268 A future patch is likely to introduce an option to "stop reading data 1269 when ECC starts failing and assume everything is okay rather than error 1270 out" to support reading the "valid" data from existing PNOR images. 1271 1272Since v6.3-rc2: 1273 1274- opal-prd: Fix memory leak in is-fsp-system check 1275- opal-prd: Check malloc return value 1276