1.. _skiboot-6.3-rc1: 2 3skiboot-6.3-rc1 4=============== 5 6skiboot v6.3-rc1 was released on Friday March 29th 2019. It is the first 7release candidate of skiboot 6.3, which will become the new stable release 8of skiboot following the 6.2 release, first released December 14th 2018. 9 10Skiboot 6.3 will mark the basis for op-build v2.3. I expect to tag the final 11skiboot 6.3 in the next week. 12 13skiboot v6.3-rc1 contains all bug fixes as of :ref:`skiboot-6.0.19`, 14and :ref:`skiboot-6.2.3` (the currently maintained 15stable releases). 16 17For how the skiboot stable releases work, see :ref:`stable-rules` for details. 18 19This release has been a longer cycle than typical for a variety of reasons. It 20also contains a lot of cleanup work and minor bug fixes (much like skiboot 6.2 21did). 22 23Over skiboot 6.2, we have the following changes: 24 25.. _skiboot-6.3-rc1-new-features: 26 27New Features 28------------ 29 30- hw/imc: Enable opal calls to init/start/stop IMC Trace mode 31 32 New OPAL APIs for In-Memory Collection Counter infrastructure(IMC), 33 including a new device type called OPAL_IMC_COUNTERS_TRACE. 34- xive: Add calls to save/restore the queues and VPs HW state 35 36 To be able to support migration of guests using the XIVE native 37 exploitation mode, (where the queue is effectively owned by the 38 guest), KVM needs to be able to save and restore the HW-modified 39 fields of the queue, such as the current queue producer pointer and 40 generation bit, and to retrieve the modified thread context registers 41 of the VP from the NVT structure : the VP interrupt pending bits. 42 43 However, there is no need to set back the NVT structure on P9. P10 44 should be the same. 45- witherspoon: Add nvlink2 interconnect information 46 47 GPUs on Redbud and Sequoia platforms are interconnected in groups of 48 2 or 3 GPUs. The problem with that is if the user decides to pass a single 49 GPU from a group to the userspace, we need to ensure that links between 50 GPUs do not get enabled. 51 52 A V100 GPU provides a way to disable selected links. In order to only 53 disable links to peer GPUs, we need a topology map. 54 55 This adds an "ibm,nvlink-peers" property to a GPU DT node with phandles 56 of peer GPUs and NVLink2 bridges. The index in the property is a GPU link 57 number. 58- platforms/romulus: Also support talos 59 60 The two are similar enough and I'd like to have a slot table for our 61 Talos. 62- OpenCAPI support! (see :ref:`skiboot-6.3-rc1-OpenCAPI` section) 63- opal/hmi: set a flag to inform OS that TOD/TB has failed. 64 65 Set a flag to indicate OS about TOD/TB failure as part of new 66 opal_handle_hmi2 handler. This flag then can be used by OS to make sure 67 functions depending on TB value (e.g. udelay()) are aware of TB not 68 ticking. 69- astbmc: Enable IPMI HIOMAP for AMI platforms 70 71 Required for Habanero, Palmetto and Romulus. 72- power-mgmt : occ : Add 'freq-domain-mask' DT property 73 74 Add a new device-tree property freq-domain-indicator to define group of 75 CPUs which would share same frequency. This property has been added under 76 power-mgmt node. It is a bitmask. 77 78 Bitwise AND is taken between this bitmask value and PIR of cpu. All the 79 CPUs lying in the same frequency domain will have same result for AND. 80 81 For example, For POWER9, 0xFFF0 indicates quad wide frequency domain. 82 Taking AND with the PIR of CPUs will yield us frequency domain which is 83 quad wise distribution as last 4 bits have been masked which represent the 84 cores. 85 86 Similarly, 0xFFF8 will represent core wide frequency domain for P8. 87 88 Also, Add a new device-tree property domain-runs-at which will denote the 89 strategy OCC is using to change the frequency of a frequency-domain. There 90 can be two strategy - FREQ_MOST_RECENTLY_SET and FREQ_MAX_IN_DOMAIN. 91 92 FREQ_MOST_RECENTLY_SET : the OCC sets the frequency of the quad to the most 93 recent frequency value requested by the CPUs in the quad. 94 95 FREQ_MAX_IN_DOMAIN : the OCC sets the frequency of the CPUs in 96 the Quad to the maximum of the latest frequency requested by each of 97 the component cores. 98- powercap: occ: Fix the powercapping range allowed for user 99 100 OCC provides two limits for minimum powercap. One being hard powercap 101 minimum which is guaranteed by OCC and the other one is a soft 102 powercap minimum which is lesser than hard-min and may or may not be 103 asserted due to various power-thermal reasons. So to allow the users 104 to access the entire powercap range, this patch exports soft powercap 105 minimum as the "powercap-min" DT property. And it also adds a new 106 DT property called "powercap-hard-min" to export the hard-min powercap 107 limit. 108- Add NVDIMM support 109 110 NVDIMMs are memory modules that use a battery backup system to allow the 111 contents RAM to be saved to non-volatile storage if system power goes 112 away unexpectedly. This allows them to be used a high-performance 113 storage device, suitable for serving as a cache for SSDs and the like. 114 115 Configuration of NVDIMMs is handled by hostboot and communicated to OPAL 116 via the HDAT. We need to parse out the NVDIMM memory ranges and create 117 memory regions with the "pmem-region" compatible label to make them 118 available to the host. 119- core/exceptions: implement support for MCE interrupts in powersave 120 121 The ISA specifies that MCE interrupts in power saving modes will enter 122 at 0x200 with powersave bits in SRR1 set. This is not currently 123 supported properly, the MCE will just happen like a normal interrupt, 124 but GPRs could be lost, which would lead to crashes (e.g., r1, r2, r13 125 etc). 126 127 So check the power save bits similarly to the sreset vector, and 128 handle this properly. 129- core/exceptions: allow recoverable sreset exceptions 130 131 This requires implementing the MSR[RI] bit. Then just allow all 132 non-fatal sreset exceptions to recover. 133- core/exceptions: implement an exception handler for non-powersave sresets 134 135 Detect non-powersave sresets and send them to the normal exception 136 handler which prints registers and stack. 137- Add PVR_TYPE_P9P 138 139 Enable a new PVR to get us running on another p9 variant. 140 141Deprecated/Removed Features 142--------------------------- 143 144- opal: Deprecate reading the PHB status 145 146 The OPAL_PCI_EEH_FREEZE_STATUS call takes a bunch of parameters, one of 147 them is @phb_status. It is defined as __be64* and always NULL in 148 the current Linux upstream but if anyone ever decides to read that status, 149 then the PHB3's handler will assume it is struct OpalIoPhb3ErrorData* 150 (which is a lot bigger than 8 bytes) and zero it causing the stack 151 corruption; p7ioc-phb has the same issue. 152 153 This removes @phb_status from all eeh_freeze_status() hooks and moves 154 the error message from PHB4 to the affected OPAL handlers. 155 156 As far as we can tell, nobody has ever used this and thus it's safe to remove. 157- Remove POWER9N DD1 support 158 159 This is not a shipping product and is no longer supported by Linux 160 or other firmware components. 161 162General 163------- 164 165- core/i2c: Various bits of refactoring 166- refactor backtrace generation infrastructure 167- astbmc: Handle failure to initialise raw flash 168 169 Initialising raw flash lead to a dead assignment to rc. Check the return 170 code and take the failure path as necessary. Both before and after the 171 fix we see output along the lines of the following when flash_init() 172 fails: :: 173 174 [ 53.283182881,7] IRQ: Registering 0800..0ff7 ops @0x300d4b98 (data 0x3052b9d8) 175 [ 53.283184335,7] IRQ: Registering 0ff8..0fff ops @0x300d4bc8 (data 0x3052b9d8) 176 [ 53.283185513,7] PHB#0000: Initializing PHB... 177 [ 53.288260827,4] FLASH: Can't load resource id:0. No system flash found 178 [ 53.288354442,4] FLASH: Can't load resource id:1. No system flash found 179 [ 53.342933439,3] CAPP: Error loading ucode lid. index=200ea 180 [ 53.462749486,2] NVRAM: Failed to load 181 [ 53.462819095,2] NVRAM: Failed to load 182 [ 53.462894236,2] NVRAM: Failed to load 183 [ 53.462967071,2] NVRAM: Failed to load 184 [ 53.463033077,2] NVRAM: Failed to load 185 [ 53.463144847,2] NVRAM: Failed to load 186 187 Eventually followed by: :: 188 189 [ 57.216942479,5] INIT: platform wait for kernel load failed 190 [ 57.217051132,5] INIT: Assuming kernel at 0x20000000 191 [ 57.217127508,3] INIT: ELF header not found. Assuming raw binary. 192 [ 57.217249886,2] NVRAM: Failed to load 193 [ 57.221294487,0] FATAL: Kernel is zeros, can't execute! 194 [ 57.221397429,0] Assert fail: core/init.c:615:0 195 [ 57.221471414,0] Aborting! 196 CPU 0028 Backtrace: 197 S: 0000000031d43c60 R: 000000003001b274 ._abort+0x4c 198 S: 0000000031d43ce0 R: 000000003001b2f0 .assert_fail+0x34 199 S: 0000000031d43d60 R: 0000000030014814 .load_and_boot_kernel+0xae4 200 S: 0000000031d43e30 R: 0000000030015164 .main_cpu_entry+0x680 201 S: 0000000031d43f00 R: 0000000030002718 boot_entry+0x1c0 202 --- OPAL boot --- 203 204 Analysis of the execution paths suggests we'll always "safely" end this 205 way due the setup sequence for the blocklevel callbacks in flash_init() 206 and error handling in blocklevel_get_info(), and there's no current risk 207 of executing from unexpected memory locations. As such the issue is 208 reduced to down to a fix for poor error hygene in the original change 209 and a resolution for a Coverity warning (famous last words etc). 210- core/flash: Retry requests as necessary in flash_load_resource() 211 212 We would like to successfully boot if we have a dependency on the BMC 213 for flash even if the BMC is not current ready to service flash 214 requests. On the assumption that it will become ready, retry for several 215 minutes to cover a BMC reboot cycle and *eventually* rather than 216 *immediately* crash out with: :: 217 218 [ 269.549748] reboot: Restarting system 219 [ 390.297462587,5] OPAL: Reboot request... 220 [ 390.297737995,5] RESET: Initiating fast reboot 1... 221 [ 391.074707590,5] Clearing unused memory: 222 [ 391.075198880,5] PCI: Clearing all devices... 223 [ 391.075201618,7] Clearing region 201ffe000000-201fff800000 224 [ 391.086235699,5] PCI: Resetting PHBs and training links... 225 [ 391.254089525,3] FFS: Error 17 reading flash header 226 [ 391.254159668,3] FLASH: Can't open ffs handle: 17 227 [ 392.307245135,5] PCI: Probing slots... 228 [ 392.363723191,5] PCI Summary: 229 ... 230 [ 393.423255262,5] OCC: All Chip Rdy after 0 ms 231 [ 393.453092828,5] INIT: Starting kernel at 0x20000000, fdt at 232 0x30800a88 390645 bytes 233 [ 393.453202605,0] FATAL: Kernel is zeros, can't execute! 234 [ 393.453247064,0] Assert fail: core/init.c:593:0 235 [ 393.453289682,0] Aborting! 236 CPU 0040 Backtrace: 237 S: 0000000031e03ca0 R: 000000003001af60 ._abort+0x4c 238 S: 0000000031e03d20 R: 000000003001afdc .assert_fail+0x34 239 S: 0000000031e03da0 R: 00000000300146d8 .load_and_boot_kernel+0xb30 240 S: 0000000031e03e70 R: 0000000030026cf0 .fast_reboot_entry+0x39c 241 S: 0000000031e03f00 R: 0000000030002a4c fast_reset_entry+0x2c 242 --- OPAL boot --- 243 244 The OPAL flash API hooks directly into the blocklevel layer, so there's 245 no delay for e.g. the host kernel, just for asynchronously loaded 246 resources during boot. 247- fast-reboot: occ: Call occ_pstates_init() on fast-reset on all machines 248 249 Commit 815417dcda2e ("init, occ: Initialise OCC earlier on BMC systems") 250 conditionally invoked occ_pstates_init() only on FSP based systems in 251 load_and_boot_kernel(). Due to this pstate table is re-parsed on FSP 252 system and skipped on BMC system during fast-reboot. So this patch fixes 253 this by invoking occ_pstates_init() on all boxes during fast-reboot. 254- opal/hmi: Don't retry TOD recovery if it is already in failed state. 255 256 On TOD failure, all cores/thread receives HMI and very first thread that 257 gets interrupt fixes the TOD where as others just resets the respective 258 HMER error bit and return. But when TOD is unrecoverable, all the threads 259 try to do TOD recovery one by one causing threads to spend more time inside 260 opal. Set a global flag when TOD is unrecoverable so that rest of the 261 threads go back to linux immediately avoiding lock ups in system 262 reboot/panic path. 263- hw/bt: Do not disable ipmi message retry during OPAL boot 264 265 Currently OPAL doesn't know whether BMC is functioning or not. If BMC is 266 down (like BMC reboot), then we keep on retry sending message to BMC. So 267 in some corner cases we may hit hard lockup issue in kernel. 268 269 Ideally we should avoid using synchronous path as much as possible. But 270 for now commit 01f977c3 added option to disable message retry in synchronous. 271 But this fix is not required during boot. Hence lets disable IPMI message 272 retry during OPAL boot. 273- hdata/memory: Fix warning message 274 275 Even though we added memory to device tree, we are getting below warning. :: 276 277 [ 57.136949696,3] Unable to use memory range 0 from MSAREA 0 278 [ 57.137049753,3] Unable to use memory range 0 from MSAREA 1 279 [ 57.137152335,3] Unable to use memory range 0 from MSAREA 2 280 [ 57.137251218,3] Unable to use memory range 0 from MSAREA 3 281- hw/bt: Add backend interface to disable ipmi message retry option 282 283 During boot OPAL makes IPMI_GET_BT_CAPS call to BMC to get BT interface 284 capabilities which includes IPMI message max resend count, message 285 timeout, etc,. Most of the time OPAL gets response from BMC within 286 specified timeout. In some corner cases (like mboxd daemon reset in BMC, 287 BMC reboot, etc) OPAL may not get response within timeout period. In 288 such scenarios, OPAL resends message until max resend count reaches. 289 290 OPAL uses synchronous IPMI message (ipmi_queue_msg_sync()) for few 291 operations like flash read, write, etc. Thread will wait in OPAL until 292 it gets response from BMC. In some corner cases like BMC reboot, thread 293 may wait in OPAL for long time (more than 20 seconds) and results in 294 kernel hardlockup. 295 296 This patch introduces new interface to disable message resend option. We 297 will disable message resend option for synchrous message. This will 298 greatly reduces kernel hardlock up issues. 299 300 This is short term fix. Long term solution is to convert all synchronous 301 messages to asynhrounous one. 302- ipmi/power: Fix system reboot issue 303 304 Kernel makes reboot/shudown OPAL call for reboot/shutdown. Once kernel 305 gets response from OPAL it runs opal_poll_events() until firmware 306 handles the request. 307 308 On BMC based system, OPAL makes IPMI call (IPMI_CHASSIS_CONTROL) to 309 initiate system reboot/shutdown. At present OPAL queues IPMI messages 310 and return SUCESS to Host. If BMC is not ready to accept command (like 311 BMC reboot), then these message will fail. We have to manually 312 reboot/shutdown the system using BMC interface. 313 314 This patch adds logic to validate message return value. If message failed, 315 then it will resend the message. At some stage BMC will be ready to accept 316 message and handles IPMI message. 317- firmware-versions: Add test case for parsing VERSION 318 319 Also make it possible to use with afl-lop/afl-fuzz just to help make 320 *sure* we're all good. 321 322 Additionally, if we hit a entry in VERSION that is larger than our 323 buffer size, we skip over it gracefully rather than overwriting the 324 stack. This is only a problem if VERSION isn't trusted, which as of 325 4b8cc05a94513816d43fb8bd6178896b430af08f it is verified as part of 326 Secure Boot. 327- core/fast-reboot: improve NMI handling during fast reset 328 329 Improve sreset and MCE handling in fast reboot. Switch the HILE bit 330 off before copying OPAL's exception vectors, so NMIs can be handled 331 properly. Also disable MSR[ME] while the vectors are being overwritten 332- core/cpu: HID update race 333 334 If the per-core HID register is updated concurrently by multiple 335 threads, updates can get lost. This has been observed during fast 336 reboot where the HILE bit does not get cleared on all cores, which 337 can cause machine check exception interrupts to crash. 338 339 Fix this by only updating HID on thread0. 340- SLW: Print verbose info on errors only 341 342 Change print level from debug to warning for reporting 343 bad EC_PPM_SPECIAL_WKUP_* scom values. To reduce cluttering 344 in the log print only on error. 345 346IBM FSP based platforms 347----------------------- 348 349- platforms/firenze: Rework I2C controller fixups 350- platforms/zz: Re-enable LXVPD slot information parsing 351 352 From memory this was disabled in the distant past since we were waiting 353 for an updates to the LXPVD format. It looks like that never happened 354 so re-enable it for the ZZ platform so that we can get PCI slot location 355 codes on ZZ. 356 357HIOMAP 358------ 359- astbmc: Try IPMI HIOMAP for P8 360 361 The HIOMAP protocol was developed after the release of P8 in preparation 362 for P9. As a consequence P9 always uses it, but it has rarely been 363 enabled for P8. P8DTU has recently added IPMI HIOMAP support to its BMC 364 firmware, so enable its use in skiboot with P8 machines. Doing so 365 requires some rework to ensure fallback works correctly as in the past 366 the fallback was to mbox, which will only work for P9. 367- libflash/ipmi-hiomap: Enforce message size for empty response 368 369 The protocol defines the response to the associated messages as empty 370 except for the command ID and sequence fields. If the BMC is returning 371 extra data consider the message malformed. 372- libflash/ipmi-hiomap: Remove unused close handling 373 374 Issuing a HIOMAP_C_CLOSE is not required by the protocol specification, 375 rather a close can be implicit in a subsequent 376 CREATE_{READ,WRITE}_WINDOW request. The implicit close provides an 377 opportunity to reduce LPC traffic and the implementation takes up that 378 optimisation, so remove the case from the IPMI callback handler. 379- libflash/ipmi-hiomap: Overhaul event handling 380 381 Reworking the event handling was inspired by a bug report by Vasant 382 where the host would get wedged on multiple flash access attempts in the 383 face of a persistent error state on the BMC-side. The cause of this bug 384 was the early-exit based on ctx->update, which erronously assumed that 385 all events had been completely handled in prior calls to 386 ipmi_hiomap_handle_events(). This is not true if e.g. 387 HIOMAP_E_DAEMON_READY is clear in the prior calls. 388 389 Regardless, there were other correctness and efficiency problems with 390 the handling strategy: 391 392 * Ack-able event state was not restored in the face of errors in the 393 process of re-establishing protocol state 394 * It forced needless window restoration with respect to the context in 395 which ipmi_hiomap_handle_events() was called. 396 * Tests for HIOMAP_E_DAEMON_READY and HIOMAP_E_FLASH_LOST were redundant 397 with the overhauled error handling introduced in the previous patch 398 399 Fix all of the above issues and add comments to explain the event 400 handling flow. 401- libflash/ipmi-hiomap: Overhaul error handling 402 403 The aim is to improve the robustness with respect to absence of the 404 BMC-side daemon. The current error handling roughly mirrors what was 405 done for the mailbox implementation, but there's room for improvement. 406 407 Errors are split into two classes, those that affect the transport state 408 and those that affect the window validity. From here, we push the 409 transport state error checks right to the bottom of the stack, to ensure 410 the link is known to be in a good state before any message is sent. 411 Window validity tests remain as they were in the hiomap_window_move() 412 and ipmi_hiomap_read() functions. Validity tests are not necessary in 413 the write and erase paths as we will receive an error response from the 414 BMC when performing a dirty or flush on an invalid window. 415 416 Recovery also remains as it was, done on entry to the blocklevel 417 callbacks. If an error state is encountered in the middle of an 418 operation no attempt is made to recover it on the spot, instead the 419 error is returned up the stack and the caller can choose how it wishes 420 to respond. 421- libflash/ipmi-hiomap: Fix leak of msg in callback 422 423POWER8 424------ 425- hw/phb3/naples: Disable D-states 426 427 Putting "Mellanox Technologies MT27700 Family [ConnectX-4] [15b3:1013]" 428 (more precisely, the second of 2 its PCI functions, no matter in what 429 order) into the D3 state causes EEH with the "PCT timeout" error. 430 This has been noticed on garrison machines only and firestones do not 431 seem to have this issue. 432 433 This disables D-states changing for devices on root buses on Naples by 434 installing a config space access filter (copied from PHB4). 435- cpufeatures: Always advertise POWER8NVL as DD2 436 437 Despite the major version of PVR being 1 (0x004c0100) for POWER8NVL, 438 these chips are functionally equalent to P8/P8E DD2 levels. 439 440 This advertises POWER8NVL as DD2. As the result, skiboot adds 441 ibm,powerpc-cpu-features/processor-control-facility for such CPUs and 442 the linux kernel can use hypervisor doorbell messages to wake secondary 443 threads; otherwise "KVM: CPU %d seems to be stuck" would appear because 444 of missing LPCR_PECEDH. 445 446p8dtu Platform 447^^^^^^^^^^^^^^ 448- p8dtu: Configure BMC graphics 449 450 We can no-longer read the values from the BMC in the way we have in the 451 past. Values were provided by Eric Chen of SMC. 452- p8dtu: Enable HIOMAP support 453 454Vesnin Platform 455^^^^^^^^^^^^^^^ 456- platforms/vesnin: Disable PCIe port bifurcation 457 458 PCIe ports connected to CPU1 and CPU3 now work as x16 instead of x8x8. 459 460- Fix hang in pnv_platform_error_reboot path due to TOD failure. 461 462 On TOD failure, with TB stuck, when linux heads down to 463 pnv_platform_error_reboot() path due to unrecoverable hmi event, the panic 464 cpu gets stuck in OPAL inside ipmi_queue_msg_sync(). At this time, rest 465 all other cpus are in smp_handle_nmi_ipi() waiting for panic cpu to proceed. 466 But with panic cpu stuck inside OPAL, linux never recovers/reboot. :: 467 468 p0 c1 t0 469 NIA : 0x000000003001dd3c <.time_wait+0x64> 470 CFAR : 0x000000003001dce4 <.time_wait+0xc> 471 MSR : 0x9000000002803002 472 LR : 0x000000003002ecf8 <.ipmi_queue_msg_sync+0xec> 473 474 STACK: SP NIA 475 0x0000000031c236e0 0x0000000031c23760 (big-endian) 476 0x0000000031c23760 0x000000003002ecf8 <.ipmi_queue_msg_sync+0xec> 477 0x0000000031c237f0 0x00000000300aa5f8 <.hiomap_queue_msg_sync+0x7c> 478 0x0000000031c23880 0x00000000300aaadc <.hiomap_window_move+0x150> 479 0x0000000031c23950 0x00000000300ab1d8 <.ipmi_hiomap_write+0xcc> 480 0x0000000031c23a90 0x00000000300a7b18 <.blocklevel_raw_write+0xbc> 481 0x0000000031c23b30 0x00000000300a7c34 <.blocklevel_write+0xfc> 482 0x0000000031c23bf0 0x0000000030030be0 <.flash_nvram_write+0xd4> 483 0x0000000031c23c90 0x000000003002c128 <.opal_write_nvram+0xd0> 484 0x0000000031c23d20 0x00000000300051e4 <opal_entry+0x134> 485 0xc000001fea6e7870 0xc0000000000a9060 <opal_nvram_write+0x80> 486 0xc000001fea6e78c0 0xc000000000030b84 <nvram_write_os_partition+0x94> 487 0xc000001fea6e7960 0xc0000000000310b0 <nvram_pstore_write+0xb0> 488 0xc000001fea6e7990 0xc0000000004792d4 <pstore_dump+0x1d4> 489 0xc000001fea6e7ad0 0xc00000000018a570 <kmsg_dump+0x140> 490 0xc000001fea6e7b40 0xc000000000028e5c <panic_flush_kmsg_end+0x2c> 491 0xc000001fea6e7b60 0xc0000000000a7168 <pnv_platform_error_reboot+0x68> 492 0xc000001fea6e7bd0 0xc0000000000ac9b8 <hmi_event_handler+0x1d8> 493 0xc000001fea6e7c80 0xc00000000012d6c8 <process_one_work+0x1b8> 494 0xc000001fea6e7d20 0xc00000000012da28 <worker_thread+0x88> 495 0xc000001fea6e7db0 0xc0000000001366f4 <kthread+0x164> 496 0xc000001fea6e7e20 0xc00000000000b65c <ret_from_kernel_thread+0x5c> 497 498 This is because, there is a while loop towards the end of 499 ipmi_queue_msg_sync() which keeps looping until "sync_msg" does not match 500 with "msg". It loops over time_wait_ms() until exit condition is met. In 501 normal scenario time_wait_ms() calls run pollers so that ipmi backend gets 502 a chance to check ipmi response and set sync_msg to NULL. :: 503 504 while (sync_msg == msg) 505 time_wait_ms(10); 506 507 But in the event when TB is in failed state time_wait_ms()->time_wait_poll() 508 returns immediately without calling pollers and hence we end up looping 509 forever. This patch fixes this hang by calling opal_run_pollers() in TB 510 failed state as well. 511 512 513.. _skiboot-6.3-rc1-power9: 514 515POWER9 516------ 517 518- Retry link training at PCIe GEN1 if presence detected but training repeatedly failed 519 520 Certain older PCIe 1.0 devices will not train unless the training process starts at GEN1 speeds. 521 As a last resort when a device will not train, fall back to GEN1 speed for the last training attempt. 522 523 This is verified to fix devices based on the Conexant CX23888 on the Talos II platform. 524- hw/phb4: Drop FRESET_DEASSERT_DELAY state 525 526 The delay between the ASSERT_DELAY and DEASSERT_DELAY states is set to 527 one timebase tick. This state seems to have been a hold over from PHB3 528 where it was used to add a 1s delay between de-asserting PERST and 529 polling the link for the CAPI FPGA. There's no requirement for that here 530 since the link polling on PHB4 is a bit smarter so we should be fine. 531- hw/phb4: Factor out PERST control 532 533 Some time ago Mikey added some code work around a bug we found where a 534 certain RAID card wouldn't come back again after a fast-reboot. The 535 workaround is setting the Link Disable bit before asserting PERST and 536 clear it after de-asserting PERST. 537 538 Currently we do this in the FRESET path, but not in the CRESET path. 539 This patch moves the PERST control into its own function to reduce 540 duplication and to the workaround is applied in all circumstances. 541- hw/phb4: Remove FRESET presence check 542 543 When we do an freset the first step is to check if a card is present in 544 the slot. However, this only occurs when we enter phb4_freset() with the 545 slot state set to SLOT_NORMAL. This occurs in: 546 547 a) The creset path, and 548 b) When the OS manually requests an FRESET via an OPAL call. 549 550 (a) is problematic because in the boot path the generic code will put the 551 slot into FRESET_START manually before calling into phb4_freset(). This 552 can result in a situation where a device is detected on boot, but not 553 after a CRESET. 554 555 I've noticed this occurring on systems where the PHB's slot presence 556 detect signal is not wired to an adapter. In this situation we can rely 557 on the in-band presence mechanism, but the presence check will make 558 us exit before that has a chance to work. 559 560 Additionally, if we enter from the CRESET path this early exit leaves 561 the slot's PERST signal being left asserted. This isn't currently an issue, 562 but if we want to support hotplug of devices into the root port it will 563 be. 564- hw/phb4: Skip FRESET PERST when coming from CRESET 565 566 PERST is asserted at the beginning of the CRESET process to prevent 567 the downstream device from interacting with the host while the PHB logic 568 is being reset and re-initialised. There is at least a 100ms wait during 569 the CRESET processing so it's not necessary to wait this time again 570 in the FRESET handler. 571 572 This patch extends the delay after re-setting the PHB logic to extend 573 to the 250ms PERST wait period that we typically use and sets the 574 skip_perst flag so that we don't wait this time again in the FRESET 575 handler. 576- hw/phb4: Look for the hub-id from in the PBCQ node 577 578 The hub-id is stored in the PBCQ node rather than the stack node so we 579 never add it to the PHB node. This breaks the lxvpd slot lookup code 580 since the hub-id is encoded in the VPD record that we need to find the 581 slot information. 582- hdata/iohub: Look for IOVPD on P9 583 584 P8 and P9 use the same IO VPD setup, so we need to load the IOHUB VPD on 585 P9 systems too. 586 587CAPI2 588^^^^^ 589- capp/phb4: Prevent HMI from getting triggered when disabling CAPP 590 591 While disabling CAPP an HMI gets triggered as soon as ETU is put in 592 reset mode. This is caused as before we can disabled CAPP, it detects 593 PHB link going down and triggers an HMI requesting Opal to perform 594 CAPP recovery. This has an un-intended side effect of spamming the 595 Opal logs with malfunction alert messages and may also confuse the 596 user. 597 598 To prevent this we mask the CAPP FIR error 'PHB Link Down' Bit(31) 599 when we are disabling CAPP just before we put ETU in reset in 600 phb4_creset(). Also now since bringing down the PHB link now wont 601 trigger an HMI and CAPP recovery, hence we manually set the 602 PHB4_CAPP_RECOVERY flag on the phb to force recovery during creset. 603 604- phb4/capp: Implement sequence to disable CAPP and enable fast-reset 605 606 We implement h/w sequence to disable CAPP in disable_capi_mode() and 607 with it also enable fast-reset for CAPI mode in phb4_set_capi_mode(). 608 609 Sequence to disable CAPP is executed in three phases. The first two 610 phase is implemented in disable_capi_mode() where we reset the CAPP 611 registers followed by PEC registers to their init values. The final 612 third final phase is to reset the PHB CAPI Compare/Mask Register and 613 is done in phb4_init_ioda3(). The reason to move the PHB reset to 614 phb4_init_ioda3() is because by the time Opal PCI reset state machine 615 reaches this function the PHB is already un-fenced and its 616 configuration registers accessible via mmio. 617- capp/phb4: Force CAPP to PCIe mode during kernel shutdown 618 619 This patch introduces a new opal syncer for PHB4 named 620 phb4_host_sync_reset(). We register this opal syncer when CAPP is 621 activated successfully in phb4_set_capi_mode() so that it will be 622 called at kernel shutdown during fast-reset. 623 624 During kernel shutdown the function will then repeatedly call 625 phb->ops->set_capi_mode() to switch switch CAPP to PCIe mode. In case 626 set_capi_mode() indicates its OPAL_BUSY, which indicates that CAPP is 627 still transitioning to new state; it calls slot->ops.run_sm() to 628 ensure that Opal slot reset state machine makes forward progress. 629 630 631Witherspoon Platform 632^^^^^^^^^^^^^^^^^^^^ 633- platforms/witherspoon: Make PCIe shared slot error message more informative 634 635 If we're missing chips for some reason, we print a warning when configuring 636 the PCIe shared slot. 637 638 The warning doesn't really make it clear what "shared slot" is, and if it's 639 printed, it'll come right after a bunch of messages about NPU setup, so 640 let's clarify the message to explicitly mention PCI. 641- witherspoon: Add nvlink2 interconnect information 642 643 See :ref:`skiboot-6.3-rc1-new-features` for details. 644 645Zaius Platform 646^^^^^^^^^^^^^^ 647 648- zaius: Add BMC description 649 650 Frederic reported that Zaius was failing with a NULL dereference when 651 trying to initialise IPMI HIOMAP. It turns out that the BMC wasn't 652 described at all, so add a description. 653 654p9dsu platform 655^^^^^^^^^^^^^^ 656- p9dsu: Fix p9dsu default variant 657 658 Add the default when no riser_id is returned from the ipmi query. 659 660 Allow a little more time for BMC reply and cleanup some label strings. 661 662 663PCIe 664---- 665 666See :ref:`skiboot-6.3-rc1-power9` for POWER9 specific PCIe changes. 667 668- core/pcie-slot: Don't bail early in the power on case 669 670 Exiting early in the power off case makes sense since we can't disable 671 slot power (or assert PERST) for suprise hotplug slots. However, we 672 should not exit early in the power-on case since it's possible slot 673 power may have been disabled (or just not enabled at boot time). 674- firenze-pci: Always init slot info from LXVPD 675 676 We can slot information from the LXVPD without having power control 677 information about that slot. This patch changes the init path so that 678 we always override the add_properties() call rather than only when we 679 have power control information about the slot. 680- fsp/lxvpd: Print more LXVPD slot information 681 682 Useful to know since it changes the behaviour of the slot core. 683- core/pcie-slot: Set power state from the PWRCTL flag 684 685 For some reason we look at the power control indicator and use that to 686 determine if the slot is "off" rather than the power control flag that 687 is used to power down the slot. 688 689 While we're here change the default behaviour so that the slot is 690 assumed to be powered on if there's no slot capability, or if there's 691 no power control available. 692- core/pci: Increase the max slot string size 693 694 The maximum string length for the slot label / device location code in 695 the PCI summary is currently 32 characters. This results in some IBM 696 location codes being truncated due to their length, e.g. :: 697 698 PHB#0001:02:11.0 [SWDN] SLOT=C11 x8 699 PHB#0001:13:00.0 [EP ] *snip* LOC_CODE=U78D3.ND1.WZS004A-P1-C 700 PHB#0001:13:00.1 [EP ] *snip* LOC_CODE=U78D3.ND1.WZS004A-P1-C 701 PHB#0001:13:00.2 [EP ] *snip* LOC_CODE=U78D3.ND1.WZS004A-P1-C 702 PHB#0001:13:00.3 [EP ] *snip* LOC_CODE=U78D3.ND1.WZS004A-P1-C 703 704 Which obscure the actual location of the card, and it looks bad. This 705 patch increases the maximum length of the label string to 80 characters 706 since that's the maximum length for a location code. 707 708 709 710.. _skiboot-6.3-rc1-OpenCAPI: 711 712OpenCAPI 713-------- 714- npu2/hw-procedures: Fix parallel zcal for opencapi 715 716 For opencapi, we currently do impedance calibration when initializing 717 the PHY for the device, which could run in parallel if we have 718 multiple opencapi devices. But if 2 devices are on the same 719 obus, the 2 calibration sequences could overlap, which likely yields 720 bad results and is useless anyway since it only needs to be done once 721 per obus. 722 723 This patch splits the opencapi PHY reset in 2 parts: 724 725 - a 'init' part called serially at boot. That's when zcal is done. If 726 we have 2 devices on the same socket, the zcal won't be redone, 727 since we're called serially and we'll see it has already be done for 728 the obus 729 - a 'reset' part called during fundamental reset as a prereq for link 730 training. It does the PHY setup for a set of lanes and the dccal. 731 732 The PHY team confirmed there's no dependency between zcal and the 733 other reset steps and it can be moved earlier. 734- npu2-hw-procedures: Fix zcal in mixed opencapi and nvlink mode 735 736 The zcal procedure needs to be run once per obus. We keep track of 737 which obus is already calibrated in an array indexed by the obus 738 number. However, the obus number is inferred from the brick index, 739 which works well for nvlink but not for opencapi. 740 741 Create an obus_index() function, which, from a device, returns the 742 correct obus index, irrespective of the device type. 743- npu2-opencapi: Fix adapter reset when using 2 adapters 744 745 If two opencapi adapters are on the same obus, we may try to train the 746 two links in parallel at boot time, when all the PCI links are being 747 trained. Both links use the same i2c controller to handle the reset 748 signal, so some care is needed to make sure resetting one doesn't 749 interfere with the reset of the other. We need to keep track of the 750 current state of the i2c controller (and use locking). 751 752 This went mostly unnoticed as you need to have 2 opencapi cards on the 753 same socket and links tended to train anyway because of the retries. 754- npu2-opencapi: Extend delay after releasing reset on adapter 755 756 Give more time to the FPGA to process the reset signal. The previous 757 delay, 5ms, is too short for newer adapters with bigger FPGAs. Extend 758 it to 250ms. 759 Ultimately, that delay will likely end up being added to the opencapi 760 specification, but we are not there yet. 761- npu2-opencapi: ODL should be in reset when enabled 762 763 We haven't hit any problem so far, but from the ODL designer, the ODL 764 should be in reset when it is enabled. 765 766 The ODL remains in reset until we start a fundamental reset to 767 initiate link training. We still assert and deassert the ODL reset 768 signal as part of the normal procedure just before training the 769 link. Asserting is therefore useless at boot, since the ODL is already 770 in reset, but we keep it as it's only a scom write and it's needed 771 when we reset/retrain from the OS. 772- npu2-opencapi: Keep ODL and adapter in reset at the same time 773 774 Split the function to assert and deassert the reset signal on the ODL, 775 so that we can keep the ODL in reset while we reset the adapter, 776 therefore having a window where both sides are in reset. 777 778 It is actually not required with our current DLx at boot time, but I 779 need to split the ODL reset function for the following patch and it 780 will become useful/required later when we introduce resetting an 781 opencapi link from the OS. 782- npu2-opencapi: Setup perf counters to detect CRC errors 783 784 It's possible to set up performance counters for the PLL to detect 785 various conditions for the links in nvlink or opencapi mode. Since 786 those counters are currently unused, let's configure them when an obus 787 is in opencapi mode to detect CRC errors on the link. Each link has 788 two counters: 789 - CRC error detected by the host 790 - CRC error detected by the DLx (NAK received by the host) 791 792 We also dump the counters shortly after the link trains, but they can 793 be read multiple times through cronus, pdbg or linux. The counters are 794 configured to be reset after each read. 795 796NVLINK2 797------- 798- npu2: Allow ATSD for LPAR other than 0 799 800 Each XTS MMIO ATSD# register is accompanied by another register - 801 XTS MMIO ATSD0 LPARID# - which controls LPID filtering for ATSD 802 transactions. 803 804 When a host system passes a GPU through to a guest, we need to enable 805 some ATSD for an LPAR. At the moment the host assigns one ATSD to 806 a NVLink bridge and this maps it to an LPAR when GPU is assigned to 807 the LPAR. The link number is used for an ATSD index. 808 809 ATSD6&7 stay mapped to the host (LPAR=0) all the time which seems to be 810 acceptable price for the simplicity. 811- npu2: Add XTS_BDF_MAP wildcard refcount 812 813 Currently PID wildcard is programmed into the NPU once and never cleared 814 up. This works for the bare metal as MSR does not change while the host 815 OS is running. 816 817 However with the device virtualization, we need to keep track of wildcard 818 entries use and clear them up before switching a GPU from a host to 819 a guest or vice versa. 820 821 This adds refcount to a NPU2, one counter per wildcard entry. The index 822 is a short lparid (4 bits long) which is allocated in opal_npu_map_lpar() 823 and should be smaller than NPU2_XTS_BDF_MAP_SIZE (defined as 16). 824 825 826 827Debugging and simulation 828------------------------ 829 830- external/mambo: Error out if kernel is too large 831 832 If you're trying to boot a gigantic kernel in mambo (which you can 833 reproduce by building a kernel with CONFIG_MODULES=n) you'll get 834 misleading errors like: :: 835 836 WARNING: 0: (0): [0:0]: Invalid/unsupported instr 0x00000000[INVALID] 837 WARNING: 0: (0): PC(EA): 0x0000000030000010 PC(RA):0x0000000030000010 MSR: 0x9000000000000000 LR: 0x0000000000000000 838 WARNING: 0: (0): numInstructions = 0 839 WARNING: 1: (1): [0:0]: Invalid/unsupported instr 0x00000000[INVALID] 840 WARNING: 1: (1): PC(EA): 0x0000000000000E40 PC(RA):0x0000000000000E40 MSR: 0x9000000000000000 LR: 0x0000000000000000 841 WARNING: 1: (1): numInstructions = 1 842 WARNING: 1: (1): Interrupt to 0x0000000000000E40 from 0x0000000000000E40 843 INFO: 1: (2): ** Execution stopped: Continuous Interrupt, Instruction caused exception, ** 844 845 So add an error to skiboot.tcl to warn the user before this happens. 846 Making PAYLOAD_ADDR further back is one way to do this but if there's a 847 less gross way to generally work around this very niche problem, I can 848 suggest that instead. 849- external/mambo: Populate kernel-base-address in the DT 850 851 skiboot.tcl defines PAYLOAD_ADDR as 0x20000000, which is the default in 852 skiboot. This is also the default in skiboot unless kernel-base-address 853 is set in the device tree. 854 855 If you change PAYLOAD_ADDR to something else for mambo, skiboot won't 856 see it because it doesn't set that DT property, so fix it so that it does. 857- external/mambo: allow CPU targeting for most debug utils 858 859 Debug util functions target CPU 0:0:0 by default Some can be 860 overidden explicitly per invocation, and others can't at all. 861 Even for those that can be overidden, it is a pain to type 862 them out when you're debugging a particular thread. 863 864 Provide a new 'target' function that allows the default CPU 865 target to be changed. Wire that up that default to all other utils. 866 Provide a new 'S' step command which only steps the target CPU. 867- qemu: bt device isn't always hanging off / 868 869 Just use the normal for_each_compatible instead. 870 871 Otherwise in the qemu model as executed by op-test, 872 we wouldn't go down the astbmc_init() path, thus not having flash. 873- devicetree: Add p9-simics.dts 874 875 Add a p9-based devicetree that's suitable for use with Simics. 876- devicetree: Move power9-phb4.dts 877 878 Clean up the formatting of power9-phb4.dts and move it to 879 external/devicetree/p9.dts. This sets us up to include it as the basis 880 for other trees. 881- devicetree: Add nx node to power9-phb4.dts 882 883 A (non-qemu) p9 without an nx node will assert in p9_darn_init(): :: 884 885 dt_for_each_compatible(dt_root, nx, "ibm,power9-nx") 886 break; 887 if (!nx) { 888 if (!dt_node_is_compatible(dt_root, "qemu,powernv")) 889 assert(nx); 890 return; 891 } 892 893 Since NX is this essential, add it to the device tree. 894- devicetree: Fix typo in power9-phb4.dts 895 896 Change "impi" to "ipmi". 897- devicetree: Fix syntax error in power9-phb4.dts 898 899 Remove the extra space causing this: :: 900 901 Error: power9-phb4.dts:156.15-16 syntax error 902 FATAL ERROR: Unable to parse input tree 903- core/init: enable machine check on secondaries 904 905 Secondary CPUs currently run with MSR[ME]=0 during boot, whih means 906 if they take a machine check, the system will checkstop. 907 908 Enable ME where possible and allow them to print registers. 909 910Utilities 911--------- 912- pflash: Don't try update RO ToC 913 914 In the future it's likely the ToC will be marked as read-only. Don't 915 error out by assuming its writable. 916- pflash: Support encoding/decoding ECC'd partitions 917 918 With the new --ecc option, pflash can add/remove ECC when 919 reading/writing flash partitions protected by ECC. 920 921 This is *not* flawless with current PNORs out in the wild though, as 922 they do not typically fill the whole partition with valid ECC data, so 923 you have to know how big the valid ECC'd data is and specify the size 924 manually. Note that for some partitions this is pratically impossible 925 without knowing the details of the content of the partition. 926 927 A future patch is likely to introduce an option to "stop reading data 928 when ECC starts failing and assume everything is okay rather than error 929 out" to support reading the "valid" data from existing PNOR images. 930 931