1.. _skiboot-6.0.20: 2 3============== 4skiboot-6.0.20 5============== 6 7skiboot 6.0.20 was released on Thursday May 9th, 2019. It replaces 8:ref:`skiboot-6.0.19` as the current stable release in the 6.0.x series. 9 10It is recommended that 6.0.20 be used instead of any previous 6.0.x version 11due to the bug fixes it contains. 12 13Bug fixes included in this release are: 14 15- core/flash: Retry requests as necessary in flash_load_resource() 16 17 We would like to successfully boot if we have a dependency on the BMC 18 for flash even if the BMC is not current ready to service flash 19 requests. On the assumption that it will become ready, retry for several 20 minutes to cover a BMC reboot cycle and *eventually* rather than 21 *immediately* crash out with: :: 22 23 [ 269.549748] reboot: Restarting system 24 [ 390.297462587,5] OPAL: Reboot request... 25 [ 390.297737995,5] RESET: Initiating fast reboot 1... 26 [ 391.074707590,5] Clearing unused memory: 27 [ 391.075198880,5] PCI: Clearing all devices... 28 [ 391.075201618,7] Clearing region 201ffe000000-201fff800000 29 [ 391.086235699,5] PCI: Resetting PHBs and training links... 30 [ 391.254089525,3] FFS: Error 17 reading flash header 31 [ 391.254159668,3] FLASH: Can't open ffs handle: 17 32 [ 392.307245135,5] PCI: Probing slots... 33 [ 392.363723191,5] PCI Summary: 34 ... 35 [ 393.423255262,5] OCC: All Chip Rdy after 0 ms 36 [ 393.453092828,5] INIT: Starting kernel at 0x20000000, fdt at 37 0x30800a88 390645 bytes 38 [ 393.453202605,0] FATAL: Kernel is zeros, can't execute! 39 [ 393.453247064,0] Assert fail: core/init.c:593:0 40 [ 393.453289682,0] Aborting! 41 CPU 0040 Backtrace: 42 S: 0000000031e03ca0 R: 000000003001af60 ._abort+0x4c 43 S: 0000000031e03d20 R: 000000003001afdc .assert_fail+0x34 44 S: 0000000031e03da0 R: 00000000300146d8 .load_and_boot_kernel+0xb30 45 S: 0000000031e03e70 R: 0000000030026cf0 .fast_reboot_entry+0x39c 46 S: 0000000031e03f00 R: 0000000030002a4c fast_reset_entry+0x2c 47 --- OPAL boot --- 48 49 The OPAL flash API hooks directly into the blocklevel layer, so there's 50 no delay for e.g. the host kernel, just for asynchronously loaded 51 resources during boot. 52 53- pci/iov: Remove skiboot VF tracking 54 55 This feature was added a few years ago in response to a request to make 56 the MaxPayloadSize (MPS) field of a Virtual Function match the MPS of the 57 Physical Function that hosts it. 58 59 The SR-IOV specification states the the MPS field of the VF is "ResvP". 60 This indicates the VF will use whatever MPS is configured on the PF and 61 that the field should be treated as a reserved field in the config space 62 of the VF. In other words, a SR-IOV spec compliant VF should always return 63 zero in the MPS field. Adding hacks in OPAL to make it non-zero is... 64 misguided at best. 65 66 Additionally, there is a bug in the way pci_device structures are handled 67 by VFs that results in a crash on fast-reboot that occurs if VFs are 68 enabled and then disabled prior to rebooting. This patch fixes the bug by 69 removing the code entirely. This patch has no impact on SR-IOV support on 70 the host operating system. 71 72- hw/xscom: Enable sw xstop by default on p9 73 74 This was disabled at some point during bringup to make life easier for 75 the lab folks trying to debug NVLink issues. This hack really should 76 have never made it out into the wild though, so we now have the 77 following situation occuring in the field: 78 79 1) A bad happens 80 2) The host kernel recieves an unrecoverable HMI and calls into OPAL to 81 request a platform reboot. 82 3) OPAL rejects the reboot attempt and returns to the kernel with 83 OPAL_PARAMETER. 84 4) Kernel panics and attempts to kexec into a kdump kernel. 85 86 A side effect of the HMI seems to be CPUs becoming stuck which results 87 in the initialisation of the kdump kernel taking a extremely long time 88 (6+ hours). It's also been observed that after performing a dump the 89 kdump kernel then crashes itself because OPAL has ended up in a bad 90 state as a side effect of the HMI. 91 92 All up, it's not very good so re-enable the software checkstop by 93 default. If people still want to turn it off they can using the nvram 94 override. 95 96- opal/hmi: Initialize the hmi event with old value of TFMR. 97 98 Do this before we fix TFAC errors. Otherwise the event at host console 99 shows no thread error reported in TFMR register. 100 101 Without this patch the console event show TFMR with no thread error: 102 (DEC parity error TFMR[59] injection) :: 103 104 [ 53.737572] Severe Hypervisor Maintenance interrupt [Recovered] 105 [ 53.737596] Error detail: Timer facility experienced an error 106 [ 53.737611] HMER: 0840000000000000 107 [ 53.737621] TFMR: 3212000870e04000 108 109 After this patch it shows old TFMR value on host console: :: 110 111 [ 2302.267271] Severe Hypervisor Maintenance interrupt [Recovered] 112 [ 2302.267305] Error detail: Timer facility experienced an error 113 [ 2302.267320] HMER: 0840000000000000 114 [ 2302.267330] TFMR: 3212000870e14010 115 116- libflash/ipmi-hiomap: Fix blocks count issue 117 118 We convert data size to block count and pass block count to BMC. 119 If data size is not block aligned then we endup sending block count 120 less than actual data. BMC will write partial data to flash memory. 121 122 Sample log :: 123 124 [ 594.388458416,7] HIOMAP: Marked flash dirty at 0x42010 for 8 125 [ 594.398756487,7] HIOMAP: Flushed writes 126 [ 594.409596439,7] HIOMAP: Marked flash dirty at 0x42018 for 3970 127 [ 594.419897507,7] HIOMAP: Flushed writes 128 129 In this case HIOMAP sent data with block count=0 and hence BMC didn't 130 flush data to flash. 131 132 Lets fix this issue by adjusting block count before sending it to BMC. 133 134- Fix hang in pnv_platform_error_reboot path due to TOD failure. 135 136 On TOD failure, with TB stuck, when linux heads down to 137 pnv_platform_error_reboot() path due to unrecoverable hmi event, the panic 138 cpu gets stuck in OPAL inside ipmi_queue_msg_sync(). At this time, rest 139 all other cpus are in smp_handle_nmi_ipi() waiting for panic cpu to proceed. 140 But with panic cpu stuck inside OPAL, linux never recovers/reboot. :: 141 142 p0 c1 t0 143 NIA : 0x000000003001dd3c <.time_wait+0x64> 144 CFAR : 0x000000003001dce4 <.time_wait+0xc> 145 MSR : 0x9000000002803002 146 LR : 0x000000003002ecf8 <.ipmi_queue_msg_sync+0xec> 147 148 STACK: SP NIA 149 0x0000000031c236e0 0x0000000031c23760 (big-endian) 150 0x0000000031c23760 0x000000003002ecf8 <.ipmi_queue_msg_sync+0xec> 151 0x0000000031c237f0 0x00000000300aa5f8 <.hiomap_queue_msg_sync+0x7c> 152 0x0000000031c23880 0x00000000300aaadc <.hiomap_window_move+0x150> 153 0x0000000031c23950 0x00000000300ab1d8 <.ipmi_hiomap_write+0xcc> 154 0x0000000031c23a90 0x00000000300a7b18 <.blocklevel_raw_write+0xbc> 155 0x0000000031c23b30 0x00000000300a7c34 <.blocklevel_write+0xfc> 156 0x0000000031c23bf0 0x0000000030030be0 <.flash_nvram_write+0xd4> 157 0x0000000031c23c90 0x000000003002c128 <.opal_write_nvram+0xd0> 158 0x0000000031c23d20 0x00000000300051e4 <opal_entry+0x134> 159 0xc000001fea6e7870 0xc0000000000a9060 <opal_nvram_write+0x80> 160 0xc000001fea6e78c0 0xc000000000030b84 <nvram_write_os_partition+0x94> 161 0xc000001fea6e7960 0xc0000000000310b0 <nvram_pstore_write+0xb0> 162 0xc000001fea6e7990 0xc0000000004792d4 <pstore_dump+0x1d4> 163 0xc000001fea6e7ad0 0xc00000000018a570 <kmsg_dump+0x140> 164 0xc000001fea6e7b40 0xc000000000028e5c <panic_flush_kmsg_end+0x2c> 165 0xc000001fea6e7b60 0xc0000000000a7168 <pnv_platform_error_reboot+0x68> 166 0xc000001fea6e7bd0 0xc0000000000ac9b8 <hmi_event_handler+0x1d8> 167 0xc000001fea6e7c80 0xc00000000012d6c8 <process_one_work+0x1b8> 168 0xc000001fea6e7d20 0xc00000000012da28 <worker_thread+0x88> 169 0xc000001fea6e7db0 0xc0000000001366f4 <kthread+0x164> 170 0xc000001fea6e7e20 0xc00000000000b65c <ret_from_kernel_thread+0x5c> 171 172 This is because, there is a while loop towards the end of 173 ipmi_queue_msg_sync() which keeps looping until "sync_msg" does not match 174 with "msg". It loops over time_wait_ms() until exit condition is met. In 175 normal scenario time_wait_ms() calls run pollers so that ipmi backend gets 176 a chance to check ipmi response and set sync_msg to NULL. 177 178 .. code-block:: c 179 180 while (sync_msg == msg) 181 time_wait_ms(10); 182 183 But in the event when TB is in failed state time_wait_ms()->time_wait_poll() 184 returns immediately without calling pollers and hence we end up looping 185 forever. This patch fixes this hang by calling opal_run_pollers() in TB 186 failed state as well. 187 188- core/ipmi: Print correct netfn value 189 190- core/lock: don't set bust_locks on lock error 191 192 bust_locks is a big hammer that guarantees a mess if it's set while 193 all other threads are not stopped. 194 195 I propose removing this in the lock error paths. In debugging the 196 previous deadlock false positive, none of the error messages printed, 197 and the in-memory console was totally garbled due to lack of locking. 198 199 I think it's generally better for debugging and system integrity to 200 keep locks held when lock errors occur. Lock busting should be used 201 carefully, just to allow messages to be printed out or machine to be 202 restarted, probably when the whole system is single-threaded. 203 204 Skiboot is slowly working toward that being feasible with co-operative 205 debug APIs between firmware and host, but for the time being, 206 difficult lock crashes are better not to corrupt everything by 207 busting locks. 208