1.. _skiboot-6.2.4: 2 3============= 4skiboot-6.2.4 5============= 6 7skiboot 6.2.4 was released on Thursday May 9th, 2019. It replaces 8:ref:`skiboot-6.2.3` as the current stable release in the 6.2.x series. 9 10It is recommended that 6.2.4 be used instead of any previous 6.2.x version 11due to the bug fixes it contains. 12 13Bug fixes included in this release are: 14 15- core/flash: Retry requests as necessary in flash_load_resource() 16 17 We would like to successfully boot if we have a dependency on the BMC 18 for flash even if the BMC is not current ready to service flash 19 requests. On the assumption that it will become ready, retry for several 20 minutes to cover a BMC reboot cycle and *eventually* rather than 21 *immediately* crash out with: :: 22 23 [ 269.549748] reboot: Restarting system 24 [ 390.297462587,5] OPAL: Reboot request... 25 [ 390.297737995,5] RESET: Initiating fast reboot 1... 26 [ 391.074707590,5] Clearing unused memory: 27 [ 391.075198880,5] PCI: Clearing all devices... 28 [ 391.075201618,7] Clearing region 201ffe000000-201fff800000 29 [ 391.086235699,5] PCI: Resetting PHBs and training links... 30 [ 391.254089525,3] FFS: Error 17 reading flash header 31 [ 391.254159668,3] FLASH: Can't open ffs handle: 17 32 [ 392.307245135,5] PCI: Probing slots... 33 [ 392.363723191,5] PCI Summary: 34 ... 35 [ 393.423255262,5] OCC: All Chip Rdy after 0 ms 36 [ 393.453092828,5] INIT: Starting kernel at 0x20000000, fdt at 37 0x30800a88 390645 bytes 38 [ 393.453202605,0] FATAL: Kernel is zeros, can't execute! 39 [ 393.453247064,0] Assert fail: core/init.c:593:0 40 [ 393.453289682,0] Aborting! 41 CPU 0040 Backtrace: 42 S: 0000000031e03ca0 R: 000000003001af60 ._abort+0x4c 43 S: 0000000031e03d20 R: 000000003001afdc .assert_fail+0x34 44 S: 0000000031e03da0 R: 00000000300146d8 .load_and_boot_kernel+0xb30 45 S: 0000000031e03e70 R: 0000000030026cf0 .fast_reboot_entry+0x39c 46 S: 0000000031e03f00 R: 0000000030002a4c fast_reset_entry+0x2c 47 --- OPAL boot --- 48 49 The OPAL flash API hooks directly into the blocklevel layer, so there's 50 no delay for e.g. the host kernel, just for asynchronously loaded 51 resources during boot. 52 53- pci/iov: Remove skiboot VF tracking 54 55 This feature was added a few years ago in response to a request to make 56 the MaxPayloadSize (MPS) field of a Virtual Function match the MPS of the 57 Physical Function that hosts it. 58 59 The SR-IOV specification states the the MPS field of the VF is "ResvP". 60 This indicates the VF will use whatever MPS is configured on the PF and 61 that the field should be treated as a reserved field in the config space 62 of the VF. In other words, a SR-IOV spec compliant VF should always return 63 zero in the MPS field. Adding hacks in OPAL to make it non-zero is... 64 misguided at best. 65 66 Additionally, there is a bug in the way pci_device structures are handled 67 by VFs that results in a crash on fast-reboot that occurs if VFs are 68 enabled and then disabled prior to rebooting. This patch fixes the bug by 69 removing the code entirely. This patch has no impact on SR-IOV support on 70 the host operating system. 71 72- astbmc: Handle failure to initialise raw flash 73 74 Initialising raw flash lead to a dead assignment to rc. Check the return 75 code and take the failure path as necessary. Both before and after the 76 fix we see output along the lines of the following when flash_init() 77 fails: :: 78 79 [ 53.283182881,7] IRQ: Registering 0800..0ff7 ops @0x300d4b98 (data 0x3052b9d8) 80 [ 53.283184335,7] IRQ: Registering 0ff8..0fff ops @0x300d4bc8 (data 0x3052b9d8) 81 [ 53.283185513,7] PHB#0000: Initializing PHB... 82 [ 53.288260827,4] FLASH: Can't load resource id:0. No system flash found 83 [ 53.288354442,4] FLASH: Can't load resource id:1. No system flash found 84 [ 53.342933439,3] CAPP: Error loading ucode lid. index=200ea 85 [ 53.462749486,2] NVRAM: Failed to load 86 [ 53.462819095,2] NVRAM: Failed to load 87 [ 53.462894236,2] NVRAM: Failed to load 88 [ 53.462967071,2] NVRAM: Failed to load 89 [ 53.463033077,2] NVRAM: Failed to load 90 [ 53.463144847,2] NVRAM: Failed to load 91 92 Eventually followed by: :: 93 94 [ 57.216942479,5] INIT: platform wait for kernel load failed 95 [ 57.217051132,5] INIT: Assuming kernel at 0x20000000 96 [ 57.217127508,3] INIT: ELF header not found. Assuming raw binary. 97 [ 57.217249886,2] NVRAM: Failed to load 98 [ 57.221294487,0] FATAL: Kernel is zeros, can't execute! 99 [ 57.221397429,0] Assert fail: core/init.c:615:0 100 [ 57.221471414,0] Aborting! 101 CPU 0028 Backtrace: 102 S: 0000000031d43c60 R: 000000003001b274 ._abort+0x4c 103 S: 0000000031d43ce0 R: 000000003001b2f0 .assert_fail+0x34 104 S: 0000000031d43d60 R: 0000000030014814 .load_and_boot_kernel+0xae4 105 S: 0000000031d43e30 R: 0000000030015164 .main_cpu_entry+0x680 106 S: 0000000031d43f00 R: 0000000030002718 boot_entry+0x1c0 107 --- OPAL boot --- 108 109 Analysis of the execution paths suggests we'll always "safely" end this 110 way due the setup sequence for the blocklevel callbacks in flash_init() 111 and error handling in blocklevel_get_info(), and there's no current risk 112 of executing from unexpected memory locations. As such the issue is 113 reduced to down to a fix for poor error hygene in the original change 114 and a resolution for a Coverity warning (famous last words etc). 115 116- hw/xscom: Enable sw xstop by default on p9 117 118 This was disabled at some point during bringup to make life easier for 119 the lab folks trying to debug NVLink issues. This hack really should 120 have never made it out into the wild though, so we now have the 121 following situation occuring in the field: 122 123 1) A bad happens 124 2) The host kernel recieves an unrecoverable HMI and calls into OPAL to 125 request a platform reboot. 126 3) OPAL rejects the reboot attempt and returns to the kernel with 127 OPAL_PARAMETER. 128 4) Kernel panics and attempts to kexec into a kdump kernel. 129 130 A side effect of the HMI seems to be CPUs becoming stuck which results 131 in the initialisation of the kdump kernel taking a extremely long time 132 (6+ hours). It's also been observed that after performing a dump the 133 kdump kernel then crashes itself because OPAL has ended up in a bad 134 state as a side effect of the HMI. 135 136 All up, it's not very good so re-enable the software checkstop by 137 default. If people still want to turn it off they can using the nvram 138 override. 139 140- opal/hmi: Initialize the hmi event with old value of TFMR. 141 142 Do this before we fix TFAC errors. Otherwise the event at host console 143 shows no thread error reported in TFMR register. 144 145 Without this patch the console event show TFMR with no thread error: 146 (DEC parity error TFMR[59] injection) :: 147 148 [ 53.737572] Severe Hypervisor Maintenance interrupt [Recovered] 149 [ 53.737596] Error detail: Timer facility experienced an error 150 [ 53.737611] HMER: 0840000000000000 151 [ 53.737621] TFMR: 3212000870e04000 152 153 After this patch it shows old TFMR value on host console: :: 154 155 [ 2302.267271] Severe Hypervisor Maintenance interrupt [Recovered] 156 [ 2302.267305] Error detail: Timer facility experienced an error 157 [ 2302.267320] HMER: 0840000000000000 158 [ 2302.267330] TFMR: 3212000870e14010 159 160- libflash/ipmi-hiomap: Fix blocks count issue 161 162 We convert data size to block count and pass block count to BMC. 163 If data size is not block aligned then we endup sending block count 164 less than actual data. BMC will write partial data to flash memory. 165 166 Sample log :: 167 168 [ 594.388458416,7] HIOMAP: Marked flash dirty at 0x42010 for 8 169 [ 594.398756487,7] HIOMAP: Flushed writes 170 [ 594.409596439,7] HIOMAP: Marked flash dirty at 0x42018 for 3970 171 [ 594.419897507,7] HIOMAP: Flushed writes 172 173 In this case HIOMAP sent data with block count=0 and hence BMC didn't 174 flush data to flash. 175 176 Lets fix this issue by adjusting block count before sending it to BMC. 177 178- Fix hang in pnv_platform_error_reboot path due to TOD failure. 179 180 On TOD failure, with TB stuck, when linux heads down to 181 pnv_platform_error_reboot() path due to unrecoverable hmi event, the panic 182 cpu gets stuck in OPAL inside ipmi_queue_msg_sync(). At this time, rest 183 all other cpus are in smp_handle_nmi_ipi() waiting for panic cpu to proceed. 184 But with panic cpu stuck inside OPAL, linux never recovers/reboot. :: 185 186 p0 c1 t0 187 NIA : 0x000000003001dd3c <.time_wait+0x64> 188 CFAR : 0x000000003001dce4 <.time_wait+0xc> 189 MSR : 0x9000000002803002 190 LR : 0x000000003002ecf8 <.ipmi_queue_msg_sync+0xec> 191 192 STACK: SP NIA 193 0x0000000031c236e0 0x0000000031c23760 (big-endian) 194 0x0000000031c23760 0x000000003002ecf8 <.ipmi_queue_msg_sync+0xec> 195 0x0000000031c237f0 0x00000000300aa5f8 <.hiomap_queue_msg_sync+0x7c> 196 0x0000000031c23880 0x00000000300aaadc <.hiomap_window_move+0x150> 197 0x0000000031c23950 0x00000000300ab1d8 <.ipmi_hiomap_write+0xcc> 198 0x0000000031c23a90 0x00000000300a7b18 <.blocklevel_raw_write+0xbc> 199 0x0000000031c23b30 0x00000000300a7c34 <.blocklevel_write+0xfc> 200 0x0000000031c23bf0 0x0000000030030be0 <.flash_nvram_write+0xd4> 201 0x0000000031c23c90 0x000000003002c128 <.opal_write_nvram+0xd0> 202 0x0000000031c23d20 0x00000000300051e4 <opal_entry+0x134> 203 0xc000001fea6e7870 0xc0000000000a9060 <opal_nvram_write+0x80> 204 0xc000001fea6e78c0 0xc000000000030b84 <nvram_write_os_partition+0x94> 205 0xc000001fea6e7960 0xc0000000000310b0 <nvram_pstore_write+0xb0> 206 0xc000001fea6e7990 0xc0000000004792d4 <pstore_dump+0x1d4> 207 0xc000001fea6e7ad0 0xc00000000018a570 <kmsg_dump+0x140> 208 0xc000001fea6e7b40 0xc000000000028e5c <panic_flush_kmsg_end+0x2c> 209 0xc000001fea6e7b60 0xc0000000000a7168 <pnv_platform_error_reboot+0x68> 210 0xc000001fea6e7bd0 0xc0000000000ac9b8 <hmi_event_handler+0x1d8> 211 0xc000001fea6e7c80 0xc00000000012d6c8 <process_one_work+0x1b8> 212 0xc000001fea6e7d20 0xc00000000012da28 <worker_thread+0x88> 213 0xc000001fea6e7db0 0xc0000000001366f4 <kthread+0x164> 214 0xc000001fea6e7e20 0xc00000000000b65c <ret_from_kernel_thread+0x5c> 215 216 This is because, there is a while loop towards the end of 217 ipmi_queue_msg_sync() which keeps looping until "sync_msg" does not match 218 with "msg". It loops over time_wait_ms() until exit condition is met. In 219 normal scenario time_wait_ms() calls run pollers so that ipmi backend gets 220 a chance to check ipmi response and set sync_msg to NULL. 221 222 .. code-block:: c 223 224 while (sync_msg == msg) 225 time_wait_ms(10); 226 227 But in the event when TB is in failed state time_wait_ms()->time_wait_poll() 228 returns immediately without calling pollers and hence we end up looping 229 forever. This patch fixes this hang by calling opal_run_pollers() in TB 230 failed state as well. 231 232- core/ipmi: Print correct netfn value 233 234- libffs: Fix string truncation gcc warning. 235 236 Use memcpy as other libffs functions do. 237