1.. _skiboot-5.4.8: 2 3============= 4skiboot-5.4.8 5============= 6 7skiboot-5.4.8 was released on Wednesday October 11th, 2017. It replaces 8:ref:`skiboot-5.4.7` as the current stable release in the 5.4.x series. 9 10Over :ref:`skiboot-5.4.7`, we have a few bug fixes for FSP platforms: 11 12- libflash/file: Handle short read()s and write()s correctly 13 14 Currently we don't move the buffer along for a short read() or write() 15 and nor do we request only the remaining amount. 16- FSP/NVRAM: Handle "get vNVRAM statistics" command 17 18 FSP sends MBOX command (cmd : 0xEB, subcmd : 0x05, mod : 0x00) to get vNVRAM 19 statistics. OPAL doesn't maintain any such statistics. Hence return 20 FSP_STATUS_INVALID_SUBCMD. 21 22 Sample OPAL log: :: 23 24 [16944.384670488,3] FSP: Unhandled message eb0500 25 [16944.474110465,3] FSP: Unhandled message eb0500 26 [16945.111280784,3] FSP: Unhandled message eb0500 27 [16945.293393485,3] FSP: Unhandled message eb0500 28- FSP/CONSOLE: Limit number of error logging 29 30 Commit c8a7535f (FSP/CONSOLE: Workaround for unresponsive ipmi daemon, added 31 in skiboot 5.4.6 and 5.7-rc1) added error logging when buffer is full. In some 32 corner cases kernel may call this function multiple time and we may endup logging 33 error again and again. 34 35 This patch fixes it by generating error log only once. 36 37- FSP/CONSOLE: Fix fsp_console_write_buffer_space() call 38 39 Kernel calls fsp_console_write_buffer_space() to check console buffer space 40 availability. If there is enough buffer space to write data, then kernel will 41 call fsp_console_write() to write actual data. 42 43 In some extreme corner cases (like one explained in commit c8a7535f) 44 console becomes full and this function returns 0 to kernel (or space available 45 in console buffer < next incoming data size). Kernel will continue retrying 46 until it gets enough space. So we will start seeing RCU stalls. 47 48 This patch keeps track of previous available space. If previous space is same 49 as current means not enough space in console buffer to write incoming data. 50 It may be due to very high console write operation and slow response from FSP 51 -OR- FSP has stopped processing data (ex: because of ipmi daemon died). At this 52 point we will start timer with timeout of SER_BUFFER_OUT_TIMEOUT (10 secs). 53 If situation is not improved within 10 seconds means something went bad. Lets 54 return OPAL_RESOURCE so that kernel can drop console write and continue. 55- FSP/CONSOLE: Close SOL session during R/R 56 57 Presently we are not closing SOL and FW console sessions during R/R. Host will 58 continue to write to SOL buffer during FSP R/R. If there is heavy console write 59 operation happening during FSP R/R (like running `top` command inside console), 60 then at some point console buffer becomes full. fsp_console_write_buffer_space() 61 returns 0 (or less than required space to write data) to host. While one thread 62 is busy writing to console, if some other threads tries to write data to console 63 we may see RCU stalls (like below) in kernel. 64 65 kernel call trace: :: 66 67 [ 2082.828363] INFO: rcu_sched detected stalls on CPUs/tasks: { 32} (detected by 16, t=6002 jiffies, g=23154, c=23153, q=254769) 68 [ 2082.828365] Task dump for CPU 32: 69 [ 2082.828368] kworker/32:3 R running task 0 4637 2 0x00000884 70 [ 2082.828375] Workqueue: events dump_work_fn 71 [ 2082.828376] Call Trace: 72 [ 2082.828382] [c000000f1633fa00] [c00000000013b6b0] console_unlock+0x570/0x600 (unreliable) 73 [ 2082.828384] [c000000f1633fae0] [c00000000013ba34] vprintk_emit+0x2f4/0x5c0 74 [ 2082.828389] [c000000f1633fb60] [c00000000099e644] printk+0x84/0x98 75 [ 2082.828391] [c000000f1633fb90] [c0000000000851a8] dump_work_fn+0x238/0x250 76 [ 2082.828394] [c000000f1633fc60] [c0000000000ecb98] process_one_work+0x198/0x4b0 77 [ 2082.828396] [c000000f1633fcf0] [c0000000000ed3dc] worker_thread+0x18c/0x5a0 78 [ 2082.828399] [c000000f1633fd80] [c0000000000f4650] kthread+0x110/0x130 79 [ 2082.828403] [c000000f1633fe30] [c000000000009674] ret_from_kernel_thread+0x5c/0x68 80 81 Hence lets close SOL (and FW console) during FSP R/R. 82 83- FSP/CONSOLE: Do not associate unavailable console 84 85 Presently OPAL sends associate/unassociate MBOX command for all 86 FSP serial console (like below OPAL message). We have to check 87 console is available or not before sending this message. 88 89 OPAL log: :: 90 91 [ 5013.227994012,7] FSP: Reassociating HVSI console 1 92 [ 5013.227997540,7] FSP: Reassociating HVSI console 2 93- FSP: Disable PSI link whenever FSP tells OPAL about impending Reset/Reload 94 95 Commit 42d5d047 fixed scenario where DPO has been initiated, but FSP went 96 into reset before the CEC power down came in. But this is generic issue 97 that can happen in normal shutdown path as well. 98 99 Hence disable PSI link as soon as we detect FSP impending R/R. 100 101 102- fsp: return OPAL_BUSY_EVENT on failure sending FSP_CMD_POWERDOWN_NORM 103 Also, return OPAL_BUSY_EVENT on failure sending FSP_CMD_REBOOT / DEEP_REBOOT. 104 105 We had a race condition between FSP Reset/Reload and powering down 106 the system from the host: 107 108 Roughly: 109 110 == ======================== ========================================================== 111 # FSP Host 112 == ======================== ========================================================== 113 1 Power on 114 2 Power on 115 3 (inject EPOW) 116 4 (trigger FSP R/R) 117 5 Processes EPOW event, starts shutting down 118 6 calls OPAL_CEC_POWER_DOWN 119 7 (is still in R/R) 120 8 gets OPAL_INTERNAL_ERROR, spins in opal_poll_events 121 9 (FSP comes back) 122 10 spinning in opal_poll_events 123 11 (thinks host is running) 124 == ======================== ========================================================== 125 126 The call to OPAL_CEC_POWER_DOWN is only made once as the reset/reload 127 error path for fsp_sync_msg() is to return -1, which means we give 128 the OS OPAL_INTERNAL_ERROR, which is fine, except that our own API 129 docs give us the opportunity to return OPAL_BUSY when trying again 130 later may be successful, and we're ambiguous as to if you should retry 131 on OPAL_INTERNAL_ERROR. 132 133 For reference, the linux code looks like this: :: 134 135 static void __noreturn pnv_power_off(void) 136 { 137 long rc = OPAL_BUSY; 138 139 pnv_prepare_going_down(); 140 141 while (rc == OPAL_BUSY || rc == OPAL_BUSY_EVENT) { 142 rc = opal_cec_power_down(0); 143 if (rc == OPAL_BUSY_EVENT) 144 opal_poll_events(NULL); 145 else 146 mdelay(10); 147 } 148 for (;;) 149 opal_poll_events(NULL); 150 } 151 152 Which means that *practically* our only option is to return OPAL_BUSY 153 or OPAL_BUSY_EVENT. 154 155 We choose OPAL_BUSY_EVENT for FSP systems as we do want to ensure we're 156 running pollers to communicate with the FSP and do the final bits of 157 Reset/Reload handling before we power off the system. 158 159