1.. _skiboot-5.4.8:
2
3=============
4skiboot-5.4.8
5=============
6
7skiboot-5.4.8 was released on Wednesday October 11th, 2017. It replaces
8:ref:`skiboot-5.4.7` as the current stable release in the 5.4.x series.
9
10Over :ref:`skiboot-5.4.7`, we have a few bug fixes for FSP platforms:
11
12- libflash/file: Handle short read()s and write()s correctly
13
14  Currently we don't move the buffer along for a short read() or write()
15  and nor do we request only the remaining amount.
16- FSP/NVRAM: Handle "get vNVRAM statistics" command
17
18  FSP sends MBOX command (cmd : 0xEB, subcmd : 0x05, mod : 0x00) to get vNVRAM
19  statistics. OPAL doesn't maintain any such statistics. Hence return
20  FSP_STATUS_INVALID_SUBCMD.
21
22    Sample OPAL log: ::
23
24      [16944.384670488,3] FSP: Unhandled message eb0500
25      [16944.474110465,3] FSP: Unhandled message eb0500
26      [16945.111280784,3] FSP: Unhandled message eb0500
27      [16945.293393485,3] FSP: Unhandled message eb0500
28- FSP/CONSOLE: Limit number of error logging
29
30  Commit c8a7535f (FSP/CONSOLE: Workaround for unresponsive ipmi daemon, added
31  in skiboot 5.4.6 and 5.7-rc1) added error logging when buffer is full. In some
32  corner cases kernel may call this function multiple time and we may endup logging
33  error again and again.
34
35  This patch fixes it by generating error log only once.
36
37- FSP/CONSOLE: Fix fsp_console_write_buffer_space() call
38
39  Kernel calls fsp_console_write_buffer_space() to check console buffer space
40  availability. If there is enough buffer space to write data, then kernel will
41  call fsp_console_write() to write actual data.
42
43  In some extreme corner cases (like one explained in commit c8a7535f)
44  console becomes full and this function returns 0 to kernel (or space available
45  in console buffer < next incoming data size). Kernel will continue retrying
46  until it gets enough space. So we will start seeing RCU stalls.
47
48  This patch keeps track of previous available space. If previous space is same
49  as current means not enough space in console buffer to write incoming data.
50  It may be due to very high console write operation and slow response from FSP
51  -OR- FSP has stopped processing data (ex: because of ipmi daemon died). At this
52  point we will start timer with timeout of SER_BUFFER_OUT_TIMEOUT (10 secs).
53  If situation is not improved within 10 seconds means something went bad. Lets
54  return OPAL_RESOURCE so that kernel can drop console write and continue.
55- FSP/CONSOLE: Close SOL session during R/R
56
57  Presently we are not closing SOL and FW console sessions during R/R. Host will
58  continue to write to SOL buffer during FSP R/R. If there is heavy console write
59  operation happening during FSP R/R (like running `top` command inside console),
60  then at some point console buffer becomes full. fsp_console_write_buffer_space()
61  returns 0 (or less than required space to write data) to host. While one thread
62  is busy writing to console, if some other threads tries to write data to console
63  we may see RCU stalls (like below) in kernel.
64
65  kernel call trace: ::
66
67    [ 2082.828363] INFO: rcu_sched detected stalls on CPUs/tasks: { 32} (detected by 16, t=6002 jiffies, g=23154, c=23153, q=254769)
68    [ 2082.828365] Task dump for CPU 32:
69    [ 2082.828368] kworker/32:3    R  running task        0  4637      2 0x00000884
70    [ 2082.828375] Workqueue: events dump_work_fn
71    [ 2082.828376] Call Trace:
72    [ 2082.828382] [c000000f1633fa00] [c00000000013b6b0] console_unlock+0x570/0x600 (unreliable)
73    [ 2082.828384] [c000000f1633fae0] [c00000000013ba34] vprintk_emit+0x2f4/0x5c0
74    [ 2082.828389] [c000000f1633fb60] [c00000000099e644] printk+0x84/0x98
75    [ 2082.828391] [c000000f1633fb90] [c0000000000851a8] dump_work_fn+0x238/0x250
76    [ 2082.828394] [c000000f1633fc60] [c0000000000ecb98] process_one_work+0x198/0x4b0
77    [ 2082.828396] [c000000f1633fcf0] [c0000000000ed3dc] worker_thread+0x18c/0x5a0
78    [ 2082.828399] [c000000f1633fd80] [c0000000000f4650] kthread+0x110/0x130
79    [ 2082.828403] [c000000f1633fe30] [c000000000009674] ret_from_kernel_thread+0x5c/0x68
80
81  Hence lets close SOL (and FW console) during FSP R/R.
82
83- FSP/CONSOLE: Do not associate unavailable console
84
85  Presently OPAL sends associate/unassociate MBOX command for all
86  FSP serial console (like below OPAL message). We have to check
87  console is available or not before sending this message.
88
89  OPAL log: ::
90
91    [ 5013.227994012,7] FSP: Reassociating HVSI console 1
92    [ 5013.227997540,7] FSP: Reassociating HVSI console 2
93- FSP: Disable PSI link whenever FSP tells OPAL about impending Reset/Reload
94
95  Commit 42d5d047 fixed scenario where DPO has been initiated, but FSP went
96  into reset before the CEC power down came in. But this is generic issue
97  that can happen in normal shutdown path as well.
98
99  Hence disable PSI link as soon as we detect FSP impending R/R.
100
101
102- fsp: return OPAL_BUSY_EVENT on failure sending FSP_CMD_POWERDOWN_NORM
103  Also, return OPAL_BUSY_EVENT on failure sending FSP_CMD_REBOOT / DEEP_REBOOT.
104
105  We had a race condition between FSP Reset/Reload and powering down
106  the system from the host:
107
108  Roughly:
109
110  == ======================== ==========================================================
111  #  FSP                      Host
112  == ======================== ==========================================================
113  1  Power on
114  2                           Power on
115  3  (inject EPOW)
116  4  (trigger FSP R/R)
117  5                           Processes EPOW event, starts shutting down
118  6                           calls OPAL_CEC_POWER_DOWN
119  7  (is still in R/R)
120  8                           gets OPAL_INTERNAL_ERROR, spins in opal_poll_events
121  9  (FSP comes back)
122  10                          spinning in opal_poll_events
123  11 (thinks host is running)
124  == ======================== ==========================================================
125
126  The call to OPAL_CEC_POWER_DOWN is only made once as the reset/reload
127  error path for fsp_sync_msg() is to return -1, which means we give
128  the OS OPAL_INTERNAL_ERROR, which is fine, except that our own API
129  docs give us the opportunity to return OPAL_BUSY when trying again
130  later may be successful, and we're ambiguous as to if you should retry
131  on OPAL_INTERNAL_ERROR.
132
133  For reference, the linux code looks like this: ::
134
135    static void __noreturn pnv_power_off(void)
136    {
137            long rc = OPAL_BUSY;
138
139            pnv_prepare_going_down();
140
141            while (rc == OPAL_BUSY || rc == OPAL_BUSY_EVENT) {
142                    rc = opal_cec_power_down(0);
143                    if (rc == OPAL_BUSY_EVENT)
144                            opal_poll_events(NULL);
145                    else
146                            mdelay(10);
147            }
148            for (;;)
149                    opal_poll_events(NULL);
150    }
151
152  Which means that *practically* our only option is to return OPAL_BUSY
153  or OPAL_BUSY_EVENT.
154
155  We choose OPAL_BUSY_EVENT for FSP systems as we do want to ensure we're
156  running pollers to communicate with the FSP and do the final bits of
157  Reset/Reload handling before we power off the system.
158
159