1.. _skiboot-6.2.4:
2
3=============
4skiboot-6.2.4
5=============
6
7skiboot 6.2.4 was released on Thursday May 9th, 2019. It replaces
8:ref:`skiboot-6.2.3` as the current stable release in the 6.2.x series.
9
10It is recommended that 6.2.4 be used instead of any previous 6.2.x version
11due to the bug fixes it contains.
12
13Bug fixes included in this release are:
14
15- core/flash: Retry requests as necessary in flash_load_resource()
16
17  We would like to successfully boot if we have a dependency on the BMC
18  for flash even if the BMC is not current ready to service flash
19  requests. On the assumption that it will become ready, retry for several
20  minutes to cover a BMC reboot cycle and *eventually* rather than
21  *immediately* crash out with: ::
22
23      [  269.549748] reboot: Restarting system
24      [  390.297462587,5] OPAL: Reboot request...
25      [  390.297737995,5] RESET: Initiating fast reboot 1...
26      [  391.074707590,5] Clearing unused memory:
27      [  391.075198880,5] PCI: Clearing all devices...
28      [  391.075201618,7] Clearing region 201ffe000000-201fff800000
29      [  391.086235699,5] PCI: Resetting PHBs and training links...
30      [  391.254089525,3] FFS: Error 17 reading flash header
31      [  391.254159668,3] FLASH: Can't open ffs handle: 17
32      [  392.307245135,5] PCI: Probing slots...
33      [  392.363723191,5] PCI Summary:
34      ...
35      [  393.423255262,5] OCC: All Chip Rdy after 0 ms
36      [  393.453092828,5] INIT: Starting kernel at 0x20000000, fdt at
37      0x30800a88 390645 bytes
38      [  393.453202605,0] FATAL: Kernel is zeros, can't execute!
39      [  393.453247064,0] Assert fail: core/init.c:593:0
40      [  393.453289682,0] Aborting!
41      CPU 0040 Backtrace:
42       S: 0000000031e03ca0 R: 000000003001af60   ._abort+0x4c
43       S: 0000000031e03d20 R: 000000003001afdc   .assert_fail+0x34
44       S: 0000000031e03da0 R: 00000000300146d8   .load_and_boot_kernel+0xb30
45       S: 0000000031e03e70 R: 0000000030026cf0   .fast_reboot_entry+0x39c
46       S: 0000000031e03f00 R: 0000000030002a4c   fast_reset_entry+0x2c
47       --- OPAL boot ---
48
49  The OPAL flash API hooks directly into the blocklevel layer, so there's
50  no delay for e.g. the host kernel, just for asynchronously loaded
51  resources during boot.
52
53- pci/iov: Remove skiboot VF tracking
54
55  This feature was added a few years ago in response to a request to make
56  the MaxPayloadSize (MPS) field of a Virtual Function match the MPS of the
57  Physical Function that hosts it.
58
59  The SR-IOV specification states the the MPS field of the VF is "ResvP".
60  This indicates the VF will use whatever MPS is configured on the PF and
61  that the field should be treated as a reserved field in the config space
62  of the VF. In other words, a SR-IOV spec compliant VF should always return
63  zero in the MPS field.  Adding hacks in OPAL to make it non-zero is...
64  misguided at best.
65
66  Additionally, there is a bug in the way pci_device structures are handled
67  by VFs that results in a crash on fast-reboot that occurs if VFs are
68  enabled and then disabled prior to rebooting. This patch fixes the bug by
69  removing the code entirely. This patch has no impact on SR-IOV support on
70  the host operating system.
71
72- astbmc: Handle failure to initialise raw flash
73
74  Initialising raw flash lead to a dead assignment to rc. Check the return
75  code and take the failure path as necessary. Both before and after the
76  fix we see output along the lines of the following when flash_init()
77  fails: ::
78
79    [   53.283182881,7] IRQ: Registering 0800..0ff7 ops @0x300d4b98 (data 0x3052b9d8)
80    [   53.283184335,7] IRQ: Registering 0ff8..0fff ops @0x300d4bc8 (data 0x3052b9d8)
81    [   53.283185513,7] PHB#0000: Initializing PHB...
82    [   53.288260827,4] FLASH: Can't load resource id:0. No system flash found
83    [   53.288354442,4] FLASH: Can't load resource id:1. No system flash found
84    [   53.342933439,3] CAPP: Error loading ucode lid. index=200ea
85    [   53.462749486,2] NVRAM: Failed to load
86    [   53.462819095,2] NVRAM: Failed to load
87    [   53.462894236,2] NVRAM: Failed to load
88    [   53.462967071,2] NVRAM: Failed to load
89    [   53.463033077,2] NVRAM: Failed to load
90    [   53.463144847,2] NVRAM: Failed to load
91
92  Eventually followed by: ::
93
94    [   57.216942479,5] INIT: platform wait for kernel load failed
95    [   57.217051132,5] INIT: Assuming kernel at 0x20000000
96    [   57.217127508,3] INIT: ELF header not found. Assuming raw binary.
97    [   57.217249886,2] NVRAM: Failed to load
98    [   57.221294487,0] FATAL: Kernel is zeros, can't execute!
99    [   57.221397429,0] Assert fail: core/init.c:615:0
100    [   57.221471414,0] Aborting!
101    CPU 0028 Backtrace:
102     S: 0000000031d43c60 R: 000000003001b274   ._abort+0x4c
103     S: 0000000031d43ce0 R: 000000003001b2f0   .assert_fail+0x34
104     S: 0000000031d43d60 R: 0000000030014814   .load_and_boot_kernel+0xae4
105     S: 0000000031d43e30 R: 0000000030015164   .main_cpu_entry+0x680
106     S: 0000000031d43f00 R: 0000000030002718   boot_entry+0x1c0
107     --- OPAL boot ---
108
109  Analysis of the execution paths suggests we'll always "safely" end this
110  way due the setup sequence for the blocklevel callbacks in flash_init()
111  and error handling in blocklevel_get_info(), and there's no current risk
112  of executing from unexpected memory locations. As such the issue is
113  reduced to down to a fix for poor error hygene in the original change
114  and a resolution for a Coverity warning (famous last words etc).
115
116- hw/xscom: Enable sw xstop by default on p9
117
118  This was disabled at some point during bringup to make life easier for
119  the lab folks trying to debug NVLink issues. This hack really should
120  have never made it out into the wild though, so we now have the
121  following situation occuring in the field:
122
123   1) A bad happens
124   2) The host kernel recieves an unrecoverable HMI and calls into OPAL to
125      request a platform reboot.
126   3) OPAL rejects the reboot attempt and returns to the kernel with
127      OPAL_PARAMETER.
128   4) Kernel panics and attempts to kexec into a kdump kernel.
129
130  A side effect of the HMI seems to be CPUs becoming stuck which results
131  in the initialisation of the kdump kernel taking a extremely long time
132  (6+ hours). It's also been observed that after performing a dump the
133  kdump kernel then crashes itself because OPAL has ended up in a bad
134  state as a side effect of the HMI.
135
136  All up, it's not very good so re-enable the software checkstop by
137  default. If people still want to turn it off they can using the nvram
138  override.
139
140- opal/hmi: Initialize the hmi event with old value of TFMR.
141
142  Do this before we fix TFAC errors. Otherwise the event at host console
143  shows no thread error reported in TFMR register.
144
145  Without this patch the console event show TFMR with no thread error:
146  (DEC parity error TFMR[59] injection) ::
147
148    [   53.737572] Severe Hypervisor Maintenance interrupt [Recovered]
149    [   53.737596]  Error detail: Timer facility experienced an error
150    [   53.737611]  HMER: 0840000000000000
151    [   53.737621]  TFMR: 3212000870e04000
152
153  After this patch it shows old TFMR value on host console: ::
154
155    [ 2302.267271] Severe Hypervisor Maintenance interrupt [Recovered]
156    [ 2302.267305]  Error detail: Timer facility experienced an error
157    [ 2302.267320]  HMER: 0840000000000000
158    [ 2302.267330]  TFMR: 3212000870e14010
159
160- libflash/ipmi-hiomap: Fix blocks count issue
161
162  We convert data size to block count and pass block count to BMC.
163  If data size is not block aligned then we endup sending block count
164  less than actual data. BMC will write partial data to flash memory.
165
166  Sample log ::
167
168    [  594.388458416,7] HIOMAP: Marked flash dirty at 0x42010 for 8
169    [  594.398756487,7] HIOMAP: Flushed writes
170    [  594.409596439,7] HIOMAP: Marked flash dirty at 0x42018 for 3970
171    [  594.419897507,7] HIOMAP: Flushed writes
172
173  In this case HIOMAP sent data with block count=0 and hence BMC didn't
174  flush data to flash.
175
176  Lets fix this issue by adjusting block count before sending it to BMC.
177
178- Fix hang in pnv_platform_error_reboot path due to TOD failure.
179
180  On TOD failure, with TB stuck, when linux heads down to
181  pnv_platform_error_reboot() path due to unrecoverable hmi event, the panic
182  cpu gets stuck in OPAL inside ipmi_queue_msg_sync(). At this time, rest
183  all other cpus are in smp_handle_nmi_ipi() waiting for panic cpu to proceed.
184  But with panic cpu stuck inside OPAL, linux never recovers/reboot. ::
185
186    p0 c1 t0
187    NIA : 0x000000003001dd3c <.time_wait+0x64>
188    CFAR : 0x000000003001dce4 <.time_wait+0xc>
189    MSR : 0x9000000002803002
190    LR : 0x000000003002ecf8 <.ipmi_queue_msg_sync+0xec>
191
192    STACK: SP NIA
193    0x0000000031c236e0 0x0000000031c23760 (big-endian)
194    0x0000000031c23760 0x000000003002ecf8 <.ipmi_queue_msg_sync+0xec>
195    0x0000000031c237f0 0x00000000300aa5f8 <.hiomap_queue_msg_sync+0x7c>
196    0x0000000031c23880 0x00000000300aaadc <.hiomap_window_move+0x150>
197    0x0000000031c23950 0x00000000300ab1d8 <.ipmi_hiomap_write+0xcc>
198    0x0000000031c23a90 0x00000000300a7b18 <.blocklevel_raw_write+0xbc>
199    0x0000000031c23b30 0x00000000300a7c34 <.blocklevel_write+0xfc>
200    0x0000000031c23bf0 0x0000000030030be0 <.flash_nvram_write+0xd4>
201    0x0000000031c23c90 0x000000003002c128 <.opal_write_nvram+0xd0>
202    0x0000000031c23d20 0x00000000300051e4 <opal_entry+0x134>
203    0xc000001fea6e7870 0xc0000000000a9060 <opal_nvram_write+0x80>
204    0xc000001fea6e78c0 0xc000000000030b84 <nvram_write_os_partition+0x94>
205    0xc000001fea6e7960 0xc0000000000310b0 <nvram_pstore_write+0xb0>
206    0xc000001fea6e7990 0xc0000000004792d4 <pstore_dump+0x1d4>
207    0xc000001fea6e7ad0 0xc00000000018a570 <kmsg_dump+0x140>
208    0xc000001fea6e7b40 0xc000000000028e5c <panic_flush_kmsg_end+0x2c>
209    0xc000001fea6e7b60 0xc0000000000a7168 <pnv_platform_error_reboot+0x68>
210    0xc000001fea6e7bd0 0xc0000000000ac9b8 <hmi_event_handler+0x1d8>
211    0xc000001fea6e7c80 0xc00000000012d6c8 <process_one_work+0x1b8>
212    0xc000001fea6e7d20 0xc00000000012da28 <worker_thread+0x88>
213    0xc000001fea6e7db0 0xc0000000001366f4 <kthread+0x164>
214    0xc000001fea6e7e20 0xc00000000000b65c <ret_from_kernel_thread+0x5c>
215
216  This is because, there is a while loop towards the end of
217  ipmi_queue_msg_sync() which keeps looping until "sync_msg" does not match
218  with "msg". It loops over time_wait_ms() until exit condition is met. In
219  normal scenario time_wait_ms() calls run pollers so that ipmi backend gets
220  a chance to check ipmi response and set sync_msg to NULL.
221
222  .. code-block:: c
223
224          while (sync_msg == msg)
225                  time_wait_ms(10);
226
227  But in the event when TB is in failed state time_wait_ms()->time_wait_poll()
228  returns immediately without calling pollers and hence we end up looping
229  forever. This patch fixes this hang by calling opal_run_pollers() in TB
230  failed state as well.
231
232- core/ipmi: Print correct netfn value
233
234- libffs: Fix string truncation gcc warning.
235
236  Use memcpy as other libffs functions do.
237