1.. _skiboot-6.0.20:
2
3==============
4skiboot-6.0.20
5==============
6
7skiboot 6.0.20 was released on Thursday May 9th, 2019. It replaces
8:ref:`skiboot-6.0.19` as the current stable release in the 6.0.x series.
9
10It is recommended that 6.0.20 be used instead of any previous 6.0.x version
11due to the bug fixes it contains.
12
13Bug fixes included in this release are:
14
15- core/flash: Retry requests as necessary in flash_load_resource()
16
17  We would like to successfully boot if we have a dependency on the BMC
18  for flash even if the BMC is not current ready to service flash
19  requests. On the assumption that it will become ready, retry for several
20  minutes to cover a BMC reboot cycle and *eventually* rather than
21  *immediately* crash out with: ::
22
23      [  269.549748] reboot: Restarting system
24      [  390.297462587,5] OPAL: Reboot request...
25      [  390.297737995,5] RESET: Initiating fast reboot 1...
26      [  391.074707590,5] Clearing unused memory:
27      [  391.075198880,5] PCI: Clearing all devices...
28      [  391.075201618,7] Clearing region 201ffe000000-201fff800000
29      [  391.086235699,5] PCI: Resetting PHBs and training links...
30      [  391.254089525,3] FFS: Error 17 reading flash header
31      [  391.254159668,3] FLASH: Can't open ffs handle: 17
32      [  392.307245135,5] PCI: Probing slots...
33      [  392.363723191,5] PCI Summary:
34      ...
35      [  393.423255262,5] OCC: All Chip Rdy after 0 ms
36      [  393.453092828,5] INIT: Starting kernel at 0x20000000, fdt at
37      0x30800a88 390645 bytes
38      [  393.453202605,0] FATAL: Kernel is zeros, can't execute!
39      [  393.453247064,0] Assert fail: core/init.c:593:0
40      [  393.453289682,0] Aborting!
41      CPU 0040 Backtrace:
42       S: 0000000031e03ca0 R: 000000003001af60   ._abort+0x4c
43       S: 0000000031e03d20 R: 000000003001afdc   .assert_fail+0x34
44       S: 0000000031e03da0 R: 00000000300146d8   .load_and_boot_kernel+0xb30
45       S: 0000000031e03e70 R: 0000000030026cf0   .fast_reboot_entry+0x39c
46       S: 0000000031e03f00 R: 0000000030002a4c   fast_reset_entry+0x2c
47       --- OPAL boot ---
48
49  The OPAL flash API hooks directly into the blocklevel layer, so there's
50  no delay for e.g. the host kernel, just for asynchronously loaded
51  resources during boot.
52
53- pci/iov: Remove skiboot VF tracking
54
55  This feature was added a few years ago in response to a request to make
56  the MaxPayloadSize (MPS) field of a Virtual Function match the MPS of the
57  Physical Function that hosts it.
58
59  The SR-IOV specification states the the MPS field of the VF is "ResvP".
60  This indicates the VF will use whatever MPS is configured on the PF and
61  that the field should be treated as a reserved field in the config space
62  of the VF. In other words, a SR-IOV spec compliant VF should always return
63  zero in the MPS field.  Adding hacks in OPAL to make it non-zero is...
64  misguided at best.
65
66  Additionally, there is a bug in the way pci_device structures are handled
67  by VFs that results in a crash on fast-reboot that occurs if VFs are
68  enabled and then disabled prior to rebooting. This patch fixes the bug by
69  removing the code entirely. This patch has no impact on SR-IOV support on
70  the host operating system.
71
72- hw/xscom: Enable sw xstop by default on p9
73
74  This was disabled at some point during bringup to make life easier for
75  the lab folks trying to debug NVLink issues. This hack really should
76  have never made it out into the wild though, so we now have the
77  following situation occuring in the field:
78
79   1) A bad happens
80   2) The host kernel recieves an unrecoverable HMI and calls into OPAL to
81      request a platform reboot.
82   3) OPAL rejects the reboot attempt and returns to the kernel with
83      OPAL_PARAMETER.
84   4) Kernel panics and attempts to kexec into a kdump kernel.
85
86  A side effect of the HMI seems to be CPUs becoming stuck which results
87  in the initialisation of the kdump kernel taking a extremely long time
88  (6+ hours). It's also been observed that after performing a dump the
89  kdump kernel then crashes itself because OPAL has ended up in a bad
90  state as a side effect of the HMI.
91
92  All up, it's not very good so re-enable the software checkstop by
93  default. If people still want to turn it off they can using the nvram
94  override.
95
96- opal/hmi: Initialize the hmi event with old value of TFMR.
97
98  Do this before we fix TFAC errors. Otherwise the event at host console
99  shows no thread error reported in TFMR register.
100
101  Without this patch the console event show TFMR with no thread error:
102  (DEC parity error TFMR[59] injection) ::
103
104    [   53.737572] Severe Hypervisor Maintenance interrupt [Recovered]
105    [   53.737596]  Error detail: Timer facility experienced an error
106    [   53.737611]  HMER: 0840000000000000
107    [   53.737621]  TFMR: 3212000870e04000
108
109  After this patch it shows old TFMR value on host console: ::
110
111    [ 2302.267271] Severe Hypervisor Maintenance interrupt [Recovered]
112    [ 2302.267305]  Error detail: Timer facility experienced an error
113    [ 2302.267320]  HMER: 0840000000000000
114    [ 2302.267330]  TFMR: 3212000870e14010
115
116- libflash/ipmi-hiomap: Fix blocks count issue
117
118  We convert data size to block count and pass block count to BMC.
119  If data size is not block aligned then we endup sending block count
120  less than actual data. BMC will write partial data to flash memory.
121
122  Sample log ::
123
124    [  594.388458416,7] HIOMAP: Marked flash dirty at 0x42010 for 8
125    [  594.398756487,7] HIOMAP: Flushed writes
126    [  594.409596439,7] HIOMAP: Marked flash dirty at 0x42018 for 3970
127    [  594.419897507,7] HIOMAP: Flushed writes
128
129  In this case HIOMAP sent data with block count=0 and hence BMC didn't
130  flush data to flash.
131
132  Lets fix this issue by adjusting block count before sending it to BMC.
133
134- Fix hang in pnv_platform_error_reboot path due to TOD failure.
135
136  On TOD failure, with TB stuck, when linux heads down to
137  pnv_platform_error_reboot() path due to unrecoverable hmi event, the panic
138  cpu gets stuck in OPAL inside ipmi_queue_msg_sync(). At this time, rest
139  all other cpus are in smp_handle_nmi_ipi() waiting for panic cpu to proceed.
140  But with panic cpu stuck inside OPAL, linux never recovers/reboot. ::
141
142    p0 c1 t0
143    NIA : 0x000000003001dd3c <.time_wait+0x64>
144    CFAR : 0x000000003001dce4 <.time_wait+0xc>
145    MSR : 0x9000000002803002
146    LR : 0x000000003002ecf8 <.ipmi_queue_msg_sync+0xec>
147
148    STACK: SP NIA
149    0x0000000031c236e0 0x0000000031c23760 (big-endian)
150    0x0000000031c23760 0x000000003002ecf8 <.ipmi_queue_msg_sync+0xec>
151    0x0000000031c237f0 0x00000000300aa5f8 <.hiomap_queue_msg_sync+0x7c>
152    0x0000000031c23880 0x00000000300aaadc <.hiomap_window_move+0x150>
153    0x0000000031c23950 0x00000000300ab1d8 <.ipmi_hiomap_write+0xcc>
154    0x0000000031c23a90 0x00000000300a7b18 <.blocklevel_raw_write+0xbc>
155    0x0000000031c23b30 0x00000000300a7c34 <.blocklevel_write+0xfc>
156    0x0000000031c23bf0 0x0000000030030be0 <.flash_nvram_write+0xd4>
157    0x0000000031c23c90 0x000000003002c128 <.opal_write_nvram+0xd0>
158    0x0000000031c23d20 0x00000000300051e4 <opal_entry+0x134>
159    0xc000001fea6e7870 0xc0000000000a9060 <opal_nvram_write+0x80>
160    0xc000001fea6e78c0 0xc000000000030b84 <nvram_write_os_partition+0x94>
161    0xc000001fea6e7960 0xc0000000000310b0 <nvram_pstore_write+0xb0>
162    0xc000001fea6e7990 0xc0000000004792d4 <pstore_dump+0x1d4>
163    0xc000001fea6e7ad0 0xc00000000018a570 <kmsg_dump+0x140>
164    0xc000001fea6e7b40 0xc000000000028e5c <panic_flush_kmsg_end+0x2c>
165    0xc000001fea6e7b60 0xc0000000000a7168 <pnv_platform_error_reboot+0x68>
166    0xc000001fea6e7bd0 0xc0000000000ac9b8 <hmi_event_handler+0x1d8>
167    0xc000001fea6e7c80 0xc00000000012d6c8 <process_one_work+0x1b8>
168    0xc000001fea6e7d20 0xc00000000012da28 <worker_thread+0x88>
169    0xc000001fea6e7db0 0xc0000000001366f4 <kthread+0x164>
170    0xc000001fea6e7e20 0xc00000000000b65c <ret_from_kernel_thread+0x5c>
171
172  This is because, there is a while loop towards the end of
173  ipmi_queue_msg_sync() which keeps looping until "sync_msg" does not match
174  with "msg". It loops over time_wait_ms() until exit condition is met. In
175  normal scenario time_wait_ms() calls run pollers so that ipmi backend gets
176  a chance to check ipmi response and set sync_msg to NULL.
177
178  .. code-block:: c
179
180          while (sync_msg == msg)
181                  time_wait_ms(10);
182
183  But in the event when TB is in failed state time_wait_ms()->time_wait_poll()
184  returns immediately without calling pollers and hence we end up looping
185  forever. This patch fixes this hang by calling opal_run_pollers() in TB
186  failed state as well.
187
188- core/ipmi: Print correct netfn value
189
190- core/lock: don't set bust_locks on lock error
191
192  bust_locks is a big hammer that guarantees a mess if it's set while
193  all other threads are not stopped.
194
195  I propose removing this in the lock error paths. In debugging the
196  previous deadlock false positive, none of the error messages printed,
197  and the in-memory console was totally garbled due to lack of locking.
198
199  I think it's generally better for debugging and system integrity to
200  keep locks held when lock errors occur. Lock busting should be used
201  carefully, just to allow messages to be printed out or machine to be
202  restarted, probably when the whole system is single-threaded.
203
204  Skiboot is slowly working toward that being feasible with co-operative
205  debug APIs between firmware and host, but for the time being,
206  difficult lock crashes are better not to corrupt everything by
207  busting locks.
208