1.. _skiboot-6.3-rc3:
2
3skiboot-6.3-rc3
4===============
5
6skiboot v6.3-rc3 was released on Thursday May 2nd 2019. It is the third
7release candidate of skiboot 6.3, which will become the new stable release
8of skiboot following the 6.2 release, first released December 14th 2018.
9
10Skiboot 6.3 will mark the basis for op-build v2.3. I expect to tag the final
11skiboot 6.3 in the next week (I also predicted this last time, so take my
12predictions with a large amount of sodium).
13
14skiboot v6.3-rc3 contains all bug fixes as of :ref:`skiboot-6.0.19`,
15and :ref:`skiboot-6.2.3` (the currently maintained
16stable releases).
17
18For how the skiboot stable releases work, see :ref:`stable-rules` for details.
19
20Over :ref:`skiboot-6.3-rc2`, we have the following changes:
21
22
23- Expose PNOR Flash partitions to host MTD driver via devicetree
24
25  This makes it possible for the host to directly address each
26  partition without requiring each application to directly parse
27  the FFS headers.  This has been in use for some time already to
28  allow BOOTKERNFW partition updates from the host.
29
30  All partitions except BOOTKERNFW are marked readonly.
31
32  The BOOTKERNFW partition is currently exclusively used by the TalosII platform
33
34- Write boot progress to LPC port 80h
35
36  This is an adaptation of what we currently do for op_display() on FSP
37  machines, inventing an encoding for what we can write into the single
38  byte at LPC port 80h.
39
40  Port 80h is often used on x86 systems to indicate boot progress/status
41  and dates back a decent amount of time. Since a byte isn't exactly very
42  expressive for everything that can go on (and wrong) during boot, it's
43  all about compromise.
44
45  Some systems (such as Zaius/Barreleye G2) have a physical dual 7 segment
46  display that display these codes. So far, this has only been driven by
47  hostboot (see hostboot commit 90ec2e65314c).
48
49- Write boot progress to LPC ports 81 and 82
50
51  There's a thought to write more extensive boot progress codes to LPC
52  ports 81 and 82 to supplement/replace any reliance on port 80.
53
54  We want to still emit port 80 for platforms like Zaius and Barreleye
55  that have the physical display. Ports 81 and 82 can be monitored by a
56  BMC though.
57
58- Copy and convert Romulus descriptors to Talos
59
60  Talos II has some hardware differences from Romulus, therefore
61  we cannot guarantee Talos II == Romulus in skiboot.  Copy and
62  slightly modify the Romulus files for Talos II.
63
64- npu2: Disable Probe-to-Invalid-Return-Modified-or-Owned snarfing by default
65
66  V100 GPUs are known to violate NVLink2 protocol in some cases (one is when
67  memory was accessed by the CPU and they by GPU using so called block
68  linear mapping) and issue double probes to NPU which can cope with this
69  problem only if CONFIG_ENABLE_SNARF_CPM ("disable/enable Probe.I.MO
70  snarfing a cp_m") is not set in the CQ_SM Misc Config register #0.
71  If the bit is set (which is the case today), NPU issues the machine
72  check stop.
73
74  The snarfing feature is designed to detect 2 probes in flight and combine
75  them into one.
76
77  This adds a new "opal-npu2-snarf-cpm" nvram variable which controls
78  CONFIG_ENABLE_SNARF_CPM for all NVLinks to prevent the machine check
79  stop from happening.
80
81  This disables snarfing by default as otherwise a broken GPU driver can
82  crash the entire box even when a GPU is passed through to a guest.
83  This provides a dial to allow regression tests (might be useful for
84  a bare metal). To enable snarfing, the user needs to run: ::
85
86    sudo nvram -p ibm,skiboot --update-config opal-npu2-snarf-cpm=enable
87
88  and reboot the host system.
89
90- hw/npu2: Show name of opencapi error interrupts
91- core/pci: Use PHB io-base-location by default for PHB slots
92
93  On witherspoon only the GPU slots and the three pluggable PCI slots
94  (SLOT0, 1, 2) have platform defined slot names. For builtin devices such
95  as the SATA controller or the PLX switch that fans out to the GPU slots
96  we have no location codes which some people consider an issue.
97
98  This patch address the problem by making the ibm,slot-location-code for
99  the root port device default to the ibm,io-base-location-code which is
100  typically the location code for the system itself.
101
102  e.g. ::
103
104    pciex@600c3c0100000/ibm,loc-code
105                     "UOPWR.0000000-Node0-Proc0"
106
107    pciex@600c3c0100000/pci@0/ibm,loc-code
108                     "UOPWR.0000000-Node0-Proc0"
109
110    pciex@600c3c0100000/pci@0/usb-xhci@0/ibm,loc-code
111                     "UOPWR.0000000-Node0"
112
113  The PHB node, and the root complex nodes have a loc code of the
114  processor they are attached to, while the usb-xhci device under the
115  root port has a location code of the system itself.
116
117- hw/phb4: Read ibm,loc-code from PBCQ node
118
119  On P9 the PBCQs are subdivided by stacks which implement the PCI Express
120  logic. When phb4 was forked from phb3 most of the properties that were
121  in the pbcq node moved into the stack node, but ibm,loc-code was not one
122  of them. This patch fixes the phb4 init sequence to read the base
123  location code from the PBCQ node (parent of the stack node) rather than
124  the stack node itself.
125- hw/xscom: add missing P9P chip name
126- asm/head: balance branches to avoid link stack predictor mispredicts
127
128  The Linux wrapper for OPAL call and return is arranged like this: ::
129
130      __opal_call:
131          mflr   r0
132          std    r0,PPC_STK_LROFF(r1)
133          LOAD_REG_ADDR(r11, opal_return)
134          mtlr   r11
135          hrfid  -> OPAL
136
137      opal_return:
138          ld     r0,PPC_STK_LROFF(r1)
139          mtlr   r0
140          blr
141
142  When skiboot returns to Linux, it branches to LR (i.e., opal_return)
143  with a blr. This unbalances the link stack predictor and will cause
144  mispredicts back up the return stack.
145- external/mambo: also invoke readline for the non-autorun case
146- asm/head.S: set POWER9 radix HID bit at entry
147
148  When running in virtual memory mode, the radix MMU hid bit should not
149  be changed, so set this in the initial boot SPR setup.
150
151  As a side effect, fast reboot also has HID0:RADIX bit set by the
152  shared spr init, so no need for an explicit call.
153- opal-prd: Fix memory leak in is-fsp-system check
154- opal-prd: Check malloc return value
155- hw/phb4: Squash the IO bridge window
156
157  The PCI-PCI bridge spec says that bridges that implement an IO window
158  should hardcode the IO base and limit registers to zero.
159  Unfortunately, these registers only define the upper bits of the IO
160  window and the low bits are assumed to be 0 for the base and 1 for the
161  limit address. As a result, setting both to zero can be mis-interpreted
162  as a 4K IO window.
163
164  This patch fixes the problem the same way PHB3 does. It sets the IO base
165  and limit values to 0xf000 and 0x1000 respectively which most software
166  interprets as a disabled window.
167
168  lspci before patch: ::
169
170    0000:00:00.0 PCI bridge: IBM Device 04c1 (prog-if 00 [Normal decode])
171            I/O behind bridge: 00000000-00000fff
172
173  lspci after patch: ::
174
175    0000:00:00.0 PCI bridge: IBM Device 04c1 (prog-if 00 [Normal decode])
176            I/O behind bridge: None
177
178- build: link with --orphan-handling=warn
179
180  The linker can warn when the linker script does not explicitly place
181  all sections. These orphan sections are placed according to
182  heuristics, which may not always be desirable. Enable this warning.
183- build: -fno-asynchronous-unwind-tables
184
185  skiboot does not use unwind tables, this option saves about 100kB,
186  mostly from .text.
187- hw/xscom: Enable sw xstop by default on p9
188
189  This was disabled at some point during bringup to make life easier for
190  the lab folks trying to debug NVLink issues. This hack really should
191  have never made it out into the wild though, so we now have the
192  following situation occuring in the field:
193
194  1) A bad happens
195  2) The host kernel recieves an unrecoverable HMI and calls into OPAL to
196     request a platform reboot.
197  3) OPAL rejects the reboot attempt and returns to the kernel with
198     OPAL_PARAMETER.
199  4) Kernel panics and attempts to kexec into a kdump kernel.
200
201  A side effect of the HMI seems to be CPUs becoming stuck which results
202  in the initialisation of the kdump kernel taking a extremely long time
203  (6+ hours). It's also been observed that after performing a dump the
204  kdump kernel then crashes itself because OPAL has ended up in a bad
205  state as a side effect of the HMI.
206
207  All up, it's not very good so re-enable the software checkstop by
208  default. If people still want to turn it off they can using the nvram
209  override.
210- opal/hmi: Initialize the hmi event with old value of TFMR.
211
212  Do this before we fix TFAC errors. Otherwise the event at host console
213  shows no thread error reported in TFMR register.
214
215  Without this patch the console event show TFMR with no thread error:
216  (DEC parity error TFMR[59] injection) ::
217
218    [   53.737572] Severe Hypervisor Maintenance interrupt [Recovered]
219    [   53.737596]  Error detail: Timer facility experienced an error
220    [   53.737611]  HMER: 0840000000000000
221    [   53.737621]  TFMR: 3212000870e04000
222
223  After this patch it shows old TFMR value on host console: ::
224
225    [ 2302.267271] Severe Hypervisor Maintenance interrupt [Recovered]
226    [ 2302.267305]  Error detail: Timer facility experienced an error
227    [ 2302.267320]  HMER: 0840000000000000
228    [ 2302.267330]  TFMR: 3212000870e14010
229