1.. _skiboot-6.3-rc1:
2
3skiboot-6.3-rc1
4===============
5
6skiboot v6.3-rc1 was released on Friday March 29th 2019. It is the first
7release candidate of skiboot 6.3, which will become the new stable release
8of skiboot following the 6.2 release, first released December 14th 2018.
9
10Skiboot 6.3 will mark the basis for op-build v2.3. I expect to tag the final
11skiboot 6.3 in the next week.
12
13skiboot v6.3-rc1 contains all bug fixes as of :ref:`skiboot-6.0.19`,
14and :ref:`skiboot-6.2.3` (the currently maintained
15stable releases).
16
17For how the skiboot stable releases work, see :ref:`stable-rules` for details.
18
19This release has been a longer cycle than typical for a variety of reasons. It
20also contains a lot of cleanup work and minor bug fixes (much like skiboot 6.2
21did).
22
23Over skiboot 6.2, we have the following changes:
24
25.. _skiboot-6.3-rc1-new-features:
26
27New Features
28------------
29
30- hw/imc: Enable opal calls to init/start/stop IMC Trace mode
31
32  New OPAL APIs for In-Memory Collection Counter infrastructure(IMC),
33  including a new device type called OPAL_IMC_COUNTERS_TRACE.
34- xive: Add calls to save/restore the queues and VPs HW state
35
36  To be able to support migration of guests using the XIVE native
37  exploitation mode, (where the queue is effectively owned by the
38  guest), KVM needs to be able to save and restore the HW-modified
39  fields of the queue, such as the current queue producer pointer and
40  generation bit, and to retrieve the modified thread context registers
41  of the VP from the NVT structure : the VP interrupt pending bits.
42
43  However, there is no need to set back the NVT structure on P9. P10
44  should be the same.
45- witherspoon: Add nvlink2 interconnect information
46
47  GPUs on Redbud and Sequoia platforms are interconnected in groups of
48  2 or 3 GPUs. The problem with that is if the user decides to pass a single
49  GPU from a group to the userspace, we need to ensure that links between
50  GPUs do not get enabled.
51
52  A V100 GPU provides a way to disable selected links. In order to only
53  disable links to peer GPUs, we need a topology map.
54
55  This adds an "ibm,nvlink-peers" property to a GPU DT node with phandles
56  of peer GPUs and NVLink2 bridges. The index in the property is a GPU link
57  number.
58- platforms/romulus: Also support talos
59
60  The two are similar enough and I'd like to have a slot table for our
61  Talos.
62- OpenCAPI support! (see :ref:`skiboot-6.3-rc1-OpenCAPI` section)
63- opal/hmi: set a flag to inform OS that TOD/TB has failed.
64
65  Set a flag to indicate OS about TOD/TB failure as part of new
66  opal_handle_hmi2 handler. This flag then can be used by OS to make sure
67  functions depending on TB value (e.g. udelay()) are aware of TB not
68  ticking.
69- astbmc: Enable IPMI HIOMAP for AMI platforms
70
71  Required for Habanero, Palmetto and Romulus.
72- power-mgmt : occ : Add 'freq-domain-mask' DT property
73
74  Add a new device-tree property freq-domain-indicator to define group of
75  CPUs which would share same frequency. This property has been added under
76  power-mgmt node. It is a bitmask.
77
78  Bitwise AND is taken between this bitmask value and PIR of cpu. All the
79  CPUs lying in the same frequency domain will have same result for AND.
80
81  For example, For POWER9, 0xFFF0 indicates quad wide frequency domain.
82  Taking AND with the PIR of CPUs will yield us frequency domain which is
83  quad wise distribution as last 4 bits have been masked which represent the
84  cores.
85
86  Similarly, 0xFFF8 will represent core wide frequency domain for P8.
87
88  Also, Add a new device-tree property domain-runs-at which will denote the
89  strategy OCC is using to change the frequency of a frequency-domain. There
90  can be two strategy - FREQ_MOST_RECENTLY_SET and FREQ_MAX_IN_DOMAIN.
91
92  FREQ_MOST_RECENTLY_SET : the OCC sets the frequency of the quad to the most
93  recent frequency value requested by the CPUs in the quad.
94
95  FREQ_MAX_IN_DOMAIN : the OCC sets the frequency of the CPUs in
96  the Quad to the maximum of the latest frequency requested by each of
97  the component cores.
98- powercap: occ: Fix the powercapping range allowed for user
99
100  OCC provides two limits for minimum powercap. One being hard powercap
101  minimum which is guaranteed by OCC and the other one is a soft
102  powercap minimum which is lesser than hard-min and may or may not be
103  asserted due to various power-thermal reasons. So to allow the users
104  to access the entire powercap range, this patch exports soft powercap
105  minimum as the "powercap-min" DT property. And it also adds a new
106  DT property called "powercap-hard-min" to export the hard-min powercap
107  limit.
108- Add NVDIMM support
109
110  NVDIMMs are memory modules that use a battery backup system to allow the
111  contents RAM to be saved to non-volatile storage if system power goes
112  away unexpectedly. This allows them to be used a high-performance
113  storage device, suitable for serving as a cache for SSDs and the like.
114
115  Configuration of NVDIMMs is handled by hostboot and communicated to OPAL
116  via the HDAT. We need to parse out the NVDIMM memory ranges and create
117  memory regions with the "pmem-region" compatible label to make them
118  available to the host.
119- core/exceptions: implement support for MCE interrupts in powersave
120
121  The ISA specifies that MCE interrupts in power saving modes will enter
122  at 0x200 with powersave bits in SRR1 set. This is not currently
123  supported properly, the MCE will just happen like a normal interrupt,
124  but GPRs could be lost, which would lead to crashes (e.g., r1, r2, r13
125  etc).
126
127  So check the power save bits similarly to the sreset vector, and
128  handle this properly.
129- core/exceptions: allow recoverable sreset exceptions
130
131  This requires implementing the MSR[RI] bit. Then just allow all
132  non-fatal sreset exceptions to recover.
133- core/exceptions: implement an exception handler for non-powersave sresets
134
135  Detect non-powersave sresets and send them to the normal exception
136  handler which prints registers and stack.
137- Add PVR_TYPE_P9P
138
139  Enable a new PVR to get us running on another p9 variant.
140
141Deprecated/Removed Features
142---------------------------
143
144- opal: Deprecate reading the PHB status
145
146  The OPAL_PCI_EEH_FREEZE_STATUS call takes a bunch of parameters, one of
147  them is @phb_status. It is defined as __be64* and always NULL in
148  the current Linux upstream but if anyone ever decides to read that status,
149  then the PHB3's handler will assume it is struct OpalIoPhb3ErrorData*
150  (which is a lot bigger than 8 bytes) and zero it causing the stack
151  corruption; p7ioc-phb has the same issue.
152
153  This removes @phb_status from all eeh_freeze_status() hooks and moves
154  the error message from PHB4 to the affected OPAL handlers.
155
156  As far as we can tell, nobody has ever used this and thus it's safe to remove.
157- Remove POWER9N DD1 support
158
159  This is not a shipping product and is no longer supported by Linux
160  or other firmware components.
161
162General
163-------
164
165- core/i2c: Various bits of refactoring
166- refactor backtrace generation infrastructure
167- astbmc: Handle failure to initialise raw flash
168
169  Initialising raw flash lead to a dead assignment to rc. Check the return
170  code and take the failure path as necessary. Both before and after the
171  fix we see output along the lines of the following when flash_init()
172  fails: ::
173
174    [   53.283182881,7] IRQ: Registering 0800..0ff7 ops @0x300d4b98 (data 0x3052b9d8)
175    [   53.283184335,7] IRQ: Registering 0ff8..0fff ops @0x300d4bc8 (data 0x3052b9d8)
176    [   53.283185513,7] PHB#0000: Initializing PHB...
177    [   53.288260827,4] FLASH: Can't load resource id:0. No system flash found
178    [   53.288354442,4] FLASH: Can't load resource id:1. No system flash found
179    [   53.342933439,3] CAPP: Error loading ucode lid. index=200ea
180    [   53.462749486,2] NVRAM: Failed to load
181    [   53.462819095,2] NVRAM: Failed to load
182    [   53.462894236,2] NVRAM: Failed to load
183    [   53.462967071,2] NVRAM: Failed to load
184    [   53.463033077,2] NVRAM: Failed to load
185    [   53.463144847,2] NVRAM: Failed to load
186
187  Eventually followed by: ::
188
189    [   57.216942479,5] INIT: platform wait for kernel load failed
190    [   57.217051132,5] INIT: Assuming kernel at 0x20000000
191    [   57.217127508,3] INIT: ELF header not found. Assuming raw binary.
192    [   57.217249886,2] NVRAM: Failed to load
193    [   57.221294487,0] FATAL: Kernel is zeros, can't execute!
194    [   57.221397429,0] Assert fail: core/init.c:615:0
195    [   57.221471414,0] Aborting!
196    CPU 0028 Backtrace:
197     S: 0000000031d43c60 R: 000000003001b274   ._abort+0x4c
198     S: 0000000031d43ce0 R: 000000003001b2f0   .assert_fail+0x34
199     S: 0000000031d43d60 R: 0000000030014814   .load_and_boot_kernel+0xae4
200     S: 0000000031d43e30 R: 0000000030015164   .main_cpu_entry+0x680
201     S: 0000000031d43f00 R: 0000000030002718   boot_entry+0x1c0
202     --- OPAL boot ---
203
204  Analysis of the execution paths suggests we'll always "safely" end this
205  way due the setup sequence for the blocklevel callbacks in flash_init()
206  and error handling in blocklevel_get_info(), and there's no current risk
207  of executing from unexpected memory locations. As such the issue is
208  reduced to down to a fix for poor error hygene in the original change
209  and a resolution for a Coverity warning (famous last words etc).
210- core/flash: Retry requests as necessary in flash_load_resource()
211
212  We would like to successfully boot if we have a dependency on the BMC
213  for flash even if the BMC is not current ready to service flash
214  requests. On the assumption that it will become ready, retry for several
215  minutes to cover a BMC reboot cycle and *eventually* rather than
216  *immediately* crash out with: ::
217
218        [  269.549748] reboot: Restarting system
219        [  390.297462587,5] OPAL: Reboot request...
220        [  390.297737995,5] RESET: Initiating fast reboot 1...
221        [  391.074707590,5] Clearing unused memory:
222        [  391.075198880,5] PCI: Clearing all devices...
223        [  391.075201618,7] Clearing region 201ffe000000-201fff800000
224        [  391.086235699,5] PCI: Resetting PHBs and training links...
225        [  391.254089525,3] FFS: Error 17 reading flash header
226        [  391.254159668,3] FLASH: Can't open ffs handle: 17
227        [  392.307245135,5] PCI: Probing slots...
228        [  392.363723191,5] PCI Summary:
229        ...
230        [  393.423255262,5] OCC: All Chip Rdy after 0 ms
231        [  393.453092828,5] INIT: Starting kernel at 0x20000000, fdt at
232        0x30800a88 390645 bytes
233        [  393.453202605,0] FATAL: Kernel is zeros, can't execute!
234        [  393.453247064,0] Assert fail: core/init.c:593:0
235        [  393.453289682,0] Aborting!
236        CPU 0040 Backtrace:
237         S: 0000000031e03ca0 R: 000000003001af60   ._abort+0x4c
238         S: 0000000031e03d20 R: 000000003001afdc   .assert_fail+0x34
239         S: 0000000031e03da0 R: 00000000300146d8   .load_and_boot_kernel+0xb30
240         S: 0000000031e03e70 R: 0000000030026cf0   .fast_reboot_entry+0x39c
241         S: 0000000031e03f00 R: 0000000030002a4c   fast_reset_entry+0x2c
242         --- OPAL boot ---
243
244  The OPAL flash API hooks directly into the blocklevel layer, so there's
245  no delay for e.g. the host kernel, just for asynchronously loaded
246  resources during boot.
247- fast-reboot: occ: Call occ_pstates_init() on fast-reset on all machines
248
249  Commit 815417dcda2e ("init, occ: Initialise OCC earlier on BMC systems")
250  conditionally invoked occ_pstates_init() only on FSP based systems in
251  load_and_boot_kernel(). Due to this pstate table is re-parsed on FSP
252  system and skipped on BMC system during fast-reboot. So this patch fixes
253  this by invoking occ_pstates_init() on all boxes during fast-reboot.
254- opal/hmi: Don't retry TOD recovery if it is already in failed state.
255
256  On TOD failure, all cores/thread receives HMI and very first thread that
257  gets interrupt fixes the TOD where as others just resets the respective
258  HMER error bit and return. But when TOD is unrecoverable, all the threads
259  try to do TOD recovery one by one causing threads to spend more time inside
260  opal. Set a global flag when TOD is unrecoverable so that rest of the
261  threads go back to linux immediately avoiding lock ups in system
262  reboot/panic path.
263- hw/bt: Do not disable ipmi message retry during OPAL boot
264
265  Currently OPAL doesn't know whether BMC is functioning or not. If BMC is
266  down (like BMC reboot), then we keep on retry sending message to BMC. So
267  in some corner cases we may hit hard lockup issue in kernel.
268
269  Ideally we should avoid using synchronous path as much as possible. But
270  for now commit 01f977c3 added option to disable message retry in synchronous.
271  But this fix is not required during boot. Hence lets disable IPMI message
272  retry during OPAL boot.
273- hdata/memory: Fix warning message
274
275  Even though we added memory to device tree, we are getting below warning. ::
276
277    [   57.136949696,3] Unable to use memory range 0 from MSAREA 0
278    [   57.137049753,3] Unable to use memory range 0 from MSAREA 1
279    [   57.137152335,3] Unable to use memory range 0 from MSAREA 2
280    [   57.137251218,3] Unable to use memory range 0 from MSAREA 3
281- hw/bt: Add backend interface to disable ipmi message retry option
282
283  During boot OPAL makes IPMI_GET_BT_CAPS call to BMC to get BT interface
284  capabilities which includes IPMI message max resend count, message
285  timeout, etc,. Most of the time OPAL gets response from BMC within
286  specified timeout. In some corner cases (like mboxd daemon reset in BMC,
287  BMC reboot, etc) OPAL may not get response within timeout period. In
288  such scenarios, OPAL resends message until max resend count reaches.
289
290  OPAL uses synchronous IPMI message (ipmi_queue_msg_sync()) for few
291  operations like flash read, write, etc. Thread will wait in OPAL until
292  it gets response from BMC. In some corner cases like BMC reboot, thread
293  may wait in OPAL for long time (more than 20 seconds) and results in
294  kernel hardlockup.
295
296  This patch introduces new interface to disable message resend option. We
297  will disable message resend option for synchrous message. This will
298  greatly reduces kernel hardlock up issues.
299
300  This is short term fix. Long term solution is to convert all synchronous
301  messages to asynhrounous one.
302- ipmi/power: Fix system reboot issue
303
304  Kernel makes reboot/shudown OPAL call for reboot/shutdown. Once kernel
305  gets response from OPAL it runs opal_poll_events() until firmware
306  handles the request.
307
308  On BMC based system, OPAL makes IPMI call (IPMI_CHASSIS_CONTROL) to
309  initiate system reboot/shutdown. At present OPAL queues IPMI messages
310  and return SUCESS to Host. If BMC is not ready to accept command (like
311  BMC reboot), then these message will fail. We have to manually
312  reboot/shutdown the system using BMC interface.
313
314  This patch adds logic to validate message return value. If message failed,
315  then it will resend the message. At some stage BMC will be ready to accept
316  message and handles IPMI message.
317- firmware-versions: Add test case for parsing VERSION
318
319  Also make it possible to use with afl-lop/afl-fuzz just to help make
320  *sure* we're all good.
321
322  Additionally, if we hit a entry in VERSION that is larger than our
323  buffer size, we skip over it gracefully rather than overwriting the
324  stack. This is only a problem if VERSION isn't trusted, which as of
325  4b8cc05a94513816d43fb8bd6178896b430af08f it is verified as part of
326  Secure Boot.
327- core/fast-reboot: improve NMI handling during fast reset
328
329  Improve sreset and MCE handling in fast reboot. Switch the HILE bit
330  off before copying OPAL's exception vectors, so NMIs can be handled
331  properly. Also disable MSR[ME] while the vectors are being overwritten
332- core/cpu: HID update race
333
334  If the per-core HID register is updated concurrently by multiple
335  threads, updates can get lost. This has been observed during fast
336  reboot where the HILE bit does not get cleared on all cores, which
337  can cause machine check exception interrupts to crash.
338
339  Fix this by only updating HID on thread0.
340- SLW: Print verbose info on errors only
341
342  Change print level from debug to warning for reporting
343  bad EC_PPM_SPECIAL_WKUP_* scom values. To reduce cluttering
344  in the log print only on error.
345
346IBM FSP based platforms
347-----------------------
348
349- platforms/firenze: Rework I2C controller fixups
350- platforms/zz: Re-enable LXVPD slot information parsing
351
352  From memory this was disabled in the distant past since we were waiting
353  for an updates to the LXPVD format. It looks like that never happened
354  so re-enable it for the ZZ platform so that we can get PCI slot location
355  codes on ZZ.
356
357HIOMAP
358------
359- astbmc: Try IPMI HIOMAP for P8
360
361  The HIOMAP protocol was developed after the release of P8 in preparation
362  for P9. As a consequence P9 always uses it, but it has rarely been
363  enabled for P8. P8DTU has recently added IPMI HIOMAP support to its BMC
364  firmware, so enable its use in skiboot with P8 machines. Doing so
365  requires some rework to ensure fallback works correctly as in the past
366  the fallback was to mbox, which will only work for P9.
367- libflash/ipmi-hiomap: Enforce message size for empty response
368
369  The protocol defines the response to the associated messages as empty
370  except for the command ID and sequence fields. If the BMC is returning
371  extra data consider the message malformed.
372- libflash/ipmi-hiomap: Remove unused close handling
373
374  Issuing a HIOMAP_C_CLOSE is not required by the protocol specification,
375  rather a close can be implicit in a subsequent
376  CREATE_{READ,WRITE}_WINDOW request. The implicit close provides an
377  opportunity to reduce LPC traffic and the implementation takes up that
378  optimisation, so remove the case from the IPMI callback handler.
379- libflash/ipmi-hiomap: Overhaul event handling
380
381  Reworking the event handling was inspired by a bug report by Vasant
382  where the host would get wedged on multiple flash access attempts in the
383  face of a persistent error state on the BMC-side. The cause of this bug
384  was the early-exit based on ctx->update, which erronously assumed that
385  all events had been completely handled in prior calls to
386  ipmi_hiomap_handle_events(). This is not true if e.g.
387  HIOMAP_E_DAEMON_READY is clear in the prior calls.
388
389  Regardless, there were other correctness and efficiency problems with
390  the handling strategy:
391
392  * Ack-able event state was not restored in the face of errors in the
393    process of re-establishing protocol state
394  * It forced needless window restoration with respect to the context in
395    which ipmi_hiomap_handle_events() was called.
396  * Tests for HIOMAP_E_DAEMON_READY and HIOMAP_E_FLASH_LOST were redundant
397    with the overhauled error handling introduced in the previous patch
398
399  Fix all of the above issues and add comments to explain the event
400  handling flow.
401- libflash/ipmi-hiomap: Overhaul error handling
402
403  The aim is to improve the robustness with respect to absence of the
404  BMC-side daemon. The current error handling roughly mirrors what was
405  done for the mailbox implementation, but there's room for improvement.
406
407  Errors are split into two classes, those that affect the transport state
408  and those that affect the window validity. From here, we push the
409  transport state error checks right to the bottom of the stack, to ensure
410  the link is known to be in a good state before any message is sent.
411  Window validity tests remain as they were in the hiomap_window_move()
412  and ipmi_hiomap_read() functions. Validity tests are not necessary in
413  the write and erase paths as we will receive an error response from the
414  BMC when performing a dirty or flush on an invalid window.
415
416  Recovery also remains as it was, done on entry to the blocklevel
417  callbacks. If an error state is encountered in the middle of an
418  operation no attempt is made to recover it on the spot, instead the
419  error is returned up the stack and the caller can choose how it wishes
420  to respond.
421- libflash/ipmi-hiomap: Fix leak of msg in callback
422
423POWER8
424------
425- hw/phb3/naples: Disable D-states
426
427  Putting "Mellanox Technologies MT27700 Family [ConnectX-4] [15b3:1013]"
428  (more precisely, the second of 2 its PCI functions, no matter in what
429  order) into the D3 state causes EEH with the "PCT timeout" error.
430  This has been noticed on garrison machines only and firestones do not
431  seem to have this issue.
432
433  This disables D-states changing for devices on root buses on Naples by
434  installing a config space access filter (copied from PHB4).
435- cpufeatures: Always advertise POWER8NVL as DD2
436
437  Despite the major version of PVR being 1 (0x004c0100) for POWER8NVL,
438  these chips are functionally equalent to P8/P8E DD2 levels.
439
440  This advertises POWER8NVL as DD2. As the result, skiboot adds
441  ibm,powerpc-cpu-features/processor-control-facility for such CPUs and
442  the linux kernel can use hypervisor doorbell messages to wake secondary
443  threads; otherwise "KVM: CPU %d seems to be stuck" would appear because
444  of missing LPCR_PECEDH.
445
446p8dtu Platform
447^^^^^^^^^^^^^^
448- p8dtu: Configure BMC graphics
449
450  We can no-longer read the values from the BMC in the way we have in the
451  past. Values were provided by Eric Chen of SMC.
452- p8dtu: Enable HIOMAP support
453
454Vesnin Platform
455^^^^^^^^^^^^^^^
456- platforms/vesnin: Disable PCIe port bifurcation
457
458  PCIe ports connected to CPU1 and CPU3 now work as x16 instead of x8x8.
459
460- Fix hang in pnv_platform_error_reboot path due to TOD failure.
461
462  On TOD failure, with TB stuck, when linux heads down to
463  pnv_platform_error_reboot() path due to unrecoverable hmi event, the panic
464  cpu gets stuck in OPAL inside ipmi_queue_msg_sync(). At this time, rest
465  all other cpus are in smp_handle_nmi_ipi() waiting for panic cpu to proceed.
466  But with panic cpu stuck inside OPAL, linux never recovers/reboot. ::
467
468    p0 c1 t0
469    NIA : 0x000000003001dd3c <.time_wait+0x64>
470    CFAR : 0x000000003001dce4 <.time_wait+0xc>
471    MSR : 0x9000000002803002
472    LR : 0x000000003002ecf8 <.ipmi_queue_msg_sync+0xec>
473
474    STACK: SP NIA
475    0x0000000031c236e0 0x0000000031c23760 (big-endian)
476    0x0000000031c23760 0x000000003002ecf8 <.ipmi_queue_msg_sync+0xec>
477    0x0000000031c237f0 0x00000000300aa5f8 <.hiomap_queue_msg_sync+0x7c>
478    0x0000000031c23880 0x00000000300aaadc <.hiomap_window_move+0x150>
479    0x0000000031c23950 0x00000000300ab1d8 <.ipmi_hiomap_write+0xcc>
480    0x0000000031c23a90 0x00000000300a7b18 <.blocklevel_raw_write+0xbc>
481    0x0000000031c23b30 0x00000000300a7c34 <.blocklevel_write+0xfc>
482    0x0000000031c23bf0 0x0000000030030be0 <.flash_nvram_write+0xd4>
483    0x0000000031c23c90 0x000000003002c128 <.opal_write_nvram+0xd0>
484    0x0000000031c23d20 0x00000000300051e4 <opal_entry+0x134>
485    0xc000001fea6e7870 0xc0000000000a9060 <opal_nvram_write+0x80>
486    0xc000001fea6e78c0 0xc000000000030b84 <nvram_write_os_partition+0x94>
487    0xc000001fea6e7960 0xc0000000000310b0 <nvram_pstore_write+0xb0>
488    0xc000001fea6e7990 0xc0000000004792d4 <pstore_dump+0x1d4>
489    0xc000001fea6e7ad0 0xc00000000018a570 <kmsg_dump+0x140>
490    0xc000001fea6e7b40 0xc000000000028e5c <panic_flush_kmsg_end+0x2c>
491    0xc000001fea6e7b60 0xc0000000000a7168 <pnv_platform_error_reboot+0x68>
492    0xc000001fea6e7bd0 0xc0000000000ac9b8 <hmi_event_handler+0x1d8>
493    0xc000001fea6e7c80 0xc00000000012d6c8 <process_one_work+0x1b8>
494    0xc000001fea6e7d20 0xc00000000012da28 <worker_thread+0x88>
495    0xc000001fea6e7db0 0xc0000000001366f4 <kthread+0x164>
496    0xc000001fea6e7e20 0xc00000000000b65c <ret_from_kernel_thread+0x5c>
497
498  This is because, there is a while loop towards the end of
499  ipmi_queue_msg_sync() which keeps looping until "sync_msg" does not match
500  with "msg". It loops over time_wait_ms() until exit condition is met. In
501  normal scenario time_wait_ms() calls run pollers so that ipmi backend gets
502  a chance to check ipmi response and set sync_msg to NULL. ::
503
504            while (sync_msg == msg)
505                    time_wait_ms(10);
506
507  But in the event when TB is in failed state time_wait_ms()->time_wait_poll()
508  returns immediately without calling pollers and hence we end up looping
509  forever. This patch fixes this hang by calling opal_run_pollers() in TB
510  failed state as well.
511
512
513.. _skiboot-6.3-rc1-power9:
514
515POWER9
516------
517
518- Retry link training at PCIe GEN1 if presence detected but training repeatedly failed
519
520  Certain older PCIe 1.0 devices will not train unless the training process starts at GEN1 speeds.
521  As a last resort when a device will not train, fall back to GEN1 speed for the last training attempt.
522
523  This is verified to fix devices based on the Conexant CX23888 on the Talos II platform.
524- hw/phb4: Drop FRESET_DEASSERT_DELAY state
525
526  The delay between the ASSERT_DELAY and DEASSERT_DELAY states is set to
527  one timebase tick. This state seems to have been a hold over from PHB3
528  where it was used to add a 1s delay between de-asserting PERST and
529  polling the link for the CAPI FPGA. There's no requirement for that here
530  since the link polling on PHB4 is a bit smarter so we should be fine.
531- hw/phb4: Factor out PERST control
532
533  Some time ago Mikey added some code work around a bug we found where a
534  certain RAID card wouldn't come back again after a fast-reboot. The
535  workaround is setting the Link Disable bit before asserting PERST and
536  clear it after de-asserting PERST.
537
538  Currently we do this in the FRESET path, but not in the CRESET path.
539  This patch moves the PERST control into its own function to reduce
540  duplication and to the workaround is applied in all circumstances.
541- hw/phb4: Remove FRESET presence check
542
543  When we do an freset the first step is to check if a card is present in
544  the slot. However, this only occurs when we enter phb4_freset() with the
545  slot state set to SLOT_NORMAL. This occurs in:
546
547  a) The creset path, and
548  b) When the OS manually requests an FRESET via an OPAL call.
549
550  (a) is problematic because in the boot path the generic code will put the
551  slot into FRESET_START manually before calling into phb4_freset(). This
552  can result in a situation where a device is detected on boot, but not
553  after a CRESET.
554
555  I've noticed this occurring on systems where the PHB's slot presence
556  detect signal is not wired to an adapter. In this situation we can rely
557  on the in-band presence mechanism, but the presence check will make
558  us exit before that has a chance to work.
559
560  Additionally, if we enter from the CRESET path this early exit leaves
561  the slot's PERST signal being left asserted. This isn't currently an issue,
562  but if we want to support hotplug of devices into the root port it will
563  be.
564- hw/phb4: Skip FRESET PERST when coming from CRESET
565
566  PERST is asserted at the beginning of the CRESET process to prevent
567  the downstream device from interacting with the host while the PHB logic
568  is being reset and re-initialised. There is at least a 100ms wait during
569  the CRESET processing so it's not necessary to wait this time again
570  in the FRESET handler.
571
572  This patch extends the delay after re-setting the PHB logic to extend
573  to the 250ms PERST wait period that we typically use and sets the
574  skip_perst flag so that we don't wait this time again in the FRESET
575  handler.
576- hw/phb4: Look for the hub-id from in the PBCQ node
577
578  The hub-id is stored in the PBCQ node rather than the stack node so we
579  never add it to the PHB node. This breaks the lxvpd slot lookup code
580  since the hub-id is encoded in the VPD record that we need to find the
581  slot information.
582- hdata/iohub: Look for IOVPD on P9
583
584  P8 and P9 use the same IO VPD setup, so we need to load the IOHUB VPD on
585  P9 systems too.
586
587CAPI2
588^^^^^
589- capp/phb4: Prevent HMI from getting triggered when disabling CAPP
590
591  While disabling CAPP an HMI gets triggered as soon as ETU is put in
592  reset mode. This is caused as before we can disabled CAPP, it detects
593  PHB link going down and triggers an HMI requesting Opal to perform
594  CAPP recovery. This has an un-intended side effect of spamming the
595  Opal logs with malfunction alert messages and may also confuse the
596  user.
597
598  To prevent this we mask the CAPP FIR error 'PHB Link Down' Bit(31)
599  when we are disabling CAPP just before we put ETU in reset in
600  phb4_creset(). Also now since bringing down the PHB link now wont
601  trigger an HMI and CAPP recovery, hence we manually set the
602  PHB4_CAPP_RECOVERY flag on the phb to force recovery during creset.
603
604- phb4/capp: Implement sequence to disable CAPP and enable fast-reset
605
606  We implement h/w sequence to disable CAPP in disable_capi_mode() and
607  with it also enable fast-reset for CAPI mode in phb4_set_capi_mode().
608
609  Sequence to disable CAPP is executed in three phases. The first two
610  phase is implemented in disable_capi_mode() where we reset the CAPP
611  registers followed by PEC registers to their init values. The final
612  third final phase is to reset the PHB CAPI Compare/Mask Register and
613  is done in phb4_init_ioda3(). The reason to move the PHB reset to
614  phb4_init_ioda3() is because by the time Opal PCI reset state machine
615  reaches this function the PHB is already un-fenced and its
616  configuration registers accessible via mmio.
617- capp/phb4: Force CAPP to PCIe mode during kernel shutdown
618
619  This patch introduces a new opal syncer for PHB4 named
620  phb4_host_sync_reset(). We register this opal syncer when CAPP is
621  activated successfully in phb4_set_capi_mode() so that it will be
622  called at kernel shutdown during fast-reset.
623
624  During kernel shutdown the function will then repeatedly call
625  phb->ops->set_capi_mode() to switch switch CAPP to PCIe mode. In case
626  set_capi_mode() indicates its OPAL_BUSY, which indicates that CAPP is
627  still transitioning to new state; it calls slot->ops.run_sm() to
628  ensure that Opal slot reset state machine makes forward progress.
629
630
631Witherspoon Platform
632^^^^^^^^^^^^^^^^^^^^
633- platforms/witherspoon: Make PCIe shared slot error message more informative
634
635  If we're missing chips for some reason, we print a warning when configuring
636  the PCIe shared slot.
637
638  The warning doesn't really make it clear what "shared slot" is, and if it's
639  printed, it'll come right after a bunch of messages about NPU setup, so
640  let's clarify the message to explicitly mention PCI.
641- witherspoon: Add nvlink2 interconnect information
642
643  See :ref:`skiboot-6.3-rc1-new-features` for details.
644
645Zaius Platform
646^^^^^^^^^^^^^^
647
648- zaius: Add BMC description
649
650  Frederic reported that Zaius was failing with a NULL dereference when
651  trying to initialise IPMI HIOMAP. It turns out that the BMC wasn't
652  described at all, so add a description.
653
654p9dsu platform
655^^^^^^^^^^^^^^
656- p9dsu: Fix p9dsu default variant
657
658  Add the default when no riser_id is returned from the ipmi query.
659
660  Allow a little more time for BMC reply and cleanup some label strings.
661
662
663PCIe
664----
665
666See :ref:`skiboot-6.3-rc1-power9` for POWER9 specific PCIe changes.
667
668- core/pcie-slot: Don't bail early in the power on case
669
670  Exiting early in the power off case makes sense since we can't disable
671  slot power (or assert PERST) for suprise hotplug slots. However, we
672  should not exit early in the power-on case since it's possible slot
673  power may have been disabled (or just not enabled at boot time).
674- firenze-pci: Always init slot info from LXVPD
675
676  We can slot information from the LXVPD without having power control
677  information about that slot. This patch changes the init path so that
678  we always override the add_properties() call rather than only when we
679  have power control information about the slot.
680- fsp/lxvpd: Print more LXVPD slot information
681
682  Useful to know since it changes the behaviour of the slot core.
683- core/pcie-slot: Set power state from the PWRCTL flag
684
685  For some reason we look at the power control indicator and use that to
686  determine if the slot is "off" rather than the power control flag that
687  is used to power down the slot.
688
689  While we're here change the default behaviour so that the slot is
690  assumed to be powered on if there's no slot capability, or if there's
691  no power control available.
692- core/pci: Increase the max slot string size
693
694  The maximum string length for the slot label / device location code in
695  the PCI summary is currently 32 characters. This results in some IBM
696  location codes being truncated due to their length, e.g. ::
697
698    PHB#0001:02:11.0 [SWDN]  SLOT=C11  x8
699    PHB#0001:13:00.0 [EP  ] *snip* LOC_CODE=U78D3.ND1.WZS004A-P1-C
700    PHB#0001:13:00.1 [EP  ] *snip* LOC_CODE=U78D3.ND1.WZS004A-P1-C
701    PHB#0001:13:00.2 [EP  ] *snip* LOC_CODE=U78D3.ND1.WZS004A-P1-C
702    PHB#0001:13:00.3 [EP  ] *snip* LOC_CODE=U78D3.ND1.WZS004A-P1-C
703
704  Which obscure the actual location of the card, and it looks bad. This
705  patch increases the maximum length of the label string to 80 characters
706  since that's the maximum length for a location code.
707
708
709
710.. _skiboot-6.3-rc1-OpenCAPI:
711
712OpenCAPI
713--------
714- npu2/hw-procedures: Fix parallel zcal for opencapi
715
716  For opencapi, we currently do impedance calibration when initializing
717  the PHY for the device, which could run in parallel if we have
718  multiple opencapi devices. But if 2 devices are on the same
719  obus, the 2 calibration sequences could overlap, which likely yields
720  bad results and is useless anyway since it only needs to be done once
721  per obus.
722
723  This patch splits the opencapi PHY reset in 2 parts:
724
725  - a 'init' part called serially at boot. That's when zcal is done. If
726    we have 2 devices on the same socket, the zcal won't be redone,
727    since we're called serially and we'll see it has already be done for
728    the obus
729  - a 'reset' part called during fundamental reset as a prereq for link
730    training. It does the PHY setup for a set of lanes and the dccal.
731
732  The PHY team confirmed there's no dependency between zcal and the
733  other reset steps and it can be moved earlier.
734- npu2-hw-procedures: Fix zcal in mixed opencapi and nvlink mode
735
736  The zcal procedure needs to be run once per obus. We keep track of
737  which obus is already calibrated in an array indexed by the obus
738  number. However, the obus number is inferred from the brick index,
739  which works well for nvlink but not for opencapi.
740
741  Create an obus_index() function, which, from a device, returns the
742  correct obus index, irrespective of the device type.
743- npu2-opencapi: Fix adapter reset when using 2 adapters
744
745  If two opencapi adapters are on the same obus, we may try to train the
746  two links in parallel at boot time, when all the PCI links are being
747  trained. Both links use the same i2c controller to handle the reset
748  signal, so some care is needed to make sure resetting one doesn't
749  interfere with the reset of the other. We need to keep track of the
750  current state of the i2c controller (and use locking).
751
752  This went mostly unnoticed as you need to have 2 opencapi cards on the
753  same socket and links tended to train anyway because of the retries.
754- npu2-opencapi: Extend delay after releasing reset on adapter
755
756  Give more time to the FPGA to process the reset signal. The previous
757  delay, 5ms, is too short for newer adapters with bigger FPGAs. Extend
758  it to 250ms.
759  Ultimately, that delay will likely end up being added to the opencapi
760  specification, but we are not there yet.
761- npu2-opencapi: ODL should be in reset when enabled
762
763  We haven't hit any problem so far, but from the ODL designer, the ODL
764  should be in reset when it is enabled.
765
766  The ODL remains in reset until we start a fundamental reset to
767  initiate link training. We still assert and deassert the ODL reset
768  signal as part of the normal procedure just before training the
769  link. Asserting is therefore useless at boot, since the ODL is already
770  in reset, but we keep it as it's only a scom write and it's needed
771  when we reset/retrain from the OS.
772- npu2-opencapi: Keep ODL and adapter in reset at the same time
773
774  Split the function to assert and deassert the reset signal on the ODL,
775  so that we can keep the ODL in reset while we reset the adapter,
776  therefore having a window where both sides are in reset.
777
778  It is actually not required with our current DLx at boot time, but I
779  need to split the ODL reset function for the following patch and it
780  will become useful/required later when we introduce resetting an
781  opencapi link from the OS.
782- npu2-opencapi: Setup perf counters to detect CRC errors
783
784  It's possible to set up performance counters for the PLL to detect
785  various conditions for the links in nvlink or opencapi mode. Since
786  those counters are currently unused, let's configure them when an obus
787  is in opencapi mode to detect CRC errors on the link. Each link has
788  two counters:
789  - CRC error detected by the host
790  - CRC error detected by the DLx (NAK received by the host)
791
792  We also dump the counters shortly after the link trains, but they can
793  be read multiple times through cronus, pdbg or linux. The counters are
794  configured to be reset after each read.
795
796NVLINK2
797-------
798- npu2: Allow ATSD for LPAR other than 0
799
800  Each XTS MMIO ATSD# register is accompanied by another register -
801  XTS MMIO ATSD0 LPARID# - which controls LPID filtering for ATSD
802  transactions.
803
804  When a host system passes a GPU through to a guest, we need to enable
805  some ATSD for an LPAR. At the moment the host assigns one ATSD to
806  a NVLink bridge and this maps it to an LPAR when GPU is assigned to
807  the LPAR. The link number is used for an ATSD index.
808
809  ATSD6&7 stay mapped to the host (LPAR=0) all the time which seems to be
810  acceptable price for the simplicity.
811- npu2: Add XTS_BDF_MAP wildcard refcount
812
813  Currently PID wildcard is programmed into the NPU once and never cleared
814  up. This works for the bare metal as MSR does not change while the host
815  OS is running.
816
817  However with the device virtualization, we need to keep track of wildcard
818  entries use and clear them up before switching a GPU from a host to
819  a guest or vice versa.
820
821  This adds refcount to a NPU2, one counter per wildcard entry. The index
822  is a short lparid (4 bits long) which is allocated in opal_npu_map_lpar()
823  and should be smaller than NPU2_XTS_BDF_MAP_SIZE (defined as 16).
824
825
826
827Debugging and simulation
828------------------------
829
830- external/mambo: Error out if kernel is too large
831
832  If you're trying to boot a gigantic kernel in mambo (which you can
833  reproduce by building a kernel with CONFIG_MODULES=n) you'll get
834  misleading errors like: ::
835
836    WARNING: 0: (0): [0:0]: Invalid/unsupported instr 0x00000000[INVALID]
837    WARNING: 0: (0):  PC(EA): 0x0000000030000010 PC(RA):0x0000000030000010 MSR: 0x9000000000000000 LR: 0x0000000000000000
838    WARNING: 0: (0):  numInstructions = 0
839    WARNING: 1: (1): [0:0]: Invalid/unsupported instr 0x00000000[INVALID]
840    WARNING: 1: (1):  PC(EA): 0x0000000000000E40 PC(RA):0x0000000000000E40 MSR: 0x9000000000000000 LR: 0x0000000000000000
841    WARNING: 1: (1):  numInstructions = 1
842    WARNING: 1: (1): Interrupt to 0x0000000000000E40 from 0x0000000000000E40
843    INFO: 1: (2): ** Execution stopped: Continuous Interrupt, Instruction caused exception,  **
844
845  So add an error to skiboot.tcl to warn the user before this happens.
846  Making PAYLOAD_ADDR further back is one way to do this but if there's a
847  less gross way to generally work around this very niche problem, I can
848  suggest that instead.
849- external/mambo: Populate kernel-base-address in the DT
850
851  skiboot.tcl defines PAYLOAD_ADDR as 0x20000000, which is the default in
852  skiboot.  This is also the default in skiboot unless kernel-base-address
853  is set in the device tree.
854
855  If you change PAYLOAD_ADDR to something else for mambo, skiboot won't
856  see it because it doesn't set that DT property, so fix it so that it does.
857- external/mambo: allow CPU targeting for most debug utils
858
859  Debug util functions target CPU 0:0:0 by default Some can be
860  overidden explicitly per invocation, and others can't at all.
861  Even for those that can be overidden, it is a pain to type
862  them out when you're debugging a particular thread.
863
864  Provide a new 'target' function that allows the default CPU
865  target to be changed. Wire that up that default to all other utils.
866  Provide a new 'S' step command which only steps the target CPU.
867- qemu: bt device isn't always hanging off /
868
869  Just use the normal for_each_compatible instead.
870
871  Otherwise in the qemu model as executed by op-test,
872  we wouldn't go down the astbmc_init() path, thus not having flash.
873- devicetree: Add p9-simics.dts
874
875  Add a p9-based devicetree that's suitable for use with Simics.
876- devicetree: Move power9-phb4.dts
877
878  Clean up the formatting of power9-phb4.dts and move it to
879  external/devicetree/p9.dts. This sets us up to include it as the basis
880  for other trees.
881- devicetree: Add nx node to power9-phb4.dts
882
883  A (non-qemu) p9 without an nx node will assert in p9_darn_init(): ::
884
885      dt_for_each_compatible(dt_root, nx, "ibm,power9-nx")
886              break;
887      if (!nx) {
888              if (!dt_node_is_compatible(dt_root, "qemu,powernv"))
889                    assert(nx);
890              return;
891      }
892
893  Since NX is this essential, add it to the device tree.
894- devicetree: Fix typo in power9-phb4.dts
895
896  Change "impi" to "ipmi".
897- devicetree: Fix syntax error in power9-phb4.dts
898
899  Remove the extra space causing this: ::
900
901      Error: power9-phb4.dts:156.15-16 syntax error
902      FATAL ERROR: Unable to parse input tree
903- core/init: enable machine check on secondaries
904
905  Secondary CPUs currently run with MSR[ME]=0 during boot, whih means
906  if they take a machine check, the system will checkstop.
907
908  Enable ME where possible and allow them to print registers.
909
910Utilities
911---------
912- pflash: Don't try update RO ToC
913
914  In the future it's likely the ToC will be marked as read-only. Don't
915  error out by assuming its writable.
916- pflash: Support encoding/decoding ECC'd partitions
917
918  With the new --ecc option, pflash can add/remove ECC when
919  reading/writing flash partitions protected by ECC.
920
921  This is *not* flawless with current PNORs out in the wild though, as
922  they do not typically fill the whole partition with valid ECC data, so
923  you have to know how big the valid ECC'd data is and specify the size
924  manually. Note that for some partitions this is pratically impossible
925  without knowing the details of the content of the partition.
926
927  A future patch is likely to introduce an option to "stop reading data
928  when ECC starts failing and assume everything is okay rather than error
929  out" to support reading the "valid" data from existing PNOR images.
930
931